[tahoe-dev] tree size increased?
zooko
zooko at zooko.com
Fri Dec 28 21:13:58 PST 2007
Brian:
There are lots of issues here; among them how do we relatively value
downloadable size, disk space, and Desert Island installation. Also,
there is the issue of aesthetics. I think that it bothers you
aesthetically to have redundant copies of a file in the tree
(setuptools_darcs, or the setuptools 0.6c7-py2.5 egg, for example).
I value aesthetics, too, and I also value having the tahoe source
package be appealing to other people (especially you) even when this
issue isn't particularly important to my aesthetics.
Anyway, here are some numbers about the effect that compressing
internal tarballs has on untarred disk space and on the size of the
compressed downloadable allmydata-tahoe tarball.
A. "current": current configuration (uncompressed tarballs within
uncompressed tarballs)
B. "compress": gzip -9 compress tahoe/misc/dependencies/*.tar
C. "deep compress": recursively untar all tarballs within tarballs
and gzip -9 them all
D. "deep 7z compress": recursively untar all tarballs and 7z -mx=9
them all (note that this doesn't actually work for the Desert Island
build since setuptools doesn't automatically un-7z source tarballs
that it finds -- this is solely for comparison purposes)
E. "no deps": rm misc/dependencies/*.tar*
compression used on the overall allmydata-tahoe.tar tarball -->
none gzip bzip2 rzip lrzip 7zip
---- ---- ----- ---- ----- ----
A. 11,243,520 5,146,208 3,933,778 1,702,507 1,617,109 1,585,380
B. 6,430,720 5,147,929 4,970,992 4,321,795 4,038,092 4,023,087
C. 6,440,960 5,161,208 4,955,399 3,257,477 3,174,995 3,165,861
D. 6,205,440 4,930,134 4,899,728 4,757,996 4,778,605 4,768,450
E. 2,252,800 978,474 906,879 795,064 774,494 769,483
table 1. size of allmydata.tahoe.tar.$COMPRESSION tarball in bytes
It's interesting how much worse the older compression algorithms are
at taking advantage of the huge redundant pieces spread far apart.
It's also interesting to see that 7zip is even better than lrzip on
this input and has the added advantages of being streamable, faster
and ported to more platforms (lrzip works only on Linux).
> I think you're optimizing the wrong thing here.. I have a dozen
> tahoe trees
> on my laptop, and now they consume over half a gigabyte (588MB
> versus the
> previous 312MB), but I only ever download the .tar.gz maybe once a
> month, and
> even for users downloading it once a day, 2.37MB is not worth
> reducing.
As the table above shows, compressing misc/dependencies/*.tar saves
about 5 MB per tree of disk space, has no effect on the allmydata-
tahoe.tar.gz, and reduces the allmydata-tahoe.tar.7z from 4 MB to 1.5
MB. Whether this is a win or a lose depends on whether you value
disk space or fast downloads more. Actually I think that the concern
you mentioned wasn't due to this added 5 MB (which would have
increased your 12 trees from 312 MB to 373 MB, not to 588 MB), but
rather the addition of bundled easy_installable dependencies in order
to enable Desert Island installation. So perhaps the trade-off that
you are weighing is more disk space usage vs. Desert Island
installation, than disk space usage vs. compressed tarball size. I
just went back and added the E. row to inform us about that issue.
rm'ing the deps saves 9 MB of disk space per tree. It also makes the
downloadable much smaller, but of course if the user also has to
download some of those dependencies then it quickly becomes a net
loss of human time.
> (maybe we
> should consider auto-creating a .tar.gz which contains the support
> tarballs
> but not put them in the SCM tree).
That's a good idea.
It would certainly solve the disk-space usage problem, and I think it
appeals more aesthetically. I'm not going to work on this before
0.7.0 (instead of I'm going to work on The Roadmap [1], e.g. fixing
the automatic .deb builds (#246) and lots and lots of documentation.
Here, I just created #249 -- "move bundled dependencies out of
revision control history and make them optional".
Regards,
Zooko
[1] http://allmydata.org/trac/tahoe/roadmap
tickets mentioned in this message:
http://allmydata.org/trac/tahoe/ticket/246
http://allmydata.org/trac/tahoe/ticket/249
More information about the tahoe-dev
mailing list