[Tahoe-dev] storage efficiency improved
Brian Warner
warner-tahoe at lothar.com
Fri Jul 13 17:59:17 PDT 2007
I just finished fixing a trio of bugs/design-problems (#80 is the umbrella
ticket, #84, #85, and #81 are the three working tickets) that resulted in
tahoe storage servers taking up drastically more space than was sensible. The
biggest problem comes from the fact that most disk filesystems out there
(like ext3) have a minimum file size: files always consume at least one disk
block, and for large drives those disk blocks are probably 4096 bytes long.
Tahoe was storing small amounts of metadata in separate files, so each share
consumed a minimum of 32kB, even for a share that was storing a 1-byte file.
100 shares at 32kB each was a minimum of 3.2MB per file, regardless of size.
Ouch.
We improved this in three ways:
#84: produce fewer shares in small networks. The default was 25-out-of-100
encoding, but if you only have 3 or 4 peers then that's just silly. The
new approach lets the Introducer provide a network-wide default, and
the new default value is 3-out-of-10. This provides a 10x space
improvement for small files, because there are fewer shares, so fewer
tiny files.
#85: store each share in a single file, rather spreading the metadata out
among 7 files and a directory. This reduces the space by about 8x for
files smaller than about 10kB.
#81: implement a new kind of storage, called URI:LIT, in which the contents
of the file are included as a literal string *inside* the URI. This is
used for files that are 55 bytes or shorter. This also has the
side-effect of fixing #81 (crash when uploading a 0-byte file).
Our current guess is that disk utilization (totalled across all peers) now
looks something like:
filesizes disk utilization
0 - 55 0 (URI:LIT is not stored on the storage servers)
55 - ~10kB 41kB
~10kB - ~21kB 82kB
and above that it should be about 3x (plus a constant overhead). We'll be
doing some more math and testing on this next week.
I'll be upgrading our testnet to the new scheme shortly (which will flush all
the current data out of there, since we've changed the storage format so
drastically). Now that this problem is out of the way, we'll probably be
doing an 0.5.0 release next week once we've implemented some more CLI tools.
cheers,
-Brian
tickets mentioned:
#80: http://allmydata.org/trac/tahoe/ticket/80 storage format is inefficient
#81: http://allmydata.org/trac/tahoe/ticket/81 implement URI:LIT
#84: http://allmydata.org/trac/tahoe/ticket/84 produce fewer shares
#85: http://allmydata.org/trac/tahoe/ticket/85 one storage file per share
More information about the Tahoe-dev
mailing list