#346 closed enhancement (fixed)
increase share-size field to 8 bytes, remove 12GiB filesize limit
Reported by: | warner | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | 1.3.0 |
Component: | code-storage | Version: | 0.8.0 |
Keywords: | Cc: | ||
Launchpad Bug: |
Description
The version=0 storage.ShareFile disk format currently uses a 4-byte field to store the size of the share, which limits total file size (when k=3) to 12GiB. We should implement a version=1 which uses an 8-byte field, to remove this limit.
The ShareFile class will need to read the version number first, then use that to decide how to read the size and num_leases fields. Then it should set self._data_offset and self._lease_offset accordingly.
I estimate that this change (plus testing) should take about one day.
The next limitation will be in the share structure (as opposed to the on-disk structure), implemented in WriteBucketProxy, in the 4-byte offset for hash tree sections. I estimate that this represents a limit of about 9TB. After that, I *think* we're using 8-byte offsets for everything, so we'll be good for something like 16EiB.
Change History (13)
comment:1 Changed at 2008-06-01T21:18:47Z by warner
comment:2 Changed at 2008-06-02T18:39:56Z by zooko
General note: in the future, I think it is probably a good idea to 8-bytes for all counts and sizes. Sometimes there are things that will be okay with 4 bytes, but instead of spending the time to figure out if it will be safe with 4 bytes, you should probably just use 8 bytes and move on to other topics.
Occasionally, there may actually be some field which gets used in such a way that conserving the 4 extra bytes is valuable enough that we should take the time to think it through and decide to go to 4 bytes (or 2, or 3, 5, or 6) instead of using our 8 byte default. I'm not aware of any field like that in the current Tahoe design -- everything currently should default to 8 bytes as far as I can think off the top of my head.
(And yes, Brian needs to get over his feeling that people are wrong to use files that large.)
comment:3 Changed at 2008-06-02T19:00:35Z by warner
The sad tale of 5bsz7 gets worse: struct.pack does not raise an exception when you give it a number that won't fit in the field you're creating, it just truncates the number. So this 18GB upload is going to look like it succeeds, and all the share data will be there, but the share size will be wrong. As a result, later attempts to read past the (size % 4GB) point will get a precondition exception, and the file will be unretrievable.
In looking at storage.py to see what could go wrong, I think I may have missed a limitation. The "segment_size" and "data_size" fields in the share structure (in WriteBucketProxy) are also 4-bytes long, not just the offsets of the hash trees. That imposes a 4GB size limit on the overall file.
Hm. So I wonder if those 4.7GB file uploads that were taking place actually succeeded or not. Rats.
comment:4 Changed at 2008-06-02T19:11:07Z by zooko
So if this is causing a silent failure where the fact that the file wasn't uploaded is hidden from the user and they think the file was successfully uploaded, then we need to elevate this issue from "major" to "critical".
comment:5 Changed at 2008-06-02T23:10:10Z by warner
I've looked over the code again, and we're ok with up to 12GiB files. The variable names in WriteBucketProxy are confusing, they do not make it clear if the variable is per-share or per-file. The segment_size and data_size fields in question turn out to be per-share, so the fact that they (and the 4-byte entries in the offset table) are limited to 4GiB only limits the shares to 4GiB, meaning an overall file size limit of 12GiB.
I'm running a test now with a 5GiB file to confirm. It took about 2 hours to upload (on a local testnet, entirely on one machine), and will take about 16 hours to download.
To get past the 12GiB limit, there are actually two places in the share format that need fixing: the self._data_size value is stored in the third word of the share, in a 4-byte field, that needs to be expanded. This is in addition to the 4-byte entries in the offset table that I identified earlier.
In addition, it would be a great idea to change the variable names in WriteBucketProxy and ReadBucketProxy to avoid this sort of confusion in the future.
comment:6 Changed at 2008-06-03T19:33:00Z by warner
The 5GB file (not 5GiB) downloaded successfully after about 19 hours. Checksums match. So 5GB works in 1.1 .
comment:7 Changed at 2008-09-24T13:51:47Z by zooko
I mentioned this ticket as one of the most important-to-me improvements that we could make in the Tahoe code: http://allmydata.org/pipermail/tahoe-dev/2008-September/000809.html
comment:8 Changed at 2008-12-31T23:50:05Z by zooko
- Resolution set to fixed
- Status changed from new to closed
Fixed by 6c4019ec33e7a253, but not by the expected technique of making a new server-side share file format with 8-byte share data size fields and making the server able to read the old format as well as the new format while writing the new format. Instead it was fixed by using os.path.getsize() instead of using the share data size field.
comment:9 Changed at 2009-01-02T22:52:23Z by warner
If the server can now handle larger shares, it needs to advertise this fact or else the clients won't take advantage of it. That means the StorageServer.VERSION dictionary needs to be updated, in particular the "maximum-immutable-share-size" value should be raised.
In addition, the part of 6c4019ec33e7a253 which provides backwards-compatibility by writing size mod 2**32 into the old share-size slot smells funny: what is this "4294967295" number? The fact that it is odd (and not even) is suspicious. Why not just use 2**32 ?
Other than that, nice patch! I'm glad there was enough redundancy left over in the original share-format to allow this trick to work.
We should arrange to do a manual test of uploading a 13GiB file to a private grid and see if it works. I expect that test will take several hours, making it unsuitable for the regular unit test suite, but it would be nice to know it worked at least once.
comment:10 Changed at 2009-01-10T01:57:58Z by warner
c7cd3f38e7b8768d adds the new advertisement, so all that's left is to do a manual test.
comment:11 Changed at 2009-01-13T03:19:58Z by warner
cc50e2f4aa96dd66 adds code to use a WriteBucketProxy_v2 when necessary, which was causing my manual test to fail. A new test is now running, I estimate it will take about 5 hours to complete, but it seems to have gotten off to a decent start.
comment:12 Changed at 2009-02-10T07:09:10Z by zooko
- Milestone changed from eventually to 1.3.0
This was fixed for 1.3.0.
comment:13 Changed at 2009-12-26T01:42:56Z by zooko
The current limit is that there is a 64-bit unsigned field which holds the offset in bytes of the next data element that comes after the share contents on the storage server's disk. See the implementation and the in-line docs in [src/allmydata/immutable/layout.py@3864].
This means that each individual share is limited to a few bytes less than 2^64^. Therefore the overall file is limited to k*2^64^ (where k is the number of shares). There might be some other limitation that I've forgotten about, but we haven't encountered it in practice, where people have many times uploaded files in excess of 12 GiB.
It's a pity that this isn't fixed yet, somebody is uploading an 18GB (18491058490) file through the prodtahoe3 helper right now (SI=5bsz7dptf235innpmcvmfjea74), and it's going to fail badly in a few days when they finally finish pushing the ciphertext to the helper.
OTOH, 18GB is a stupidly large file.. I know my sense of "reasonable" file sizes is different than the rest of the world, but files of that size are hard to deal with even locally: it must take hours just to read the contents out from disk.