Changes between Version 2 and Version 3 of KnownIssues


Ignore:
Timestamp:
2008-06-10T23:27:18Z (16 years ago)
Author:
zooko
Comment:

new version

Legend:

Unmodified
Added
Removed
Modified
  • KnownIssues

    v2 v3  
    11= Known Issues =
    22
    3 This page describes known problems for recent releases of Tahoe. Issues are
    4 fixed as quickly as possible, however users of older releases may still need
    5 to be aware of these problems until they upgrade to a release which resolves
    6 it.
     3Below is a list of known issues in recent releases of Tahoe, and how to manage
     4them.
    75
    8 == Issues in [Tahoe 1.1 milestone:1.1.0] (not quite released) ==
    96
    10 === Servers which run out of space ===
     7== issues in Tahoe v1.1.0, released 2008-06-10 ==
    118
    12 If a Tahoe storage server runs out of space, writes will fail with an
    13 {{{IOError}}} exception. In some situations, Tahoe-1.1 clients will not react
    14 to this very well:
     9=== issue 1: server out of space when writing mutable file ===
    1510
    16  * If the exception occurs during an immutable-share write, that share will
    17    be broken. The client will detect this, and will declare the upload as
    18    failing if insufficient shares can be placed (this "shares of happiness"
    19    threshold defaults to 7 out of 10). The code does not yet search for new
    20    servers to replace the full ones. If the upload fails, the server's
    21    upload-already-in-progress routines may interfere with a subsequent
    22    upload.
    23  * If the exception occurs during a mutable-share write, the old share will
    24    be left in place (and a new home for the share will be sought). If enough
    25    old shares are left around, subsequent reads may see the file in its
    26    earlier state, known as a "rollback" fault. Writing a new version of the
    27    file should find the newer shares correctly, although it will take
    28    longer (more roundtrips) than usual.
     11If a v1.0 or v1.1.0 storage server runs out of disk space then its attempts to
     12write data to the local filesystem will fail.  For immutable files, this will
     13not lead to any problem (the attempt to upload that share to that server will
     14fail, the partially uploaded share will be deleted from the storage server's
     15"incoming shares" directory, and the client will move on to using another
     16storage server instead).
    2917
    30 The out-of-space handling code is not yet complete, and we do not yet have a
    31 space-limiting solution that is suitable for large storage nodes. The
    32 "sizelimit" configuration uses a /usr/bin/du -style query at node startup,
    33 which takes a long time (tens of minutes) on storage nodes that offer 100GB
    34 or more, making it unsuitable for highly-available servers.
     18If the write was an attempt to modify an existing mutable file, however, a
     19problem will result: when the attempt to write the new share fails due to
     20insufficient disk space, then it will be aborted and the old share will be left
     21in place.  If enough such old shares are left, then a subsequent read may get
     22those old shares and see the file in its earlier state, which is a "rollback"
     23failure.  With the default parameters (3-of-10), six old shares will be enough
     24to potentially lead to a rollback failure.
    3525
    36 In lieu of 'sizelimit', server admins are advised to set the
    37 NODEDIR/readonly_storage (and remove 'sizelimit', and restart their nodes) on
    38 their storage nodes before space is exhausted. This will stop the influx of
    39 immutable shares. Mutable shares will continue to arrive, but since these are
    40 mainly used by directories, the amount of space consumed will be smaller.
     26==== how to manage it ====
    4127
    42 Eventually we will have a better solution for this.
     28Make sure your Tahoe storage servers don't run out of disk space.  This means
     29refusing storage requests before the disk fills up. There are a couple of ways
     30to do that with v1.1.
    4331
    44 == Issues in Tahoe 1.0 ==
     32First, there is a configuration option named "sizelimit" which will cause the
     33storage server to do a "du" style recursive examination of its directories at
     34startup, and then if the sum of the size of files found therein is greater than
     35the "sizelimit" number, it will reject requests by clients to write new
     36immutable shares.
    4537
    46 === Servers which run out of space ===
     38However, that can take a long time (something on the order of a minute of
     39examination of the filesystem for each 10 GB of data stored in the Tahoe
     40server), and the Tahoe server will be unavailable to clients during that time.
    4741
    48 In addition to the problems described above, Tahoe-1.0 clients which
    49 experience out-of-space errors while writing mutable files are likely to
    50 think the write succeeded, when it in fact failed. This can cause data loss.
     42Another option is to set the "readonly_storage" configuration option on the
     43storage server before startup.  This will cause the storage server to reject
     44all requests to upload new immutable shares.
    5145
    52 === Large Directories or Mutable files in a specific range of sizes ===
     46Note that neither of these configurations affect mutable shares: even if
     47sizelimit is configured and the storage server currently has greater space used
     48than allowed, or even if readonly_storage is configured, servers will continue
     49to accept new mutable shares and will continue to accept requests to overwrite
     50existing mutable shares.
    5351
    54 A mismatched pair of size limits causes a problem when a client attempts to
    55 upload a large mutable file with a size between 3139275 and 3500000 bytes.
    56 (Mutable files larger than 3.5MB are refused outright). The symptom is very
    57 high memory usage (3GB) and 100% CPU for about 5 minutes. The attempted write
    58 will fail, but the client may think that it succeeded. This size corresponds
    59 to roughly 9000 entries in a directory.
     52Mutable files are typically used only for directories, and are usually much
     53smaller than immutable files, so if you use one of these configurations to stop
     54the influx of immutable files while there is still sufficient disk space to
     55receive an influx of (much smaller) mutable files, you may be able to avoid the
     56potential for "rollback" failure.
    6057
    61 This was fixed in 1.1, as ticket #379. Files up to 3.5MB should now work
    62 properly, and files above that size should be rejected properly. Both servers
    63 and clients must be upgraded to resolve the problem, although once the client
    64 is upgraded to 1.1 the memory usage and false-success problems should be
    65 fixed.
     58A future version of Tahoe will include a fix for this issue.  Here is
     59[http://allmydata.org/pipermail/tahoe-dev/2008-May/000630.html the mailing list
     60discussion] about how that future version will work.
    6661
    67 === pycryptopp compile errors resulting in corruption ===
    6862
    69 Certain combinations of compiler, linker, and pycryptopp versions may cause
    70 corruption errors during decryption, resulting in corrupted plaintext.
     63== issues in Tahoe v1.1.0 and v1.0.0 ==
    7164
     65=== issue 2: pyOpenSSL and/or Twisted defect resulting false alarms in the unit tests ===
     66
     67The combination of Twisted v8.1.0 and pyOpenSSL v0.7 causes the Tahoe v1.1 unit
     68tests to fail, even though the behavior of Tahoe itself which is being tested is
     69correct.
     70
     71==== how to manage it ====
     72
     73If you are using Twisted v8.1.0 and pyOpenSSL v0.7, then please ignore XYZ in
     74XYZ.  Downgrading to an older version of Twisted or pyOpenSSL will cause those
     75false alarms to stop happening.
     76
     77
     78== issues in Tahoe v1.0.0, released 2008-03-25 ==
     79
     80(Tahoe v1.0 was superceded by v1.1 which was released 2008-06-10.)
     81
     82=== issue 3: server out of space when writing mutable file ===
     83
     84In addition to the problems caused by insufficient disk space described above,
     85v1.0 clients which are writing mutable files when the servers fail to write to
     86their filesystem are likely to think the write succeeded, when it in fact
     87failed. This can cause data loss.
     88
     89==== how to manage it ====
     90
     91Upgrade client to v1.1, or make sure that servers are always able to write to
     92their local filesystem (including that there is space available) as described in
     93"issue 1" above.
     94
     95
     96=== issue 4: server out of space when writing immutable file ===
     97
     98Tahoe v1.0 clients are using v1.0 servers which are unable to write to their
     99filesystem during an immutable upload will correctly detect the first failure,
     100but if they retry the upload without restarting the client, or if another client
     101attempts to upload the same file, the second upload may appear to succeed when
     102it hasn't, which can lead to data loss.
     103
     104==== how to manage it ====
     105
     106Upgrading either or both of the client and the server to v1.1 will fix this
     107issue.  Also it can be avoided by ensuring that the servers are always able to
     108write to their local filesystem (including that there is space available) as
     109described in "issue 1" above.
     110
     111
     112=== issue 5: large directories or mutable files in a specific range of sizes ===
     113
     114If a client attempts to upload a large mutable file with a size greater than
     115about 3,139,000 and less than or equal to 3,500,000 bytes then it will fail but
     116appear to succeed, which can lead to data loss.
     117
     118(Mutable files larger than 3,500,000 are refused outright).  The symptom of the
     119failure is very high memory usage (3 GB of memory) and 100% CPU for about 5
     120minutes, before it appears to succeed, although it hasn't.
     121
     122Directories are stored in mutable files, and a directory of approximately 9000
     123entries may fall into this range of mutable file sizes (depending on the size of
     124the filenames or other metadata associated with the entries).
     125
     126==== how to manage it ====
     127
     128This was fixed in v1.1, under ticket #379.  If the client is upgraded to v1.1,
     129then it will fail cleanly instead of falsely appearing to succeed when it tries
     130to write a file whose size is in this range.  If the server is also upgraded to
     131v1.1, then writes of mutable files whose size is in this range will succeed.
     132(If the server is upgraded to v1.1 but the client is still v1.0 then the client
     133will still suffer this failure.)
     134
     135
     136=== issue 6: pycryptopp defect resulting in data corruption ===
     137
     138Versions of pycryptopp earlier than pycryptopp-0.5.0 had a defect which, when
     139compiled with some compilers, would cause AES-256 encryption and decryption to
     140be computed incorrectly.  This could cause data corruption.  Tahoe v1.0
     141required, and came with a bundled copy of, pycryptopp v0.3.
     142
     143==== how to manage it ====
     144
     145You can detect whether pycryptopp-0.3 has this failure when it is compiled by
     146your compiler.  Run the unit tests that come with pycryptopp-0.3: unpack the
     147"pycryptopp-0.3.tar" file that comes in the Tahoe v1.0 {{{misc/dependencies}}}
     148directory, cd into the resulting {{{pycryptopp-0.3.0}}} directory, and execute
     149{{{python ./setup.py test}}}.  If the tests pass, then your compiler does not
     150trigger this failure.
     151
     152Tahoe v1.1 requires, and comes with a bundled copy of, pycryptopp v0.5.1, which
     153does not have this defect.