16 | | * If the exception occurs during an immutable-share write, that share will |
17 | | be broken. The client will detect this, and will declare the upload as |
18 | | failing if insufficient shares can be placed (this "shares of happiness" |
19 | | threshold defaults to 7 out of 10). The code does not yet search for new |
20 | | servers to replace the full ones. If the upload fails, the server's |
21 | | upload-already-in-progress routines may interfere with a subsequent |
22 | | upload. |
23 | | * If the exception occurs during a mutable-share write, the old share will |
24 | | be left in place (and a new home for the share will be sought). If enough |
25 | | old shares are left around, subsequent reads may see the file in its |
26 | | earlier state, known as a "rollback" fault. Writing a new version of the |
27 | | file should find the newer shares correctly, although it will take |
28 | | longer (more roundtrips) than usual. |
| 11 | If a v1.0 or v1.1.0 storage server runs out of disk space then its attempts to |
| 12 | write data to the local filesystem will fail. For immutable files, this will |
| 13 | not lead to any problem (the attempt to upload that share to that server will |
| 14 | fail, the partially uploaded share will be deleted from the storage server's |
| 15 | "incoming shares" directory, and the client will move on to using another |
| 16 | storage server instead). |
30 | | The out-of-space handling code is not yet complete, and we do not yet have a |
31 | | space-limiting solution that is suitable for large storage nodes. The |
32 | | "sizelimit" configuration uses a /usr/bin/du -style query at node startup, |
33 | | which takes a long time (tens of minutes) on storage nodes that offer 100GB |
34 | | or more, making it unsuitable for highly-available servers. |
| 18 | If the write was an attempt to modify an existing mutable file, however, a |
| 19 | problem will result: when the attempt to write the new share fails due to |
| 20 | insufficient disk space, then it will be aborted and the old share will be left |
| 21 | in place. If enough such old shares are left, then a subsequent read may get |
| 22 | those old shares and see the file in its earlier state, which is a "rollback" |
| 23 | failure. With the default parameters (3-of-10), six old shares will be enough |
| 24 | to potentially lead to a rollback failure. |
| 65 | === issue 2: pyOpenSSL and/or Twisted defect resulting false alarms in the unit tests === |
| 66 | |
| 67 | The combination of Twisted v8.1.0 and pyOpenSSL v0.7 causes the Tahoe v1.1 unit |
| 68 | tests to fail, even though the behavior of Tahoe itself which is being tested is |
| 69 | correct. |
| 70 | |
| 71 | ==== how to manage it ==== |
| 72 | |
| 73 | If you are using Twisted v8.1.0 and pyOpenSSL v0.7, then please ignore XYZ in |
| 74 | XYZ. Downgrading to an older version of Twisted or pyOpenSSL will cause those |
| 75 | false alarms to stop happening. |
| 76 | |
| 77 | |
| 78 | == issues in Tahoe v1.0.0, released 2008-03-25 == |
| 79 | |
| 80 | (Tahoe v1.0 was superceded by v1.1 which was released 2008-06-10.) |
| 81 | |
| 82 | === issue 3: server out of space when writing mutable file === |
| 83 | |
| 84 | In addition to the problems caused by insufficient disk space described above, |
| 85 | v1.0 clients which are writing mutable files when the servers fail to write to |
| 86 | their filesystem are likely to think the write succeeded, when it in fact |
| 87 | failed. This can cause data loss. |
| 88 | |
| 89 | ==== how to manage it ==== |
| 90 | |
| 91 | Upgrade client to v1.1, or make sure that servers are always able to write to |
| 92 | their local filesystem (including that there is space available) as described in |
| 93 | "issue 1" above. |
| 94 | |
| 95 | |
| 96 | === issue 4: server out of space when writing immutable file === |
| 97 | |
| 98 | Tahoe v1.0 clients are using v1.0 servers which are unable to write to their |
| 99 | filesystem during an immutable upload will correctly detect the first failure, |
| 100 | but if they retry the upload without restarting the client, or if another client |
| 101 | attempts to upload the same file, the second upload may appear to succeed when |
| 102 | it hasn't, which can lead to data loss. |
| 103 | |
| 104 | ==== how to manage it ==== |
| 105 | |
| 106 | Upgrading either or both of the client and the server to v1.1 will fix this |
| 107 | issue. Also it can be avoided by ensuring that the servers are always able to |
| 108 | write to their local filesystem (including that there is space available) as |
| 109 | described in "issue 1" above. |
| 110 | |
| 111 | |
| 112 | === issue 5: large directories or mutable files in a specific range of sizes === |
| 113 | |
| 114 | If a client attempts to upload a large mutable file with a size greater than |
| 115 | about 3,139,000 and less than or equal to 3,500,000 bytes then it will fail but |
| 116 | appear to succeed, which can lead to data loss. |
| 117 | |
| 118 | (Mutable files larger than 3,500,000 are refused outright). The symptom of the |
| 119 | failure is very high memory usage (3 GB of memory) and 100% CPU for about 5 |
| 120 | minutes, before it appears to succeed, although it hasn't. |
| 121 | |
| 122 | Directories are stored in mutable files, and a directory of approximately 9000 |
| 123 | entries may fall into this range of mutable file sizes (depending on the size of |
| 124 | the filenames or other metadata associated with the entries). |
| 125 | |
| 126 | ==== how to manage it ==== |
| 127 | |
| 128 | This was fixed in v1.1, under ticket #379. If the client is upgraded to v1.1, |
| 129 | then it will fail cleanly instead of falsely appearing to succeed when it tries |
| 130 | to write a file whose size is in this range. If the server is also upgraded to |
| 131 | v1.1, then writes of mutable files whose size is in this range will succeed. |
| 132 | (If the server is upgraded to v1.1 but the client is still v1.0 then the client |
| 133 | will still suffer this failure.) |
| 134 | |
| 135 | |
| 136 | === issue 6: pycryptopp defect resulting in data corruption === |
| 137 | |
| 138 | Versions of pycryptopp earlier than pycryptopp-0.5.0 had a defect which, when |
| 139 | compiled with some compilers, would cause AES-256 encryption and decryption to |
| 140 | be computed incorrectly. This could cause data corruption. Tahoe v1.0 |
| 141 | required, and came with a bundled copy of, pycryptopp v0.3. |
| 142 | |
| 143 | ==== how to manage it ==== |
| 144 | |
| 145 | You can detect whether pycryptopp-0.3 has this failure when it is compiled by |
| 146 | your compiler. Run the unit tests that come with pycryptopp-0.3: unpack the |
| 147 | "pycryptopp-0.3.tar" file that comes in the Tahoe v1.0 {{{misc/dependencies}}} |
| 148 | directory, cd into the resulting {{{pycryptopp-0.3.0}}} directory, and execute |
| 149 | {{{python ./setup.py test}}}. If the tests pass, then your compiler does not |
| 150 | trigger this failure. |
| 151 | |
| 152 | Tahoe v1.1 requires, and comes with a bundled copy of, pycryptopp v0.5.1, which |
| 153 | does not have this defect. |