Opened at 2015-02-17T19:45:53Z
Last modified at 2015-02-17T20:54:37Z
#2386 new defect
updating unhealthy MDMF files: likely problem — at Initial Version
Reported by: | warner | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | undecided |
Component: | code-mutable | Version: | 1.10.0 |
Keywords: | Cc: | ||
Launchpad Bug: |
Description
While studying #2034 at today's meeting, we identified a probable design flaw in the way that optimized MDMF updates are supposed to work. The consequence is that, when updating a single block of large MDMF file, it is necessary to retrieve data from all possible shares. If any shares are unrecoverable, the file cannot be updated correctly.
Ideally, for any mutable file, if you can retrieve 'k' (or more) shares, you should be able to produce new versions of each share that you could retrieve. In addition, the MDMF-specific efficiency goal is you should only need to fetch O(deltasize) bytes from each share (proportional to the size of the modification you're making, quantized by segment size).
But because of this flaw, the updater must either fetch O(filesize) data (e.g. the whole share) from 'k' shares, or it must fetch a small amount (O(log(filesize))) from all 'N' shares.
We haven't confirmed it yet, but the current implementation probably assumes that it has fetched data from all 'N' shares, and thus updating an "unhealthy" file is likely to fail. Note that repairing the file first should fix this, since after repair, all N shares will be available to the updater. But ideally any change to a mutable file will automatically repair it.
Tahoe's Transverse Hash Trees: A Refresher
When encoding files, Tahoe starts by encrypting the file, then breaking it up into "numsegs" segments (typically 128KiB each). Each segment is then erasure-coded into blocks, one for each of the N shares that the encoding settings dictate (so there will be numsegs*N blocks total). All of the blocks for a given share are concatenated together: this makes up the bulk of the share that will be delivered to a server. The rest of the share contains verification data used to detect corruption of those blocks.
For each share, the block hash tree is a Merkle hash tree whose leaves are the hashes of each block. The block-hash-tree has numsegs leaves (padded up to a power of two, as usual for Merkle trees). There is one block-hash-tree per share, and each one is different. Each share has a different block-hash-tree-root. The entire block-hash-tree, including its root, is stored in the share.
The block-hash-tree-roots, one per share, are used as the leaves of the share hash tree. The share-hash-tree has N leaves (padded again). Each share includes the share-hash-tree "Merkle chain", sometimes called the "uncle chain", which is the O(log(N)) subset of hash-tree nodes necessary to validate the leaf (the block-hash-tree-root for that share).
In immutable files, the share-hash-tree-root is included in the "UEB" (URI Extension Block), which is replicated into every share. The hash of the UEB is put into the filecap. With this UEB hash, you can verify every bit of every block of every share, and thus detect file corruption (and use an alternate share). For mutable files, the share-hash-tree-root is included in the RSA-signed block of data (next to the sequence number), providing the same kind of integrity checking. This signed block is replicated into every share.
Imagine the file as a line of data sitting on a table, extending sideways into the X axis. Chop it up into segments. Now erasure-code each segment into blocks, which extend sideways into the Y axis. Each share is a line in the X axis, sitting on the table, paralleling the original file. Each block-hash-tree is a pointy-on-top triangle that lives in the X+Z plane, extending upwards from the blocks that make up the share, culminating in the block-hash-tree-root. The share-hash-tree triangle lives above that in the Y+Z plane: its leaves are the block-hash-tree-roots, and it culminates in the share-hash-tree-root.
Rebuilding the Hash Trees During Update
MDMF encrypts each segment independently, with a new salt. So any modification to the file results in at least one encrypted segment being modified, which means every share has a modified block, which means the uncle chain in every block-hash-tree is changed, and every block-hash-tree-root is changed, so the entire share-hash-tree is changed, finally leading to a new share-hash-tree-root.
MDMF's performance goal is to only require O(deltasize) bandwidth for making file modifications. The main work is to read-decrypt-modify-encrypt-write the handful of segments that cover the plaintext being changed. That part is fine. However, we also have to publish new integrity-checking data that matches the modified blocks.
Given e.g. share number 2. The mutable-modify code can download the entire block-hash-tree from sh2 without retrieving the blocks themselves. For each modified block, it can update the corresponding block-hash-tree leaves, and recompute the tree. This updated tree (which will only differ along the uncle chain) can be published along with the rest of the share. So far, so good.
But to rebuild the share-hash-tree, we need the block-hash-tree-root from all N shares. And since we're modifying a block (in every share), we're recomputing all the block-hash-trees. So, to be specific, we need the block-hash-tree uncle-chain for all N shares. This is N*O(ln(filesize)) data, and meets our performance goals.
But this only works in the ideal case, where we were able to locate all N shares. We want this case anyways, because it means we can probably update all N shares, which gives us maximum redundancy and doesn't leave any old versions lying around (where they might confuse somebody into retrieving stale data).
If the file wasn't completely healthy, we might not be able to get all N shares. Retrieving merely 'k' shares is enough to read the file, and it would be nice if that were also enough to modify the file.
Options
- if modify() cannot find all N shares, perform a repair first (to get N shares), then try again
- if modify() cannot find all N shares, download the entire file, re-encode the missing blocks, and rebuild the block-hash-trees for all shares
- switch to a new encoding scheme that rotates the transverse trees (new ticket to be added describing this)
For now, we will probably just add a precondition that asserts we've seen block-hash-tree data from all N shares during modify, so that modifying an unhealthy file will throw a sensible error. Our concern is that the code might assume it has N leaves when it actually has fewer, and could produce corrupt share-hash-trees or something unpleasant.
Triggering a repair is probably the easiest short-term fix. Downloading the entire file would be a drag, and a significant architectural change. A new encoding scheme would be a lot of work, and would probably only fix this one problems.