[tahoe-lafs-trac-stream] [Tahoe-LAFS] #2386: updating unhealthy MDMF files: likely problem

Tue Feb 17 20:54:38 UTC 2015

#2386: updating unhealthy MDMF files: likely problem
------------------------------+-----------------------
     Reporter:  warner        |      Owner:
         Type:  defect        |     Status:  new
     Priority:  normal        |  Milestone:  undecided
    Component:  code-mutable  |    Version:  1.10.0
   Resolution:                |   Keywords:
Launchpad Bug:                |
------------------------------+-----------------------

Old description:

> While studying #2034 at today's meeting, we identified a probable design
> flaw in the way that optimized MDMF updates are supposed to work. The
> consequence is that, when updating a single block of large MDMF file, it
> is necessary to retrieve data from all possible shares. If any shares are
> unrecoverable, the file cannot be updated correctly.
>
> Ideally, for any mutable file, if you can retrieve 'k' (or more) shares,
> you should be able to produce new versions of each share that you could
> retrieve. In addition, the MDMF-specific efficiency goal is you should
> only need to fetch `O(deltasize)` bytes from each share (proportional to
> the size of the modification you're making, quantized by segment size).
>
> But because of this flaw, the updater must either fetch `O(filesize)`
> data (e.g. the whole share) from 'k' shares, or it must fetch a small
> amount (`O(log(filesize))`) from all 'N' shares.
>
> We haven't confirmed it yet, but the current implementation probably
> assumes that it has fetched data from all 'N' shares, and thus updating
> an "unhealthy" file is likely to fail. Note that repairing the file first
> should fix this, since after repair, all N shares will be available to
> the updater. But ideally any change to a mutable file will automatically
> repair it.
>

> == Tahoe's Transverse Hash Trees: A Refresher ==
>
> When encoding files, Tahoe starts by encrypting the file, then breaking
> it up into "numsegs" '''segments''' (typically 128KiB each). Each segment
> is then erasure-coded into '''blocks''', one for each of the N shares
> that the encoding settings dictate (so there will be `numsegs*N` blocks
> total). All of the blocks for a given share are concatenated together:
> this makes up the bulk of the share that will be delivered to a server.
> The rest of the share contains verification data used to detect
> corruption of those blocks.
>
> For each share, the '''block hash tree''' is a Merkle hash tree whose
> leaves are the hashes of each block. The block-hash-tree has `numsegs`
> leaves (padded up to a power of two, as usual for Merkle trees). There is
> one block-hash-tree per share, and each one is different. Each share has
> a different '''block-hash-tree-root'''. The entire block-hash-tree,
> including its root, is stored in the share.
>
> The block-hash-tree-roots, one per share, are used as the leaves of the
> '''share hash tree'''. The share-hash-tree has `N` leaves (padded again).
> Each share includes the share-hash-tree "Merkle chain", sometimes called
> the "uncle chain", which is the `O(log(N))` subset of hash-tree nodes
> necessary to validate the leaf (the block-hash-tree-root for that share).
>
> In immutable files, the '''share-hash-tree-root''' is included in the
> "UEB" (URI Extension Block), which is replicated into every share. The
> hash of the UEB is put into the filecap. With this UEB hash, you can
> verify every bit of every block of every share, and thus detect file
> corruption (and use an alternate share). For mutable files, the share-
> hash-tree-root is included in the RSA-signed block of data (next to the
> sequence number), providing the same kind of integrity checking. This
> signed block is replicated into every share.
>
> Imagine the file as a line of data sitting on a table, extending sideways
> into the X axis. Chop it up into segments. Now erasure-code each segment
> into blocks, which extend sideways into the Y axis. Each share is a line
> in the X axis, sitting on the table, paralleling the original file. Each
> block-hash-tree is a pointy-on-top triangle that lives in the X+Z plane,
> extending upwards from the blocks that make up the share, culminating in
> the block-hash-tree-root. The share-hash-tree triangle lives above that
> in the Y+Z plane: its leaves are the block-hash-tree-roots, and it
> culminates in the share-hash-tree-root.
>
> == Rebuilding the Hash Trees During Update ==
>
> MDMF encrypts each segment independently, with a new salt. So any
> modification to the file results in at least one encrypted segment being
> modified, which means every share has a modified block, which means the
> uncle chain in every block-hash-tree is changed, and every block-hash-
> tree-root is changed, so the entire share-hash-tree is changed, finally
> leading to a new share-hash-tree-root.
>
> MDMF's performance goal is to only require `O(deltasize)` bandwidth for
> making file modifications. The main work is to read-decrypt-modify-
> encrypt-write the handful of segments that cover the plaintext being
> changed. That part is fine. However, we also have to publish new
> integrity-checking data that matches the modified blocks.
>
> Given e.g. share number 2. The mutable-modify code can download the
> entire block-hash-tree from sh2 without retrieving the blocks themselves.
> For each modified block, it can update the corresponding block-hash-tree
> leaves, and recompute the tree. This updated tree (which will only differ
> along the uncle chain) can be published along with the rest of the share.
> So far, so good.
>
> But to rebuild the share-hash-tree, we need the block-hash-tree-root from
> ''all N shares''. And since we're modifying a block (in every share),
> we're recomputing all the block-hash-trees. So, to be specific, we need
> the block-hash-tree uncle-chain for all N shares. This is
> `N*O(ln(filesize))` data, and meets our performance goals.
>
> But this only works in the ideal case, where we were able to locate all N
> shares. We want this case anyways, because it means we can probably
> ''update'' all N shares, which gives us maximum redundancy and doesn't
> leave any old versions lying around (where they might confuse somebody
> into retrieving stale data).
>
> If the file wasn't completely healthy, we might not be able to get all N
> shares. Retrieving merely 'k' shares is enough to read the file, and it
> would be nice if that were also enough to modify the file.
>
> == Options ==
>
> * if `modify()` cannot find all N shares, perform a repair first (to get
> N shares), then try again
> * if `modify()` cannot find all N shares, download the entire file, re-
> encode the missing blocks, and rebuild the block-hash-trees for all
> shares
> * switch to a new encoding scheme that rotates the transverse trees (new
> ticket to be added describing this)
>
> For now, we will probably just add a precondition that asserts we've seen
> block-hash-tree data from all N shares during modify, so that modifying
> an unhealthy file will throw a sensible error. Our concern is that the
> code might assume it has N leaves when it actually has fewer, and could
> produce corrupt share-hash-trees or something unpleasant.
>
> Triggering a repair is probably the easiest short-term fix. Downloading
> the entire file would be a drag, and a significant architectural change.
> A new encoding scheme would be a lot of work, and would probably only fix
> this one problems.

New description:

 While studying #2034 at today's meeting, we identified a probable design
 flaw in the way that optimized MDMF updates are supposed to work. The
 consequence is that, when updating a single block of large MDMF file, it
 is necessary to retrieve data from all possible shares. If any shares are
 unrecoverable, the file cannot be updated correctly.

 Ideally, for any mutable file, if you can retrieve 'k' (or more) shares,
 you should be able to produce new versions of each share that you could
 retrieve. In addition, the MDMF-specific efficiency goal is you should
 only need to fetch `O(deltasize)` bytes from each share (proportional to
 the size of the modification you're making, quantized by segment size).

 But because of this flaw, the updater must either fetch `O(filesize)` data
 (e.g. the whole share) from 'k' shares, or it must fetch a small amount
 (`O(log(filesize))`) from all 'N' shares.

 We haven't confirmed it yet, but the current implementation probably
 assumes that it has fetched data from all 'N' shares, and thus updating an
 "unhealthy" file is likely to fail. Note that repairing the file first
 should fix this, since after repair, all N shares will be available to the
 updater. But ideally any change to a mutable file will automatically
 repair it.

 == Tahoe's Transverse Hash Trees: A Refresher ==

 When encoding files, Tahoe starts by encrypting the file, then breaking it
 up into "numsegs" '''segments''' (typically 128KiB each). Each segment is
 then erasure-coded into '''blocks''', one for each of the N shares that
 the encoding settings dictate (so there will be `numsegs*N` blocks total).
 All of the blocks for a given share are concatenated together: this makes
 up the bulk of the share that will be delivered to a server. The rest of
 the share contains verification data used to detect corruption of those
 blocks.

 For each share, the '''block hash tree''' is a Merkle hash tree whose
 leaves are the hashes of each block. The block-hash-tree has `numsegs`
 leaves (padded up to a power of two, as usual for Merkle trees). There is
 one block-hash-tree per share, and each one is different. Each share has a
 different '''block-hash-tree-root'''. The entire block-hash-tree,
 including its root, is stored in the share.

 The block-hash-tree-roots, one per share, are used as the leaves of the
 '''share hash tree'''. The share-hash-tree has `N` leaves (padded again).
 Each share includes the share-hash-tree "Merkle chain", sometimes called
 the "uncle chain", which is the `O(log(N))` subset of hash-tree nodes
 necessary to validate the leaf (the block-hash-tree-root for that share).

 In immutable files, the '''share-hash-tree-root''' is included in the
 "UEB" (URI Extension Block), which is replicated into every share. The
 hash of the UEB is put into the filecap. With this UEB hash, you can
 verify every bit of every block of every share, and thus detect file
 corruption (and use an alternate share). For mutable files, the share-
 hash-tree-root is included in the RSA-signed block of data (next to the
 sequence number), providing the same kind of integrity checking. This
 signed block is replicated into every share.

 Imagine the file as a line of data sitting on a table, extending sideways
 into the X axis. Chop it up into segments. Now erasure-code each segment
 into blocks, which extend sideways into the Y axis. Each share is a line
 in the X axis, sitting on the table, paralleling the original file. Each
 block-hash-tree is a pointy-on-top triangle that lives in the X+Z plane,
 extending upwards from the blocks that make up the share, culminating in
 the block-hash-tree-root. The share-hash-tree triangle lives above that in
 the Y+Z plane: its leaves are the block-hash-tree-roots, and it culminates
 in the share-hash-tree-root.

 == Rebuilding the Hash Trees During Update ==

 MDMF encrypts each segment independently, with a new salt. So any
 modification to the file results in at least one encrypted segment being
 modified, which means every share has a modified block, which means the
 uncle chain in every block-hash-tree is changed, and every block-hash-
 tree-root is changed, so the entire share-hash-tree is changed, finally
 leading to a new share-hash-tree-root.

 MDMF's performance goal is to only require `O(deltasize)` bandwidth for
 making file modifications. The main work is to read-decrypt-modify-
 encrypt-write the handful of segments that cover the plaintext being
 changed. That part is fine. However, we also have to publish new
 integrity-checking data that matches the modified blocks.

 Given e.g. share number 2. The mutable-modify code can download the entire
 block-hash-tree from sh2 without retrieving the blocks themselves. For
 each modified block, it can update the corresponding block-hash-tree
 leaves, and recompute the tree. This updated tree (which will only differ
 along the uncle chain) can be published along with the rest of the share.
 So far, so good.

 But to rebuild the share-hash-tree, we need the block-hash-tree-root from
 ''all N shares''. And since we're modifying a block (in every share),
 we're recomputing all the block-hash-trees. So, to be specific, we need
 the block-hash-tree uncle-chain for all N shares. This is
 `N*O(ln(filesize))` data, and meets our performance goals.

 But this only works in the ideal case, where we were able to locate all N
 shares. We want this case anyways, because it means we can probably
 ''update'' all N shares, which gives us maximum redundancy and doesn't
 leave any old versions lying around (where they might confuse somebody
 into retrieving stale data).

 If the file wasn't completely healthy, we might not be able to get all N
 shares. Retrieving merely 'k' shares is enough to read the file, and it
 would be nice if that were also enough to modify the file.

 == Options ==

 * if `modify()` cannot find all N shares, perform a repair first (to get N
 shares), then try again
 * if `modify()` cannot find all N shares, download the entire file, re-
 encode the missing blocks, and rebuild the block-hash-trees for all shares
 * switch to a new encoding scheme that rotates the transverse trees
 (#2387)

 For now, we will probably just add a precondition that asserts we've seen
 block-hash-tree data from all N shares during modify, so that modifying an
 unhealthy file will throw a sensible error. Our concern is that the code
 might assume it has N leaves when it actually has fewer, and could produce
 corrupt share-hash-trees or something unpleasant.

 Triggering a repair is probably the easiest short-term fix. Downloading
 the entire file would be a drag, and a significant architectural change. A
 new encoding scheme would be a lot of work, and would probably only fix
 this one problems.

--

Comment (by warner):

 add link to transverse-block-hash-tree ticket #2387

--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/2386#comment:1>
Tahoe-LAFS <https://Tahoe-LAFS.org>
secure decentralized storage