[tahoe-lafs-trac-stream] [Tahoe-LAFS] #2386: updating unhealthy MDMF files: likely problem
Tahoe-LAFS
trac at tahoe-lafs.org
Tue Feb 17 20:54:38 UTC 2015
#2386: updating unhealthy MDMF files: likely problem
------------------------------+-----------------------
Reporter: warner | Owner:
Type: defect | Status: new
Priority: normal | Milestone: undecided
Component: code-mutable | Version: 1.10.0
Resolution: | Keywords:
Launchpad Bug: |
------------------------------+-----------------------
Old description:
> While studying #2034 at today's meeting, we identified a probable design
> flaw in the way that optimized MDMF updates are supposed to work. The
> consequence is that, when updating a single block of large MDMF file, it
> is necessary to retrieve data from all possible shares. If any shares are
> unrecoverable, the file cannot be updated correctly.
>
> Ideally, for any mutable file, if you can retrieve 'k' (or more) shares,
> you should be able to produce new versions of each share that you could
> retrieve. In addition, the MDMF-specific efficiency goal is you should
> only need to fetch `O(deltasize)` bytes from each share (proportional to
> the size of the modification you're making, quantized by segment size).
>
> But because of this flaw, the updater must either fetch `O(filesize)`
> data (e.g. the whole share) from 'k' shares, or it must fetch a small
> amount (`O(log(filesize))`) from all 'N' shares.
>
> We haven't confirmed it yet, but the current implementation probably
> assumes that it has fetched data from all 'N' shares, and thus updating
> an "unhealthy" file is likely to fail. Note that repairing the file first
> should fix this, since after repair, all N shares will be available to
> the updater. But ideally any change to a mutable file will automatically
> repair it.
>
> == Tahoe's Transverse Hash Trees: A Refresher ==
>
> When encoding files, Tahoe starts by encrypting the file, then breaking
> it up into "numsegs" '''segments''' (typically 128KiB each). Each segment
> is then erasure-coded into '''blocks''', one for each of the N shares
> that the encoding settings dictate (so there will be `numsegs*N` blocks
> total). All of the blocks for a given share are concatenated together:
> this makes up the bulk of the share that will be delivered to a server.
> The rest of the share contains verification data used to detect
> corruption of those blocks.
>
> For each share, the '''block hash tree''' is a Merkle hash tree whose
> leaves are the hashes of each block. The block-hash-tree has `numsegs`
> leaves (padded up to a power of two, as usual for Merkle trees). There is
> one block-hash-tree per share, and each one is different. Each share has
> a different '''block-hash-tree-root'''. The entire block-hash-tree,
> including its root, is stored in the share.
>
> The block-hash-tree-roots, one per share, are used as the leaves of the
> '''share hash tree'''. The share-hash-tree has `N` leaves (padded again).
> Each share includes the share-hash-tree "Merkle chain", sometimes called
> the "uncle chain", which is the `O(log(N))` subset of hash-tree nodes
> necessary to validate the leaf (the block-hash-tree-root for that share).
>
> In immutable files, the '''share-hash-tree-root''' is included in the
> "UEB" (URI Extension Block), which is replicated into every share. The
> hash of the UEB is put into the filecap. With this UEB hash, you can
> verify every bit of every block of every share, and thus detect file
> corruption (and use an alternate share). For mutable files, the share-
> hash-tree-root is included in the RSA-signed block of data (next to the
> sequence number), providing the same kind of integrity checking. This
> signed block is replicated into every share.
>
> Imagine the file as a line of data sitting on a table, extending sideways
> into the X axis. Chop it up into segments. Now erasure-code each segment
> into blocks, which extend sideways into the Y axis. Each share is a line
> in the X axis, sitting on the table, paralleling the original file. Each
> block-hash-tree is a pointy-on-top triangle that lives in the X+Z plane,
> extending upwards from the blocks that make up the share, culminating in
> the block-hash-tree-root. The share-hash-tree triangle lives above that
> in the Y+Z plane: its leaves are the block-hash-tree-roots, and it
> culminates in the share-hash-tree-root.
>
> == Rebuilding the Hash Trees During Update ==
>
> MDMF encrypts each segment independently, with a new salt. So any
> modification to the file results in at least one encrypted segment being
> modified, which means every share has a modified block, which means the
> uncle chain in every block-hash-tree is changed, and every block-hash-
> tree-root is changed, so the entire share-hash-tree is changed, finally
> leading to a new share-hash-tree-root.
>
> MDMF's performance goal is to only require `O(deltasize)` bandwidth for
> making file modifications. The main work is to read-decrypt-modify-
> encrypt-write the handful of segments that cover the plaintext being
> changed. That part is fine. However, we also have to publish new
> integrity-checking data that matches the modified blocks.
>
> Given e.g. share number 2. The mutable-modify code can download the
> entire block-hash-tree from sh2 without retrieving the blocks themselves.
> For each modified block, it can update the corresponding block-hash-tree
> leaves, and recompute the tree. This updated tree (which will only differ
> along the uncle chain) can be published along with the rest of the share.
> So far, so good.
>
> But to rebuild the share-hash-tree, we need the block-hash-tree-root from
> ''all N shares''. And since we're modifying a block (in every share),
> we're recomputing all the block-hash-trees. So, to be specific, we need
> the block-hash-tree uncle-chain for all N shares. This is
> `N*O(ln(filesize))` data, and meets our performance goals.
>
> But this only works in the ideal case, where we were able to locate all N
> shares. We want this case anyways, because it means we can probably
> ''update'' all N shares, which gives us maximum redundancy and doesn't
> leave any old versions lying around (where they might confuse somebody
> into retrieving stale data).
>
> If the file wasn't completely healthy, we might not be able to get all N
> shares. Retrieving merely 'k' shares is enough to read the file, and it
> would be nice if that were also enough to modify the file.
>
> == Options ==
>
> * if `modify()` cannot find all N shares, perform a repair first (to get
> N shares), then try again
> * if `modify()` cannot find all N shares, download the entire file, re-
> encode the missing blocks, and rebuild the block-hash-trees for all
> shares
> * switch to a new encoding scheme that rotates the transverse trees (new
> ticket to be added describing this)
>
> For now, we will probably just add a precondition that asserts we've seen
> block-hash-tree data from all N shares during modify, so that modifying
> an unhealthy file will throw a sensible error. Our concern is that the
> code might assume it has N leaves when it actually has fewer, and could
> produce corrupt share-hash-trees or something unpleasant.
>
> Triggering a repair is probably the easiest short-term fix. Downloading
> the entire file would be a drag, and a significant architectural change.
> A new encoding scheme would be a lot of work, and would probably only fix
> this one problems.
New description:
While studying #2034 at today's meeting, we identified a probable design
flaw in the way that optimized MDMF updates are supposed to work. The
consequence is that, when updating a single block of large MDMF file, it
is necessary to retrieve data from all possible shares. If any shares are
unrecoverable, the file cannot be updated correctly.
Ideally, for any mutable file, if you can retrieve 'k' (or more) shares,
you should be able to produce new versions of each share that you could
retrieve. In addition, the MDMF-specific efficiency goal is you should
only need to fetch `O(deltasize)` bytes from each share (proportional to
the size of the modification you're making, quantized by segment size).
But because of this flaw, the updater must either fetch `O(filesize)` data
(e.g. the whole share) from 'k' shares, or it must fetch a small amount
(`O(log(filesize))`) from all 'N' shares.
We haven't confirmed it yet, but the current implementation probably
assumes that it has fetched data from all 'N' shares, and thus updating an
"unhealthy" file is likely to fail. Note that repairing the file first
should fix this, since after repair, all N shares will be available to the
updater. But ideally any change to a mutable file will automatically
repair it.
== Tahoe's Transverse Hash Trees: A Refresher ==
When encoding files, Tahoe starts by encrypting the file, then breaking it
up into "numsegs" '''segments''' (typically 128KiB each). Each segment is
then erasure-coded into '''blocks''', one for each of the N shares that
the encoding settings dictate (so there will be `numsegs*N` blocks total).
All of the blocks for a given share are concatenated together: this makes
up the bulk of the share that will be delivered to a server. The rest of
the share contains verification data used to detect corruption of those
blocks.
For each share, the '''block hash tree''' is a Merkle hash tree whose
leaves are the hashes of each block. The block-hash-tree has `numsegs`
leaves (padded up to a power of two, as usual for Merkle trees). There is
one block-hash-tree per share, and each one is different. Each share has a
different '''block-hash-tree-root'''. The entire block-hash-tree,
including its root, is stored in the share.
The block-hash-tree-roots, one per share, are used as the leaves of the
'''share hash tree'''. The share-hash-tree has `N` leaves (padded again).
Each share includes the share-hash-tree "Merkle chain", sometimes called
the "uncle chain", which is the `O(log(N))` subset of hash-tree nodes
necessary to validate the leaf (the block-hash-tree-root for that share).
In immutable files, the '''share-hash-tree-root''' is included in the
"UEB" (URI Extension Block), which is replicated into every share. The
hash of the UEB is put into the filecap. With this UEB hash, you can
verify every bit of every block of every share, and thus detect file
corruption (and use an alternate share). For mutable files, the share-
hash-tree-root is included in the RSA-signed block of data (next to the
sequence number), providing the same kind of integrity checking. This
signed block is replicated into every share.
Imagine the file as a line of data sitting on a table, extending sideways
into the X axis. Chop it up into segments. Now erasure-code each segment
into blocks, which extend sideways into the Y axis. Each share is a line
in the X axis, sitting on the table, paralleling the original file. Each
block-hash-tree is a pointy-on-top triangle that lives in the X+Z plane,
extending upwards from the blocks that make up the share, culminating in
the block-hash-tree-root. The share-hash-tree triangle lives above that in
the Y+Z plane: its leaves are the block-hash-tree-roots, and it culminates
in the share-hash-tree-root.
== Rebuilding the Hash Trees During Update ==
MDMF encrypts each segment independently, with a new salt. So any
modification to the file results in at least one encrypted segment being
modified, which means every share has a modified block, which means the
uncle chain in every block-hash-tree is changed, and every block-hash-
tree-root is changed, so the entire share-hash-tree is changed, finally
leading to a new share-hash-tree-root.
MDMF's performance goal is to only require `O(deltasize)` bandwidth for
making file modifications. The main work is to read-decrypt-modify-
encrypt-write the handful of segments that cover the plaintext being
changed. That part is fine. However, we also have to publish new
integrity-checking data that matches the modified blocks.
Given e.g. share number 2. The mutable-modify code can download the entire
block-hash-tree from sh2 without retrieving the blocks themselves. For
each modified block, it can update the corresponding block-hash-tree
leaves, and recompute the tree. This updated tree (which will only differ
along the uncle chain) can be published along with the rest of the share.
So far, so good.
But to rebuild the share-hash-tree, we need the block-hash-tree-root from
''all N shares''. And since we're modifying a block (in every share),
we're recomputing all the block-hash-trees. So, to be specific, we need
the block-hash-tree uncle-chain for all N shares. This is
`N*O(ln(filesize))` data, and meets our performance goals.
But this only works in the ideal case, where we were able to locate all N
shares. We want this case anyways, because it means we can probably
''update'' all N shares, which gives us maximum redundancy and doesn't
leave any old versions lying around (where they might confuse somebody
into retrieving stale data).
If the file wasn't completely healthy, we might not be able to get all N
shares. Retrieving merely 'k' shares is enough to read the file, and it
would be nice if that were also enough to modify the file.
== Options ==
* if `modify()` cannot find all N shares, perform a repair first (to get N
shares), then try again
* if `modify()` cannot find all N shares, download the entire file, re-
encode the missing blocks, and rebuild the block-hash-trees for all shares
* switch to a new encoding scheme that rotates the transverse trees
(#2387)
For now, we will probably just add a precondition that asserts we've seen
block-hash-tree data from all N shares during modify, so that modifying an
unhealthy file will throw a sensible error. Our concern is that the code
might assume it has N leaves when it actually has fewer, and could produce
corrupt share-hash-trees or something unpleasant.
Triggering a repair is probably the easiest short-term fix. Downloading
the entire file would be a drag, and a significant architectural change. A
new encoding scheme would be a lot of work, and would probably only fix
this one problems.
--
Comment (by warner):
add link to transverse-block-hash-tree ticket #2387
--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/2386#comment:1>
Tahoe-LAFS <https://Tahoe-LAFS.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list