#819 closed defect (fixed)

allmydata.test.test_repairer.Verifier.test_corrupt_crypttext_hashtree failed

Reported by: zooko Owned by: warner
Priority: critical Milestone: undecided
Component: code Version: 1.5.0
Keywords: integrity Cc:
Launchpad Bug:

Description

http://allmydata.org/buildbot/builders/hardy-amd64/builds/295/steps/test/logs/stdio

The unit test corrupted a share and the verifier failed to notice that the share was corrupted.

That test corrupts the share in a random way each time -- it invokes _corrupt_crypttext_hash_tree() which calls corrupt_field() which corrupts the ciphertext hash tree in a randomly chosen way. Unfortunately it doesn't log the way that it chose to corrupt it or the random seed that it used so to reproduce this we'll probably have to run with the trial --until-failure option or something. Hm, I see that there is a debug flag that you can pass to corrupt_field() to get it to log what it is doing. I'll try that.

Change History (11)

comment:1 Changed at 2009-10-26T21:52:27Z by zooko

  • Status changed from new to assigned

comment:2 Changed at 2009-12-06T00:26:04Z by davidsarah

  • Priority changed from major to critical

Seems important to get to the bottom of this; it could be exploitable if there is some predictable set of possible corruptions that are not detected.

comment:3 Changed at 2009-12-06T01:11:38Z by zooko

I edited my test suite to print out the random seed used in that test and ran the test with --until-failure. The next step is for me to figure out what kind of corruption is being randomly generated and is evading the verifier and then write a separate, fully deterministic test case that exercises that case.

comment:4 Changed at 2009-12-07T02:09:14Z by warner

What was the seed? If you set the seed to that value, can you reproduce the failure? If you can help me reproduce it, I can help track it down.

My hunch is that the verifier is not quite looking at the whole share. The Verifier (whose job is to fetch and verify every byte of every share) shares a lot of code with the Downloader (whose job is to fetch as few bytes as possible). So any shortcuts that the Downloader takes must be worked around by the Verifier, and it's easy to miss some.

The new Downloader that I'm slowly developing (for #287) will be even more specialized for minimum-fetching, so I think it'll soon be time to stop sharing much code between the two pieces.

comment:5 Changed at 2010-01-10T01:17:40Z by zooko

It occurs to me that there is something that these two modules will still share -- for every bit of data that the downloader does need, it wants to verify that bit's correctness, and the verifier wants to verify the bit's correctness in the same way that the downloader does, and the downloader wants to verify that bit's correctness in the same way that the verifier does. It would be good if the downloader doesn't download any bits that it doesn't need, but of course for various reasons it might end up downloading some that are redundant or unneeded and I guess it doesn't need to check on those ones.

comment:6 Changed at 2010-01-10T01:23:40Z by zooko

  • Owner changed from zooko to warner
  • Status changed from assigned to new

Okay I added a deterministic unit test. If you flip the 7th bit of the 9th byte of the cryptext share hash tree the then verifier says that the share data is all perfect, which is wrong wrong wrong. :-) 876c4a153b53b08c The test is marked as TODO. I'm going to work on other things now and see if Brian takes this bug.

comment:7 Changed at 2010-01-10T08:53:46Z by warner

  • Status changed from new to assigned

yay determinism!

the test fails for me if I run it alone, but passes repeatedly if I run the whole allmydata.test.test_repairer.Verifier group. Weird.

Ah, this is because it was sharing a basedir with the previous test (that makes it re-use the servers from the previous test, making them interfere with each other). This was also a problem with repair_from_deletion_of_1 and _of_7. I've pushed a fix for these in 1ff49a15bbe2ca6a.

I'll try to investigate the failure tomorrow.

comment:8 Changed at 2010-01-10T18:45:51Z by warner

ok, first thing I've learned is that _corrupt_crypttext_hash_tree_byte_9_bit_7 is actually corrupting the 9th byte of the *container*, not the share data or the crypttext-hash-tree field in particular. It correctly extracts the offset of the hash tree, but then ignores it and goes right for data[9]. This lands in the container's "share data size", which (according to Footnote 1 in immutable/layout.py) "as of Tahoe v1.3.0 these fields are not used when reading". So that's why this test fails to fail.

I'll poke at it some more and see if I can find a crypttext-hash-tree bit to flip that the verifier misses.

comment:9 Changed at 2010-01-10T20:24:54Z by warner

Ok, we're all good. We learned that several of the corrupt-and-verify unit tests were failing to include the container offset size, and thus were corrupting something other than what they intended. Several pieces of the share are unused, so we don't expect to detect corruption in them. When the offset problem caused our tests to corrupt one of these unused sections, the verifier didn't see problems, so the test failed.

I wrote a new test (test_each_byte), which sequentially corrupts each byte, runs the Verifier, then replaces the original share. This test is commented out, because it takes 140s to run on my laptop, but it reveals the following spans of bytes in which the Verifier doesn't detect corruption (of a 56-byte file, which turns into a 2753 byte share):

  • 4-7 (container share data length, unused and invisible to client)
  • 8-11 (container number of leases, invisible to client)
  • 16-19 (block size, unused)
  • 20-23 (share data size, unused)
  • 28-31 (offset of plaintext hash tree, unused)
  • 67-546 (reserved for plaintext hash tree, unused)
  • 1988-2680 (unused portion of space reserved for UEB)
  • 2681-2752 (container leases, invisible to client)

This is exactly what we'd expect. The _corrupt_crypttext_hash_tree code was (because of the missing offset) occasionally flipping bits in the unused plaintext_hash_tree section which immediately preceded it, which is not checked by the Verifier, hence the intermittent failures.

It's worth another ticket to look into removing the space allocation for that plaintext hash tree: we currently reserve that space but never write to it. If we stopped allocating it, we could save about 32*2*ln2(numsegs) bytes per share. On the other hand, I'd like to use it for a safely encrypted form of the plaintext hash tree, so I'm not going to prioritize the deallocation ticket.

comment:10 Changed at 2010-01-10T20:27:16Z by zooko

  • Resolution set to fixed
  • Status changed from assigned to closed

fixed by e7a9c000bfb76619 (corrupt the right parts of the share when you are testing whether the verifier notices)

comment:11 Changed at 2010-01-10T20:42:38Z by warner

my test_each_byte was added in 1ed5bbdcb6908f0a

Note: See TracTickets for help on using tickets.