[tahoe-dev] verification of subset of file == proof of retrievability
Zooko Wilcox-O'Hearn
zooko at zooko.com
Wed Jun 13 05:57:59 UTC 2012
Folks:
Over on the Bitcoin discussion forums (warning: wretched hive of scum
and villainy), someone was asserting that they wanted a "proof of
retrievability" protocol and saying that, while they hadn't looked,
they were pretty sure Tahoe-LAFS didn't do it right:
https://bitcointalk.org/index.php?topic=2236.msg847771#msg847771
I was mildly annoyed by this, because actually we have some extremely
strong features along those lines.
However, when I wrote a reply explaining exactly what we do have, I
was forced to admit that it isn't fully there yet. We already have
verification of complete files (although see #568), but we don't have
verification of a randomly-chosen subset of a file, which would be a
"Proof of Retrievability". See below for the message I posted to the
Bitcoin forum.
See also an old rant of mine complaining that academic cryptographers
have failed to study the papers and documentation of Tahoe-LAFS
closely enough to realize that there is most of a
proof-of-retrievability in there:
https://lafsgateway.zooko.com/uri/URI:DIR2-RO:d73ap7mtjvv7y6qsmmwqwai4ii:tq5tqejzulg7yj4h7nxuurpiuuz5jsgvczmdamcalpk2rc6gmbsq/klog.html#[[HAIL%3A%20A%20High-Availability%20and%20Integrity%20Layer%20for%20Cloud%20Storage]]
https://tahoe-lafs.org/trac/tahoe-lafs/ticket/568# make immutable
check/verify/repair and mutable check/verify work given only a verify
cap
------- message I posted to the Bitcoin forum
> Correct, to verify every bit of every share you need to download every bit of every share. It's expensive - hopefully future versions of Tahoe-LAFS will implement a probabilistic "proof of retrievability" protocol like the one you suggest.
Downloading only a subset of a file is already implemented, but there
isn't a command implemented that says "pick a random segment of this
file, download it from that server, and let me know if it passed
integrity checks". You can approximate it with the current Tahoe-LAFS
client, like this (the following lines that begin with "$" is me
typing in stuff as though I were using a bash prompt):
1. Pick a random spot in the file. Let's say the file size is 10 MB:
$ FILESIZE=29153345
$ python -c "import random;print random.randrange(0, $FILESIZE)"
2451799
2. Fetch the segment that contains that point. Segments are (unless
you've tweaked the configuration in a way that nobody does) 128 KiB in
size, so this will download the 128 KiB of the file that contain byte
number 2451799 and check the integrity of all 128 KiB:
$ curl --range 2451799-2451799
http://localhost/uri/URI:CHK:jwq3f6lkcioyxeuxlt3exlulqe:sccvpp27agfz32lqjghq2djaxetcuo7luko5dhrpdgs7bfidbasa:1:1:29153345
| hexdump -C % Total % Received % Xferd Average Speed Time
Time Time Current
Dload Upload Total Spent Left Speed
100 1 0 1 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
00000000 a9 |.|
00000001
As you can see it took only a second to download and emitted only one
byte to stdout, but it downloaded and verified the integrity of the
128 KiB segment that contained that byte.
If you are using multiple servers, using Tahoe-LAFS's awesome erasure
coding feature to spread out the data among multiple servers, then
this will download the data from the 3 fastest servers (unless you've
changed the default setting from "3" to some other number). There is
no good way to force it to download the data from specific servers in
order to test them -- it always picks the fastest servers. You can see
which server(s) it used by looking at the "Recent Uploads and
Downloads" page on the web user interface, which will also tell you a
bunch of performance statistics about this download.
In short, this feature is *almost* there. We just need someone to
write some code to do this automatically in the client (which is
written in Python) instead of as a series of bash commands. Also this
code should download one (randomly chosen) block from every server it
can find instead of from just the three fastest servers, and it should
print out a useful summary of what it tried and which servers had good
shares.
Oh, there is a different function which does print out a useful
summary of results -- the "verify" feature. But, that downloads and
tests every block instead of just one randomly chosen block. Another
way to implement this is to add an option to that to indicate how many
blocks it should try:
$ time tahoe check --verify
URI:CHK:jwq3f6lkcioyxeuxlt3exlulqe:sccvpp27agfz32lqjghq2djaxetcuo7luko5dhrpdgs7bfidbasa:1:1:29153345
Summary: Healthy
storage index: 7qhuoagk4z4ugsjkjgjcre6sx4
good-shares: 1 (encoding is 1-of-1)
wrong-shares: 0
real 1m2.705s
user 0m0.570s
sys 0m0.060s
More information about the tahoe-dev
mailing list