[tahoe-dev] [tahoe-lafs] #753: use longer storage index / cap for collision resistance
tahoe-lafs
trac at allmydata.org
Fri Jul 10 02:41:24 PDT 2009
#753: use longer storage index / cap for collision resistance
---------------------------+------------------------------------------------
Reporter: warner | Owner:
Type: defect | Status: new
Priority: major | Milestone: undecided
Component: code-encoding | Version: 1.4.1
Keywords: | Launchpad_bug:
---------------------------+------------------------------------------------
As we work on new encoding schemes (like DSA-based mutable files), I'm
thinking that we want to put a lower bound on the cap/SI length to
maintain
reasonable margin against collisions. 256 bits would be more than enough.
128
is ok, but a bit tight. 92 bits would make me nervous.
robk's friend Sylvan expressed concerns about Tahoe (and mnet/mojonation
before it) because, for something that is meant as a backup system, even
the
slightest possibility of the CHK-based indexing scheme mapping two
documents
to the same storage index was too high for him. (I believe he would be
more satisfied with a scheme that used centrally-allocated guaranteed-
unique
storage indices, which we could do but would require more coordination and
longer caps, since we could no longer use a randomly-generated readkey to
derive the storage index. In exchange for a controllable but non-zero
probability of collision, we get to avoid central coordination and use
smaller caps).
The specific places where collisions could occur are:
* mapping from file contents to CHK-derived readkey
* mapping from readkey (CHK-derived or randomly generated) to storage
index
* mapping from randomly-generated mutable writekey to storage index
The "birthday paradox" determines the chance of collision. If I'm doing my
math right, if you want less than '{{{p}}}' chance of getting any
collisions
when selecting items out of a namespace of size '{{{N}}}', then you can't
select more than {{{C = sqrt(2*N*p)}}} items. This is called a "paradox"
(where "surprise" would be a better term) because that square root causes
C
to be surprisingly low: for birthdays (in which N=365), p=0.5 leads to
C=19.
In the Tahoe context, {{{C}}} is the number of files you can add to the
grid.
In the current case, our 128-bit storage index (N=2**128) means that p=0.5
gets us a nice large 2**64 number of files, except that p=0.5 is
insufficient
margin: we'd much prefer a vanishingly small chance of collision, like
p=2**-64. Fortunately we get two bits of margin for every one bit we
reduce
from C. The table looks like:
||N||numfiles||prob(collision)||
||96||2**48||2**-1 (0.5)||
||96||2**40||2**-16||
||96||2**32||2**-32||
||96||2**24||2**-48||
||128||2**64||2**-1 (0.5)||
||128||2**56||2**-16||
||128||2**48||2**-32||
||128||2**32||2**-64||
||256||2**128||2**-1 (0.5)||
||256||2**96||2**-64||
Note that our {{{N}}} is the minimum of the storage-index size and the
top-most cap value (i.e. the readkey for immutable files, or the writekey
for
mutable files). So a DSA-based mutable file with a 92-bit writecap gives
us
an {{{N}}} of 2**92, even if it is expanded into a storage-index of 128 or
256 bits.
Also note that the allmydata.com grid currently has something like 10M
objects in it, about C=2**23.
So, I'm thinking that as much as a nice short 96-bit DSA mutable writecap
would be nice, it's too short to provide enough collision margin. I want
to
be able to put trillions of files into a grid, and I want a the chance of
collision to be so small that I don't ever need to worry about it, and 96
bits isn't really there. 128 bits is probably good enough, but doesn't
have
enough margin to be obviously and unquestionably safe (C=2**32 is a lot of
files but you can imagine people wanting more, p=2**-64 is a tiny
probability
but you can imagine people wanting a bit better). 256 would be plenty (but
of
course I want my filecaps to be shorter than that).
--
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/753>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid
More information about the tahoe-dev
mailing list