[tahoe-dev] [tahoe-lafs] #753: use longer storage index / cap for collision resistance
tahoe-lafs
trac at allmydata.org
Sat Jul 11 03:25:36 PDT 2009
#753: use longer storage index / cap for collision resistance
---------------------------+------------------------------------------------
Reporter: warner | Owner:
Type: defect | Status: new
Priority: major | Milestone: undecided
Component: code-encoding | Version: 1.4.1
Keywords: | Launchpad_bug:
---------------------------+------------------------------------------------
Old description:
> As we work on new encoding schemes (like DSA-based mutable files), I'm
> thinking that we want to put a lower bound on the cap/SI length to
> maintain
> reasonable margin against collisions. 256 bits would be more than enough.
> 128
> is ok, but a bit tight. 92 bits would make me nervous.
>
> robk's friend Sylvan expressed concerns about Tahoe (and mnet/mojonation
> before it) because, for something that is meant as a backup system, even
> the
> slightest possibility of the CHK-based indexing scheme mapping two
> documents
> to the same storage index was too high for him. (I believe he would be
> more satisfied with a scheme that used centrally-allocated guaranteed-
> unique
> storage indices, which we could do but would require more coordination
> and
> longer caps, since we could no longer use a randomly-generated readkey to
> derive the storage index. In exchange for a controllable but non-zero
> probability of collision, we get to avoid central coordination and use
> smaller caps).
>
> The specific places where collisions could occur are:
>
> * mapping from file contents to CHK-derived readkey
> * mapping from readkey (CHK-derived or randomly generated) to storage
> index
> * mapping from randomly-generated mutable writekey to storage index
>
> The "birthday paradox" determines the chance of collision. If I'm doing
> my
> math right, if you want less than '{{{p}}}' chance of getting any
> collisions
> when selecting items out of a namespace of size '{{{N}}}', then you can't
> select more than {{{C = sqrt(2*N*p)}}} items. This is called a "paradox"
> (where "surprise" would be a better term) because that square root causes
> C
> to be surprisingly low: for birthdays (in which N=365), p=0.5 leads to
> C=19.
> In the Tahoe context, {{{C}}} is the number of files you can add to the
> grid.
>
> In the current case, our 128-bit storage index (N=2**128) means that
> p=0.5
> gets us a nice large 2**64 number of files, except that p=0.5 is
> insufficient
> margin: we'd much prefer a vanishingly small chance of collision, like
> p=2**-64. Fortunately we get two bits of margin for every one bit we
> reduce
> from C. The table looks like:
>
> ||N||numfiles||prob(collision)||
> ||96||2**48||2**-1 (0.5)||
> ||96||2**40||2**-16||
> ||96||2**32||2**-32||
> ||96||2**24||2**-48||
> ||128||2**64||2**-1 (0.5)||
> ||128||2**56||2**-16||
> ||128||2**48||2**-32||
> ||128||2**32||2**-64||
> ||256||2**128||2**-1 (0.5)||
> ||256||2**96||2**-64||
>
> Note that our {{{N}}} is the minimum of the storage-index size and the
> top-most cap value (i.e. the readkey for immutable files, or the writekey
> for
> mutable files). So a DSA-based mutable file with a 92-bit writecap gives
> us
> an {{{N}}} of 2**92, even if it is expanded into a storage-index of 128
> or
> 256 bits.
>
> Also note that the allmydata.com grid currently has something like 10M
> objects in it, about C=2**23.
>
> So, I'm thinking that as much as a nice short 96-bit DSA mutable writecap
> would be nice, it's too short to provide enough collision margin. I want
> to
> be able to put trillions of files into a grid, and I want a the chance of
> collision to be so small that I don't ever need to worry about it, and 96
> bits isn't really there. 128 bits is probably good enough, but doesn't
> have
> enough margin to be obviously and unquestionably safe (C=2**32 is a lot
> of
> files but you can imagine people wanting more, p=2**-64 is a tiny
> probability
> but you can imagine people wanting a bit better). 256 would be plenty
> (but of
> course I want my filecaps to be shorter than that).
New description:
As we work on new encoding schemes (like DSA-based mutable files), I'm
thinking that we want to put a lower bound on the cap/SI length to
maintain
reasonable margin against collisions. 256 bits would be more than enough.
128
is ok, but a bit tight. 92 bits would make me nervous.
robk's friend Sylvan expressed concerns about Tahoe (and mnet/mojonation
before it) because, for something that is meant as a backup system, even
the
slightest possibility of the CHK-based indexing scheme mapping two
documents
to the same storage index was too high for him. (I believe he would be
more satisfied with a scheme that used centrally-allocated guaranteed-
unique
storage indices, which we could do but would require more coordination and
longer caps, since we could no longer use a randomly-generated readkey to
derive the storage index. In exchange for a controllable but non-zero
probability of collision, we get to avoid central coordination and use
smaller caps).
The specific places where collisions could occur are:
* mapping from file contents to CHK-derived readkey
* mapping from readkey (CHK-derived or randomly generated) to storage
index
* mapping from randomly-generated mutable writekey to storage index
The "birthday paradox" determines the chance of collision. If I'm doing my
math right, if you want less than '{{{p}}}' chance of getting any
collisions
when selecting items out of a namespace of size '{{{N}}}', then you can't
select more than {{{C = sqrt(2*N*p)}}} items. This is called a "paradox"
(where "surprise" would be a better term) because that square root causes
C
to be surprisingly low: for birthdays (in which N=365), p=0.5 leads to
C=19.
In the Tahoe context, {{{C}}} is the number of files you can add to the
grid.
In the current case, our 128-bit storage index (N=2**128) means that p=0.5
gets us a nice large 2**64 number of files, except that p=0.5 is
insufficient
margin: we'd much prefer a vanishingly small chance of collision, like
p=2**-64. Fortunately we get two bits of margin for every one bit we
reduce
from C. The table looks like:
||N||numfiles|| ||prob(collision)||
||96||2**48||->||2**-1 (0.5)||
||96||2**40||->||2**-17||
||96||2**32||->||2**-33||
||96||2**24||->||2**-49||
||128||2**64||->||2**-1 (0.5)||
||128||2**56||->||2**-17||
||128||2**48||->||2**-33||
||128||2**32||->||2**-65||
||192||2**96||->||2**-1||
||192||2**80||->||2**-33||
||192||2**64||->||2**-65||
||256||2**128||->||2**-1 (0.5)||
||256||2**96||->||2**-65||
Note that our {{{N}}} is the minimum of the storage-index size and the
top-most cap value (i.e. the readkey for immutable files, or the writekey
for
mutable files). So a DSA-based mutable file with a 92-bit writecap gives
us
an {{{N}}} of 2**92, even if it is expanded into a storage-index of 128 or
256 bits.
Also note that the allmydata.com grid currently has something like 10M
objects in it, about C=2**23.
So, I'm thinking that as much as a nice short 96-bit DSA mutable writecap
would be nice, it's too short to provide enough collision margin. I want
to
be able to put trillions of files into a grid, and I want a the chance of
collision to be so small that I don't ever need to worry about it, and 96
bits isn't really there. 128 bits is probably good enough, but doesn't
have
enough margin to be obviously and unquestionably safe (C=2**32 is a lot of
files but you can imagine people wanting more, p=2**-64 is a tiny
probability
but you can imagine people wanting a bit better). 256 would be plenty (but
of
course I want my filecaps to be shorter than that).
--
Comment(by warner):
as before, I think I'd like to continue using "storage index" for what
you're
calling the "file identifier", but yeah split out "server selector" or
"peer-selection index" or some similar term for the purpose of determining
which servers you're going to be talking to. One way of describing this
would
be "we used to use the storage-index as the peer-selection index, but
these
days we put two separate values in the filecap".
I am also starting to think of these as separate concepts, but remember
that
we've yet to actually implement such a split.
Sylvan's concern was about availability: he considered a backup system to
be
broken if its design has a built-in probability of file unrecoverability.
It's easier to see the problem if we set the encryption-key and hash
lengths
to infinity, but restrict the storage index to say four bits. Then upload
two
files, and try to download one of them.. you've got a 1/16 chance of
getting
a download failure because your two files had the same storage-index, you
downloaded the wrong bits, and now they won't pass the integrity check.
Also, when we talk about this, we should be careful to distinguish between
the failure modes of mutable files versus immutable files.. they're very
distinct. And, collisions at different levels have very different
consequences: if the storage index is too small, we'll get availability
failures; if the immutable encryption key or mutable writekey is too
small,
we'll get confidentiality failures. I've been assuming that we'll keep the
security parameters sufficiently large.. this ticket was specifically
about
the availability concerns resulting from a too-small storage index.
If we compress the filecap by deriving the storage-index from the
writekey,
then clearly we're limited by {{{min(len(writekey),len(storage-index))}}}.
Mostly I ticketed this issue because it's something I want to keep in mind
as
we design the next revision of the filecap format. If we don't already
have a
wiki page for it, I'll add one to organize our ideas.. I think they're
currently spread across half a dozen tickets.
I updated the table in the description: I think 192-bit caps will let us
have
an effectively-infinite number of files (2**64) with an effectively-zero
chance of collision (2**-65).
--
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/753#comment:3>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid
More information about the tahoe-dev
mailing list