[tahoe-dev] [tahoe-lafs] #753: use longer storage index / cap for collision resistance

Sat Jul 11 03:25:36 PDT 2009

#753: use longer storage index / cap for collision resistance
---------------------------+------------------------------------------------
 Reporter:  warner         |           Owner:           
     Type:  defect         |          Status:  new      
 Priority:  major          |       Milestone:  undecided
Component:  code-encoding  |         Version:  1.4.1    
 Keywords:                 |   Launchpad_bug:           
---------------------------+------------------------------------------------

Old description:

> As we work on new encoding schemes (like DSA-based mutable files), I'm
> thinking that we want to put a lower bound on the cap/SI length to
> maintain
> reasonable margin against collisions. 256 bits would be more than enough.
> 128
> is ok, but a bit tight. 92 bits would make me nervous.
>
> robk's friend Sylvan expressed concerns about Tahoe (and mnet/mojonation
> before it) because, for something that is meant as a backup system, even
> the
> slightest possibility of the CHK-based indexing scheme mapping two
> documents
> to the same storage index was too high for him. (I believe he would be
> more satisfied with a scheme that used centrally-allocated guaranteed-
> unique
> storage indices, which we could do but would require more coordination
> and
> longer caps, since we could no longer use a randomly-generated readkey to
> derive the storage index. In exchange for a controllable but non-zero
> probability of collision, we get to avoid central coordination and use
> smaller caps).
>
> The specific places where collisions could occur are:
>
>  * mapping from file contents to CHK-derived readkey
>  * mapping from readkey (CHK-derived or randomly generated) to storage
> index
>  * mapping from randomly-generated mutable writekey to storage index
>
> The "birthday paradox" determines the chance of collision. If I'm doing
> my
> math right, if you want less than '{{{p}}}' chance of getting any
> collisions
> when selecting items out of a namespace of size '{{{N}}}', then you can't
> select more than {{{C = sqrt(2*N*p)}}} items. This is called a "paradox"
> (where "surprise" would be a better term) because that square root causes
> C
> to be surprisingly low: for birthdays (in which N=365), p=0.5 leads to
> C=19.
> In the Tahoe context, {{{C}}} is the number of files you can add to the
> grid.
>
> In the current case, our 128-bit storage index (N=2**128) means that
> p=0.5
> gets us a nice large 2**64 number of files, except that p=0.5 is
> insufficient
> margin: we'd much prefer a vanishingly small chance of collision, like
> p=2**-64. Fortunately we get two bits of margin for every one bit we
> reduce
> from C. The table looks like:
>
> ||N||numfiles||prob(collision)||
> ||96||2**48||2**-1 (0.5)||
> ||96||2**40||2**-16||
> ||96||2**32||2**-32||
> ||96||2**24||2**-48||
> ||128||2**64||2**-1 (0.5)||
> ||128||2**56||2**-16||
> ||128||2**48||2**-32||
> ||128||2**32||2**-64||
> ||256||2**128||2**-1 (0.5)||
> ||256||2**96||2**-64||
>

> Note that our {{{N}}} is the minimum of the storage-index size and the
> top-most cap value (i.e. the readkey for immutable files, or the writekey
> for
> mutable files). So a DSA-based mutable file with a 92-bit writecap gives
> us
> an {{{N}}} of 2**92, even if it is expanded into a storage-index of 128
> or
> 256 bits.
>
> Also note that the allmydata.com grid currently has something like 10M
> objects in it, about C=2**23.
>
> So, I'm thinking that as much as a nice short 96-bit DSA mutable writecap
> would be nice, it's too short to provide enough collision margin. I want
> to
> be able to put trillions of files into a grid, and I want a the chance of
> collision to be so small that I don't ever need to worry about it, and 96
> bits isn't really there. 128 bits is probably good enough, but doesn't
> have
> enough margin to be obviously and unquestionably safe (C=2**32 is a lot
> of
> files but you can imagine people wanting more, p=2**-64 is a tiny
> probability
> but you can imagine people wanting a bit better). 256 would be plenty
> (but of
> course I want my filecaps to be shorter than that).

New description:

 As we work on new encoding schemes (like DSA-based mutable files), I'm
 thinking that we want to put a lower bound on the cap/SI length to
 maintain
 reasonable margin against collisions. 256 bits would be more than enough.
 128
 is ok, but a bit tight. 92 bits would make me nervous.

 robk's friend Sylvan expressed concerns about Tahoe (and mnet/mojonation
 before it) because, for something that is meant as a backup system, even
 the
 slightest possibility of the CHK-based indexing scheme mapping two
 documents
 to the same storage index was too high for him. (I believe he would be
 more satisfied with a scheme that used centrally-allocated guaranteed-
 unique
 storage indices, which we could do but would require more coordination and
 longer caps, since we could no longer use a randomly-generated readkey to
 derive the storage index. In exchange for a controllable but non-zero
 probability of collision, we get to avoid central coordination and use
 smaller caps).

 The specific places where collisions could occur are:

  * mapping from file contents to CHK-derived readkey
  * mapping from readkey (CHK-derived or randomly generated) to storage
 index
  * mapping from randomly-generated mutable writekey to storage index

 The "birthday paradox" determines the chance of collision. If I'm doing my
 math right, if you want less than '{{{p}}}' chance of getting any
 collisions
 when selecting items out of a namespace of size '{{{N}}}', then you can't
 select more than {{{C = sqrt(2*N*p)}}} items. This is called a "paradox"
 (where "surprise" would be a better term) because that square root causes
 C
 to be surprisingly low: for birthdays (in which N=365), p=0.5 leads to
 C=19.
 In the Tahoe context, {{{C}}} is the number of files you can add to the
 grid.

 In the current case, our 128-bit storage index (N=2**128) means that p=0.5
 gets us a nice large 2**64 number of files, except that p=0.5 is
 insufficient
 margin: we'd much prefer a vanishingly small chance of collision, like
 p=2**-64. Fortunately we get two bits of margin for every one bit we
 reduce
 from C. The table looks like:

 ||N||numfiles|| ||prob(collision)||
 ||96||2**48||->||2**-1 (0.5)||
 ||96||2**40||->||2**-17||
 ||96||2**32||->||2**-33||
 ||96||2**24||->||2**-49||
 ||128||2**64||->||2**-1 (0.5)||
 ||128||2**56||->||2**-17||
 ||128||2**48||->||2**-33||
 ||128||2**32||->||2**-65||
 ||192||2**96||->||2**-1||
 ||192||2**80||->||2**-33||
 ||192||2**64||->||2**-65||
 ||256||2**128||->||2**-1 (0.5)||
 ||256||2**96||->||2**-65||

 Note that our {{{N}}} is the minimum of the storage-index size and the
 top-most cap value (i.e. the readkey for immutable files, or the writekey
 for
 mutable files). So a DSA-based mutable file with a 92-bit writecap gives
 us
 an {{{N}}} of 2**92, even if it is expanded into a storage-index of 128 or
 256 bits.

 Also note that the allmydata.com grid currently has something like 10M
 objects in it, about C=2**23.

 So, I'm thinking that as much as a nice short 96-bit DSA mutable writecap
 would be nice, it's too short to provide enough collision margin. I want
 to
 be able to put trillions of files into a grid, and I want a the chance of
 collision to be so small that I don't ever need to worry about it, and 96
 bits isn't really there. 128 bits is probably good enough, but doesn't
 have
 enough margin to be obviously and unquestionably safe (C=2**32 is a lot of
 files but you can imagine people wanting more, p=2**-64 is a tiny
 probability
 but you can imagine people wanting a bit better). 256 would be plenty (but
 of
 course I want my filecaps to be shorter than that).

--

Comment(by warner):

 as before, I think I'd like to continue using "storage index" for what
 you're
 calling the "file identifier", but yeah split out "server selector" or
 "peer-selection index" or some similar term for the purpose of determining
 which servers you're going to be talking to. One way of describing this
 would
 be "we used to use the storage-index as the peer-selection index, but
 these
 days we put two separate values in the filecap".

 I am also starting to think of these as separate concepts, but remember
 that
 we've yet to actually implement such a split.

 Sylvan's concern was about availability: he considered a backup system to
 be
 broken if its design has a built-in probability of file unrecoverability.
 It's easier to see the problem if we set the encryption-key and hash
 lengths
 to infinity, but restrict the storage index to say four bits. Then upload
 two
 files, and try to download one of them.. you've got a 1/16 chance of
 getting
 a download failure because your two files had the same storage-index, you
 downloaded the wrong bits, and now they won't pass the integrity check.

 Also, when we talk about this, we should be careful to distinguish between
 the failure modes of mutable files versus immutable files.. they're very
 distinct. And, collisions at different levels have very different
 consequences: if the storage index is too small, we'll get availability
 failures; if the immutable encryption key or mutable writekey is too
 small,
 we'll get confidentiality failures. I've been assuming that we'll keep the
 security parameters sufficiently large.. this ticket was specifically
 about
 the availability concerns resulting from a too-small storage index.

 If we compress the filecap by deriving the storage-index from the
 writekey,
 then clearly we're limited by {{{min(len(writekey),len(storage-index))}}}.

 Mostly I ticketed this issue because it's something I want to keep in mind
 as
 we design the next revision of the filecap format. If we don't already
 have a
 wiki page for it, I'll add one to organize our ideas.. I think they're
 currently spread across half a dozen tickets.

 I updated the table in the description: I think 192-bit caps will let us
 have
 an effectively-infinite number of files (2**64) with an effectively-zero
 chance of collision (2**-65).

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/753#comment:3>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid