#753 new defect

use longer storage index / cap for collision resistance — at Initial Version

Reported by: warner Owned by:
Priority: major Milestone: undecided
Component: code-encoding Version: 1.4.1
Keywords: newcaps security Cc:
Launchpad Bug:

Description

As we work on new encoding schemes (like DSA-based mutable files), I'm thinking that we want to put a lower bound on the cap/SI length to maintain reasonable margin against collisions. 256 bits would be more than enough. 128 is ok, but a bit tight. 92 bits would make me nervous.

robk's friend Sylvan expressed concerns about Tahoe (and mnet/mojonation before it) because, for something that is meant as a backup system, even the slightest possibility of the CHK-based indexing scheme mapping two documents to the same storage index was too high for him. (I believe he would be more satisfied with a scheme that used centrally-allocated guaranteed-unique storage indices, which we could do but would require more coordination and longer caps, since we could no longer use a randomly-generated readkey to derive the storage index. In exchange for a controllable but non-zero probability of collision, we get to avoid central coordination and use smaller caps).

The specific places where collisions could occur are:

  • mapping from file contents to CHK-derived readkey
  • mapping from readkey (CHK-derived or randomly generated) to storage index
  • mapping from randomly-generated mutable writekey to storage index

The "birthday paradox" determines the chance of collision. If I'm doing my math right, if you want less than 'p' chance of getting any collisions when selecting items out of a namespace of size 'N', then you can't select more than C = sqrt(2*N*p) items. This is called a "paradox" (where "surprise" would be a better term) because that square root causes C to be surprisingly low: for birthdays (in which N=365), p=0.5 leads to C=19. In the Tahoe context, C is the number of files you can add to the grid.

In the current case, our 128-bit storage index (N=2128) means that p=0.5 gets us a nice large 264 number of files, except that p=0.5 is insufficient margin: we'd much prefer a vanishingly small chance of collision, like p=2-64. Fortunately we get two bits of margin for every one bit we reduce from C. The table looks like:

Nnumfilesprob(collision)
962482-1 (0.5)
962402-16
962322-32
962242-48
1282642-1 (0.5)
1282562-16
1282482-32
1282322-64
25621282-1 (0.5)
2562962-64

Note that our N is the minimum of the storage-index size and the top-most cap value (i.e. the readkey for immutable files, or the writekey for mutable files). So a DSA-based mutable file with a 92-bit writecap gives us an N of 292, even if it is expanded into a storage-index of 128 or 256 bits.

Also note that the allmydata.com grid currently has something like 10M objects in it, about C=223.

So, I'm thinking that as much as a nice short 96-bit DSA mutable writecap would be nice, it's too short to provide enough collision margin. I want to be able to put trillions of files into a grid, and I want a the chance of collision to be so small that I don't ever need to worry about it, and 96 bits isn't really there. 128 bits is probably good enough, but doesn't have enough margin to be obviously and unquestionably safe (C=232 is a lot of files but you can imagine people wanting more, p=2-64 is a tiny probability but you can imagine people wanting a bit better). 256 would be plenty (but of course I want my filecaps to be shorter than that).

Change History (0)

Note: See TracTickets for help on using tickets.