id,summary,reporter,owner,description,type,status,priority,milestone,component,version,resolution,keywords,cc,launchpad_bug 753,use longer storage index / cap for collision resistance,warner,,"As we work on new encoding schemes (like DSA-based mutable files), I'm thinking that we want to put a lower bound on the cap/SI length to maintain reasonable margin against collisions. 256 bits would be more than enough. 128 is ok, but a bit tight. 92 bits would make me nervous. robk's friend Sylvan expressed concerns about Tahoe (and mnet/mojonation before it) because, for something that is meant as a backup system, even the slightest possibility of the CHK-based indexing scheme mapping two documents to the same storage index was too high for him. (I believe he would be more satisfied with a scheme that used centrally-allocated guaranteed-unique storage indices, which we could do but would require more coordination and longer caps, since we could no longer use a randomly-generated readkey to derive the storage index. In exchange for a controllable but non-zero probability of collision, we get to avoid central coordination and use smaller caps). The specific places where collisions could occur are: * mapping from file contents to CHK-derived readkey * mapping from readkey (CHK-derived or randomly generated) to storage index * mapping from randomly-generated mutable writekey to storage index The ""birthday paradox"" determines the chance of collision. If I'm doing my math right, if you want less than '{{{p}}}' chance of getting any collisions when selecting items out of a namespace of size '{{{N}}}', then you can't select more than {{{C = sqrt(2*N*p)}}} items. This is called a ""paradox"" (where ""surprise"" would be a better term) because that square root causes C to be surprisingly low: for birthdays (in which N=365), p=0.5 leads to C=19. In the Tahoe context, {{{C}}} is the number of files you can add to the grid. In the current case, our 128-bit storage index (N=2^128^) means that p=0.5 gets us a nice large 2^64^ number of files, except that p=0.5 is insufficient margin: we'd much prefer a vanishingly small chance of collision, like p=2^-64^. Fortunately we get two bits of margin for every one bit we reduce from C. The table looks like: ||N||numfiles|| ||prob(collision)|| ||96||2^48^||->||2^-1^ (0.5)|| ||96||2^48^||->||2^-17^|| ||96||2^32^||->||2^-33^|| ||96||2^24^||->||2^-49^|| ||128||2^64^||->||2^-1^ (0.5)|| ||128||2^56^||->||2^-17^|| ||128||2^48^||->||2^-33^|| ||128||2^32^||->||2^-65^|| ||192||2^96^||->||2^-1^|| ||192||2^80^||->||2^-33^|| ||192||2^64^||->||2^-65^|| ||256||2^128^||->||2^-1^ (0.5)|| ||256||2^96^||->||2^-65^|| Note that our {{{N}}} is the minimum of the storage-index size and the top-most cap value (i.e. the readkey for immutable files, or the writekey for mutable files). So a DSA-based mutable file with a 92-bit writecap gives us an {{{N}}} of 2^92^, even if it is expanded into a storage-index of 128 or 256 bits. Also note that the allmydata.com grid currently has something like 10M objects in it, about C=2^23^. So, I'm thinking that as much as a nice short 96-bit DSA mutable writecap would be nice, it's too short to provide enough collision margin. I want to be able to put trillions of files into a grid, and I want a the chance of collision to be so small that I don't ever need to worry about it, and 96 bits isn't really there. 128 bits is probably good enough, but doesn't have enough margin to be obviously and unquestionably safe (C=2^32^ is a lot of files but you can imagine people wanting more, p=2^-64^ is a tiny probability but you can imagine people wanting a bit better). 256 would be plenty (but of course I want my filecaps to be shorter than that).",defect,new,major,undecided,code-encoding,1.4.1,,newcaps security,,