wiki:NewImmutableEncodingDesign

Version 1 (modified by warner, at 2009-08-25T09:59:45Z) (diff)

new page about immutable encoding design

The NewCapDesign page describes desired features of the next filecap design. This page is for designing the encoding format for these new immutable files.

Features

  • as described on NewCapDesign#filecaplength, we probably need 128bit confidentiality "C" bits, 256bit integrity "I" bits, and 128bit storage-collision resistance. There are encoding schemes that can combine the C and I bits (at the expense of convergence, or certain forms of offline attenutation).
  • we may define a "server-selection-index" (which is used to permute or otherwise narrow the list of servers to be used) to be separate from the "storage-index" (which is used to identify a specific share on whichever servers we actually talk to). This may involve a separate field in the filecap, or it may continue to be derived from the storage index.
  • some encoding schemes allow the readcap to be attenuated to a verifycap offline
  • in general, we don't care how long the verifycap is
  • the server should be able to validate the entire share by itself, without the readcap. In general, this means that the storage-index must also be the verifycap.
    • note that this implies that the storage-index cannot be computed until the end of encoding, when all shares have been generated, the share hash tree has been built, and its root has been added to the UEB.
      • this implies that we can't use the storage-index to detect convergence with earlier uploads of the same file. To retain convergence may require a lookup table on the server (mapping hash-of-readkey to storage-index, or something)
      • it also implies that storage-index can't be used as a server-selection-index, which again points to using hash-of-readkey as SSI (to retain convergence of server-selection). Setting the storage-index at the end of upload requires a new uploader protocol, which uses an "upload handle" for the data transfer, and finishes with a "now commit this share to storage-index=X" message.
      • the original CHK design uses hash-of-readkey as storage-index, which has all these good properties except server-side full share validation. (servers can compare share contents against the UEB, and we could put a copy of the UEB hash into the share, but servers would continue to be unable to make sure the share was in the right place)

Options

note: all cap-length computations assume the integrity-providing "I" field is 256bits long, and the confidentiality-providing "C" field is 128bits long. If we decide on different values, the sums below should be updated.

One: current CHK design

Readcaps consist of two main pieces: C bits and I bits, plus:

  • k (which improves the accuracy of the initial number of queries to send out)
  • N (which improves the guessed upper bound on number of queries to send out, and used to be required by the abandoned TahoeThree algorithm)
  • filesize (advisory only, used by deep-size measurements in lieu of fetching share data to measure filesize)

SI = H(C), SSI=SI. Verifycap is SI+I.

  • SSI and SI are known ahead of time, uploader protocol starts with SI
  • good convergence
  • long caps (128+256+len(k+N+filesize)) ~= 400bits
  • server cannot verify entire share

Two: Zooko's scheme

Readcaps contain one crypto value that combines C and I fields. (I forget how this worked.. it was clever, but I think it had some fatal flaw, like not being able to get a storage-index from the readcap without first retrieving shares, or something. One of us will dig up the notes on it and describe it here).

  • short caps
  • convergence problems

Others?

Ideas

It might be possible to have the uploader give two values to the server, at different stages of the upload process, which (together) would allow full validation of the resulting share. Using a single value (the verifycap), as a storage index, would be cleaner, but might not be strictly necessary.

The servers could maintain a table, mapping from one sort of index to another, if that made it easier for the upload process to proceed (or to achieve convergence). For example, H(readkey) is known at the beginning of upload, but the I bits aren't known until the end. If the client could use SSI=H(readkey) and then ask each server to tell them the storage-index of any shares which used H(readkey), it could achieve convergence and still use the I bits as the storage-index. The servers would be obligated to maintain a table with one entry per bucket (so probably ~20M entries), and errors/malicious behavior in this table would cause convergence failures (which are hardly fatal).

The SSI can be much shorter than the SI. It only needs to be long enough to provide good load-balancing properties. It could be included explicitly in the filecap. Alternate (non-TahoeTwo) peer-selection strategies could encode whatever per-file information they needed into the SSI, assuming some sort of tradeoff between cap length (i.e. SSI length) and work done by the downloader to find the right servers.