| 1 | .. -*- coding: utf-8-with-signature -*- |
|---|
| 2 | |
|---|
| 3 | ========== |
|---|
| 4 | Tahoe URIs |
|---|
| 5 | ========== |
|---|
| 6 | |
|---|
| 7 | 1. `File URIs`_ |
|---|
| 8 | |
|---|
| 9 | 1. `CHK URIs`_ |
|---|
| 10 | 2. `LIT URIs`_ |
|---|
| 11 | 3. `Mutable File URIs`_ |
|---|
| 12 | |
|---|
| 13 | 2. `Directory URIs`_ |
|---|
| 14 | 3. `Internal Usage of URIs`_ |
|---|
| 15 | |
|---|
| 16 | Each file and directory in a Tahoe-LAFS file store is described by a "URI". |
|---|
| 17 | There are different kinds of URIs for different kinds of objects, and there |
|---|
| 18 | are different kinds of URIs to provide different kinds of access to those |
|---|
| 19 | objects. Each URI is a string representation of a "capability" or "cap", and |
|---|
| 20 | there are read-caps, write-caps, verify-caps, and others. |
|---|
| 21 | |
|---|
| 22 | Each URI provides both ``location`` and ``identification`` properties. |
|---|
| 23 | ``location`` means that holding the URI is sufficient to locate the data it |
|---|
| 24 | represents (this means it contains a storage index or a lookup key, whatever |
|---|
| 25 | is necessary to find the place or places where the data is being kept). |
|---|
| 26 | ``identification`` means that the URI also serves to validate the data: an |
|---|
| 27 | attacker who wants to trick you into into using the wrong data will be |
|---|
| 28 | limited in their abilities by the identification properties of the URI. |
|---|
| 29 | |
|---|
| 30 | Some URIs are subsets of others. In particular, if you know a URI which |
|---|
| 31 | allows you to modify some object, you can produce a weaker read-only URI and |
|---|
| 32 | give it to someone else, and they will be able to read that object but not |
|---|
| 33 | modify it. Directories, for example, have a read-cap which is derived from |
|---|
| 34 | the write-cap: anyone with read/write access to the directory can produce a |
|---|
| 35 | limited URI that grants read-only access, but not the other way around. |
|---|
| 36 | |
|---|
| 37 | src/allmydata/uri.py is the main place where URIs are processed. It is |
|---|
| 38 | the authoritative definition point for all the the URI types described |
|---|
| 39 | herein. |
|---|
| 40 | |
|---|
| 41 | File URIs |
|---|
| 42 | ========= |
|---|
| 43 | |
|---|
| 44 | The lowest layer of the Tahoe architecture (the "key-value store") is |
|---|
| 45 | reponsible for mapping URIs to data. This is basically a distributed |
|---|
| 46 | hash table, in which the URI is the key, and some sequence of bytes is |
|---|
| 47 | the value. |
|---|
| 48 | |
|---|
| 49 | There are two kinds of entries in this table: immutable and mutable. For |
|---|
| 50 | immutable entries, the URI represents a fixed chunk of data. The URI itself |
|---|
| 51 | is derived from the data when it is uploaded into the grid, and can be used |
|---|
| 52 | to locate and download that data from the grid at some time in the future. |
|---|
| 53 | |
|---|
| 54 | For mutable entries, the URI identifies a "slot" or "container", which can be |
|---|
| 55 | filled with different pieces of data at different times. |
|---|
| 56 | |
|---|
| 57 | It is important to note that the values referenced by these URIs are just |
|---|
| 58 | sequences of bytes, and that **no** filenames or other metadata is retained at |
|---|
| 59 | this layer. The file store layer (which sits above the key-value store layer) |
|---|
| 60 | is entirely responsible for directories and filenames and the like. |
|---|
| 61 | |
|---|
| 62 | CHK URIs |
|---|
| 63 | -------- |
|---|
| 64 | |
|---|
| 65 | CHK (Content Hash Keyed) files are immutable sequences of bytes. They are |
|---|
| 66 | uploaded in a distributed fashion using a "storage index" (for the "location" |
|---|
| 67 | property), and encrypted using a "read key". A secure hash of the data is |
|---|
| 68 | computed to help validate the data afterwards (providing the "identification" |
|---|
| 69 | property). All of these pieces, plus information about the file's size and |
|---|
| 70 | the number of shares into which it has been distributed, are put into the |
|---|
| 71 | "CHK" uri. The storage index is derived by hashing the read key (using a |
|---|
| 72 | tagged SHA-256d hash, then truncated to 128 bits), so it does not need to be |
|---|
| 73 | physically present in the URI. |
|---|
| 74 | |
|---|
| 75 | The current format for CHK URIs is the concatenation of the following |
|---|
| 76 | strings:: |
|---|
| 77 | |
|---|
| 78 | URI:CHK:(key):(hash):(needed-shares):(total-shares):(size) |
|---|
| 79 | |
|---|
| 80 | Where (key) is the base32 encoding of the 16-byte AES read key, (hash) is the |
|---|
| 81 | base32 encoding of the SHA-256 hash of the URI Extension Block, |
|---|
| 82 | (needed-shares) is an ascii decimal representation of the number of shares |
|---|
| 83 | required to reconstruct this file, (total-shares) is the same representation |
|---|
| 84 | of the total number of shares created, and (size) is an ascii decimal |
|---|
| 85 | representation of the size of the data represented by this URI. All base32 |
|---|
| 86 | encodings are expressed in lower-case, with the trailing '=' signs removed. |
|---|
| 87 | |
|---|
| 88 | For example, the following is a CHK URI, generated from a previous version of |
|---|
| 89 | the contents of :doc:`architecture.rst<../architecture>`:: |
|---|
| 90 | |
|---|
| 91 | URI:CHK:ihrbeov7lbvoduupd4qblysj7a:bg5agsdt62jb34hxvxmdsbza6do64f4fg5anxxod2buttbo6udzq:3:10:28733 |
|---|
| 92 | |
|---|
| 93 | Historical note: The name "CHK" is somewhat inaccurate and continues to be |
|---|
| 94 | used for historical reasons. "Content Hash Key" means that the encryption key |
|---|
| 95 | is derived by hashing the contents, which gives the useful property that |
|---|
| 96 | encoding the same file twice will result in the same URI. However, this is an |
|---|
| 97 | optional step: by passing a different flag to the appropriate API call, Tahoe |
|---|
| 98 | will generate a random encryption key instead of hashing the file: this gives |
|---|
| 99 | the useful property that the URI or storage index does not reveal anything |
|---|
| 100 | about the file's contents (except filesize), which improves privacy. The |
|---|
| 101 | URI:CHK: prefix really indicates that an immutable file is in use, without |
|---|
| 102 | saying anything about how the key was derived. |
|---|
| 103 | |
|---|
| 104 | |
|---|
| 105 | LIT URIs |
|---|
| 106 | -------- |
|---|
| 107 | |
|---|
| 108 | LITeral files are also an immutable sequence of bytes, but they are so short |
|---|
| 109 | that the data is stored inside the URI itself. These are used for files of 55 |
|---|
| 110 | bytes or shorter, which is the point at which the LIT URI is the same length |
|---|
| 111 | as a CHK URI would be. |
|---|
| 112 | |
|---|
| 113 | LIT URIs do not require an upload or download phase, as their data is stored |
|---|
| 114 | directly in the URI. |
|---|
| 115 | |
|---|
| 116 | The format of a LIT URI is simply a fixed prefix concatenated with the base32 |
|---|
| 117 | encoding of the file's data:: |
|---|
| 118 | |
|---|
| 119 | URI:LIT:bjuw4y3movsgkidbnrwg26lemf2gcl3xmvrc6kropbuhi3lmbi |
|---|
| 120 | |
|---|
| 121 | The LIT URI for an empty file is "URI:LIT:", and the LIT URI for a 5-byte |
|---|
| 122 | file that contains the string "hello" is "URI:LIT:nbswy3dp". |
|---|
| 123 | |
|---|
| 124 | Mutable File URIs |
|---|
| 125 | ----------------- |
|---|
| 126 | |
|---|
| 127 | The other kind of DHT entry is the "mutable slot", in which the URI names a |
|---|
| 128 | container to which data can be placed and retrieved without changing the |
|---|
| 129 | identity of the container. |
|---|
| 130 | |
|---|
| 131 | These slots have write-caps (which allow read/write access), read-caps (which |
|---|
| 132 | only allow read-access), and verify-caps (which allow a file checker/repairer |
|---|
| 133 | to confirm that the contents exist, but does not let it decrypt the |
|---|
| 134 | contents). |
|---|
| 135 | |
|---|
| 136 | Mutable slots use public key technology to provide data integrity, and put a |
|---|
| 137 | hash of the public key in the URI. As a result, the data validation is |
|---|
| 138 | limited to confirming that the data retrieved matches *some* data that was |
|---|
| 139 | uploaded in the past, but not _which_ version of that data. |
|---|
| 140 | |
|---|
| 141 | The format of the write-cap for mutable files is:: |
|---|
| 142 | |
|---|
| 143 | URI:SSK:(writekey):(fingerprint) |
|---|
| 144 | |
|---|
| 145 | Where (writekey) is the base32 encoding of the 16-byte AES encryption key |
|---|
| 146 | that is used to encrypt the RSA private key, and (fingerprint) is the base32 |
|---|
| 147 | encoded 32-byte SHA-256 hash of the RSA public key. For more details about |
|---|
| 148 | the way these keys are used, please see :doc:`mutable`. |
|---|
| 149 | |
|---|
| 150 | The format for mutable read-caps is:: |
|---|
| 151 | |
|---|
| 152 | URI:SSK-RO:(readkey):(fingerprint) |
|---|
| 153 | |
|---|
| 154 | The read-cap is just like the write-cap except it contains the other AES |
|---|
| 155 | encryption key: the one used for encrypting the mutable file's contents. This |
|---|
| 156 | second key is derived by hashing the writekey, which allows the holder of a |
|---|
| 157 | write-cap to produce a read-cap, but not the other way around. The |
|---|
| 158 | fingerprint is the same in both caps. |
|---|
| 159 | |
|---|
| 160 | Historical note: the "SSK" prefix is a perhaps-inaccurate reference to |
|---|
| 161 | "Sub-Space Keys" from the Freenet project, which uses a vaguely similar |
|---|
| 162 | structure to provide mutable file access. |
|---|
| 163 | |
|---|
| 164 | |
|---|
| 165 | Directory URIs |
|---|
| 166 | ============== |
|---|
| 167 | |
|---|
| 168 | The key-value store layer provides a mapping from URI to data. To turn this |
|---|
| 169 | into a graph of directories and files, the "file store" layer (which sits on |
|---|
| 170 | top of the key-value store layer) needs to keep track of "directory nodes", |
|---|
| 171 | or "dirnodes" for short. :doc:`dirnodes` describes how these work. |
|---|
| 172 | |
|---|
| 173 | Dirnodes are contained inside mutable files, and are thus simply a particular |
|---|
| 174 | way to interpret the contents of these files. As a result, a directory |
|---|
| 175 | write-cap looks a lot like a mutable-file write-cap:: |
|---|
| 176 | |
|---|
| 177 | URI:DIR2:(writekey):(fingerprint) |
|---|
| 178 | |
|---|
| 179 | Likewise directory read-caps (which provide read-only access to the |
|---|
| 180 | directory) look much like mutable-file read-caps:: |
|---|
| 181 | |
|---|
| 182 | URI:DIR2-RO:(readkey):(fingerprint) |
|---|
| 183 | |
|---|
| 184 | Historical note: the "DIR2" prefix is used because the non-distributed |
|---|
| 185 | dirnodes in earlier Tahoe releases had already claimed the "DIR" prefix. |
|---|
| 186 | |
|---|
| 187 | |
|---|
| 188 | Internal Usage of URIs |
|---|
| 189 | ====================== |
|---|
| 190 | |
|---|
| 191 | The classes in source:src/allmydata/uri.py are used to pack and unpack these |
|---|
| 192 | various kinds of URIs. Three Interfaces are defined (IURI, IFileURI, and |
|---|
| 193 | IDirnodeURI) which are implemented by these classes, and string-to-URI-class |
|---|
| 194 | conversion routines have been registered as adapters, so that code which |
|---|
| 195 | wants to extract e.g. the size of a CHK or LIT uri can do:: |
|---|
| 196 | |
|---|
| 197 | print IFileURI(uri).get_size() |
|---|
| 198 | |
|---|
| 199 | If the URI does not represent a CHK or LIT uri (for example, if it was for a |
|---|
| 200 | directory instead), the adaptation will fail, raising a TypeError inside the |
|---|
| 201 | IFileURI() call. |
|---|
| 202 | |
|---|
| 203 | Several utility methods are provided on these objects. The most important is |
|---|
| 204 | ``to_string()``, which returns the string form of the URI. Therefore |
|---|
| 205 | ``IURI(uri).to_string == uri`` is true for any valid URI. See the IURI class |
|---|
| 206 | in source:src/allmydata/interfaces.py for more details. |
|---|
| 207 | |
|---|