1 | |
---|
2 | |
---|
3 | zfec -- efficient, portable erasure coding tool |
---|
4 | =============================================== |
---|
5 | |
---|
6 | Generate redundant blocks of information such that if some of the blocks are |
---|
7 | lost then the original data can be recovered from the remaining blocks. This |
---|
8 | package includes command-line tools, C API, Python API, and Haskell API. |
---|
9 | |
---|
10 | |
---|
11 | Intro and Licence |
---|
12 | ----------------- |
---|
13 | |
---|
14 | This package implements an "erasure code", or "forward error correction |
---|
15 | code". |
---|
16 | |
---|
17 | You may use this package under the GNU General Public License, version 2 or, |
---|
18 | at your option, any later version. You may use this package under the |
---|
19 | Transitive Grace Period Public Licence, version 1.0 or, at your option, any |
---|
20 | later version. (You may choose to use this package under the terms of either |
---|
21 | licence, at your option.) See the file COPYING.GPL for the terms of the GNU |
---|
22 | General Public License, version 2. See the file COPYING.TGPPL.rst for the |
---|
23 | terms of the Transitive Grace Period Public Licence, version 1.0. |
---|
24 | |
---|
25 | The most widely known example of an erasure code is the RAID-5 algorithm |
---|
26 | which makes it so that in the event of the loss of any one hard drive, the |
---|
27 | stored data can be completely recovered. The algorithm in the zfec package |
---|
28 | has a similar effect, but instead of recovering from the loss of only a |
---|
29 | single element, it can be parameterized to choose in advance the number of |
---|
30 | elements whose loss it can tolerate. |
---|
31 | |
---|
32 | This package is largely based on the old "fec" library by Luigi Rizzo et al., |
---|
33 | which is a mature and optimized implementation of erasure coding. The zfec |
---|
34 | package makes several changes from the original "fec" package, including |
---|
35 | addition of the Python API, refactoring of the C API to support zero-copy |
---|
36 | operation, a few clean-ups and optimizations of the core code itself, and the |
---|
37 | addition of a command-line tool named "zfec". |
---|
38 | |
---|
39 | |
---|
40 | Installation |
---|
41 | ------------ |
---|
42 | |
---|
43 | This package is managed with the "setuptools" package management tool. To |
---|
44 | build and install the package directly into your system, just run ``python |
---|
45 | ./setup.py install``. If you prefer to keep the package limited to a |
---|
46 | specific directory so that you can manage it yourself (perhaps by using the |
---|
47 | "GNU stow") tool, then give it these arguments: ``python ./setup.py install |
---|
48 | --single-version-externally-managed |
---|
49 | --record=${specificdirectory}/zfec-install.log |
---|
50 | --prefix=${specificdirectory}`` |
---|
51 | |
---|
52 | To run the self-tests, execute ``python ./setup.py test`` (or if you have |
---|
53 | Twisted Python installed, you can run ``trial zfec`` for nicer output and |
---|
54 | test options.) This will run the tests of the C API, the Python API, and the |
---|
55 | command-line tools. |
---|
56 | |
---|
57 | To run the tests of the Haskell API: ``runhaskell haskell/test/FECTest.hs`` |
---|
58 | |
---|
59 | Note that in order to run the Haskell API tests you must have installed the |
---|
60 | library first due to the fact that the interpreter cannot process FEC.hs as |
---|
61 | it takes a reference to an FFI function. |
---|
62 | |
---|
63 | |
---|
64 | Community |
---|
65 | --------- |
---|
66 | |
---|
67 | The source is currently available via darcs on the web with the command: |
---|
68 | |
---|
69 | darcs get https://tahoe-lafs.org/source/zfec/trunk |
---|
70 | |
---|
71 | More information on darcs is available at http://darcs.net |
---|
72 | |
---|
73 | Please post about zfec to the Tahoe-LAFS mailing list and contribute patches: |
---|
74 | |
---|
75 | <https://tahoe-lafs.org/cgi-bin/mailman/listinfo/tahoe-dev> |
---|
76 | |
---|
77 | |
---|
78 | Overview |
---|
79 | -------- |
---|
80 | |
---|
81 | This package performs two operations, encoding and decoding. Encoding takes |
---|
82 | some input data and expands its size by producing extra "check blocks", also |
---|
83 | called "secondary blocks". Decoding takes some data -- any combination of |
---|
84 | blocks of the original data (called "primary blocks") and "secondary blocks", |
---|
85 | and produces the original data. |
---|
86 | |
---|
87 | The encoding is parameterized by two integers, k and m. m is the total |
---|
88 | number of blocks produced, and k is how many of those blocks are necessary to |
---|
89 | reconstruct the original data. m is required to be at least 1 and at most |
---|
90 | 256, and k is required to be at least 1 and at most m. |
---|
91 | |
---|
92 | (Note that when k == m then there is no point in doing erasure coding -- it |
---|
93 | degenerates to the equivalent of the Unix "split" utility which simply splits |
---|
94 | the input into successive segments. Similarly, when k == 1 it degenerates to |
---|
95 | the equivalent of the unix "cp" utility -- each block is a complete copy of |
---|
96 | the input data.) |
---|
97 | |
---|
98 | Note that each "primary block" is a segment of the original data, so its size |
---|
99 | is 1/k'th of the size of original data, and each "secondary block" is of the |
---|
100 | same size, so the total space used by all the blocks is m/k times the size of |
---|
101 | the original data (plus some padding to fill out the last primary block to be |
---|
102 | the same size as all the others). In addition to the data contained in the |
---|
103 | blocks themselves there are also a few pieces of metadata which are necessary |
---|
104 | for later reconstruction. Those pieces are: 1. the value of K, 2. the |
---|
105 | value of M, 3. the sharenum of each block, 4. the number of bytes of |
---|
106 | padding that were used. The "zfec" command-line tool compresses these pieces |
---|
107 | of data and prepends them to the beginning of each share, so each the |
---|
108 | sharefile produced by the "zfec" command-line tool is between one and four |
---|
109 | bytes larger than the share data alone. |
---|
110 | |
---|
111 | The decoding step requires as input k of the blocks which were produced by |
---|
112 | the encoding step. The decoding step produces as output the data that was |
---|
113 | earlier input to the encoding step. |
---|
114 | |
---|
115 | |
---|
116 | Command-Line Tool |
---|
117 | ----------------- |
---|
118 | |
---|
119 | The bin/ directory contains two Unix-style, command-line tools "zfec" and |
---|
120 | "zunfec". Execute ``zfec --help`` or ``zunfec --help`` for usage |
---|
121 | instructions. |
---|
122 | |
---|
123 | |
---|
124 | Performance |
---|
125 | ----------- |
---|
126 | |
---|
127 | To run the benchmarks, execute the included bench/bench_zfec.py script with |
---|
128 | optional --k= and --m= arguments. |
---|
129 | |
---|
130 | On my Athlon 64 2.4 GHz workstation (running Linux), the "zfec" command-line |
---|
131 | tool encoded a 160 MB file with m=100, k=94 (about 6% redundancy) in 3.9 |
---|
132 | seconds, where the "par2" tool encoded the file with about 6% redundancy in |
---|
133 | 27 seconds. zfec encoded the same file with m=12, k=6 (100% redundancy) in |
---|
134 | 4.1 seconds, where par2 encoded it with about 100% redundancy in 7 minutes |
---|
135 | and 56 seconds. |
---|
136 | |
---|
137 | The underlying C library in benchmark mode encoded from a file at about 4.9 |
---|
138 | million bytes per second and decoded at about 5.8 million bytes per second. |
---|
139 | |
---|
140 | On Peter's fancy Intel Mac laptop (2.16 GHz Core Duo), it encoded from a file |
---|
141 | at about 6.2 million bytes per second. |
---|
142 | |
---|
143 | On my even fancier Intel Mac laptop (2.33 GHz Core Duo), it encoded from a |
---|
144 | file at about 6.8 million bytes per second. |
---|
145 | |
---|
146 | On my old PowerPC G4 867 MHz Mac laptop, it encoded from a file at about 1.3 |
---|
147 | million bytes per second. |
---|
148 | |
---|
149 | Here is a paper analyzing the performance of various erasure codes and their |
---|
150 | implementations, including zfec: |
---|
151 | |
---|
152 | http://www.usenix.org/events/fast09/tech/full_papers/plank/plank.pdf |
---|
153 | |
---|
154 | Zfec shows good performance on different machines and with different values |
---|
155 | of K and M. It also has a nice small memory footprint. |
---|
156 | |
---|
157 | |
---|
158 | API |
---|
159 | --- |
---|
160 | |
---|
161 | Each block is associated with "blocknum". The blocknum of each primary block |
---|
162 | is its index (starting from zero), so the 0'th block is the first primary |
---|
163 | block, which is the first few bytes of the file, the 1'st block is the next |
---|
164 | primary block, which is the next few bytes of the file, and so on. The last |
---|
165 | primary block has blocknum k-1. The blocknum of each secondary block is an |
---|
166 | arbitrary integer between k and 255 inclusive. (When using the Python API, |
---|
167 | if you don't specify which secondary blocks you want when invoking encode(), |
---|
168 | then it will by default provide the blocks with ids from k to m-1 inclusive.) |
---|
169 | |
---|
170 | - C API |
---|
171 | |
---|
172 | fec_encode() takes as input an array of k pointers, where each pointer |
---|
173 | points to a memory buffer containing the input data (i.e., the i'th buffer |
---|
174 | contains the i'th primary block). There is also a second parameter which |
---|
175 | is an array of the blocknums of the secondary blocks which are to be |
---|
176 | produced. (Each element in that array is required to be the blocknum of a |
---|
177 | secondary block, i.e. it is required to be >= k and < m.) |
---|
178 | |
---|
179 | The output from fec_encode() is the requested set of secondary blocks which |
---|
180 | are written into output buffers provided by the caller. |
---|
181 | |
---|
182 | Note that this fec_encode() is a "low-level" API in that it requires the |
---|
183 | input data to be provided in a set of memory buffers of exactly the right |
---|
184 | sizes. If you are starting instead with a single buffer containing all of |
---|
185 | the data then please see easyfec.py's "class Encoder" as an example of how |
---|
186 | to split a single large buffer into the appropriate set of input buffers |
---|
187 | for fec_encode(). If you are starting with a file on disk, then please see |
---|
188 | filefec.py's encode_file_stringy_easyfec() for an example of how to read |
---|
189 | the data from a file and pass it to "class Encoder". The Python interface |
---|
190 | provides these higher-level operations, as does the Haskell interface. If |
---|
191 | you implement functions to do these higher-level tasks in other languages, |
---|
192 | please send a patch to tahoe-dev@tahoe-lafs.org so that your API can be |
---|
193 | included in future releases of zfec. |
---|
194 | |
---|
195 | fec_decode() takes as input an array of k pointers, where each pointer |
---|
196 | points to a buffer containing a block. There is also a separate input |
---|
197 | parameter which is an array of blocknums, indicating the blocknum of each |
---|
198 | of the blocks which is being passed in. |
---|
199 | |
---|
200 | The output from fec_decode() is the set of primary blocks which were |
---|
201 | missing from the input and had to be reconstructed. These reconstructed |
---|
202 | blocks are written into output buffers provided by the caller. |
---|
203 | |
---|
204 | |
---|
205 | - Python API |
---|
206 | |
---|
207 | encode() and decode() take as input a sequence of k buffers, where a |
---|
208 | "sequence" is any object that implements the Python sequence protocol (such |
---|
209 | as a list or tuple) and a "buffer" is any object that implements the Python |
---|
210 | buffer protocol (such as a string or array). The contents that are |
---|
211 | required to be present in these buffers are the same as for the C API. |
---|
212 | |
---|
213 | encode() also takes a list of desired blocknums. Unlike the C API, the |
---|
214 | Python API accepts blocknums of primary blocks as well as secondary blocks |
---|
215 | in its list of desired blocknums. encode() returns a list of buffer |
---|
216 | objects which contain the blocks requested. For each requested block which |
---|
217 | is a primary block, the resulting list contains a reference to the |
---|
218 | apppropriate primary block from the input list. For each requested block |
---|
219 | which is a secondary block, the list contains a newly created string object |
---|
220 | containing that block. |
---|
221 | |
---|
222 | decode() also takes a list of integers indicating the blocknums of the |
---|
223 | blocks being passed int. decode() returns a list of buffer objects which |
---|
224 | contain all of the primary blocks of the original data (in order). For |
---|
225 | each primary block which was present in the input list, then the result |
---|
226 | list simply contains a reference to the object that was passed in the input |
---|
227 | list. For each primary block which was not present in the input, the |
---|
228 | result list contains a newly created string object containing that primary |
---|
229 | block. |
---|
230 | |
---|
231 | Beware of a "gotcha" that can result from the combination of mutable data |
---|
232 | and the fact that the Python API returns references to inputs when |
---|
233 | possible. |
---|
234 | |
---|
235 | Returning references to its inputs is efficient since it avoids making an |
---|
236 | unnecessary copy of the data, but if the object which was passed as input |
---|
237 | is mutable and if that object is mutated after the call to zfec returns, |
---|
238 | then the result from zfec -- which is just a reference to that same object |
---|
239 | -- will also be mutated. This subtlety is the price you pay for avoiding |
---|
240 | data copying. If you don't want to have to worry about this then you can |
---|
241 | simply use immutable objects (e.g. Python strings) to hold the data that |
---|
242 | you pass to zfec. |
---|
243 | |
---|
244 | - Haskell API |
---|
245 | |
---|
246 | The Haskell code is fully Haddocked, to generate the documentation, run |
---|
247 | ``runhaskell Setup.lhs haddock``. |
---|
248 | |
---|
249 | |
---|
250 | Utilities |
---|
251 | --------- |
---|
252 | |
---|
253 | The filefec.py module has a utility function for efficiently reading a file |
---|
254 | and encoding it piece by piece. This module is used by the "zfec" and |
---|
255 | "zunfec" command-line tools from the bin/ directory. |
---|
256 | |
---|
257 | |
---|
258 | Dependencies |
---|
259 | ------------ |
---|
260 | |
---|
261 | A C compiler is required. To use the Python API or the command-line tools a |
---|
262 | Python interpreter is also required. We have tested it with Python v2.4, |
---|
263 | v2.5, v2.6, and v2.7. For the Haskell interface, GHC >= 6.8.1 is required. |
---|
264 | |
---|
265 | |
---|
266 | Acknowledgements |
---|
267 | ---------------- |
---|
268 | |
---|
269 | Thanks to the author of the original fec lib, Luigi Rizzo, and the folks that |
---|
270 | contributed to it: Phil Karn, Robert Morelos-Zaragoza, Hari Thirumoorthy, and |
---|
271 | Dan Rubenstein. Thanks to the Mnet hackers who wrote an earlier Python |
---|
272 | wrapper, especially Myers Carpenter and Hauke Johannknecht. Thanks to Brian |
---|
273 | Warner and Amber O'Whielacronx for help with the API, documentation, |
---|
274 | debugging, compression, and unit tests. Thanks to Adam Langley for improving |
---|
275 | the C API and contributing the Haskell API. Thanks to the creators of GCC |
---|
276 | (starting with Richard M. Stallman) and Valgrind (starting with Julian |
---|
277 | Seward) for a pair of excellent tools. Thanks to my coworkers at Allmydata |
---|
278 | -- http://allmydata.com -- Fabrice Grinda, Peter Secor, Rob Kinninmont, Brian |
---|
279 | Warner, Zandr Milewski, Justin Boreta, Mark Meras for sponsoring this work |
---|
280 | and releasing it under a Free Software licence. Thanks to Jack Lloyd, Samuel |
---|
281 | Neves, and David-Sarah Hopwood. |
---|
282 | |
---|
283 | |
---|
284 | Related Works |
---|
285 | ------------- |
---|
286 | |
---|
287 | Note: a Unix-style tool like "zfec" does only one thing -- in this case |
---|
288 | erasure coding -- and leaves other tasks to other tools. Other Unix-style |
---|
289 | tools that go well with zfec include `GNU tar`_ for archiving multiple files |
---|
290 | and directories into one file, `lzip`_ for compression, and `GNU Privacy |
---|
291 | Guard`_ for encryption or `b2sum`_ for integrity. It is important to do |
---|
292 | things in order: first archive, then compress, then either encrypt or |
---|
293 | integrity-check, then erasure code. Note that if GNU Privacy Guard is used |
---|
294 | for privacy, then it will also ensure integrity, so the use of b2sum is |
---|
295 | unnecessary in that case. Note also that you also need to do integrity |
---|
296 | checking (such as with b2sum) on the blocks that result from the erasure |
---|
297 | coding in *addition* to doing it on the file contents! (There are two |
---|
298 | different subtle failure modes -- see "more than one file can match an |
---|
299 | immutable file cap" on the `Hack Tahoe-LAFS!`_ Hall of Fame.) |
---|
300 | |
---|
301 | The `Tahoe-LAFS`_ project uses zfec as part of a complete distributed |
---|
302 | filesystem with integrated encryption, integrity, remote distribution of the |
---|
303 | blocks, directory structure, backup of changed files or directories, access |
---|
304 | control, immutable files and directories, proof-of-retrievability, and repair |
---|
305 | of damaged files and directories. |
---|
306 | |
---|
307 | `fecpp`_ is an alternative to zfec. It implements a bitwise-compatible |
---|
308 | algorithm to zfec and is BSD-licensed. |
---|
309 | |
---|
310 | .. _GNU tar: http://directory.fsf.org/project/tar/ |
---|
311 | .. _lzip: http://www.nongnu.org/lzip/lzip.html |
---|
312 | .. _GNU Privacy Guard: http://gnupg.org/ |
---|
313 | .. _b2sum: https://blake2.net/ |
---|
314 | .. _Tahoe-LAFS: https://tahoe-lafs.org |
---|
315 | .. _Hack Tahoe-LAFS!: https://tahoe-lafs.org/hacktahoelafs/ |
---|
316 | .. _fecpp: http://www.randombit.net/code/fecpp/ |
---|
317 | |
---|
318 | |
---|
319 | Enjoy! |
---|
320 | |
---|
321 | Zooko Wilcox-O'Hearn |
---|
322 | |
---|
323 | 2013-05-15 |
---|
324 | |
---|
325 | Boulder, Colorado |
---|