[tahoe-lafs-trac-stream] [tahoe-lafs] #1354: compression (e.g. to efficiently store sparse files)

Thu Feb 3 21:52:51 PST 2011

#1354: compression (e.g. to efficiently store sparse files)
-------------------------------+--------------------------------------------
     Reporter:  zooko          |       Owner:           
         Type:  enhancement    |      Status:  new      
     Priority:  major          |   Milestone:  undecided
    Component:  code-encoding  |     Version:  1.8.2    
   Resolution:                 |    Keywords:           
Launchpad Bug:                 |  
-------------------------------+--------------------------------------------

Comment (by warner):

 Hm. You could compress data a chunk at a time, watching the output size
 until it grew above some min-segment-size threshold, then flush the
 compression stream and declare end-of-segment. Then start again with the
 remaining data, repeat. Now you've got a list of compressed segments and a
 table with the amount of plaintext that went into each one. You encrypt
 the table with the readcap and store it in each share. You also store
 (unencrypted) the table of ciphertext segment sizes. (unlike the
 uncompressed case, the plaintext-segment-size table will differ
 significantly from the ciphertext-segment-size table).

 Alacrity would rise: you'd have to download the whole encrypted-segment-
 size table (which is O(filesize), although the multiplier is very small,
 something like 8 bytes per segment). There's probably a clever O(log(N))
 scheme lurking in there somewhere, but I expect it'd involve adding
 roundtrips (you store multiple layers of offset tables: first you fetch
 the coarse one that tells you which parts of the next-finer-grained table
 you need to fetch, then the last table you fetch has actual segment
 offsets).

 This scheme requires a compression library that either avoids deep
 pipelining or is willing to tell you how much more compressed output would
 be emitted if you did a flush() right now. I don't think most libraries
 have this property. You declare a segment to be finished as soon as you've
 emitted say 1MB of compressed data, then you tell the compressor to flush
 the pipeline and add whatever else it gives you to the segment. The
 concern is that you could wind up with a segment significantly greater
 than 1MB if the pipeline is deep.

-- 
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1354#comment:3>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage