[tahoe-lafs-trac-stream] [tahoe-lafs] #2018: padding to hide the size of plaintexts

Mon Feb 10 17:24:49 UTC 2014

#2018: padding to hide the size of plaintexts
-------------------------+-------------------------------------------------
     Reporter:  zooko    |      Owner:  nejucomo
         Type:           |     Status:  new
  enhancement            |  Milestone:  undecided
     Priority:  normal   |    Version:  1.10.0
    Component:  code-    |   Keywords:  confidentiality privacy compression
  encoding               |  newcaps research
   Resolution:           |
Launchpad Bug:           |
-------------------------+-------------------------------------------------

Comment (by nickm):

 32*ceil(log2(F)) doesn't really hide file sizes so well; are you sure it's
 what you mean?  If a file is 1.1MB long, should the maximum padding
 _really_ be only 32 * 21 == 672 bytes?  That doesn't seem big enough to
 obscure the true file size.

 Here's a simple experiment I just did.  It's not a very good one, but with
 any luck people can improve it to make it something more scientifically
 sound.  I took all the music files on my computer (about 16000 of them),
 and built a database of their file sizes.  (I chose music files because
 their sizes are already in a similar range to one another, without being
 overly homogeneous.  And because I had a lot of them.  More thinking will
 identify a better corpus.)

 {{{
 % find ~/Music/*/ -type f -print0 |xargs -0 ls -l  | perl -ne '@a = split;
 print "$a[4]\n";' > sizes
 % wc -l sizes
 16275 sizes
 }}}

 At the start, nearly every file size was unique.
 {{{
 % sort -n sizes | uniq -c  |grep '^ *1 ' |wc -l
 16094
 }}}

 Next I tried a "round the size up to the nearest multiple of max_pad"
 rule, taking max_pad as 32*ceil(log2(F)).
 {{{
 % python ~/Music/fsize.py < sizes | sort -n |uniq -c | grep '^ *1 ' |wc -l
 9671
 }}}
 So, more than half of my files still have a unique size.  That's probably
 not so good.

 (I chose "rounding up" rather than "randomizing" since it makes the
 attacker's job strictly harder, and the analysis much simpler.)

 Next I tried the rule: Let N = ceil(log2(F)).  If N >= 5, let X = N - 5;
 otherwise let X = N.  Let MaxPad = 2^X^.  Round the file size up to the
 nearest multiple of MaxPad.  This time:
 {{{
 % python ~/Music/fsize.py < sizes | sort -n |uniq -c | grep '^ *1 ' |wc -l
 65
 }}}
 Only 65 files had unique sizes.  Note also that the padding overhead is
 about 1/32 of the file size, for large files.

 (A better experiment would pick a better metric than "how many files have
 a unique size?" The count of unique-sized files doesn't capture the notion
 that having three files with the same padded size is better than having
 two files with the same padded size.  Maybe something entropy-based would
 be the right thing here.)

-- 
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/2018#comment:15>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage