[tahoe-lafs-trac-stream] [tahoe-lafs] #2018: padding to hide the size of plaintexts
tahoe-lafs
trac at tahoe-lafs.org
Mon Feb 10 17:24:49 UTC 2014
#2018: padding to hide the size of plaintexts
-------------------------+-------------------------------------------------
Reporter: zooko | Owner: nejucomo
Type: | Status: new
enhancement | Milestone: undecided
Priority: normal | Version: 1.10.0
Component: code- | Keywords: confidentiality privacy compression
encoding | newcaps research
Resolution: |
Launchpad Bug: |
-------------------------+-------------------------------------------------
Comment (by nickm):
32*ceil(log2(F)) doesn't really hide file sizes so well; are you sure it's
what you mean? If a file is 1.1MB long, should the maximum padding
_really_ be only 32 * 21 == 672 bytes? That doesn't seem big enough to
obscure the true file size.
Here's a simple experiment I just did. It's not a very good one, but with
any luck people can improve it to make it something more scientifically
sound. I took all the music files on my computer (about 16000 of them),
and built a database of their file sizes. (I chose music files because
their sizes are already in a similar range to one another, without being
overly homogeneous. And because I had a lot of them. More thinking will
identify a better corpus.)
{{{
% find ~/Music/*/ -type f -print0 |xargs -0 ls -l | perl -ne '@a = split;
print "$a[4]\n";' > sizes
% wc -l sizes
16275 sizes
}}}
At the start, nearly every file size was unique.
{{{
% sort -n sizes | uniq -c |grep '^ *1 ' |wc -l
16094
}}}
Next I tried a "round the size up to the nearest multiple of max_pad"
rule, taking max_pad as 32*ceil(log2(F)).
{{{
% python ~/Music/fsize.py < sizes | sort -n |uniq -c | grep '^ *1 ' |wc -l
9671
}}}
So, more than half of my files still have a unique size. That's probably
not so good.
(I chose "rounding up" rather than "randomizing" since it makes the
attacker's job strictly harder, and the analysis much simpler.)
Next I tried the rule: Let N = ceil(log2(F)). If N >= 5, let X = N - 5;
otherwise let X = N. Let MaxPad = 2^X^. Round the file size up to the
nearest multiple of MaxPad. This time:
{{{
% python ~/Music/fsize.py < sizes | sort -n |uniq -c | grep '^ *1 ' |wc -l
65
}}}
Only 65 files had unique sizes. Note also that the padding overhead is
about 1/32 of the file size, for large files.
(A better experiment would pick a better metric than "how many files have
a unique size?" The count of unique-sized files doesn't capture the notion
that having three files with the same padded size is better than having
two files with the same padded size. Maybe something entropy-based would
be the right thing here.)
--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/2018#comment:15>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list