[tahoe-dev] measure your convergence
zooko
zooko at zooko.com
Thu Mar 20 13:38:34 PDT 2008
Folks:
Ever wondered how much storage space you would save if you and your
friends coalesced all of your identical files?
Wonder no longer! Now you can find out! Install the "dupfilefind"
utility [*] and run it with command-line arguments like:
dupfilefind --ignore-dirs="," --min-size=32 --profiles
(It probably works on all operating systems.)
It will recursively examine all files reachable from the current
working directory and spew out a series of "hashcode filesize" pairs,
where the hashcode is the least significant 8 bits of the adler32
checksum of the first 8192 bytes of the file.
It will also mention whenever it finds two separate files on your
system which are identical with each other.
Send the output to your friends, or to me -- zooko at zooko.com -- and
we'll find out approximately how many of your files are shared with
other people who submit results. (Please compress your output with a
good compressor like 7zip or rzip or bzip2.)
You take full responsibility for leaking all this information about
your files -- namely their 8 bit adler32 sums of their first 8192
bytes, and their file size. Also, in case duplicate files are
detected on your system, their device number and inode number.
Regards,
Zooko
[*] To install the dupfilefind utility, either download this tarball:
http://pypi.python.org/packages/source/d/dupfilefind/
dupfilefind-1.1.2.tar.gz#md5=af8de6f3ead053e326389a9a87b0a11d
untar it, cd into the resulting directory, and run:
python setup.py install
or else install the easy_install tool:
http://peak.telecommunity.com/DevCenter/EasyInstall#installing-easy-
install
and then run:
easy_install dupfilefind
More information about the tahoe-dev
mailing list