[tahoe-dev] using tahoe as a backend driver for irods storage system
Zooko O'Whielacronx
zooko at zooko.com
Tue Dec 6 03:13:54 UTC 2011
Dear Jimmy and Terrell:
Very interesting! I hope you proceed with this.
I would like to second Terrell's suggestion about performance: "try it
and see". There are some detailed resources about performance, but I
don't advise that you look at these, *yet*:
* https://tahoe-lafs.org/trac/tahoe-lafs/wiki/Performance
* https://tahoe-lafs.org/trac/tahoe-lafs/ticket/932# benchmark
Tahoe-LAFS compared to nosql dbs
* https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/performance.rst
Keep in mind that Tahoe-LAFS itself may get performance improvements
in future releases. Most of the performance problems are not "deep"
slowdowns caused by some unique Tahoe-LAFS feature, but rather
"shallow" slowdowns, such as the client unnecessarily pausing between
sending successive network requests. The new MDMF format that Kevan
wrote accidentally came in at about three times as fast as the current
immutable (CHK) format on a LAN. That probably indicates that someone
could optimize the CHK format to be *more* than three times as fast as
it is now, without having to change the format.
Since performance optimization is one of those things that the open
source community is especially good at -- the goals are relatively
easy for everyone to agree on, and it is fun and rewarding -- I'm
hopeful that someone will volunteer to work on that.
There is one simple strategy you can take when you are doing this to
avoid unnecessary performance problems: don't store things under a
relative pathname, like $DIRCAP/foo/bar, or $DIRCAP/foo, when you can
instead store those things directly under their cap.
So for example, uploading a file and getting back its immutable (CHK)
cap and then using that cap to later download that file will always be
somewhat faster than uploading a file *into a directory* and then
using the cap to the directory plus the filename to download that
file. It is easy to understand why: the latter sort of upload -- the
sort that uploads the file into a directory under a filename -- is
implemented by first doing the former sort of upload -- uploading the
file to an immutable cap -- and then downloading the current version
of the directory, rewriting it to add the presence of the new child,
and then uploading a new version of that directory. This is obviously
always slower than just doing the first step, and it can be a *lot*
slower if it is an SDMF-format directory and there are many child
entries in it.
I think that was the performance issue that I brought up with respect
to git-annex -- that it was storing things under their filenames
within a directory and I wondered if it could be changed to store
things under their cap.
Regards,
Zooko
More information about the tahoe-dev
mailing list