[tahoe-dev] [tahoe-lafs] #329: dirnodes could cache encrypted/serialized entries for speed
tahoe-lafs
trac at allmydata.org
Thu Jul 2 11:27:29 PDT 2009
#329: dirnodes could cache encrypted/serialized entries for speed
---------------------------+------------------------------------------------
Reporter: warner | Owner: kevan
Type: enhancement | Status: new
Priority: minor | Milestone: undecided
Component: code-dirnodes | Version: 0.8.0
Keywords: dirnode | Launchpad_bug:
---------------------------+------------------------------------------------
Comment(by warner):
I usually write code like this by adding a new unit test. You can run a
specific unit test with a minimum of fuss (and startup time and extraneous
noise) by doing e.g.
{{{make quicktest TEST=allmydata.test.test_dirnode.DeepStats.test_stats}}}
.
That will get all the PYTHONPATH stuff set up for you. Anything you
{{{print}}} from the unit test will get displayed, so I use this for one-
off
tools all the time.
Also, you can use {{{misc/run-with-pythonpath.py}}}, which will set up the
right environment and then run the command of your choice. For example, if
you wrote a {{{foo.py}}} to do the stuff you just described, then you
could
invoke {{{python misc/run-with-pythonpath.py python foo.py}}}.
I'm disappointed that the sys.path technique you tried didn't work.. it
used
to.
Your approach to {{{_pack_contents}}} sounds great! Note that the
{{{serialized}}} form should contain the whole
{{{name+rocap+encrwcap+metadata}}} string (i.e. be sure to include the
name).
That way, if {{{_pack_contents}}} sees that it has a pre-serialized string
available, it can just append that to the list of entries that it's
building,
and doesn't need to re-serialize the childname either. At the end of that
loop, it should do a single {{{"".join(entries)}}} to perform the final
assembly (rather than doing incremental {{{+=}}} operations, which would
be a
lot slower).
> If we pass contents packed with an unmodified _pack_contents to the new
_unpack_contents, we should get the same underlying directory structure
back (i.e.: all expected child node names are present, and metadata values
are as they should be. Is there anything else we'd want to check?)
We should make sure that the rocap/rwcap is the same too. Doing
child.get_uri() and comparing it against a constant is plenty.
> We should be able to interact with Adder, Deleter, and MetadataSetter in
the same way that we do now. Are there already tests to adequately verify
this, or should I plan on writing those, too?
test_dirnode.py should already have test coverage for the modifiers. It
exercises all the {{{NewDirectoryNode}}} methods, like {{{set_uri}}} and
{{{list}}}. If they pass, then you've updated the modifier classes
successfully.
Also, if it works on your platform (it's somewhat touchy right now), use
our
test-coverage tools: {{{make quicktest-figleaf figleaf-output}}}, and
check
to see that all the code you've touched is actually being run. The
buildbot
has a link to the current coverage data (generated under py2.4; it seems
that
py2.5 gives slightly different answers). I think it should be pretty easy
to
make these changes and achieve the same or better coverage than before.
Also, I don't know if it'd be worth it, but you could get fancy and
instead
of changing the data structure from a list of two-tuples to a list of
three-tuples, you could:
* create a subclass of {{{list}}}
* have it store an auxilliary value (the unmodified serialized string)
for each key
* override {{{__setitem__}}} to remove that auxilliary value
* add a {{{get_both_items}}} method, which takes a key and returns a
two-tuple of (aux-serialized, value) (which is really (aux-serialized,
(child,metadata)))
* add a {{{set_both_items}}} method which takes (key, aux-serialized,
(child,metadata))
* then the Modifier classes like {{{Adder}}} and {{{Deleter}}} don't have
to change: they'll set the entries that have changed and leave the rest
alone
* now {{{_pack_contents}}} uses {{{get_both_items}}} and prefers the pre-
serialized form if it's still there
Since {{{NewDirectoryNode.list}}} returns whatever {{{_unpack_contents}}}
returns, you might manage to write less code if you use this technique.
Otherwise you'll need to find all the callers of {{{_read}}} and update
them
to tolerate the new three-tuple (and add a wrapper to {{{list}}} to
convert
it into two-tuple form for external callers, so that it continues to honor
the same interface defined in {{{allmydata.interfaces.IDirectoryNode}}}).
Oh, and of course, it would be a good idea to actually measure the time
that serialization/deserialization takes before doing any of this work.
Create 10/100/1k/10k/100k entries at random, call {{{_pack_contents}}}
under {{{timeit.py}}} or some other sort of how-long-does-it-take loop,
and figure out an {{{A+Bx}}} curve (I'd expect the serialization time to
be some constant A plus some per-entry time B, let's figure out what A and
B are). Let's make sure that serialization is the culprit, it might be
that call to {{{_create_node}}} in unpack that we should focus on. And
let's set a reasonable goal.. maybe we should be able to
unpack+modify+pack 10k entries in less than a second, or something.
--
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/329#comment:11>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid
More information about the tahoe-dev
mailing list