[tahoe-dev] [tahoe-lafs] #329: dirnodes could cache encrypted/serialized entries for speed

Thu Jul 2 11:27:29 PDT 2009

#329: dirnodes could cache encrypted/serialized entries for speed
---------------------------+------------------------------------------------
 Reporter:  warner         |           Owner:  kevan    
     Type:  enhancement    |          Status:  new      
 Priority:  minor          |       Milestone:  undecided
Component:  code-dirnodes  |         Version:  0.8.0    
 Keywords:  dirnode        |   Launchpad_bug:           
---------------------------+------------------------------------------------

Comment(by warner):

 I usually write code like this by adding a new unit test. You can run a
 specific unit test with a minimum of fuss (and startup time and extraneous
 noise) by doing e.g.
 {{{make quicktest TEST=allmydata.test.test_dirnode.DeepStats.test_stats}}}
 .
 That will get all the PYTHONPATH stuff set up for you. Anything you
 {{{print}}} from the unit test will get displayed, so I use this for one-
 off
 tools all the time.

 Also, you can use {{{misc/run-with-pythonpath.py}}}, which will set up the
 right environment and then run the command of your choice. For example, if
 you wrote a {{{foo.py}}} to do the stuff you just described, then you
 could
 invoke {{{python misc/run-with-pythonpath.py python foo.py}}}.

 I'm disappointed that the sys.path technique you tried didn't work.. it
 used
 to.

 Your approach to {{{_pack_contents}}} sounds great! Note that the
 {{{serialized}}} form should contain the whole
 {{{name+rocap+encrwcap+metadata}}} string (i.e. be sure to include the
 name).
 That way, if {{{_pack_contents}}} sees that it has a pre-serialized string
 available, it can just append that to the list of entries that it's
 building,
 and doesn't need to re-serialize the childname either. At the end of that
 loop, it should do a single {{{"".join(entries)}}} to perform the final
 assembly (rather than doing incremental {{{+=}}} operations, which would
 be a
 lot slower).

 > If we pass contents packed with an unmodified _pack_contents to the new
 _unpack_contents, we should get the same underlying directory structure
 back (i.e.: all expected child node names are present, and metadata values
 are as they should be. Is there anything else we'd want to check?)

 We should make sure that the rocap/rwcap is the same too. Doing
 child.get_uri() and comparing it against a constant is plenty.

 > We should be able to interact with Adder, Deleter, and MetadataSetter in
 the same way that we do now. Are there already tests to adequately verify
 this, or should I plan on writing those, too?

 test_dirnode.py should already have test coverage for the modifiers. It
 exercises all the {{{NewDirectoryNode}}} methods, like {{{set_uri}}} and
 {{{list}}}. If they pass, then you've updated the modifier classes
 successfully.

 Also, if it works on your platform (it's somewhat touchy right now), use
 our
 test-coverage tools: {{{make quicktest-figleaf figleaf-output}}}, and
 check
 to see that all the code you've touched is actually being run. The
 buildbot
 has a link to the current coverage data (generated under py2.4; it seems
 that
 py2.5 gives slightly different answers). I think it should be pretty easy
 to
 make these changes and achieve the same or better coverage than before.

 Also, I don't know if it'd be worth it, but you could get fancy and
 instead
 of changing the data structure from a list of two-tuples to a list of
 three-tuples, you could:

  * create a subclass of {{{list}}}
   * have it store an auxilliary value (the unmodified serialized string)
 for each key
   * override {{{__setitem__}}} to remove that auxilliary value
   * add a {{{get_both_items}}} method, which takes a key and returns a
 two-tuple of (aux-serialized, value) (which is really (aux-serialized,
 (child,metadata)))
   * add a {{{set_both_items}}} method which takes (key, aux-serialized,
 (child,metadata))
  * then the Modifier classes like {{{Adder}}} and {{{Deleter}}} don't have
 to change: they'll set the entries that have changed and leave the rest
 alone
  * now {{{_pack_contents}}} uses {{{get_both_items}}} and prefers the pre-
 serialized form if it's still there

 Since {{{NewDirectoryNode.list}}} returns whatever {{{_unpack_contents}}}
 returns, you might manage to write less code if you use this technique.
 Otherwise you'll need to find all the callers of {{{_read}}} and update
 them
 to tolerate the new three-tuple (and add a wrapper to {{{list}}} to
 convert
 it into two-tuple form for external callers, so that it continues to honor
 the same interface defined in {{{allmydata.interfaces.IDirectoryNode}}}).

 Oh, and of course, it would be a good idea to actually measure the time
 that serialization/deserialization takes before doing any of this work.
 Create 10/100/1k/10k/100k entries at random, call {{{_pack_contents}}}
 under {{{timeit.py}}} or some other sort of how-long-does-it-take loop,
 and figure out an {{{A+Bx}}} curve (I'd expect the serialization time to
 be some constant A plus some per-entry time B, let's figure out what A and
 B are). Let's make sure that serialization is the culprit, it might be
 that call to {{{_create_node}}} in unpack that we should focus on. And
 let's set a reasonable goal.. maybe we should be able to
 unpack+modify+pack 10k entries in less than a second, or something.

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/329#comment:11>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid