[tahoe-lafs-trac-stream] [tahoe-lafs] #1456: High latency for 'tahoe get' if 'tahoe put' in parallel

Sun Jul 31 12:33:32 PDT 2011

#1456: High latency for 'tahoe get' if 'tahoe put' in parallel
-------------------------+-------------------------------------------------
     Reporter:  T_X      |      Owner:  T_X
         Type:  defect   |     Status:  new
     Priority:           |  Milestone:  undecided
  critical               |    Version:  1.8.2
    Component:  code     |   Keywords:  download upload latency performance
   Resolution:           |  gateway vm kvm vpn trickle
Launchpad Bug:           |
-------------------------+-------------------------------------------------
Changes (by zooko):

 * keywords:  latency performance gateway vm kvm vpn trickle => download
     upload latency performance gateway vm kvm vpn trickle
 * owner:  somebody => T_X
 * priority:  major => critical

Comment:

 T_X: thank you for the bug report. It sounds like it might be a serious
 problem in Tahoe-LAFS. I'm glad you've taken the effort to record detailed
 measurements and include notes about how you tries to make a minimal,
 reproducible case. I especially appreciate that you included your test
 script—very good!

 However, I'm still confused and need more help from you to understand
 what's going on. Could you summarize in one paragraph of English -- like
 not more than 3 or 4 sentences what is wrong and how you know it is
 happening?

 You're observing dramatically high latency on {{{tahoe get}}} in some
 cases. In fact, in 10 runs of {{{tahoe get}}} ([attachment:tahoe-
 stats-2.log]), it took this many seconds:
 {{{
 1       15.38
 2       213.31
 3       564.83
 4       11.87
 5       11.99
 6       11.99
 7       12.56
 8       12.11
 9       12.50
 10      12.83
 }}}

 The fact that it took 560 seconds to do a {{{tahoe get}}} (after which it
 completed successfully instead of erroring out?) is definitely an
 indication of something very wrong. I'm still hoping it turns out to be
 something wrong in your test rig or scripts rather than in Tahoe-LAFS, but
 we'll see. :-)

 So, that's a question. How do we know that the runs that took an order of
 magnitude longer completed successfully? As far as I can tell from a quick
 scan of [attachment:test-run.sh#L16 your script], it isn't checking the
 return value or inspecting the resulting downloaded file to be sure it
 worked.

 (Note this would still be a major problem in Tahoe-LAFS if it waited 560
 seconds and failed as if it waited 560 seconds and succeeded, but it would
 help to understand which is happening).

 Thanks!

-- 
Ticket URL: <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1456#comment:1>
tahoe-lafs <http://tahoe-lafs.org>
secure decentralized storage