[volunteergrid2-l] gaps in stats page

Sun Mar 25 21:42:07 UTC 2012

Hi Christoph, hello Brian,

> Another interesting one is inxoy6uiulkr2uwm6s3rmz6jzyiywkvi, which
> took 7 seconds to reply (even though both my node, and that one, are 
> physically and network-wise relatively close). By coincidence, that
> is also the node which is showing loads of gaps on the stats page ( 
> https://vg2-stats.rosenkeller.org/ ), even though I assume it's not 
> being constantly restarted, but online 24/7.

inoxy6... is the cerezal server node. I've been observing the stats
page and the response latencies for some time. Sometimes response times
are long, the highest time I've seen so far were 43 seconds.
From the "normal" times (these below 1000 ms), 
cerezal is usually a bit above average (not surprising, it is not
a fast system) but I did not find it to have these very large latencies
frequently when looking at mapupdate MODE_READ times from my client
node. It would possibly be helpful to see detailed latencies for other
operations as well.

Regarding the gaps which show up in the stats page, I didn't see
gaps in the service with one exception: There was a real interruption
of service on late Friday evening when I made some benchmarks on the
ARM CPU. It seems that responses to the stats server are sometimes too
late. I did not see correlation with the load of the 
NAS (it is mostly idle but I do some backups and rsyncing). 
The gaps disappeared for some time when I restarted tahoe,
then appeared again.

BTW, I also tried to change process priorities with the Debian
ionice command, setting real-time I/O priority to the tahoe
process, the result is in March 20 and 21 in the stats starting
with the increasement of available disk space, so it may be helpful
but does not eliminate the gaps.

What draws my attention about the stats page is that
gaps seem start to show up when server capacity disappears.

Regards,

Johannes

On Sat, 24 Mar 2012 18:16:25 +0100
Christoph Langguth <christoph at rosenkeller.org> wrote:

> On Fri, 23 Mar 2012 12:22:59 -0700, Brian Warner wrote:
> > Hi folks.. I just joined the mailing list so I could hear from y'all
> > about how Tahoe is working for you. In particular, I've heard some
> > anecdotal reports about serious latency problems. Would folks mind
> > if I
> > attached a non-server client to VG2 and uploaded a few tiny
> > files/directories to see what's going on? (I kind of suspect a bug
> > in which TCP connections are getting silently lost, but the client 
> > doesn't
> > realize it yet, and the uploader or the mutable-file-publisher is 
> > then
> > waiting on a very slow TCP timeout).
> >
> Hi Brian,
> 
> most definitely, I would much appreciate it.
> 
> To demonstrate (one of) the problems, I'm attaching a screenshot of 
> what is happening on a deep-check --repair --add-lease run I started
> a few minutes ago. The screenshot was taken about 10 minutes after 
> starting, and it's still stuck at the initial queries stage,
> supposedly waiting for a reply from the 16th server. This time,
> something(tm) seems to have timed out after exactly 16 minutes and 59
> seconds; then afterwards, the actual deep-check went through pretty
> quickly. Interestingly enough, this problems seems to appear most of
> the time, but not all of the time -- but if it does, it's always the
> connection with ej3fwcecqssij4ljf6esjkmflhook6jk. (Ted, don't be
> offended, I'm just stating what I observed :-) ).
> Another interesting one is inxoy6uiulkr2uwm6s3rmz6jzyiywkvi, which
> took 7 seconds to reply (even though both my node, and that one, are 
> physically and network-wise relatively close). By coincidence, that
> is also the node which is showing loads of gaps on the stats page ( 
> https://vg2-stats.rosenkeller.org/ ), even though I assume it's not 
> being constantly restarted, but online 24/7.
> 
> Again, folks: don't take this personally. I'm only describing the 
> situation and trying to find out what is going wrong, and why it
> seems to be mostly related to a few nodes, while others seem
> rock-stable. This *could* of course all be network-related, but from
> my experience, I doubt it. Network issues are usually intermittent
> and not regularly repeating. Maybe Brian can indeed gain some
> valuable insight and shed some light on this, so I'm definitely in
> favor of his proposal.