[tahoe-dev] Sighting reports [correction]

Brian Warner warner at lothar.com
Wed Jul 14 21:04:15 UTC 2010


On 7/13/10 5:45 PM, Kyle Markley wrote:
>>> When a storage server accepts responsibility for a share during peer
>>> selection, it makes a placeholder file of the same size as it was
>>> asked to store.
>>
>> This is wrong -- the storage server writes a small placeholder file,
>> not a placeholder file of the same size as the share file.

To be precise, it opens a file next to the place where the final share
will eventually live (in
BASEDIR/storage/shares/incoming/$PREFIX/$SI/$SHNUM, which will then be
moved into BASEDIR/storage/shares/$PREFIX/$SI/$SHNUM), fills that with
share data, then finally moves it into place once the client closes the
share.

The storage server remembers how much space is "allocated", meaning that
a share-write operation has started but not yet finished, and counts it
against available disk space. Each share-write operation must commit to
the share size at the beginning. This is probably conservative,
especially when two shares are being uploaded at the same time.

> I should have included more information in my report. None of the
> storage servers are anywhere close to full -- they all have several
> GiB remaining, and the file being uploaded is only about 2 MiB. The
> shares definitely should not have been rejected due to insufficient
> space.

That *is* weird. It sounds to me like the servers connections have been
lost, or are otherwise marked as unusable, but the web status page is
presenting different information.

The next step is to obtain a "flogfile" from your node. This should
capture some of the information about the server-selection process
(note: I don't know how much the new server-selection code logs about
its decision-making inputs, but the flogfile is the right place for it
to deliver this information).

There are two basic ways to collect this information:

1:
  run "flogtool tail --save-to=stuff.flog $BASEDIR/private/logport.furl"
  start your backup, wait for it to fail
  kill the flogtool process
  bzip2 stuff.flog
  attach stuff.flog.bz2 to the ticket

or 2:
  start the backup, wait for it to fail
  go to the webapi Welcome page and hit the "Report An Incident" button
  find the most recent .flog.bz2 file in $BASEDIR/logs/incidents/
  attach that file to the ticket

The first method will capture every single log message from the time
that "flogtool tail" connects to the time you kill the flogtool process.
This could be gigabytes of data. You might use the "flogtool filter"
tool to remove events that occur before the failing upload: we probably
don't need information from more than a few minutes before this upload.

The second method will capture just the saved log events: foolscap keeps
a set of circular buffers, one per severity level, and new messages push
out old ones. In practice, this tends to do a pretty good job of
capturing the interesting pieces while keeping the flogfiles small
(usually about 15kB).

> allmydata.interfaces.UploadUnhappinessError: shares could be placed on
> only 3 server(s) such that any 2 of them have enough shares to recover
> the file, but we were asked to place shares on at least 4 such
> servers. (placed all 4 shares, want to place shares on at least 4
> servers such that any 2 of them have enough shares to recover the
> file, sent 4 queries to 4 peers, 3 queries placed some shares, 1
> placed none (of which 1 placed none due to the server being full and 0
> placed none due to an error))

Ugh, I really don't like that error message. Having tried to piece
together what the heck it's saying, I'm starting to think that it would
be more useful (especially to us) to include a brief summary of the
peer-selection decision. (some of the mutable-publish code does this
too). Something like:

  abcde:sh1+sh2,cdefg:sh2+sh3,fghij:sh4,ijklm:

where "abcde" is a server ID, and server "ijklm" has no shares
allocated. Or maybe just a record of the queries we've sent out and
their responses:

  abcde:1+2?1,cdefg:2+3?2,abcde:3?3,fghij:3+4?error

(indicating that we asked abcde to hold sh1+sh2, and it accepted only
sh2, but then later we asked it to hold sh3 and it accepted, etc)


cheers,
  -Brian


More information about the tahoe-dev mailing list