#2251 new defect

Database full or disk full error, stops backups

Reported by: CyberAxe Owned by: daira
Priority: normal Milestone: undecided
Component: code-storage Version: 1.10.0
Keywords: database full space logging Cc: zooko
Launchpad Bug:

Description (last modified by daira)

http://sebsauvage.net/paste/?86d37574a47fd008#Xcj23YbPlxSMZ63yIVfune6X+kxkHQSCQ9Hm7kJmqkI=

[...]
C:\allmydata-tahoe-1.10.0\bin>Backup-Now

C:\allmydata-tahoe-1.10.0\bin>ECHO OFF
Traceback (most recent call last):
  File "c:\allmydata-tahoe-1.10.0\src\allmydata\scripts\runner.py", line 156, in run
    rc = runner(sys.argv[1:], install_node_control=install_node_control)
  File "c:\allmydata-tahoe-1.10.0\src\allmydata\scripts\runner.py", line 141, in runner
    rc = cli.dispatch[command](so)
  File "c:\allmydata-tahoe-1.10.0\src\allmydata\scripts\cli.py", line 574, in backup
    rc = tahoe_backup.backup(options)
  File "c:\allmydata-tahoe-1.10.0\src\allmydata\scripts\tahoe_backup.py", line 325, in backup
    return bu.run()
  File "c:\allmydata-tahoe-1.10.0\src\allmydata\scripts\tahoe_backup.py", line 118, in run
    new_backup_dircap = self.process(options.from_dir)
  File "c:\allmydata-tahoe-1.10.0\src\allmydata\scripts\tahoe_backup.py", line 211, in process
    newdircap = mkdir(create_contents, self.options)
  File "c:\allmydata-tahoe-1.10.0\src\allmydata\scripts\tahoe_backup.py", line 47, in mkdir
    raise HTTPError("Error during mkdir", resp)
HTTPError: Error during mkdir: 500 Internal Server Error
"Traceback (most recent call last):
  File \"C:\\allmydata-tahoe-1.10.0\\support\\Lib\\site-packages\\foolscap-0.6.4-py2.7.egg\\foolscap\\call.py\", line 753, in receiveClose
    self.request.fail(f)
  File \"C:\\allmydata-tahoe-1.10.0\\support\\Lib\\site-packages\\foolscap-0.6.4-py2.7.egg\\foolscap\\call.py\", line 95, in fail
    self.deferred.errback(why)
  File \"C:\\allmydata-tahoe-1.10.0\\support\\Lib\\site-packages\\twisted-12.3.0-py2.7-win-amd64.egg\\twisted\\internet\\defer.py\", line 42
2, in errback
    self._startRunCallbacks(fail)
  File \"C:\\allmydata-tahoe-1.10.0\\support\\Lib\\site-packages\\twisted-12.3.0-py2.7-win-amd64.egg\\twisted\\internet\\defer.py\", line 48
9, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File \"C:\\allmydata-tahoe-1.10.0\\support\\Lib\\site-packages\\twisted-12.3.0-py2.7-win-amd64.egg\\twisted\\internet\\defer.py\", line 57
6, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File \"c:\\allmydata-tahoe-1.10.0\\src\\allmydata\\immutable\\upload.py\", line 604, in _got_response
    return self._loop()
  File \"c:\\allmydata-tahoe-1.10.0\\src\\allmydata\\immutable\\upload.py\", line 516, in _loop
    return self._failed(msg)
  File \"c:\\allmydata-tahoe-1.10.0\\src\\allmydata\\immutable\\upload.py\", line 617, in _failed
    raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: server selection failed for <Tahoe2ServerSelector for upload 66ww6>: shares could be placed or
found on only 0 server(s). We were asked to place shares on at least 1 server(s) such that any 1 of them have enough shares to recover the f
ile. (placed 0 shares out of 1 total (1 homeless), want to place shares on at least 1 servers such that any 1 of them have enough shares to
recover the file, sent 1 queries to 1 servers, 0 queries placed some shares, 1 placed none (of which 0 placed none due to the server being f
ull and 1 placed none due to an error)) (last failure (from <ServerTracker for server xfia4afr and SI 66ww6>) was: [Failure instance: Traceb
ack (failure with no frames): <class 'foolscap.tokens.RemoteException'>: <RemoteException around '[CopiedFailure instance: Traceback from re
mote host -- Traceback (most recent call last):
  File \"/home/customer/LAFS_source/src/allmydata/storage/server.py\", line 258, in <lambda>
    canary))
  File \"/home/customer/LAFS_source/src/allmydata/storage/backends/cloud/cloud_backend.py\", line 175, in make_bucket_writer
    d.addCallback(lambda ign: BucketWriter(account, immsh, canary))
  File \"/usr/local/lib/python2.7/dist-packages/Twisted-13.2.0-py2.7-linux-i686.egg/twisted/internet/defer.py\", line 306, in addCallback
    callbackKeywords=kw)
  File \"/usr/local/lib/python2.7/dist-packages/Twisted-13.2.0-py2.7-linux-i686.egg/twisted/internet/defer.py\", line 295, in addCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File \"/usr/local/lib/python2.7/dist-packages/Twisted-13.2.0-py2.7-linux-i686.egg/twisted/internet/defer.py\", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File \"/home/customer/LAFS_source/src/allmydata/storage/backends/cloud/cloud_backend.py\", line 175, in <lambda>
    d.addCallback(lambda ign: BucketWriter(account, immsh, canary))
  File \"/home/customer/LAFS_source/src/allmydata/storage/bucket.py\", line 28, in __init__
    share.get_allocated_data_length(), SHARETYPE_IMMUTABLE)
  File \"/home/customer/LAFS_source/src/allmydata/storage/account.py\", line 54, in add_share
    self._leasedb.add_new_share(storage_index, shnum, used_space, sharetype)
  File \"/home/customer/LAFS_source/src/allmydata/storage/leasedb.py\", line 176, in add_new_share
    (si_s, shnum, prefix, backend_key, used_space, sharetype, STATE_COMING))
sqlite3.OperationalError: database or disk is full
]'>
])
"

This was trying to backup 1 very small test file. I have 38GB of space locally. On windows

Change History (6)

comment:1 Changed at 2014-06-30T19:48:09Z by zooko

This is probably related to #2254 -- it is happening on the same server as #2254 is and both error messages have to do with the server's filesystem being full.

comment:2 Changed at 2014-06-30T22:07:43Z by Zancas

After the issue was reported to me I began investigating the state of CyberAxe's server. My investigation began at ~ 2014-06-27T19:45Z.

I ssh'd in to the server and ran:

Last login: Mon Apr  7 21:32:11 2014 from 184-96-237-182.hlrn.qwest.net
 ubuntu@ip-10-185-214-61:~$ df
 Filesystem     1K-blocks    Used Available Use% Mounted on
 /dev/xvda1       8256952 7838464         0 100% /
 udev              299036       8    299028   1% /dev
 tmpfs              60948     160     60788   1% /run
 none                5120       0      5120   0% /run/lock
 none              304720       0    304720   0% /run/shm

Here's how big the storageserver/storage db's were:

customer@ip-10-185-214-61:~/storageserver/storage$ ls -l ./*
-rw------- 1 customer customer        0 May  1 09:05 ./accounting_crawler.state
-rw------- 1 customer customer    16384 Apr 30 14:27 ./bucket_counter.state
-rw-r--r-- 1 customer customer 21267456 Apr 14 05:16 ./leasedb.sqlite
-rw-r--r-- 1 customer customer    32768 May 26 08:46 ./leasedb.sqlite-shm
-rw-r--r-- 1 customer customer  1246104 May 25 08:55 ./leasedb.sqlite-wal

Here's what I pasted into IRC as the result of an experiment using 'du':

<zancas> customer@ip-10-185-214-61:~$ du -sh {storageserver,introducer}/logs
                                                                        [14:01]
<zancas> 6.1G    storageserver/logs
<zancas> 20K     introducer/logs
<zancas> 
<zooko> aha
<zancas> customer@ip-10-185-214-61:~/storageserver/logs$ pwd
<zancas> /home/customer/storageserver/logs
<zancas> customer@ip-10-185-214-61:~/storageserver/logs$ ls -l twistd.log.999
<zancas> -rw------- 1 customer customer 1000057 Apr 26 20:43 twistd.log.999

I bzipp2'd and tape-archived the log files, and scp'd them to my local dev machine.

Here's a 'df' I ran on Saturday, before informing CyberAxe that we had a fix to test:

customer@ip-10-185-214-61:~/storageserver/logs$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      7.9G  1.5G  6.1G  19% /
udev            293M  8.0K  293M   1% /dev
tmpfs            60M  160K   60M   1% /run
none            5.0M     0  5.0M   0% /run/lock
none            298M     0  298M   0% /run/shm

Here's the result of a df run at about 2014-06-30T22:0930Z:

customer@ip-10-185-214-61:~/storageserver/logs$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      7.9G  1.5G  6.1G  20% /
udev            293M  8.0K  293M   1% /dev
tmpfs            60M  160K   60M   1% /run
none            5.0M     0  5.0M   0% /run/lock
none            298M     0  298M   0% /run/shm
customer@ip-10-185-214-61:~/storageserver/logs$ 

If I Understand Correctly the second df run above occurred *after* CyberAxe attempted another backup, and it failed in ticket:2254.

comment:3 Changed at 2014-07-11T17:26:50Z by CyberAxe

I tested this and can now backup/restore files after Za cleaned out the servers log files.

comment:4 Changed at 2014-07-24T00:25:47Z by daira

  • Cc daira removed
  • Component changed from unknown to code-frontend-cli
  • Description modified (diff)
  • Keywords file removed

comment:5 Changed at 2014-07-24T00:26:46Z by daira

  • Component changed from code-frontend-cli to code-storage
  • Keywords logging added

comment:6 Changed at 2014-07-24T00:29:06Z by daira

The underlying problem is #1927.

Note: See TracTickets for help on using tickets.