Opened at 2013-02-21T01:09:38Z
Last modified at 2020-10-30T12:35:44Z
#1921 closed defect
leasedb: NonExistentShareError: can't find [share] in `shares` table — at Version 7
Reported by: | davidsarah | Owned by: | davidsarah |
---|---|---|---|
Priority: | major | Milestone: | 1.15.0 |
Component: | code-storage | Version: | 1.9.2 |
Keywords: | leasedb accounting-crawler blocks-cloud-merge | Cc: | |
Launchpad Bug: |
Description (last modified by zooko)
The attached incident was seen when running a test with a large number of uploads with the OpenStack cloud backend. The most relevant part seems to be:
local#18690 23:06:18.999: storage: allocate_buckets lp6ibjsxe6vf6ern6v3soepjh4 local#18691 23:06:18.999: OpenStack list objects request GET https://storage101.dfw1.clouddrive.com/v1/MossoCloudFS_a888b70a-4771-40c0-8403-e921454e03fd/test?format=json&prefix=shares%2Flp%2Flp6ibjsxe6vf6ern6v3soepjh4%2F {'User-Agent': ['Tahoe-LAFS OpenStack client'], 'X-Auth-Token': ['be1fea33-921b-47be-b95b-99d4cc5139ea']} local#18692 23:06:18.999: Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x4433830> local#18693 23:06:19.504: OpenStack list objects response: 200 OK local#18694 23:06:19.505: OpenStack list read 201 bytes, parsed as 1 items local#18695 23:06:19.505: OpenStack get object request GET https://storage101.dfw1.clouddrive.com/v1/MossoCloudFS_a888b70a-4771-40c0-8403-e921454e03fd/test/shares/lp/lpcjuif2ixx6khivy6zxdmvofe/0 {'User-Agent': ['Tahoe-LAFS OpenStack client'], 'X-Auth-Token': ['be1fea33-921b-47be-b95b-99d4cc5139ea']} local#18696 23:06:19.506: Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x442b320> local#18697 23:06:19.506: Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x46b6c68> local#18698 23:06:19.520: OpenStack list objects response: 200 OK local#18699 23:06:19.521: OpenStack list read 2 bytes, parsed as 0 items local#18700 23:06:19.523: Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x4433830> local#18701 23:06:19.537: OpenStack put object request PUT https://storage101.dfw1.clouddrive.com/v1/MossoCloudFS_a888b70a-4771-40c0-8403-e921454e03fd/test/shares/lp/lp6ibjsxe6vf6ern6v3soepjh4/0 {'User-Agent': ['Tahoe-LAFS OpenStack client'], 'Content-Type': ['application/octet-stream'], 'X-Auth-Token': ['be1fea33-921b-47be-b95b-99d4cc5139ea']} local#18702 23:06:19.537: Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x433fef0> local#18703 23:06:20.020: OpenStack get object response: 200 OK local#18704 23:06:20.022: share SI=lp6ibjsxe6vf6ern6v3soepjh4 shnum=0 unexpectedly disappeared [INCIDENT-TRIGGER] local#18705 23:06:20.059: Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x442b320> local#18706 23:06:20.062: OpenStack list objects request GET https://storage101.dfw1.clouddrive.com/v1/MossoCloudFS_a888b70a-4771-40c0-8403-e921454e03fd/test?format=json&prefix=shares%2Flq%2F {'User-Agent': ['Tahoe-LAFS OpenStack client'], 'X-Auth-Token': ['be1fea33-921b-47be-b95b-99d4cc5139ea']} local#18707 23:06:20.062: Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x4341e60> local#18708 23:06:20.521: OpenStack put object response: 201 Created local#18709 23:06:20.527: an inbound callRemote that we [n4zt] executed (on behalf of someone else, TubID uzie) failed local#18710 23:06:20.527: reqID=6970, rref=<allmydata.storage.bucket.BucketWriter object at 0x31a9050>, methname=RIBucketWriter.close local#18711 23:06:20.527: args=[] local#18712 23:06:20.527: kwargs={} local#18713 23:06:20.527: the LOCAL failure was: FAILURE: [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/foolscap/eventual.py", line 26, in _turn cb(*args, **kwargs) File "/home/davidsarah/tahoe/bitsys/allmydata-tahoe-1.9.0.post363/src/allmydata/util/deferredutil.py", line 55, in _with_log op(res) File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 368, in callback self._startRunCallbacks(result) File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 464, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 551, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/home/davidsarah/tahoe/bitsys/allmydata-tahoe-1.9.0.post363/src/allmydata/storage/bucket.py", line 62, in _got_used_space self._account.add_or_renew_default_lease(storage_index, shnum) File "/home/davidsarah/tahoe/bitsys/allmydata-tahoe-1.9.0.post363/src/allmydata/storage/account.py", line 58, in add_or_renew_default_lease return self.add_or_renew_lease(storage_index, shnum, renewal_time, expiration_time) File "/home/davidsarah/tahoe/bitsys/allmydata-tahoe-1.9.0.post363/src/allmydata/storage/account.py", line 63, in add_or_renew_lease renewal_time, expiration_time) File "/home/davidsarah/tahoe/bitsys/allmydata-tahoe-1.9.0.post363/src/allmydata/storage/leasedb.py", line 250, in add_or_renew_leases raise NonExistentShareError(si_s, shnum) allmydata.storage.leasedb.NonExistentShareError: can't find SI='lp6ibjsxe6vf6ern6v3soepjh4' shnum=0 in `shares` table ] local#18714 23:06:20.533: Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 0x433fef0> local#18715 23:06:20.545: storage: aborting write to share None
I'll avoid speculation in the ticket description, but I don't think this is specific to OpenStack.
Change History (8)
Changed at 2013-02-21T01:10:40Z by davidsarah
comment:1 follow-up: ↓ 2 Changed at 2013-02-21T01:35:31Z by davidsarah
comment:2 in reply to: ↑ 1 Changed at 2013-02-21T03:44:11Z by davidsarah
Replying to davidsarah:
[And sigh, I just noticed that I leaked the Auth-Token. It's fine, the Auth-Token is only valid for 24h and I'll delete the container contents after that anyway. But that does need fixing.]
This has been fixed; header values are no longer logged.
comment:3 Changed at 2013-02-27T23:53:05Z by zooko
- Keywords crawlers added
comment:4 Changed at 2013-03-01T01:40:35Z by davidsarah
- Keywords accounting-crawler added; openstack cloud-backend crawlers removed
I believe this was fixed in commits 4cd54e36 ("leasedb/accounting crawler: only treat stable shares as disappeared or unleased") and 9ebc0e8b ("OpenStack: fix a type error introduced by the fix to #1921") on the 1909-cloud-openstack branch. Note that this is not on the 1818-leasedb branch, and in general I need to review all patches on 1909-cloud-openstack 1819-cloud-merge to see whether they are applicable to 1818-leasedb. I have just noted this on #1818.
The problem was indeed not specific to OpenStack (or the cloud backend). The leasedb design doc had the correct design, which was for the accounting crawler to treat shares in states other than STABLE as leased, but that requirement had been missed in the implementation. That reminds me that we need better tests for edge cases in the accounting crawler.
I'll leave this ticket open until the fix is on the 1818-leasedb branch.
comment:5 Changed at 2013-03-03T22:48:26Z by davidsarah
I saw another instance of this after the patch that was supposed to have fixed it :-( I haven't investigated that yet.
comment:6 Changed at 2013-05-25T22:31:56Z by daira
- Description modified (diff)
#1987 may be the same bug as this. I'm not marking them as duplicates because I'm not sure of that yet.
comment:7 Changed at 2013-07-04T18:48:32Z by zooko
- Description modified (diff)
Okay, now for the speculation:
Notice that the accounting crawler is running concurrently with a PUT. The accounting crawler is what is causing the list requests for prefix 'lp', 'lq', etc.
What triggers the incident is that the crawler sees a share that unexpectedly disappeared, i.e. it has a leasedb entry but no chunk object(s):
Almost immediately afterward, the concurrent PUT request fails. That request is for the same share that the crawler saw "disappear". It seems as though what happened is that there was a race between the crawler and the share creation:
I think this will only happen when the accounting crawler is looking at a prefix while a share is being uploaded to it (that is not difficult to reproduce if you do enough uploads).
[And sigh, I just noticed that I leaked the Auth-Token. It's fine, the Auth-Token is only valid for 24h and I'll delete the container contents after that anyway. But that does need fixing.]