#2015 closed defect (fixed)
leasedb: high file descriptor usage exposed by tests
Reported by: | daira | Owned by: | markberger |
---|---|---|---|
Priority: | major | Milestone: | 1.14.0 |
Component: | code-storage | Version: | cloud-branch |
Keywords: | leasedb tests reliability file-descriptor leak | Cc: | |
Launchpad Bug: |
Description
Running tests on the leasedb or cloud-merge branches may cause the trial process to run out of file descriptors. This happens more readily on platforms where the default fd limit is small (e.g. 256 on Mac OS X), or when using synchronous = NORMAL or FULL (as opposed to OFF) options to sqlite for the leasedb.
If there is a file descriptor leak, we should fix it or work around it; otherwise we should use iputil.increase_rlimits() to increase the number of fds available on platforms where the limits are too small (as we do in node processes).
See also #1870 for a possibly-related performance regression on the leasedb branches.
Attachments (2)
Change History (21)
comment:1 follow-up: ↓ 2 Changed at 2013-07-06T18:17:46Z by zooko
Changed at 2013-07-06T18:18:20Z by zooko
comment:2 in reply to: ↑ 1 ; follow-up: ↓ 3 Changed at 2013-07-07T19:53:34Z by daira
Replying to zooko:
fdleakfinder definitely reports that the sqlite files opened during test cases are left open instead of closed at the end of this test case.
That's understandable, because there is no code to close them! (This shouldn't matter for a node which only has one DB connection open; it only matters for tests.)
comment:3 in reply to: ↑ 2 Changed at 2013-07-07T21:53:56Z by zooko
Replying to daira:
Replying to zooko:
fdleakfinder definitely reports that the sqlite files opened during test cases are left open instead of closed at the end of this test case.
That's understandable, because there is no code to close them! (This shouldn't matter for a node which only has one DB connection open; it only matters for tests.)
I would have assumed that the fds would be closed automatically when the Python objects were garbage-collected.
markberger: it sounds like there might be a bug in the Python sqlite module such that it leaks fds. Or at least a misunderstanding on my part. You could investigate this hypothesis by writing a program that opens a million sqlite databases in a row while losing its reference to teh previous one. Something like this (untested, written by cutting and pasting a bit of code from dbutil.py):
for i in range(1000000): sqlite3.connect("db-test-file-%d" % i)
If that script runs out of fds, then indeed the underlying database files are not getting closed when the accompanying Python objects get garbage collected. (Or the Python objects are not getting garbage-collected in a timely way, due to reference cycles preventing their de-allocation by reference-counting…)
comment:4 follow-up: ↓ 5 Changed at 2013-07-08T14:45:30Z by markberger
I called dbtuil.get_db 4000 times like in the above example and I did not receive any fds errors.
for i in range(4000): dbfile = "test-db-%d" % i conn = get_db(dbfile, create_version=(LEASE_SCHEMA_V1, 1), journal_mode="WAL", synchronous="NORMAL")
comment:5 in reply to: ↑ 4 Changed at 2013-07-08T15:06:39Z by zooko
markberger: interesting! Did it create 4000 files in the local filesystem? Could you cause it to hang (for example with time.sleep(1000)) and see how many fds it has open with the lsof command-line tool?
I guess I'll run your script under strace (on my linux system) and use my fdleakfinder script on it and see what I learn…
comment:6 follow-up: ↓ 7 Changed at 2013-07-08T15:56:25Z by markberger
With my script the total number of fds appears to remain constant during execution when I call sudo lsof | wc -l from another terminal. This isn't the case when running the test suite (as your fdleakfinder has shown).
comment:7 in reply to: ↑ 6 Changed at 2013-07-08T16:46:29Z by zooko
Replying to markberger:
With my script the total number of fds appears to remain constant during execution when I call sudo lsof | wc -l from another terminal. This isn't the case when running the test suite (as your fdleakfinder has shown).
Okay, markberger. Good job disproving the hypothesis that there is a leak in the Python sqlite module, or even in our allmydata/util/dbutil.py! Now what's the next step to find the problem? The fdleakfinder outputs show which specific files are still being held open after tests end…
comment:8 Changed at 2013-07-08T17:03:28Z by zooko
By the way, here is the output from fdleakfinder on markberger's script that opened 4000 dbs one after another:
--- high water mark: 15 {3: ('"/home/zooko/playground/tahoe-lafs/src/allmydata/__init__.py", O_RDONLY', '1373301887.690948 open("/home/zooko/playground/tahoe-lafs/src/allmydata/__init__.py", O_RDONLY) = 3 <0.000023>\n'), 4: ('"/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/__init__.py", O_RDONLY', '1373301887.804392 open("/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/__init__.py", O_RDONLY) = 4 <0.000023>\n'), 5: ('"/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/formless/webform.py", O_RDONLY', '1373301888.392319 open("/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/formless/webform.py", O_RDONLY) = 5 <0.000023>\n'), 6: ('"/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/static.py", O_RDONLY', '1373301888.394230 open("/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Nevow-0.10.0-py2.7.egg/nevow/static.py", O_RDONLY) = 6 <0.000023>\n'), 7: ('"/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-linux-x86_64.egg/twisted/web/http.py", O_RDONLY', '1373301888.785862 open("/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-linux-x86_64.egg/twisted/web/http.py", O_RDONLY) = 7 <0.000031>\n'), 8: ('"/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-linux-x86_64.egg/twisted/internet/reactor.py", O_RDONLY', '1373301888.818183 open("/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-linux-x86_64.egg/twisted/internet/reactor.py", O_RDONLY) = 8 <0.000033>\n'), 9: ('"/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-linux-x86_64.egg/twisted/internet/default.py", O_RDONLY', '1373301888.819526 open("/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-linux-x86_64.egg/twisted/internet/default.py", O_RDONLY) = 9 <0.000031>\n'), 10: ('"/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-linux-x86_64.egg/twisted/internet/epollreactor.py", O_RDONLY', '1373301888.820879 open("/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-linux-x86_64.egg/twisted/internet/epollreactor.py", O_RDONLY) = 10 <0.000032>\n'), 11: ('"/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-linux-x86_64.egg/twisted/internet/posixbase.py", O_RDONLY', '1373301888.822672 open("/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-linux-x86_64.egg/twisted/internet/posixbase.py", O_RDONLY) = 11 <0.000032>\n'), 12: ('"/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-linux-x86_64.egg/twisted/internet/tcp.py", O_RDONLY', '1373301888.863697 open("/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-linux-x86_64.egg/twisted/internet/tcp.py", O_RDONLY) = 12 <0.000033>\n'), 13: ('"/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-linux-x86_64.egg/twisted/internet/_newtls.py", O_RDONLY', '1373301888.865874 open("/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-linux-x86_64.egg/twisted/internet/_newtls.py", O_RDONLY) = 13 <0.000032>\n'), 14: ('"/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-linux-x86_64.egg/twisted/protocols/tls.py", O_RDONLY', '1373301888.869115 open("/home/zooko/playground/tahoe-lafs/support/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-linux-x86_64.egg/twisted/protocols/tls.py", O_RDONLY) = 14 <0.000031>\n'), 15: ('"/usr/lib/python2.7/dist-packages/OpenSSL/__init__.py", O_RDONLY', '1373301888.884666 open("/usr/lib/python2.7/dist-packages/OpenSSL/__init__.py", O_RDONLY) = 15 <0.000037>\n'), 16: ('"/usr/lib/python2.7/plat-linux2/DLFCN.py", O_RDONLY', '1373301888.898436 open("/usr/lib/python2.7/plat-linux2/DLFCN.py", O_RDONLY) = 16 <0.000029>\n'), 17: ('"/usr/lib/python2.7/plat-linux2/DLFCN.pyc", O_RDONLY', '1373301888.898649 open("/usr/lib/python2.7/plat-linux2/DLFCN.pyc", O_RDONLY) = 17 <0.000030>\n')}
comment:9 Changed at 2013-07-09T17:17:24Z by zooko
- Owner changed from markberger to zooko
- Status changed from new to assigned
I'm working on this right now. FYI here's the patch I'm testing.
Changed at 2013-07-09T17:18:08Z by zooko
comment:10 Changed at 2013-07-11T06:15:56Z by zooko
- Keywords test-needed added
Here is a patch for your enjoyment:
https://github.com/zooko/tahoe-lafs/tree/2015-high-fd-usage
However, it is not yet suitable for merge to trunk since it lacks a unit test.
comment:11 Changed at 2013-07-11T17:22:07Z by markberger
The patch also causes the following tests to fail:
[ERROR] Traceback (most recent call last): File "/Users/markberger/Code/tahoe-fork/src/allmydata/storage/server.py", line 258, in <lambda> canary)) File "/Users/markberger/Code/tahoe-fork/src/allmydata/storage/backends/disk/disk_backend.py", line 167, in make_bucket_writer bw = BucketWriter(account, immsh, canary) File "/Users/markberger/Code/tahoe-fork/src/allmydata/storage/bucket.py", line 28, in __init__ share.get_allocated_data_length(), SHARETYPE_IMMUTABLE) File "/Users/markberger/Code/tahoe-fork/src/allmydata/storage/account.py", line 54, in add_share self._leasedb.add_new_share(storage_index, shnum, used_space, sharetype) File "/Users/markberger/Code/tahoe-fork/src/allmydata/storage/leasedb.py", line 167, in add_new_share self._cursor.execute("INSERT OR REPLACE INTO `shares`" exceptions.AttributeError: LeaseDB instance has no attribute '_cursor' allmydata.test.test_storage.AccountingCrawlerWithDiskBackend.test_basic allmydata.test.test_storage.AccountingCrawlerWithDiskBackend.test_expire_age allmydata.test.test_storage.AccountingCrawlerWithDiskBackend.test_expire_cutoff_date allmydata.test.test_storage.AccountingCrawlerWithDiskBackend.test_limited_history
comment:12 Changed at 2013-07-11T20:37:01Z by markberger
- Keywords review-needed added; test-needed removed
On this branch (https://github.com/markberger/tahoe-lafs/tree/2015-high-fd-usage) there is a patch by daira which fixes the problem that caused the tests to fail. I've also added a unit test for this ticket which ensures that the database connection and the cursor are closed when the service stops.
comment:13 Changed at 2013-07-12T05:13:25Z by zooko
Good job, Mark_B! I hope someone reviews this soon… I personally am going to focus on imminent commercial launch at https://LeastAuthority.com instead, for the next week or so.
Are you going to move on to… Something else, now? Oh, I know, go back to fixing the regression in new-server-selector (#1382) about contacting too many servers?
comment:14 Changed at 2013-07-12T14:58:56Z by daira
- Keywords review-needed removed
- Owner changed from zooko to markberger
- Status changed from assigned to new
Reviewed with comments on github. Please squash the commits from and including zooko's into a single commit. (Ask on irc if you don't know how.)
Zooko: tut tut, no chitchat unrelated to the ticket in ticket comments :-)
comment:15 Changed at 2013-07-12T15:34:50Z by markberger
The commits have been squashed and I opened a pull request to LeastAuthority?/1819-cloud-merge branch. The pull request is here: https://github.com/LeastAuthority/tahoe-lafs/pull/1
comment:16 Changed at 2013-07-12T19:18:30Z by daira
- Milestone changed from 1.11.0 to soon
- Resolution set to fixed
- Status changed from new to closed
- Version changed from 1.10.0 to cloud-branch
Fixed in e6da7e2e on the 1819-cloud-merge branch.
comment:17 Changed at 2014-11-27T03:51:41Z by daira
- Milestone changed from soon to 1.12.0
comment:18 Changed at 2016-03-22T05:02:25Z by warner
- Milestone changed from 1.12.0 to 1.13.0
Milestone renamed
comment:19 Changed at 2016-06-28T18:17:14Z by warner
- Milestone changed from 1.13.0 to 1.14.0
renaming milestone
I learned that to recreate the big performance regression, you need to apply this patch to current 1819-cloud-merge:
I used my fdleakfinder (https://github.com/zooko/fdleakfinder) on the full test suite, with optimized sqlite settings (journal_mode="WAL", synchronous="NORMAL", see below), and with fd limit bumped up to the max that my linux allows -- 2000.
fdleakfinder definitely reports that the sqlite files opened during test cases are left open instead of closed at the end of this test case. It is reporting the "high water mark", which is a list of all the fds that were open *at one particular* time. See attached output from fdleakfinder.