[tahoe-dev] configuring sqlite efficiency and durability

Tue Dec 4 02:13:22 UTC 2012

Folks:

One of the unit tests
(allmydata.test.test_cli.Cp.test_copy_using_filecap) takes more than
1000 seconds to complete on the leasedb branch, which uses sqlite for
tracking leases instead of putting leases as metadata appended to the
ends of share container files. The same test takes about 13 seconds on
the master branch.

I instrumented the calls to "leasedb" and observed that there were
3240 lease updates during the course of that test. There are 10
servers in the tests, and after the test is over there are 78 shares
left per server. Our storage servers and clients are both in the habit
of updating the leases whenever the client does anything with the
file, such as updating the contents of a mutable. I think this
behavior should be changed. It isn't part of the explicit interface,
it is just an "opportunistic" sort of thing that storage servers and
clients have been doing. But, I think the cost may be a bit more
significant with leasedb than it was with the old scheme. Also, in the
future you may be willing to write updates into a file and yet
unwilling to pay for that file's preservation!

But aside from that, even if leasedb gets updated only when a lease is
explicitly added-or-renewed by a client, we would like to be able to
support a high load of clients doing that. According to the rough
estimates mentioned above, the current leasedb branch can do about 3.2
lease updates per second. The master branch can do closer to 250 lease
updates per second. (This is all on my Macbook Pro 5,3 with a spinning
disk drive.)

So, I experimented with two of the settings of sqlite that tradeoff
durability for performance:

Write-Ahead Logging:

http://www.sqlite.org/wal.html

the "synchronous" setting:

http://www.sqlite.org/pragma.html#pragma_synchronous

The results are interesting. All numbers here are the total wall-clock
time to run that unit test which adds-or-renews leases 3240 times:

synchronous = FULL (default), journal mode = default (rollback
journal), time: 1042.4 seconds

synchronous = FULL (default), journal mode = WAL, time: 308.5 seconds

synchronous = OFF, journal mode = default (rollback journal), time
(best of 5 runs): 14.9 seconds

synchronous = OFF, journal mode = WAL, best of 5 runs: 18.1 seconds

synchronous = NORMAL, journal mode = WAL, best of 5 runs: 19.7 seconds

The meaning of these flags can be gleaned by reading sqlite's
excellent documentation, as linked above. The two most promising
settings to my way of thinking are (synchronous=FULL, journal=WAL),
and (synchronous=NORMAL, journal=WAL). In the first setting, we get
full atomicity and durability while processing about 10 leases per
second. In the second setting we get full atomicity but not durability
while processing about 160 leases per second.

In the latter setting we *still* have full atomicity, so there is no
risk of database corruption even in case of power outage, but we lose
durability. That means that the db state may rollback to an earlier
state, effectively erasing some of the most recent transactions. The
failure mode that we might worry about there, as Brian explained to me
today, is that the storage server updates the leasedb to add/renew a
lease, then sends a message back to the storage client saying that it
was successful, then the kernel or underlying machine/vm could crash,
and when it restarts that transaction has been erased, so there isn't
a fresh lease on that share even though the client was told that there
was.

Now, this is not that bad of a failure mode, and it probably would
happen very rarely. The reason it is not a bad failure mode is that
clients already need to be able to handle worse failures, such as a
server telling them that their lease is good and then (accidentally or
maliciously) deleting the share entirely. This failure mode is not as
bad as that one, since if the client, or a different client,
re-establishes the lease before the time expires, the share will be
preserved. The reason that it is not a common failure mode is that it
could only happen if there were a kernel panic or a power outage at
the wrong instant, right after the server sent a message to the client
acknowledging the lease renewal, but before the leased updates reached
the durable Write-Ahead-Log. In fact, as David-Sarah pointed out
today, since we update the leasedb and then write out the full share
data before acking on file upload, the window of opportunity for this
failure is probably zero on file upload. It may be non-zero, but is
probably still very short, on lease renewals that are not part of the
initial share upload.

There is a similar argument to be made for the other use of sqlite in
tahoe-lafs, in the backupdb. There, the cost of a transaction being
rolled-back after it had already been committed (due to a kernel panic
or power outage at the wrong instant) is merely that you will hash the
file again next time when deciding whether or not to upload it to
storage servers. So that is even less of a problem.

So in short, I'm in favor of using synchronous=NORMAL,
journal_mode=WAL for all current uses of sqlite in Tahoe-LAFS. Let me
know if you see a flaw in this reasoning!

The code will look like this, plus some comments:

@@ -31,6 +32,10 @@ def get_db(dbfile, stderr=sys.stderr,
     # The default is unspecified according to
<http://www.sqlite.org/foreignkeys.html#fk_enable>.
     c.execute("PRAGMA foreign_keys = ON;")

+    c.execute("PRAGMA journal_mode = WAL;")
+
+    c.execute("PRAGMA synchronous = NORMAL;")
+
     if must_create:
         c.executescript(schema)
         c.execute("INSERT INTO version (version) VALUES (?)",
(target_version,))

Regards,

Zooko