#1240 closed defect (fixed)

remove ResponseCache in favour of MDMFSlotReadProxy's cache

Reported by: davidsarah Owned by: davidsarah
Priority: major Milestone: 1.10.0
Component: code-mutable Version: 1.8.0
Keywords: mutable cache reviewed Cc:
Launchpad Bug:

Description (last modified by lebek)

MDMFSlotReadProxy can already be initialised with share data which it uses in preference to fetching from the server. By exploiting this behaviour we can remove ResponseCache which does essentially the same thing.

ResponseCache had no functional tests of how it is used (as noted in 1045#comment:39, deactivating the cache by commenting out the line that adds entries to it will currently cause all tests to pass). The MDMFSlotReadProxy cache will require a test of this type.

One way to test that the cache is working would be to download a file, then break all of the servers holding its shares, then download it again in order to check that the shares are retrieved from cache rather than from the servers.

Change History (34)

comment:1 Changed at 2011-01-06T02:02:17Z by davidsarah

  • Milestone changed from soon to 1.8.2
  • Owner set to davidsarah
  • Status changed from new to assigned

comment:2 Changed at 2011-01-18T04:17:56Z by davidsarah

  • Milestone changed from 1.8.2 to 1.9.0

Ran out of time for 1.8.2.

comment:3 Changed at 2011-08-13T23:26:35Z by davidsarah

  • Milestone changed from 1.9.0 to 1.10.0

comment:4 Changed at 2012-04-01T07:59:06Z by amiller

  • Keywords review-needed added; test-needed removed

I wrote a test that shows the cache is accessed during downloads from a NoNetworkGrid. https://github.com/amiller/tahoe-lafs/pull/5/files

I initially tried a "break-the-servers" test as per the ticket description, but this fails when the client finds no available versions. This seems consistent with the goals expressed in #465, which mentions performance but not availability.

comment:5 Changed at 2012-04-01T20:46:44Z by zooko

  • Owner changed from davidsarah to zooko
  • Status changed from assigned to new

comment:6 Changed at 2012-04-01T20:46:50Z by zooko

  • Status changed from new to assigned

comment:7 Changed at 2012-04-04T19:48:49Z by amiller

After some time with Zooko tracing the use of the ResponseCache, we have a new strategy that involves removing it altogether. Instead we can make better use of the "_data" cache that already exists in !MDMFSlotReadyPoxy.

To begin with, here is a test that will run against trunk to count the remoteCalls to 'slot_readv,' i.e., the cache misses.

https://github.com/amiller/tahoe-lafs/commit/21008a4d3eb3c491fbc48d9fb0e8e3eadd3c45a4

(It's probably not good form for this test to just printf counters to console, so I'll fix it as soon as someone suggests better alternatives)

comment:8 Changed at 2012-04-04T20:42:02Z by amiller

  • Keywords test-needed added

This commit removes ResponseCache. Instead, the !MDMFSlotReadProxys created by ServerMap are kept around as instance variables. Retrieve is able to access these rather than creating its own instance. As a result, the MDMFSlotReadProxy._data cache obviates the need for the additional logic of ResponseCache.

Notably, by removing some logic that produces false cache misses, (https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/src/allmydata/mutable/retrieve.py?rev=a56e639346a4bb38#L293) the number of cache misses reduces from 84 to 60 in test_deepcheck_mdmf.

https://github.com/amiller/tahoe-lafs/commit/cbdfb002a4bd8d16ec3b3faa608cfeb9e6325bfd

comment:9 Changed at 2012-04-18T08:38:25Z by amiller

  • Keywords test-needed removed

My test branch (amiller-1240-tests) differs from trunk only minimally, in particular the only test it removes was for ResponseCache itself. The feature branch (amiller-1240) also passes these tests now.

The only difference is that the number of slot_readv calls in trunk is 84 but goes down to 60 after replacing ResponseCache with this simpler alternative (according printouts in test_dirnode.Dirnode.test_deepcheck_cachemisses).

Test branch Feature branch

comment:10 Changed at 2012-04-18T15:17:00Z by davidsarah

  • Milestone changed from 1.11.0 to 1.10.0

comment:11 Changed at 2012-05-05T10:52:36Z by lebek

  • Owner changed from zooko to lebek
  • Status changed from assigned to new

comment:12 Changed at 2012-05-05T14:38:43Z by lebek

  • Description modified (diff)
  • Summary changed from add functional test of ResponseCache to remove ResponseCache in favour of MDMFSlotReadProxy's cache

comment:13 Changed at 2012-05-05T15:50:58Z by lebek

  • Keywords review-needed removed
  • Owner changed from lebek to amiller

Thanks amiller, I just reviewed this - a net negative lines of code patch, which also increases cache hits, very nice.

Everything looks good to me, just two points:

  • test_deepcheck_cachemisses is functionally identical to test_deepcheck. The print of the slot_readv call count should to be removed and replaced with unittest assertions that verify the cache is accessed when it should be (I think you mention this above)
  • in src/allmydata/mutable/retreive.py you import allmydata.util.spans, but the module is never used

That's all.

Last edited at 2012-05-05T15:53:49Z by lebek (previous) (diff)

comment:14 Changed at 2012-05-05T21:31:17Z by amiller

  • Keywords review-needed added
  • Owner changed from amiller to lebek

I got rid of the lines in test_deepcheck_cachemisses that duplicate test_deepcheck functionality. I also got rid of the print statement in the test - now it will Fail if the cache isn't hit as often as expected. I also had to fix a minor tuple bug, which this test would have caught!

Test branch Feature branch

comment:15 Changed at 2012-05-13T18:28:44Z by warner

Hrm, I'm not 100% sure about retaining the readproxy for a long time. In the long run, I'd like to get rid of those proxies and move their functionality into the publish/retrieve code itself, arranged like I did with the new immutable downloader.

But, for now, having fewer moving parts definitely seems like a good thing. I'll look this over and see if I can review+land it.

comment:16 Changed at 2012-06-18T15:49:38Z by zooko

  • Keywords reviewed added; review-needed removed

I consider amiller's work (which I helped with initially) and lebek's review sufficient for merging this to trunk. Removing review-needed and adding reviewed, which means that someone with commit privs should go ahead and do that...

comment:17 Changed at 2012-07-17T13:29:01Z by zooko

  • Owner changed from lebek to zooko
  • Status changed from new to assigned

I will figure out how to merge this patch to trunk. Last time I got confused by how one should use git to do this. If I get confused by git again, I'll just apply the diff...

comment:18 Changed at 2012-09-04T16:05:16Z by warner

  • Owner changed from zooko to warner
  • Status changed from assigned to new

comment:19 Changed at 2012-09-04T23:22:02Z by warner

I rebased and rewrote this patch, fixed up the leftover debugging settings and removed the trailing whitespace. It doesn't pass tests yet (several in test_mutable are failing, at least). Branch is here: https://github.com/warner/tahoe-lafs/tree/1240-2

comment:20 Changed at 2012-09-04T23:40:26Z by warner

  • Owner changed from warner to amiller

Kicking this back to amiller to fix the test failures. In at least one of them (test_retrieve_surprise), a call to clear the old ResponseCache was removed, but no call to clear the new cache (inside the readproxy?) was added. Not sure what's going on with the other ones.

comment:21 Changed at 2012-09-05T03:27:39Z by davidsarah

  • Keywords reviewed removed

comment:22 Changed at 2012-10-23T22:23:40Z by amiller

After several weeks, this mystery is mostly solved. In an earlier commit ff8a0a, I introduced an error that prevented the cache from being populated correctly.

The error was something along the lines of store(H(data), data) followed by get(H'(data))->data which resulted in the cache being less effective (but not incorrect). This loss of effectiveness was succesfully detected by the test_deepcheck_cachemisses in that commit.

In a later commit 1c50a4d, I fixed this error, but in doing so broke three tests that expected a larger number of server reads (i.e., cache misses) during an file download. Very small files, such as 'content1' in these tests, are loaded entirely into the cache during the 'get blocks' protocol - the actual file download produces no additional reads.

In the newest commit 8d68da2 (which is rebased against a recent trunk) I applied the simplest fix to two of these tests, which is to increase the size of the test content past the initial cache fill. The third test, test_corrupt_all_block_hash_tree_late, is fixed by adding 'corrupt_early=True' to the corruption test. I am not entirely sure whether this changes the semantics of the test, so I leave that as a question for the next reviewer.

Version 0, edited at 2012-10-23T22:23:40Z by amiller (next)

comment:23 Changed at 2012-10-23T22:25:57Z by amiller

  • Keywords review-needed added
  • Owner amiller deleted

comment:24 Changed at 2012-11-01T16:45:26Z by amiller

  • Keywords test-needed added; review-needed removed
  • Owner set to amiller

Kicking it back to myself, since adding 'corrupt_early=True' does in fact change the semantic of the test (this is the reason there's both a 'corrupt' and a 'corrupt_late' version of each corruption test).

The first two tests were fixed by increasing the length of the content so that a second server fetch is necessary. However the third test involves corruption of the block hash tree (BHT), which is typically small and doesn't exceed the first cache fill. Since there's no guarantee that the BHT will fit in the first cache fill, it's wrong to remove this test.

Ways to proceed:

  1. Introduce a way to disable the cache for a particular test. This would also require a unit test to assert whether the cache is used or not.
  2. Figure out how to force the block-hash-tree to be huge, so that it exceeds the initial cache fill on its own?

comment:25 Changed at 2012-11-01T19:48:46Z by davidsarah

  1. Have the test patch the cache size?

comment:26 follow-up: Changed at 2012-11-08T15:38:37Z by amiller

  • Keywords review-needed added; test-needed removed

I tried increasing the content size (up to the limit of my patience for running the test) but it didn't increase the size of the hash tree past the 4000 bytes it would take to push it out of the initial cache fill. What's an estimate of the file size this would require?

I added a function to MDMFSlotReadProxy to clear its cache, as well as one to clear all the proxies in a servermap. The test_corrupt_all calls this function before attempting to retrieve data. All the tests pass once again d754999a

comment:27 in reply to: ↑ 26 ; follow-up: Changed at 2012-11-08T16:19:36Z by amiller

The reason the block_hash_tree didn't extend past 4000 bytes is because this test is for an SDMF file, which has exactly one segment, and therefore the block_hash_tree never exceeds the initial cache fill. So the best solution after all is to simply remove this test. 06fe92

Note that there is a similar test for MDMF files. This test already passes, because for this test data the hash tree exceeds the initial read.

comment:28 Changed at 2012-11-08T16:44:32Z by davidsarah

  • Owner changed from amiller to davidsarah
  • Status changed from new to assigned

comment:29 in reply to: ↑ 27 ; follow-up: Changed at 2012-11-08T18:44:02Z by davidsarah

Replying to amiller:

So the best solution after all is to simply remove this test. 06fe92

+1, but I suggest leaving a comment saying why no test corresponding to the MDMF one is needed.

comment:30 in reply to: ↑ 29 Changed at 2012-12-20T16:41:04Z by amiller

Replying to davidsarah:

+1, but I suggest leaving a comment saying why no test corresponding to the MDMF one is needed.

I added a brief comment in the MDMF version explaining why its SDMF counterpart is unnecessary in this commit: 168ec4f4

comment:31 Changed at 2012-12-20T16:52:16Z by davidsarah

  • Keywords reviewed added; review-needed removed

+1

comment:32 Changed at 2012-12-26T23:48:22Z by davidsarah

I made the following change in mutable/retrieve.py when rebasing for trunk:

-            # Reuse the SlotReader from the servermap
-            #print 'retrieving version:', hash(self.verinfo)
-            try:
-                reader = self.servermap.proxies[(self.verinfo,
-                                                 server.get_serverid(),
-                                                 self._storage_index, shnum)]
-            except KeyError:
-                reader = MDMFSlotReadProxy(server.get_rref(),
-                                           self._storage_index, shnum, None)
+            # Reuse the SlotReader from the servermap.
+            key = (self.verinfo, server.get_serverid(),
+                   self._storage_index, shnum)
+            if key in self.servermap.proxies:
+                reader = self.servermap.proxies[key]
+            else:
+                reader = MDMFSlotReadProxy(server.get_rref(),
+                                           self._storage_index, shnum, None)

comment:33 Changed at 2012-12-29T02:40:42Z by davidsarah

  • Resolution set to fixed
  • Status changed from assigned to closed

Fixed in 4563ba456b1b2d64.

comment:34 Changed at 2013-01-02T07:04:52Z by zooko

+0 on the change in comment:32.

Note: See TracTickets for help on using tickets.