Opened at 2008-06-14T17:18:11Z
Last modified at 2010-10-24T16:56:07Z
#465 new enhancement
add a mutable-file cache
Reported by: | warner | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | eventually |
Component: | code-mutable | Version: | 1.1.0 |
Keywords: | performance cache mutable confidentiality memory | Cc: | |
Launchpad Bug: |
Description (last modified by warner)
If the mutable-file retrieval process were allowed to keep an on-disk cache (indexed by SI+roothash, populated by either publish or retrieve), then a lot of directory-traversal operations would run a lot faster.
The cache could store plaintext, or ciphertext (or, if/when we implement deep-traversal caps, it could contain the intermediate form). For a node that runs on behalf of a single user, the plaintext cache would be fastest. For a webapi node that works for multiple users, I wouldn't feel comfortable holding on to the plaintext for any longer than necessary (i.e. we want to maintain forward secrecy), so if a cache still seemed useful then I'd want it to be just a ciphertext cache.
The cache should only store one version per SI, so once we publish or discover a different roothash, that should replace the old cache entry. The cache should have a bounded size, with a random-discard policy. Of course we need some efficient way to manage that size (doing 'du -s' on the cache directory would be slow for a large cache, either we should keep the cache from getting that large or do something more clever).
It would probably be enough to declare that the cache is implemented as a single directory, with one file per SI, and each file contains the base32-encoded roothash + newline + plaintext/ciphertext . The size bound is imposed by limiting the number of files in this directory, and it is counted at startup.
Note that because the cache is indexed by (SI,roothash), it is an accurate cache: a servermap-update is always performed (incurring one round-trip), and only the retrieve phase is bypassed upon a cache hit. This only helps improve retrieves of directories that are large enough to not fit in the initial read that the servermap-update performs (since there is already a small share-cache that holds these reads), which probably means 6 children or more per directory. These not-so-small directories could be fetched in a single round-trip instead of two RTT.
If we allowed the cache to be indexed by just SI (or if we were to introduce a separate cache that mapped from SI to somewhat-current-roothash), it would be an inaccurate cache, implementing a tradeoff between up-to-dateness and performance. In this mode, the node would be allowed to cache the state of a directory for some amount of time. We'd get zero-RTT retrieves for some directories, but we'd also run the risk of not noticing updates that someone else had made.
Change History (4)
comment:1 Changed at 2008-06-14T18:22:26Z by warner
- Description modified (diff)
comment:2 Changed at 2009-09-24T05:54:51Z by zooko
comment:3 Changed at 2010-02-11T03:44:13Z by davidsarah
- Keywords performance cache mutable confidentiality added
- Milestone changed from undecided to eventually
comment:4 Changed at 2010-10-24T16:56:07Z by davidsarah
- Keywords memory added
See also #1045 (Memory leak during massive file upload (or download) through SFTP frontend).
It appears that that problem is due to the current design of the ResponseCache in source:allmydata/mutable/common.py, and might be solved by replacing that cache (which stores share responses) with a mutable file cache as described in this ticket.
If you like this ticket, you might also like #606 (backupdb: add directory cache), #316 (add caching to tahoe proper?), and #300 (macfuse: need some sort of caching).