Opened at 2011-09-04T17:25:23Z
Last modified at 2011-10-17T18:52:19Z
#1526 closed defect
make sure the new MDMF extension field is forward-compatible and safe — at Version 5
Reported by: | zooko | Owned by: | |
---|---|---|---|
Priority: | critical | Milestone: | 1.9.0 |
Component: | code-mutable | Version: | 1.9.0a1 |
Keywords: | forward-compatibility mdmf design-review-needed review-needed | Cc: | kevan, warner, davidsarah |
Launchpad Bug: |
Description (last modified by davidsarah)
In #393 we've added "extension fields" to the MDMF caps. If I recall correctly, the original motivation was that future writers of MDMFs might want to include the K and segsize parameters in the cap itself (in addition to inside the share data) and future readers might want to take advantage of that information so that they don't need to do an extra round trip to learn that information by fetching some of the share data.
It is important that such future writers of MDMFs don't exclude older (v1.9) readers of MDMFs from being able to read those caps or even (gah!) cause them to get the wrong data or to incur an error when they try to read those caps. Therefore we need a way to include data in the caps which older (v1.9) readers will reliably and safely ignore (and still be able to read the file data correctly) but future readers can use if they want.
I thought we had decided to make a generic field for "extensions" in the MDMF caps, and not to make the current (1.9) reader or writer actually use this extension field yet. But the current code in trunk constrains that field instead of allowing it to be generically extensible, and it seems to try to use the numbers contained therein for its K and segsize values in some (?) cases.
The MDMF filecap format is currently defined as:
URI:MDMF:$writekey:$fingerprint[(:$extension)*]
and all existing caps are created with two extension fields: the first is K (as an integer), the second is segsize (also as an integer). The intention is to allow additional extension fields to be added in the future, perhaps hints that could speed up a download.
But the current code first constrains each extension field to contain only the characters 0-9 and :, and then it requires there to be exactly two such fields.
Future versions of MDMF writers can't use the extensions then, to communicate anything other than the currently defined K and segsize fields. They might allow some future use if each field can have a larger set of characters, and if there is space for messages that Tahoe-LAFS v1.9 readers will parse and then ignore.
I'd like to have this in 1.9 because then 1.9 will be tolerant of caps generated by future versions that have a different number of extension fields.
As for the use of that field to initialize the K and segsize values, I haven't read through the code carefully enough to see if it does that correctly and if it has good tests. This is potentially complicated.
What, for example, happens if the segsize indicated in the cap and the segsize indicated in the version info differ? Can the segsize or the K change in different versions of the same MDMF? (I'm pretty sure it can't, but if it can't then maybe the value in the cap should be the only place that K or segsize exist.) Does the current trunk MDMF reader actually really use this value? Scanning through the code, I don't think so but I'm not 100% sure yet.
There are three possibilities that I think we should consider for v1.9.0:
Proposal 1 (extension field for future use—currently unused)
- 1. a. Loosen the constraint-checking on the extension field in MDMF caps to allow a larger character class and have almost no constraints except those necessary for safe and easy parsing to find where the field begins and ends.
- 1. b. Eliminate all code which uses the contents of the extension field when reading.
- 1. c. Reconsider whether to populate the contents of that extension field with K:segsize when generating a URL. (The way K and segsize are encoded into the extension field has, of course, to fit into the constraints of the extension field. In addition to that, it should not consume the entire extension field, but should allow a safe and easy way for other fields to be added into the extension field such that they can be unambiguously parsed apart from the K and segsize fields.)
- 1. d. Think about whether this proposal will lead to unsafety/insecurity or forward-compatibility problems.
Proposal 2 (K and segsize in cap):
- 2. a. Define part of the MDMF cap to hold K and segsize. This is in fact exactly the same as the "extension field" in the current trunk, but we stop calling it the "extension field" and start calling it K and segsize.
- 2. b. Eliminate all code which uses K and segsize values from anywhere other than the cap when reading.
- 2. c. Think about whether this proposal will lead to unsafety/insecurity or forward-compatibility problems.
Proposal 3 (K and segsize in cap, plus an extension field):
- Do all of Proposal 2 to encode K and segsize into the cap, and then also do proposal 1.a. to provide an extension field for future use.
[Note: Zooko was talking below about slightly different versions of the proposals -- he wasn't sure whether the existing code included K and segsize, which in fact it does.]
I think that Proposal 1 is the least likely to delay or destabilize Tahoe-LAFS v1.9 especially if we leave out the optional 1.c. step. If v1.9 does not attempt to use the extension field in any way other than telling where it begins and ends, then future MDMF users will not have to worry that what they put in there will cause problems for old v1.9 users. By removing all the code that does anything with the extension field (aside from the regex which allows the extension field to be present in an MDMF cap), we can simplify the current 1.9 alpha code for easier review.
I'd like to hear your opinion about this! (Especially if you are Kevan, David-Sarah, or Brian.)
Change History (5)
comment:1 Changed at 2011-09-04T19:16:09Z by davidsarah
comment:2 Changed at 2011-09-04T21:19:56Z by warner
comment:3 Changed at 2011-09-04T21:37:46Z by kevan
I like the conservatism of precisely specified and restrictive extension fields; it makes me more comfortable with the idea of extension fields, since it's easier for me to reason about the form of a valid cap and feel confident that we haven't inadvertently allowed caps that we don't want to allow. It is kind of silly to add a restriction like that before we even start writing the code that might one day use the extensions, though, and my objection isn't anything more than a vague unease with the idea.
I think that the code at the moment simply populates the k and segsize values -- it doesn't try to use them. Actually, aside from the restrictive nature of the extension fields at the moment, proposal 1 is what the current code does.
As I understand it, the extension fields are meant to advisory. That is, any part of Tahoe-LAFS that uses them to work with a file should tolerate cases in which they're wrong. So we'd treat the remote share's representation of the segsize, k, or whatever else we choose to stick in the cap as canonical, and defer to it in the event of an inconsistency.
comment:4 Changed at 2011-09-04T21:48:35Z by davidsarah
I'm a bit confused as to what optimizations are available from including K and segsize in the cap. Doesn't the downloader get K and segsize on the first round-trip anyway? Why would it be helpful to know them before that?
For any future extensions, I would prefer each field to be labelled. A one-letter label as the first character of the field would probably be sufficient. The advantage of this is that if we decide that a particular field is no longer necessary, we can just omit it without making the parse ambiguous.
comment:5 Changed at 2011-09-04T22:09:23Z by davidsarah
- Description modified (diff)
Current trunk does put K and segsize into the extension field when generating caps (src/allmydata/mutable/filenode.py@5227#L702).