[tahoe-lafs-trac-stream] [tahoe-lafs] #510: use plain HTTP for storage server protocol

Sun Oct 27 01:04:26 UTC 2013

#510: use plain HTTP for storage server protocol
------------------------------+---------------------------------
     Reporter:  warner        |      Owner:  zooko
         Type:  enhancement   |     Status:  new
     Priority:  major         |  Milestone:  2.0.0
    Component:  code-storage  |    Version:  1.2.0
   Resolution:                |   Keywords:  standards gsoc http
Launchpad Bug:                |
------------------------------+---------------------------------

Comment (by simeon):

 I've typed all this, but I'm getting tired again. Hopefully have
 transferred the important core points now, and I will attempt sometime to
 come back and clarify/summarise from this sketch. Let me know if you think
 I am beating a different path, and that summary can be put somewhere else
 where it won't clutter your system. :) For now, thanks for your thoughts
 and good luck with your project.

 Replying to [comment:32 simeon]:
 > Replying to [comment:31 simeon]:
 > > "...everything else done at a client end, apart from exchanging
 objects between backend nodes and deprecation."
 > >
 > > Perhaps since deprecation is handled by the server, then
 locking/blocking should also be
 >
 > Oh. And hashsumming of data, and making symlinks between a filename and
 a hashsum db blob.

 > I think the URI must be capable of specifying the following (question-
 marks where I'm not sure the item is useful):

 I forgot the salt! ;-) ... and the user ID for mutable files, which should
 definitely be capable of being named the same as an immutable file in
 every other respect, so that a user can easily create a file in their
 directory with a name that they can see directly corresponds with a file
 in the cache.

 So adding salt we have

 * optional salt
 * ? file size or size-class ?
 * ? checkdigit (a one-byte checksum of the hashsum, to detect user typos)
 ?
 * hash algorithm (if omitted, default to a pre-configured value)
 * truncatable hashsum (if omitted, query based on filename, if provided,
 and if allowed by config)
 * optional sequence number in case of hash collisions (not guaranteed to
 be consistent between nodes, or across time, server maintained, and when
 dups exist but one isn't specified, then return a list somehow to allow
 the client to choose)
 * optional filename

 Salt has to be specified and stored where the server can find it easily,
 (ie as part of the object label in my model) if the server checks
 hashsums. If it does not, it could be hidden in client-only metadata, but
 I think for users it's useful to keep it as an obvious part of the object
 id ... we call the URI ... which they would see as 'filename'.

 For usability I think this is too many fields, filesize and checkdigit
 probably have to be dropped. There are two way to make the URI, either
 strict field-ordering and using field-counting and hope the there is no
 need to extend the schema later... or with explicit query-labels.

 == Lots of examples to ponder ==

 So what might a worst-case bungle of a compact usable and intuitive
 implementation would look like? I'm just gonna use an example, cause I
 forget how to do RFC syntax:

 lafs:secretsalt;4k;XYZQabc1230;1;My+theory+by+Anne+Elque.html.gz

 Where secretsalt is a custom salt; 4 is a single-byte file-size class; G
 is a checkdigit for the complete hashsum; XYZQabc1230 is the first 11
 chars of the hashsum, 1 is because the current db was DoSed with a
 manufactured item that happens to hit the same hashsum as the file we
 want; and the title is "My theory by Anne Elque".

 With query labels, it goes something like

 lafs:s=secretsalt;z=4;d=k;h=XYZQabc1230;f=My+theory+by+Anne+Elque;t=htmlgz

 Or if we drop z and d, roll sequence-number into filename, and accept a
 default salt, perhaps for the normal case where the item does not have
 duplicate-hashsum siblings, we can use field counting to infer that the
 salt is default, things look OK even with field identifiers

 lafs:h=XYZQabc1230;f=My+theory+by+Anne+Elque.html.gz

 lafs:s=secretsalt;h=XYZQabc1230;f=My+theory+by+Anne+Elque.html.gz

 for an autmoated transaction, the server need not pass a filename (nor the
 salt perhaps? depends on how we store items with different salts, and how
 much time we want to spend doing lookups), so a request could just be for

 lafs:h=XYZQabc1230

 Want to save two bytes? Make the default field h=

 lafs:XYZQabc1230

 Users who are familiar might find this more convenient that a URI that
 mentions a title.

 If no h is provided, only an f, then optionally have the server do a
 lookup. This could be optimised by keeping a set of mutable, possible non-
 public files with reverse-lookup-mappings of filename to hashid.

 Maybe the gz is a separate transform rather than a file extension, removed
 when passing the object back to the user? Unsure which way is better, the
 implications are not that important, but can become complex, eg a file
 could be a gzip, stored by the db would it be gzipped again? Without a way
 to specify so, then clients might do this. I don't think the server
 backend should add gzip, but as discussed in earlier posting, client
 should iff the content is compressible.

 It's easy actually to test and compress if needed, so the client probably
 should test, but not testing might make the SCHEMA simpler, since the
 client would always assume it has/hasn't been gzipped. Not having a
 separate field means counting on the file extension as being correct, and
 this leads to problems if users misuse the file extension...

 So this is a possible solution:

 lafs:t=gz;s=secretsalt;h=XYZQabc1230;f=My+theory+by+Anne+Elque.html

 on a web UI, this might be presented as

 http://lafs-ui.com/htmlgz/secretsalt/XYZQabc1230;My+theory+by+Anne+Elque

 and in the user's directory

 http://lafs-
 ui.com/AElque at lafsmail.com/secretsalt/XYZQabc1230;My+theory+by+Anne+Elque

 or probably they either use the default salt, or have a personal default
 configured somewhere in a client, so it can be shorter, like

 http://lafs-ui.com/AElque@lafsmail.com/XYZQabc1230;My+theory+by+Anne+Elque

 Does the lafs schema need to support user ids?

 lafs:u=AElque at lafsmail.com;t=gz;s=secretsalt;h=XYZQabc1230;f=My+theory+by+Anne+Elque.html

 Personally I think the email is preferrable to a PGP ID, easier for a user
 to understand. But if using an ID, I guess this might be

 lafs:u=0xA72B89345;t=gz;s=secretsalt;h=XYZQabc1230;f=My+theory+by+Anne+Elque.html

 I'm probably misusing the request here, not sure how you intend to expose
 the user directory. I'm guessing it makes sense only for mutable files, so
 the h= and s= go away, if the user wants to represent them, they are
 stored in the f= field,

 lafs:u=AElque at lafsmail.com;t=gz;f=My+theory+by+Anne+Elque.html
 lafs:u=AElque at lafsmail.com;t=gz;f=s=secretsalt;h=XYZQabc1230;f=My+theory+by+Anne+Elque.html

 Are we confused yet? ;-) I'm guessing it doesn't make sense to allow the
 schema to reference the mutable files, or else they need a specific
 schema. Certainly in any case, the user-definable field of the 'filename'
 needs to either be defined as potentially including strings that duplicate
 syntax elements, or the server does need logic to disallow or escape such
 entities. Myself, I prefer to say, filename always comes last, and can
 include any legal HTTP URI filename field characters. But a simple other
 way is to ensure that semi-colon is escaped within the field. I'm not
 gonna re-write the above URIs that way though, you have to use your
 imagination.

 http://lafs-ui.com/AElque@lafsmail.com/XYZQabc1230;My+theory+by+Anne+Elque

 http://lafs-ui.com/AElque@lafsmail.com/My+theory+by+Anne+Elque

 Both of the above might be symlinks to the cache object item XYZQabc1230.
 The user might specifically include the truncated hashsum that would be
 seen on the cache item, but it's not needed because their homedir is a
 bunch of links/redirects to cache objects. Or is that too insecure? Again,
 my thoughts revolve around a public editing system, yours around a private
 data store. In this case, I would want to adopt the semantics that you
 would use.

 I fear that having differing representations in web UI is
 counterproductive, so it's best if the syntax can be as brief, compact and
 intuitive and filename-friendly as possible. :)

 Is it better then, to omit the (pretend) directory paths, and use fields
 instead? I think this is uglier, less clear, so which is actually better,
 ambiguity in the name of clarity?

 http://lafs-
 ui.com/u=AElque at lafsmail.com;t=gz;h=XYZQabc1230;f=My+theory+by+Anne+Elque

 I like the directory paths for some parameters, fields for others. Perhaps
 directory paths are OK where they would not be preserved in a save-file
 anyway? That means maybe more like this:

 http://lafs-ui.com/AElque@lafsmail.com/My+theory+by+Anne+Elque

 http://lafs-ui.com/gzhtml/h=XYZQabc1230;f=My+theory+by+Anne+Elque

 Two views of the same object, with the 'client' being a web UI on the site
 lafs-ui.com, and one version being returned with mime-type html using HTTP
 header "Content-Encoding: gzip", the second returned using mime-type
 octet-stream and without the compression-encoding header, so the web
 browser would provide the user the raw blob.

 http://lafs-ui.com/gzhtml/h=XYZQabc1230
 http://lafs-ui.com/cache/h=XYZQabc1230

 Do a lookup for files by name?

 http://lafs-ui.com/search/f=My+theory+by+Anne+Elque

 When the user presses 'save file', it's gonna depend on the context of
 their client how much of this gets passed through. I guess this means the
 server sets the Content-Disposition filename field to something
 comprehensive, and the client software filters it if the user prefers.
 Users who use things like lynx or telnet to grab a file have to wittle the
 fields down themselves. It's harder (unreliable) to reconstruct a field
 unless the server provides it, so I doubt that it's worth saving bytes by
 omitting or making the suggested filename minimalist. But perhaps it can
 be taken on context, return filled with fields the client mentioned? Then
 an internode-style transmission by hashsum-only has very little overhead,
 the entire field can be omitted.

 In a user client, it should be configurable, to anything from the full
 lafs:blah  down to just the filename portion, "My theory by Anne Elque",
 but as mentioned, the client shouldn't try to reconstruct fields that the
 server has not described, since that is not going to be easy to make
 correct in every case.

 == On disk ==

 On disk on the server, the files can be stored efficiently using the
 hashsum to distribute the objects equally across a bunch of
 subdirectories. This is the real benefit of using the hash in the
 filename, as I guess you were aware at least tangentially, since I guess
 you understand the load-balancing aspect wrt distributed nodes. But it
 makes it easy on the filesystem to do look-ups as well.

 The server node makes a translation when the client requests hashsum
 XYZQblah, to X/Y/XYZblah. Or for a larger installation, more
 subdirectories, perhaps XY/ZQ/blah, or X/Y/Z/XYZQblah. This is hidden from
 the user, and unless they are locally storing massive quantities of
 unsorted files in their homedir, they wouldn't want to or need to emulate.

 So I'm guessing if custom salts are allowed that could be implemented on
 the server path as a root directory or parent directory tree of a similar
 type ... specifics would depend on how many users, how many files?

 And on the server the full hash should be stored in the filename.
 Optionally it would be nice but not important to enable the user to
 configure to see the full hash in the 'save as' dialogue too.

 Human-readable portion of filename I think is made using a link to that
 original object, resolved internally by the server to avoid excessive lag,
 with appended desired/suggested filename field. POSIX hardlinks vs
 symlinks vs hackish text file redirects have differing implications for
 ease-of-management, cross-platform compatibility, and for consistency wrt
 can one version disappear, be locked or blocked, while another remain.
 Ultimately it might be nice to have the different mechanisms as
 configurable policy options, otherwise the most sensible approach is to
 take the simplest path I guess.

 == The other fields ==

 Below I've ranted a bit about the various fields that ultimately above I
 decided were too problematic to bother the user with. This may be
 interesting if you want to understand why, and it may be worth having the
 API support the ideas, perhaps they have uses behind the scenes or in
 special use-cases.

 Another reason might be to help keep the schema overall compatible with
 RFC 6920, even if it involves some translation that can be automated at
 least in one direction (it's obviously not trivial to reconstruct a
 hashsum in a URI that has been truncated to a length that 6920 doesn't
 support, for example). I don't want to even think about it right now.

 Here is my analysis, also don't want to revisit this right now, so it may
 have errors and stuff.

 It's only in the case of a hash collision with differing content that the
 sequence number field would get used. Probably this option could be left
 out of an implementation, but it's good I think to plan for perhaps
 supporting it, in case an efficient mechanism is found for this DoS attack
 vector. (My gaze hovers over the bitcoin generators at this point.)

 The file-size classifier byte is basically for the same purpose. I see
 bitcoin spawning methods for quickly generating massive numbers of hashes,
 which who knows, some people might be able to leverage to use as a library
 for hash collisions. (They use hash256 but can I think fairly easily
 change.) But these items are of a small, perhaps even fixed size I think.
 Including a filesize classification byte MAY be a way to get such hash-
 collisions to not matter. On the other hand, I am not that sure it's worth
 it, and dunno if others have done a rigorous analysis of such a strategy.

 Checkdigit is for the human-typo factor. If the checkdigit does not match
 then the user can be alerted. However ... it's only able to be checked
 when the full hash is known, or else it would vary depending on the
 truncation length of the hashsum. And since it can't easily be verified by
 hand anyway, I'm again not all that sure if it's useful enough to be worth
 using. If the bitspace of the hashsum truncation length is sufficient,
 there should not be two items who are close enough together to be only a
 mere typo apart. So a dud request leads to a lookup in either case, and an
 error in either case.

 Only if the client end is given the full hashsum by the user typing it in,
 does the benefit arise, where eg a js could alert the user prior to the
 client making the query to the server. I think this is useless, really. :)
 On the other hand, if you did implement things that way, it might be
 considered a life-saver by the poor human!

 The usage I envisage would primarily be people copy-pasting URIs, or using
 phones to capture QRCode style links, more often than they would manually
 type one in. The main reason for truncating the hashsum is not to make it
 easier to type, but to make it easier to read! Humans stop reading after a
 lot of incomprehensible cruft. ;-) I hope this is not one of those cases.
 ;-) The URI has to fit neatly into the address bar, including enough of
 the actual human-generated filename for the reader to notice that it has
 one.

 So I think checkdigit and filesize are important options for the URI
 schema to support, but maybe not essential to a given implementation.

-- 
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/510#comment:33>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage