[tahoe-dev] [Python-Dev] PEP 383 and Tahoe [was: GUI libraries]

Wed May 6 13:10:48 PDT 2009

 Stephen J, Turnbull wrote:

> what's implortant for this discussion is what appears at the interfaces (user-client, client-server).
...
> But that means that what I saw before with Unix ls is no longer what I see in Tahoe, *or in the destination Unix system with ls*.

I still don't understand why you say that.  Let me back up and see if
we have the same model of the components and interfaces.

There is a system, which either offers a unicode-safe interface
(Windows, Mac) or a bytes interface (Linux, Solaris).  If it offers a
bytes interface then it also offers a declared encoding.  Tahoe runs
as a user-space process and relies on Python (version 2) to tell Tahoe
what this declared encoding is with sys.getfilesystemencoding() (for
the filesystem) and sys.getdefaultencoding() (for command-line
arguments and stdin/stdout).

Then, a Tahoe client writes something down, which must include a valid
unicode filename as its primary key, and which also may have optional
metadata.

Then, another Tahoe client reads what was written and loads it into
its memory in-process.

The standard interface for that second Tahoe client to emit
information is the WUI/WAPI (Web User Interface / Web Application
Programming Interface).  See
http://testgrid.allmydata.org:3567/uri/URI%3ADIR2%3Adjrdkfawoqihigoett4g6auz6a%3Ajx5mplfpwexnoqff7y5e4zjus4lidm76dcuarpct7cckorh2dpgq
for an example.  An HTTP client contacts the Tahoe client (which acts
as an HTTP server) and sends an HTTP request and receives an answer
which includes a view of the Tahoe filesystem such as a directory
listing.

Then, there are at least five interfaces for connecting the WAPI up to
other things:

1. The CLI (Command Line Interface) is a processes that contacts the
Tahoe client over HTTP and, for its "tahoe ls" and related commands,
it emits results on stdout/stderr.

2. The CLI's "tahoe cp" command reads files and directories over HTTP
and writes them directly into the local filesystem (when the target of
the cp is the local filesystem).

3. The Windows native client serves CIFS/SMB to the local Windows
operating system and presents the results returned by the Tahoe
client.

4. The iPhone client presents the results on iPhone.

5. The FUSE plugins present the results to the Linux or Mac VFS layer.

6. There may be others that I don't fully appreciate.  Probably the
two Ruby libraries are sensitive to the decisions we make in this
design, but I'm not sure.

Okay, that's the setting, now the five possible requirements are:

Requirement 1 "valid unicode filename": This is mandated by backwards
compatibility with the current Tahoe clients as well as the five or
more external components listed above expect valid unicode in the
"filename" slot.

Requirement 2 "faithful unicode if decodable": If a filename decodes
with the getfilesystemencoding(), then we'll use the resulting unicode
as the filename.

Requirement 3 "no file left behind": If a filename doesn't decode with
the getfilesystemencoding(), then we'll invent a unicode string with
which to refer to that file, so that the file will at least be present
even if badly named.

(Note that these first three requirements already require Tahoe to
implement some handling of collisions, when the unicode string we
invented to name a file with an undecodable name happens to be the
same as the name of another file in the same directory.)

Requirement 5 "no loss of information": A future cyborg archaeologist
can dig into the Tahoe metadata and figure out what the bits were
before the filesystem was copied into Tahoe.

Possible Requirement 4 "round trip == faithful bytes": This is the
tricky one.  The motivation is that if you have a Linux or Solaris
system, and you do a backup with Tahoe, and then later do a restore
with Tahoe, you want the same bytestrings for all your filenames to be
restored, even if your locale was set such that those bytestrings were
undecodable when you did the backup, or even if your locale was set so
that the bytestrings were decodable but were mojibake.  On the other
hand, if this requirement is satisfied by default then what you see
when you view a Tahoe directory through the WUI, "tahoe ls", etc. will
be different from what you get when you restore that Tahoe directory
to your local filesystem.  Also, since everyone is moving toward
utf-8, they may consider ill-encoded filenames to be a problem that
they would like to learn about as early as possible, such as when they
are doing the original backup into Tahoe.  Also, at least one person
has told me that he would be horrified for a "tahoe cp -r tahoe:
hislocalsystem/" to insert filenames into his local system which were
*not* valid encodings in his filesystem.  He has the exact opposite
requirement of "round trip": that even if the original filenames were
ill-encoded, he doesn't want Tahoe to write ill-encoded filenames into
this system.

So I'm having a hard time making up my mind about this one, and at the
moment I'm leaning toward making it an option like
'--handle-ill-encoded-filenames" with default value of 'mangle' and
options of 'forcebytes', 'stop', or 'skipfile'.  (Which, by the way,
is rather like a suggestion Brian Warner made quite a while back.)

My current thinking is that if 'mangle' is set then we should emulate
the behavior of Nautilus and, to a lesser extent, of GNU ls, which is
to decode while replacing undecodable bytes with the U+FFFD char, and
then append " (badly encoded filename)" to the end of the filename.

Okay, even though you've written much more which deserves a response,
I'm going to stop here and send this just to see if you and I (and
everyone else) is on the same page.

As I currently understand it, what you see on Unix (using GNU ls or
Nautilus, for example) will be what you see on Tahoe and on the target
localsystem, unless you pass
--handle-ill-encoded-filenames=forcebytes, in which case it depends on
the original and target system's encoding matching.

Regards,

Zooko