#221 closed defect (fixed)

give proper filenames on download

Reported by: zooko Owned by: warner
Priority: major Milestone: 1.1.0
Component: code-frontend-web Version: 0.7.0
Keywords: Cc: secorp, dreid@…
Launchpad Bug:

Description (last modified by zooko)

As reported by Jonathan Tapicer:

The file download url should send a header with the correct file so the
browser shows it for saving, sending the 'filename' parameter to the Tahoe
node via GET seems to have no effect, the filename is always a long
incoherent string of characters. For example, try this link:
http://tahoebs1.allmydata.com:8011/uri/URI%3ACHK%3A7u9ffi6gzsoi7qzj55783qu9k
w%3Adb9y3ep7n1s3nui3h1bk34riqcrk4xjtowjo57nikdfpzo8ojamy%3A3%3A10%3A68560?fi
lename=foo.txt&save=true

According to Brian and/or RobK, the only way to really get the browser to give the right filename is to append "/foo.txt" to the end of the URL, and teach the tahoe backend (webish.py) to ignore that trailing filename-shaped thing.

Change History (18)

comment:1 Changed at 2007-12-04T21:40:19Z by zooko

  • Component changed from unknown to code-frontend-web
  • Description modified (diff)
  • Owner nobody deleted

comment:2 Changed at 2007-12-04T21:52:36Z by zooko

  • Owner set to warner

This feature is urgent for the project that Jonathan is working on, so I put it into v0.7.0 Milestone. Brian: how long do you think it will take to implement this? Based on my newly earner knowledge of webish.py, I think it can be done in 1 day.

comment:3 Changed at 2007-12-04T21:53:36Z by zooko

  • Cc secorp added

Now using the cool new "e-mail from trac" feature to Cc: Peter.

comment:4 Changed at 2007-12-05T06:28:07Z by warner

The proposals we have on the table are:

  1. GET /download/$URI/$filename Which would only be for single-component pathnames, i.e. /download/abd123f/foo.txt and not /download/abd123f/subdir/foo.txt
  1. GET /download/$URI[/$SUBPATH]/$filename Which would be for multi-component pathnames. The important feature is that the last component of the URL is *always* the filename that we want the browser to see, and is *never* used by the tahoe node to find a child.
  1. GET /uri/$URI[/$SUBPATH]?filename=$filename This is what we currently have and it doesn't work. I believe the browser uses a default filename of $SUBPATH?filename=$filename, which is a mess.

The first option is only for single-component pathnames, but that's just the sort of URL we create by default for file downloads right now, so it wouldn't be that much of a limitation. The second option is less limiting but I find it surprising to have most of the URL components be for tahoe and then the last one be just for the benefit of the browser.

comment:5 Changed at 2007-12-05T20:20:58Z by zooko

Let me say up front that there does not exist a nice option, since there are two namespaces (the tahoe namespace and the namespace which is defined by the producer of the URL) that both require to use "/final_component" in some cases. So we are trying to choose the least surprising among the not-so-nice options.

I agree with your feeling about why the second option is surprising.

The surprising thing about the first option is that if you are a programmer writing HTML or javascript, or writing code in some other language that produces HTML or javascript when it is run, then you normally use GET /uri/$URI[/$SUBPATH], even when you are passing save=true, or even if the $SUBPATH is empty. However, if you pass save=true and $SUBPATH is empty at the same time, then you'll get this bad behavior. So you need to add a clause so that you do it as GET /uri/$URI[/$SUBPATH] unless you are downloading the result and the $SUBPATH is empty, in which case you do GET /download/$URI/$LOCAL_FILE_NAME instead.

Oh, and in fact, this same problem can apply to files which are viewed instead of downloaded! If you view a file, such as this one in your browser and then try to save that file to disk, it will give you a big long ugly suggested file name. So here is a proposal which offer good save-as file names for clicking a view link followed "File -> Save As" just as well as for clicking a download link:

1.b. There are two kinds of "GET":

  • GET /uri/$DIR_URI[/$SUBPATH]
  • GET /named/$FILE_URI/$LOCAL_FILE_NAME

This is just like option 1 except that it is not called "download" and it is orthogonal to the save=true option. For example, the HTML directories served up by webish.py could include links like these.

Hopefully a user might be able to see from the URL that the "/foo.txt" means something different in these two URLs: http://localhost:8123/uri/DR_ysr4tryfm88rhk1od1zpo53r9wd5wb8e5xizwzg6ou5ifxuc/foo.txt, http://localhost:8123/named/MR_hu6fnak1cge5zkz9eiysfy66iwuwggtcpxc3bir6cwo6o3bf/foo.txt.

1.c. You can allow the use of /named/ even when there is a subpath:

  • GET /uri/$URI[/$SUBPATH]
  • GET /named/$URI[/$SUBPATH]/$LOCAL_FILE_NAME
  1. You can use // to separate the tahoe namespace from the local save-as name:
    • GET /uri/$URI[/$SUBPATH]
    • GET /uri/$URI[/$SUBPATH]//$LOCAL_FILE_NAME

(Note that there is precedent for using // to indicate a boundary between nested namespaces -- it's the separator between "scheme" and "authority" in URIs.)

4.b. You can combine /named/ and // for redundant signals:

  • GET /uri/$URI[/$SUBPATH]
  • GET /named/$URI[/$SUBPATH]//$LOCAL_FILE_NAME

Okay, at this point I vote for option 4.b., I await Brian's feedback, and I request that Jonathan tell us which of these (especially 4.b) would make sense for his use.

comment:6 Changed at 2007-12-07T12:43:41Z by zooko

  • Milestone changed from 0.7.0 to 0.7.1

Brian: what do you think of proposal 4.b? Jonathan said (in e-mail) that he liked it.

Since ticket #222 solves Jonathan's immediate problem and is easier to do than this ticket, I'm putting #222 into Milestone v0.7.0 and bumping this one out.

comment:7 Changed at 2008-01-23T02:31:07Z by zooko

  • Milestone changed from 0.7.1 to undecided

comment:8 Changed at 2008-03-08T00:42:09Z by warner

I'm slightly uncomfortable with 4/4b (using a double-slash to separate what you're accessing from what you want to call it). I like having a distinction, but a double-slash means:

  • you can no longer have empty subdirectory names. Granted, this would be confusing and dumb, but we haven't prohibited it yet.
  • the twisted.web implementation would have an unusual code path. Basically you iterate over path components.. if you get a non-empty string, you perform a child lookup. But if you get a empty string, you stop with the child that you already have (hopefully a file) and consume the rest of the path (asserting that it is of length one) for use as the filename. Hmm, maybe that isn't so tough after all.

I don't like 1c, because that would lead to something like: http://localhost:8123/named/DR_usr4tryf/foo.txt/foo.txt

I don't think I like /named for some reason (it's only used for GET, never for PUT, so something emphazising the read- or download- ness seems better). That's not a strong feeling, though.

Hm, the '4' /uri/$URI/[$SUBPATH]//$FILENAME approach is growing on me. It seems like we might be stealing a big chunk of the namespace for a relatively trivial purpose, however.. we might want to use that same syntax later for indicating which version of a multiversioned LDMF file you want to retrieve (like the '@@' syntax that Clearcase uses for this purpose).

Would it be an unreasonable restriction to say that you can only use this local-name feature for file URIs and not for subpaths? I guess that means I'm leaning towards '1'.

Ah, so many choices..

comment:9 Changed at 2008-03-08T03:39:13Z by dreid

  • Cc dreid@… added

The Content-Disposition header should work correctly with most modern web browsers. At the very least it works on Safari 3, Firefox 2. It is not part of the HTTP spec but it is a widely implemented way of hinting at browsers what the default filename should be. It's mentioned in RFC2616 Section 19.5.1

19.5.1 Content-Disposition

   The Content-Disposition response-header field has been proposed as a
   means for the origin server to suggest a default filename if the user
   requests that the content is saved to a file. This usage is derived
   from the definition of Content-Disposition in RFC 1806 [changeset:d0fd8ddc8a113b24].

        content-disposition = "Content-Disposition" ":"
                              disposition-type *( ";" disposition-parm )
        disposition-type = "attachment" | disp-extension-token
        disposition-parm = filename-parm | disp-extension-parm
        filename-parm = "filename" "=" quoted-string
        disp-extension-token = token
        disp-extension-parm = token "=" ( token | quoted-string )

   An example is

        Content-Disposition: attachment; filename="fname.ext"

   The receiving user agent SHOULD NOT respect any directory path
   information present in the filename-parm parameter, which is the only
   parameter believed to apply to HTTP implementations at this time. The
   filename SHOULD be treated as a terminal component only.

   If this header is used in a response with the application/octet-
   stream content-type, the implied suggestion is that the user agent
   should not display the response, but directly enter a `save response
   as...' dialog.

   See section 15.5 for Content-Disposition security issues.

The filename at the end of the URL will ensure the most wide-ranging support and also provide a hint to humans as to the contents of the URL. But you might find the Content-Disposition header a less intrusive change in the meantime.

comment:10 Changed at 2008-03-08T15:45:49Z by zooko

Hm, I thought that we already set the Content-Disposition header, but docs/webapi.txt@2219 says that we do so only if ?save=on.

comment:11 Changed at 2008-03-08T15:56:24Z by zooko

Hm, I thought that we already set the Content-Disposition header, but docs/webapi.txt@2219#L168 says that we do so only if ?save=on. Trying it with wget --save-headers http://tahoebs1.allmydata.com:8123/uri/URI%3ACHK%3Az76adrcsxcud72oqjw62dzbuyy%3A57plpxz3skec4qnbhe43pzdfe2hjg5lh44wsfqxzg7y7klub2syq%3A3%3A10%3A319262?filename=IraqMedia_Oct03_rpt.pdf shows that the Content-Disposition header is not set, and with wget --save-headers http://tahoebs1.allmydata.com:8123/uri/URI%3ACHK%3Az76adrcsxcud72oqjw62dzbuyy%3A57plpxz3skec4qnbhe43pzdfe2hjg5lh44wsfqxzg7y7klub2syq%3A3%3A10%3A319262?filename=IraqMedia_Oct03_rpt.pdf&save=on. It also doesn't. Oh wait, I obviously don't understand what wget's --save-headers option is supposed to do -- it never shows me any headers. On the other hand curl's --dump-header filename.txt does what I expect, and shows that we do indeed set the Content-Disposition if ?save=on.

I also tested it with Firefox 2 on Mac OS X and it worked. I guess the limitation of this approach, though, is that you can't give someone an URL which they can either view or save and they get a reasonable filename when saving.

comment:12 Changed at 2008-03-08T22:18:04Z by zooko

SamB had a couple of suggestions of how to set the default save filename without also triggering the browser so save the file immediately:

<SamB> well... you could TRY using an "inline" content-disposition...   [14:23] 
<SamB> but I don't think it's likely to help... 
<SamB> the other thing is that it might help not to have a content-type of 
       application/octet-stream                                         [14:24] 

comment:13 Changed at 2008-05-09T01:15:15Z by warner

I'd like to make some progress on this one.

When we last left our intrepid heroes, they were pondering the following alternatives:

  • 1b : GET /named/$FILE_URI/$LOCAL_FILE_NAME
  • 1c : GET /named/$URI[/$SUBPATH]/$LOCAL_FILE_NAME
  • 4 : GET /uri/$URI[/$SUBPATH]//$LOCAL_FILE_NAME
  • 4b : GET /named/$URI[/$SUBPATH]//$LOCAL_FILE_NAME

I rejected 1c because of the broken-looking duplication in e.g. /named/dir/foo.txt/foo.txt, and I'm not favorable towards 4 or 4b for similar reasons: /uri/dir/foo.txt//foo.txt or /named/dir/foo.txt//foo.txt.

I'm ok with 1b, but I'd suggest "file" instead of "named", to emphasize the single-file-ness, since it would only be used for individual files. An example of this would look like: http://localhost:8123/file/IR_hu6fnak1cge5zkz9eiysfy66iwu/foo.txt

Zooko, where are your thoughts these days?

If we can get sufficient consensus on this, I'll implement it tomorrow.

comment:14 Changed at 2008-05-09T01:16:59Z by warner

  • Status changed from new to assigned

comment:15 Changed at 2008-05-09T01:29:37Z by warner

  • Milestone changed from undecided to 1.1.0

fixing this will probably close #385 too, as long as the log-sanitizer recognizes /named or /file as it does /uri .

comment:16 Changed at 2008-05-09T18:29:55Z by zooko

I'm in favor of 4. Your objections to 4 seem to be:

  • "you can no longer have empty subdirectory names. Granted, this would be confusing and dumb, but we haven't prohibited it yet."

So since you wrote that we have prohibited empty subdirectory names, haven't we? In any case I don't mind doing so and in fact I think I prefer to do.

  • "the twisted.web implementation would have an unusual code path. Basically you iterate over path components.. if you get a non-empty string, you perform a child lookup. But if you get a empty string, you stop with the child that you already have (hopefully a file) and consume the rest of the path (asserting that it is of length one) for use as the filename. Hmm, maybe that isn't so tough after all."

What do you think of this, now?

  • "It seems like we might be stealing a big chunk of the namespace for a relatively trivial purpose, however.. we might want to use that same syntax later for indicating which version of a multiversioned LDMF file you want to retrieve (like the '@@' syntax that Clearcase uses for this purpose)."

Well, maybe we can use '@@' for that if we later want to? :-)

  • "the broken-looking duplication" ... "e.g. /uri/dir/foo.txtfoo.txt"

I don't mind this. Tahoe programmers need to learn the difference between the first foo.txt, which specifies which child of dir, and the second, which specifies what name your web browser uses for that file. It is convenient that you can make the last // optional and then the web browser will use the former for the latter by default. In particular, this allows you to just append //webbrowsername to any Tahoe URL, e.g.:

uri/URI%3ACHK%3Awaqatup4yk7dyoosuyaux6vwzu%3Alzoa4c2phsp3x7ws47bofsbihjm5avxqo35qrqscgqvtwrjotyra%3A3%3A10%3A313227//wiki.html

means that the capability in question actually is the wiki page, and the name wiki.html is just what you want your web browser to call it.

Okay, overall I'm pretty happy with 4.b, and I hope that your objections, above, are not too strong.

The alternative, if I understand correctly, is 1.b, where the Tahoe filesystem's namespace is denoted by a top-level /uri/ (soon to be renamed /cap/), and the web browser's name for it is denoted by a top-level /named/ (or some such) and it allowed only when the cap is pointing directly at the file. I like the way that this unifies the two namespaces in the case that the URL includes only a capability -- in that case there is no way to write that you want the web browser to use a different filename from the one that the Tahoe directory uses. But I don't like the way that it "special cases" the case that the URL includes only a capability -- in that case instead of doing something that seems "natural" like appending a name to the URL, you have to change the top-level name from /cap/ to /named/. Argh.

I would like to agonize over this for a little while longer, please. :-)

Also I would like to invite tahoe-dev to notice this ticket, as some of them may have a useful insight to cut the Gordian knot of my ambivalence.

comment:17 Changed at 2008-05-14T21:05:56Z by warner

Hurray! Hurrah! We have consensus!

/file/FILECAP/@@named=/FILENAME

The actual implementation will simply ignore anything after FILECAP, so the exact syntax is:

/file/FILECAP[/IGNORED..]

I'll implement this now. The code will simply have a locateChild method that ignores the rest of the path segments.

comment:18 Changed at 2008-05-14T23:19:54Z by warner

  • Resolution set to fixed
  • Status changed from assigned to closed

Closed, by 304abfee32a06e05.

Note: See TracTickets for help on using tickets.