#1373 new defect

'tahoe cp' should not make links to existing immutable files when the encoding parameters have changed

Reported by: davidsarah Owned by:
Priority: major Milestone: undecided
Component: code-frontend-cli Version: 1.8.2
Keywords: tahoe-cp preservation availability rebalancing usability Cc:
Launchpad Bug:

Description

Whenever tahoe cp copies an existing Tahoe immutable file, it will link to the file rather than re-uploading it (see the need_to_copy_bytes method of TahoeFileSource). This is arguably the wrong behaviour when the current encoding parameters are not the same as those of the existing file.

This makes it more difficult to rebalance a file by copying it (see this tahoe-dev thread).

Change History (4)

comment:1 Changed at 2011-02-28T20:02:38Z by davidsarah

Hmm, the encoding parameters of the existing file can be obtained from its URI, but how does tahoe cp know the current encoding parameters? It can't get them from tahoe.cfg, because its node directory may just have a node.url and no tahoe.cfg. Uploading a small file (just larger than the LIT threshold) and looking at the resulting URI would do it, but that's inefficient.

We may need a new web-API operation to get the encoding parameters.

Last edited at 2011-02-28T20:02:59Z by davidsarah (previous) (diff)

comment:2 follow-up: Changed at 2011-03-01T18:09:06Z by warner

I think 'tahoe cp' should not re-encode by default.. instead, I'd like to see an option like --recode, or maybe something that means "I really care about the current gateway's current encoding parameters, and I'm willing to lose convergence and spend bandwidth to make sure new files match those parameters", because I think most of the time people won't care enough to spend those things, and would prefer to have 'cp' continue to share existing immutable files.

I agree that "tahoe cp -r" is a reasonable tool to do deep-modifications of files, things that will result in new filecaps.

Also, if we're going to use a CLI command to do this, the brains of the operation should probably be in the gateway anyways: you wouldn't want the CLI tool to download the full contents of the file and then send them right back to the same gateway. Instead, you'd want a webapi command that takes an existing filecap and some new encoding parameters, and does the download/reencode/upload locally, then gives you back a new filecap. So it's not that the webapi needs to expose the encoding parameters, but it's more like it needs to expose a "copy iff encoding-parameters don't match my current defaults".

That said, a good first step would probably be to make it possible for webapi clients to control the encoding parameters on a per-upload basis, probably by adding ?encoding=k,N arguments to the PUT command. Then we can get some experience with how people want to use this sort of control, which I think should inform the creation of tools like deep-reencode.

comment:3 in reply to: ↑ 2 ; follow-up: Changed at 2011-03-01T23:42:48Z by davidsarah

Replying to warner:

I think 'tahoe cp' should not re-encode by default.. instead, I'd like to see an option like --recode, or maybe something that means "I really care about the current gateway's current encoding parameters, and I'm willing to lose convergence and spend bandwidth to make sure new files match those parameters", because I think most of the time people won't care enough to spend those things, and would prefer to have 'cp' continue to share existing immutable files.

I disagree, because I don't think that changes in the encoding parameters will be common enough for the bandwidth and space usage to be a significant issue. The user wouldn't have changed the parameters if they didn't want new files to have the new parameters. A copy of a file made by tahoe cp is logically a new file.

I agree that "tahoe cp -r" is a reasonable tool to do deep-modifications of files, things that will result in new filecaps.

Also, if we're going to use a CLI command to do this, the brains of the operation should probably be in the gateway anyways: you wouldn't want the CLI tool to download the full contents of the file and then send them right back to the same gateway.

It wouldn't do that. It would compare the parameters in the URI that it is copying with the current parameters. There's no need for any web-API request to compare parameters per-file, because the URI has already been obtained from a download of the parent directory (or from the command-line).

comment:4 in reply to: ↑ 3 Changed at 2011-03-01T23:49:58Z by davidsarah

Replying to davidsarah:

Replying to warner:

Also, if we're going to use a CLI command to do this, the brains of the operation should probably be in the gateway anyways: you wouldn't want the CLI tool to download the full contents of the file and then send them right back to the same gateway.

It wouldn't do that. It would compare the parameters in the URI that it is copying with the current parameters.

I misinterpreted what you said here; I thought you were referring to the operation to check whether the parameters have changed.

I don't see why the reencoding operation can't be optimized later. Creating files with the requested parameters is a correctness issue, in my view, so takes precedence over the bandwidth usage between the gateway and the CLI process (which is normally on a local connection and so not a big problem).

Note: See TracTickets for help on using tickets.