| 1 | | {{{ |
| 2 | | [20:43] <warner> as far as overall design goes, when we get around to rewriting the uploader, I think it should have a separate code path that it follows as soon as it sees any evidence of shares already being present |
| 3 | | [20:44] <warner> we want the new-upload case to work quickly, but the moment we see an alreadygot= share, we should switch into a mode where we search fairly thoroughly for all the existing shares (without allocating anything), come up with a sharemap, then replace the earlier (speculative) allocations with the ideal ones |
| 4 | | [20:44] <warner> (dealing with allocation failures along the way) |
| 5 | | [20:45] <warner> it's probably worth losing some of the pipelining along this path to simplify the code |
| 6 | | [20:45] <zooko> neat idea |
| 7 | | [20:45] <zooko> I'm not sure I agree. |
| 8 | | [20:45] <zooko> Okay, afk... |
| 9 | | [20:46] <warner> of course, second-upload (when we detect existing shares) is really a form of repair, better than normal repair because we have the full plaintext available locally |
| 10 | | [20:48] <warner> so maybe what ought to happen is that we rewrite the repairer, and make the uploader's "hey there are already shares there" path do: abandon the current allocations, hand the alreadygot info and the plaintext/ciphertext filehandle to the repairer, start the repairer, wait for its results |
| 11 | | [20:50] <kpreid> second-upload is a common case if you're storing a lot of files in tahoe and locally and 'moving'/copying back and forth |
| 12 | | [20:50] <warner> hm, true |
| 13 | | [20:51] <kpreid> and given immutable directories... |
| 14 | | [20:51] <warner> we've got two sorts of frontend-duplicate detectors that might move some pressure off the uploader: the helper, and the backupdb |
| 15 | | [20:51] <kpreid> some might want to have efficient upload-the-same-thing-again, as for backups |
| 16 | | [20:51] <warner> it's kind of an open question as to where the responsibilities ought to lie |
| 17 | | [20:51] <kpreid> i.e. tahoe backup without the builtin cleverness |
| 18 | | [20:52] <warner> I think it's reasonable to add backupdb support to 'tahoe cp', and to add a contents-to-filecap table to backupdb, which would avoid network IO in the case of moving files around |
| 19 | | [20:53] <warner> (currently if you move a file, the backupdb declares a miss, and we proceed to the uploader, which will either get a Helper hit or exercise the second-upload code. In my experience, second-upload results in duplicate shares, which is a drag, so I'd prefer to avoid it) |
| 20 | | [20:55] <warner> (OTOH, having the backupdb track file-contents means that we'd do an extra hash of the file for each new upload, in addition to the subsequent CHK-computation hash. OT3H, unless the files are really large, the filesystem cache should save us from actually doing the disk IO an extra time) |
| 21 | | [20:55] <zooko> warner: 1.6.0 was changed to search for all extant shares before immutable upload, IIRC |
| 22 | | [20:56] <zooko> http://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/NEWS?rev=d329759bb83ad6a0#L66 |
| 23 | | [20:56] <warner> I think it sends requests to everyone, yeah (which I still consider to be a scaling problem), but I don't believe that it waits to hear from everyone before starting the upload |
| 24 | | [20:56] <warner> there's always a tradeoff between stallability and thoroughness there |
| 25 | | }}} |
| | 1 | IRC discussion: |
| | 2 | |
| | 3 | > [20:43] <warner> as far as overall design goes, when we get around to rewriting the uploader, I think it should have a separate code path that it follows as soon as it sees any evidence of shares already being present |
| | 4 | > [20:44] <warner> we want the new-upload case to work quickly, but the moment we see an alreadygot= share, we should switch into a mode where we search fairly thoroughly for all the existing shares (without allocating anything), come up with a sharemap, then replace the earlier (speculative) allocations with the ideal ones |
| | 5 | > [20:44] <warner> (dealing with allocation failures along the way) |
| | 6 | > [20:45] <warner> it's probably worth losing some of the pipelining along this path to simplify the code |
| | 7 | > [20:45] <zooko> neat idea |
| | 8 | > [20:45] <zooko> I'm not sure I agree. |
| | 9 | > [20:45] <zooko> Okay, afk... |
| | 10 | > [20:46] <warner> of course, second-upload (when we detect existing shares) is really a form of repair, better than normal repair because we have the full plaintext available locally |
| | 11 | > [20:48] <warner> so maybe what ought to happen is that we rewrite the repairer, and make the uploader's "hey there are already shares there" path do: abandon the current allocations, hand the alreadygot info and the plaintext/ciphertext filehandle to the repairer, start the repairer, wait for its results |
| | 12 | > [20:50] <kpreid> second-upload is a common case if you're storing a lot of files in tahoe and locally and 'moving'/copying back and forth |
| | 13 | > [20:50] <warner> hm, true |
| | 14 | > [20:51] <kpreid> and given immutable directories... |
| | 15 | > [20:51] <warner> we've got two sorts of frontend-duplicate detectors that might move some pressure off the uploader: the helper, and the backupdb |
| | 16 | > [20:51] <kpreid> some might want to have efficient upload-the-same-thing-again, as for backups |
| | 17 | > [20:51] <warner> it's kind of an open question as to where the responsibilities ought to lie |
| | 18 | > [20:51] <kpreid> i.e. tahoe backup without the builtin cleverness |
| | 19 | > [20:52] <warner> I think it's reasonable to add backupdb support to 'tahoe cp', and to add a contents-to-filecap table to backupdb, which would avoid network IO in the case of moving files around |
| | 20 | > [20:53] <warner> (currently if you move a file, the backupdb declares a miss, and we proceed to the uploader, which will either get a Helper hit or exercise the second-upload code. In my experience, second-upload results in duplicate shares, which is a drag, so I'd prefer to avoid it) |
| | 21 | > [20:55] <warner> (OTOH, having the backupdb track file-contents means that we'd do an extra hash of the file for each new upload, in addition to the subsequent CHK-computation hash. OT3H, unless the files are really large, the filesystem cache should save us from actually doing the disk IO an extra time) |
| | 22 | > [20:55] <zooko> warner: 1.6.0 was changed to search for all extant shares before immutable upload, IIRC |
| | 23 | > [20:56] <zooko> http://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/NEWS?rev=d329759bb83ad6a0#L66 |
| | 24 | > [20:56] <warner> I think it sends requests to everyone, yeah (which I still consider to be a scaling problem), but I don't believe that it waits to hear from everyone before starting the upload |
| | 25 | > [20:56] <warner> there's always a tradeoff between stallability and thoroughness there |