Opened at 2013-07-23T16:46:43Z
Last modified at 2018-08-21T21:56:16Z
#2035 new defect
"tahoe backup" on the same immutable content when some shares are missing does not repair that content.
Reported by: | nejucomo | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | soon |
Component: | code-encoding | Version: | 1.10.0 |
Keywords: | usability preservation reliability servers-of-happiness repair tahoe-backup performance | Cc: | srl |
Launchpad Bug: |
Description (last modified by zooko)
From srl295 on IRC:
Interesting behavior- Had some stuff on our grid backed up with 'tahoe backup' which creates Latest, Archive etc- some of them are immutable dirs. Then, I lost a drive. So that particular storage node went down and never came back. Some directories were irrecoverable, OK fine. I create new directories and run tahoe backup again on the new directory URI. However, I'm still getting errors on deep check. - ERROR: NotEnoughSharesError(ran out of shares: complete= pending=Share(sh1-on-sslg5v) overdue= unused= need 2. Last failure: None) -- I wonder, is tahoe reusing some unrecoverable URIs since the same immutable directory was created?
Change History (25)
comment:1 Changed at 2013-07-23T16:48:29Z by srl
comment:2 Changed at 2013-07-23T16:49:05Z by nejucomo
Also from IRC, a recommended reproduction:
a good test case would probably be- upload an immutable file, make it unhealthy or unrecoverable, then later try to upload it again
comment:3 Changed at 2013-07-23T16:51:08Z by nejucomo
- Keywords preservation reliability servers-of-happiness upload repair tahoe-backup added
comment:4 follow-up: ↓ 5 Changed at 2013-07-23T17:05:34Z by nejucomo
More IRC from nejucomo (me):
Ah! I have a hypothesis: The backup command keeps a local cache of which file revisions have been uploaded. Then it checks that as an optimization. If you can find that cache db, try renaming it, then rerun the backup.
comment:5 in reply to: ↑ 4 ; follow-ups: ↓ 9 ↓ 22 Changed at 2013-07-23T17:22:39Z by nejucomo
Replying to nejucomo:
More IRC from nejucomo (me):
Ah! I have a hypothesis: The backup command keeps a local cache of which file revisions have been uploaded. Then it checks that as an optimization. If you can find that cache db, try renaming it, then rerun the backup.
srl verified that removing backupdb.sqlite, deleting the backup directories, and then rerunning backup successfully stored their data into the same immutable caps.
Therefore I propose this is a bug in backupdb caching logic. If possible it should verify the health of items in the cache. If this is expensive, maybe it could be an opt-in behavior with a commandline option.
I'm going to update the keywords to reflect this new information.
comment:6 Changed at 2013-07-23T17:23:57Z by nejucomo
- Keywords usability added; upload removed
comment:7 Changed at 2013-07-23T17:26:30Z by nejucomo
I'm not certain the current keywords are accurate. I attempted to err on the side of caution and apply them liberally.
- I removed upload because I believe upload does the right thing.
- repair may not be relevant because although this is about repairing backups, it's not using any specialized repair mechanism outside of immutable-dedup upload.
- I added usability because without knowing the trick of nuking backupdb.sqlite users may believe they've successfully made a backup where some files remain unrecoverable due to the cache.
comment:8 Changed at 2013-07-23T17:55:28Z by daira
- Keywords changed from usability, preservation, reliability, servers-of-happiness, repair, tahoe-backup to usability preservation reliability servers-of-happiness repair tahoe-backup
comment:9 in reply to: ↑ 5 Changed at 2013-07-23T18:25:30Z by srl
Replying to nejucomo:
Therefore I propose this is a bug in backupdb caching logic. If possible it should verify the health of items in the cache. If this is expensive, maybe it could be an opt-in behavior with a commandline option.
backup-and-verify would be nice. I would think that backup could be efficient here, simply check if the shares are there before re-using its cache.
Also note the trick is to nuke the database AND unlink. So that trick probably can't work with preserving the Archived items.
comment:10 Changed at 2013-07-23T19:03:02Z by daira
- Keywords performance added
The backupdb is a performance hack to avoid the latency cost of asking servers whether each file exists on the grid. If the latter were fast enough (which would probably require batching requests for multiple files), then it wouldn't be needed. (tahoe backup probabilistically checks some files on each run even if they are present in the backupdb, but I don't think that particularly helps.)
In the meantime, how about adding a --repair option to tahoe backup, which would bypass the backupdb-based conditional upload and upload/repair every file?
comment:11 follow-up: ↓ 13 Changed at 2013-07-23T19:10:29Z by daira
Hmm, it looks from this code in the method BackupDB_v2.check_file in src/allmydata/scripts/backupdb.py, as though the --ignore-timestamps option of tahoe backup causes existing db entries to be completely ignored, rather than only ignoring timestamps.
Perhaps we just need to rename --ignore-timestamps or document it better?
comment:12 Changed at 2013-07-23T19:13:05Z by daira
Sorry, intended to paste the code:
if ((last_size != size or not use_timestamps or last_mtime != mtime or last_ctime != ctime) # the file has been changed or (not row2) # we somehow forgot where we put the file last time ): c.execute("DELETE FROM local_files WHERE path=?", (path,)) self.connection.commit() return FileResult(self, None, False, path, mtime, ctime, size)
So when not use_timestamps, the existing db entry is deleted and the FileResult has None for the existing file URI. (Note that we still might not repair the file very well; see #1382.)
comment:13 in reply to: ↑ 11 ; follow-up: ↓ 15 Changed at 2013-07-23T19:36:53Z by srl
Replying to daira:
Hmm, it looks from this code in the method BackupDB_v2.check_file in src/allmydata/scripts/backupdb.py, as though the --ignore-timestamps option of tahoe backup causes existing db entries to be completely ignored, rather than only ignoring timestamps.
Perhaps we just need to rename --ignore-timestamps or document it better?
Just to note, I had to both rename the db AND unlink the bad directories to get them repaired.
comment:14 Changed at 2013-07-24T14:58:59Z by zooko
- Description modified (diff)
comment:15 in reply to: ↑ 13 ; follow-up: ↓ 16 Changed at 2013-07-24T15:01:17Z by zooko
Replying to srl:
Just to note, I had to both rename the db AND unlink the bad directories to get them repaired.
Unlink them from where?
comment:16 in reply to: ↑ 15 Changed at 2013-07-24T18:02:03Z by srl
comment:17 Changed at 2013-09-01T16:42:01Z by daira
- Milestone changed from undecided to 1.12.0
- Owner changed from daira to markberger
- Summary changed from Publishing the same immutable content when some shares are unrecoverable does not repair that content. to Uploading the same immutable content when some shares are unrecoverable does not repair that content.
Nitpick: use "uploading" for immutable files or shares, and "publishing" for (versions of) mutable files or shares.
markberger: do any of your improvements address this?
comment:18 Changed at 2013-09-01T16:42:14Z by daira
- Component changed from unknown to code-encoding
comment:19 follow-up: ↓ 20 Changed at 2013-09-01T18:23:15Z by daira
comment:20 in reply to: ↑ 19 Changed at 2013-09-01T21:55:27Z by zooko
- Summary changed from Uploading the same immutable content when some shares are unrecoverable does not repair that content. to "tahoe backup" on the same immutable content when some shares are unrecoverable does not repair that content.
Replying to daira:
It's unclear to me whether this is just a duplicate of other bugs (e.g. #1130 and #1124) that are being fixed in #1382, or whether it is a separate problem in tahoe backup.
I think this is a different problem to #1382. I think this problem has to do with the fact that "tahoe backup" inspects its local cache "backupdb" and decides that the file is already backed-up, and then does not issue any network requests, which would allow it find out that the file is damaged or even broken.
If that's the issue, possible solutions include:
- a "backup-and-check" or "backup-and-verify" feature, as mentioned in comment:10 and other comments,
- causing all checks and verifies to update the backupdb and record the fact that the file was detected as damaged or broken, and then the next time you run "tahoe backup" tahoe could notice this recorded fact and automatically trigger the "-and-check" or "-and-verify" behavior
Changing the Summary of this ticket to reflect what I think the issue is.
comment:21 Changed at 2013-09-01T22:26:27Z by daira
- Summary changed from "tahoe backup" on the same immutable content when some shares are unrecoverable does not repair that content. to "tahoe backup" on the same immutable content when some shares are missing does not repair that content.
Shares can be missing; only files/directories can be unrecoverable.
comment:22 in reply to: ↑ 5 Changed at 2013-09-02T01:34:08Z by daira
- Milestone changed from 1.12.0 to soon
- Owner markberger deleted
Oh, I missed this:
srl verified that removing backupdb.sqlite, deleting the backup directories, and then rerunning backup successfully stored their data into the same immutable caps.
So it's definitely the backupdb logic.
comment:23 Changed at 2013-11-28T01:40:30Z by amontero
I strongly +1 for an "ignore-backupdb" kind of option that could ensure that all files were uploaded at backup time without any backupdb optimization. Even if it is at that added time cost. If I'm not wrong, unmodified files would produce shares identical to already stored ones and no bandwith would be used.
comment:24 Changed at 2013-11-28T22:14:07Z by daira
See also #1331 (--verify option for tahoe backup).
comment:25 Changed at 2018-08-21T21:56:16Z by tlhonmey
"(tahoe backup probabilistically checks some files on each run even if they are present in the backupdb, but I don't think that particularly helps.) "
If the random check is to be helpful, then when it encounters a file that the backupdb says should be there, but isn't, it should discard the backupdb and start over assuming that it needs to check every file.
It would also be good to be able to specify a frequency for the random checking since the size and composition of the data in question affects what the best tradeoff is between speed and thoroughness.
I think it was set to need 2, happy 2, total 3 on 1.9.2 when the original directory upload happen. Same settings under 1.10 when the failure and re-publish happened.