#2035 new defect

"tahoe backup" on the same immutable content when some shares are missing does not repair that content.

Reported by: nejucomo Owned by:
Priority: normal Milestone: soon
Component: code-encoding Version: 1.10.0
Keywords: usability preservation reliability servers-of-happiness repair tahoe-backup performance Cc: srl
Launchpad Bug:

Description (last modified by zooko)

From srl295 on IRC:

Interesting behavior- Had some stuff on our grid backed up with 'tahoe backup' which creates Latest, Archive etc- some of them are immutable dirs. Then, I lost a drive. So that particular storage node went down and never came back. Some directories were irrecoverable, OK fine. I create new directories and run tahoe backup again on the new directory URI. However, I'm still getting errors on deep check. - ERROR: NotEnoughSharesError(ran out of shares: complete= pending=Share(sh1-on-sslg5v) overdue= unused= need 2. Last failure: None) -- I wonder, is tahoe reusing some unrecoverable URIs since the same immutable directory was created?

Change History (25)

comment:1 Changed at 2013-07-23T16:48:29Z by srl

I think it was set to need 2, happy 2, total 3 on 1.9.2 when the original directory upload happen. Same settings under 1.10 when the failure and re-publish happened.

comment:2 Changed at 2013-07-23T16:49:05Z by nejucomo

Also from IRC, a recommended reproduction:

a good test case would probably be- upload an immutable file, make it unhealthy or unrecoverable, then later try to upload it again

comment:3 Changed at 2013-07-23T16:51:08Z by nejucomo

  • Keywords preservation reliability servers-of-happiness upload repair tahoe-backup added

comment:4 follow-up: Changed at 2013-07-23T17:05:34Z by nejucomo

More IRC from nejucomo (me):

Ah! I have a hypothesis: The backup command keeps a local cache of which file revisions have been uploaded. Then it checks that as an optimization. If you can find that cache db, try renaming it, then rerun the backup.

comment:5 in reply to: ↑ 4 ; follow-ups: Changed at 2013-07-23T17:22:39Z by nejucomo

Replying to nejucomo:

More IRC from nejucomo (me):

Ah! I have a hypothesis: The backup command keeps a local cache of which file revisions have been uploaded. Then it checks that as an optimization. If you can find that cache db, try renaming it, then rerun the backup.

srl verified that removing backupdb.sqlite, deleting the backup directories, and then rerunning backup successfully stored their data into the same immutable caps.

Therefore I propose this is a bug in backupdb caching logic. If possible it should verify the health of items in the cache. If this is expensive, maybe it could be an opt-in behavior with a commandline option.

I'm going to update the keywords to reflect this new information.

comment:6 Changed at 2013-07-23T17:23:57Z by nejucomo

  • Keywords usability added; upload removed

comment:7 Changed at 2013-07-23T17:26:30Z by nejucomo

I'm not certain the current keywords are accurate. I attempted to err on the side of caution and apply them liberally.

  • I removed upload because I believe upload does the right thing.
  • repair may not be relevant because although this is about repairing backups, it's not using any specialized repair mechanism outside of immutable-dedup upload.
  • I added usability because without knowing the trick of nuking backupdb.sqlite users may believe they've successfully made a backup where some files remain unrecoverable due to the cache.

comment:8 Changed at 2013-07-23T17:55:28Z by daira

  • Keywords changed from usability, preservation, reliability, servers-of-happiness, repair, tahoe-backup to usability preservation reliability servers-of-happiness repair tahoe-backup

comment:9 in reply to: ↑ 5 Changed at 2013-07-23T18:25:30Z by srl

Replying to nejucomo:

Therefore I propose this is a bug in backupdb caching logic. If possible it should verify the health of items in the cache. If this is expensive, maybe it could be an opt-in behavior with a commandline option.

backup-and-verify would be nice. I would think that backup could be efficient here, check if the shares are there before re-using its cache.

Also note the trick is to nuke the database AND unlink. So that trick probably can't work with preserving the Archived items.

Last edited at 2013-07-23T18:26:01Z by srl (previous) (diff)

comment:10 Changed at 2013-07-23T19:03:02Z by daira

  • Keywords performance added

The backupdb is a performance hack to avoid the latency cost of asking servers whether each file exists on the grid. If the latter were fast enough (which would probably require batching requests for multiple files), then it wouldn't be needed. (tahoe backup probabilistically checks some files on each run even if they are present in the backupdb, but I don't think that particularly helps.)

In the meantime, how about adding a --repair option to tahoe backup, which would bypass the backupdb-based conditional upload and upload/repair every file?

Last edited at 2013-07-23T19:04:25Z by daira (previous) (diff)

comment:11 follow-up: Changed at 2013-07-23T19:10:29Z by daira

Hmm, it looks from this code in the method BackupDB_v2.check_file in src/allmydata/scripts/backupdb.py, as though the --ignore-timestamps option of tahoe backup causes existing db entries to be completely ignored, rather than only ignoring timestamps.

Perhaps we just need to rename --ignore-timestamps or document it better?

comment:12 Changed at 2013-07-23T19:13:05Z by daira

Sorry, intended to paste the code:

        if ((last_size != size
             or not use_timestamps
             or last_mtime != mtime
             or last_ctime != ctime) # the file has been changed
            or (not row2) # we somehow forgot where we put the file last time
            ):
            c.execute("DELETE FROM local_files WHERE path=?", (path,))
            self.connection.commit()
            return FileResult(self, None, False, path, mtime, ctime, size)

So when not use_timestamps, the existing db entry is deleted and the FileResult has None for the existing file URI. (Note that we still might not repair the file very well; see #1382.)

comment:13 in reply to: ↑ 11 ; follow-up: Changed at 2013-07-23T19:36:53Z by srl

Replying to daira:

Hmm, it looks from this code in the method BackupDB_v2.check_file in src/allmydata/scripts/backupdb.py, as though the --ignore-timestamps option of tahoe backup causes existing db entries to be completely ignored, rather than only ignoring timestamps.

Perhaps we just need to rename --ignore-timestamps or document it better?

Just to note, I had to both rename the db AND unlink the bad directories to get them repaired.

comment:14 Changed at 2013-07-24T14:58:59Z by zooko

  • Description modified (diff)

comment:15 in reply to: ↑ 13 ; follow-up: Changed at 2013-07-24T15:01:17Z by zooko

Replying to srl:

Just to note, I had to both rename the db AND unlink the bad directories to get them repaired.

Unlink them from where?

comment:16 in reply to: ↑ 15 Changed at 2013-07-24T18:02:03Z by srl

Replying to zooko:

Replying to srl:

Just to note, I had to both rename the db AND unlink the bad directories to get them repaired.

Unlink them from where?

I unlinked the Latest and Archives directories that tahoe backup created

comment:17 Changed at 2013-09-01T16:42:01Z by daira

  • Milestone changed from undecided to 1.12.0
  • Owner changed from daira to markberger
  • Summary changed from Publishing the same immutable content when some shares are unrecoverable does not repair that content. to Uploading the same immutable content when some shares are unrecoverable does not repair that content.

Nitpick: use "uploading" for immutable files or shares, and "publishing" for (versions of) mutable files or shares.

markberger: do any of your improvements address this?

comment:18 Changed at 2013-09-01T16:42:14Z by daira

  • Component changed from unknown to code-encoding

comment:19 follow-up: Changed at 2013-09-01T18:23:15Z by daira

It's unclear to me whether this is just a duplicate of other bugs (e.g. #1130 and #1124) that are being fixed in #1382, or whether it is a separate problem in tahoe backup.

comment:20 in reply to: ↑ 19 Changed at 2013-09-01T21:55:27Z by zooko

  • Summary changed from Uploading the same immutable content when some shares are unrecoverable does not repair that content. to "tahoe backup" on the same immutable content when some shares are unrecoverable does not repair that content.

Replying to daira:

It's unclear to me whether this is just a duplicate of other bugs (e.g. #1130 and #1124) that are being fixed in #1382, or whether it is a separate problem in tahoe backup.

I think this is a different problem to #1382. I think this problem has to do with the fact that "tahoe backup" inspects its local cache "backupdb" and decides that the file is already backed-up, and then does not issue any network requests, which would allow it find out that the file is damaged or even broken.

If that's the issue, possible solutions include:

  • a "backup-and-check" or "backup-and-verify" feature, as mentioned in comment:10 and other comments,
  • causing all checks and verifies to update the backupdb and record the fact that the file was detected as damaged or broken, and then the next time you run "tahoe backup" tahoe could notice this recorded fact and automatically trigger the "-and-check" or "-and-verify" behavior

Changing the Summary of this ticket to reflect what I think the issue is.

comment:21 Changed at 2013-09-01T22:26:27Z by daira

  • Summary changed from "tahoe backup" on the same immutable content when some shares are unrecoverable does not repair that content. to "tahoe backup" on the same immutable content when some shares are missing does not repair that content.

Shares can be missing; only files/directories can be unrecoverable.

comment:22 in reply to: ↑ 5 Changed at 2013-09-02T01:34:08Z by daira

  • Milestone changed from 1.12.0 to soon
  • Owner markberger deleted

Oh, I missed this:

nejucomo:

srl verified that removing backupdb.sqlite, deleting the backup directories, and then rerunning backup successfully stored their data into the same immutable caps.

So it's definitely the backupdb logic.

comment:23 Changed at 2013-11-28T01:40:30Z by amontero

I strongly +1 for an "ignore-backupdb" kind of option that could ensure that all files were uploaded at backup time without any backupdb optimization. Even if it is at that added time cost. If I'm not wrong, unmodified files would produce shares identical to already stored ones and no bandwith would be used.

comment:24 Changed at 2013-11-28T22:14:07Z by daira

See also #1331 (--verify option for tahoe backup).

comment:25 Changed at 2018-08-21T21:56:16Z by tlhonmey

"(tahoe backup probabilistically checks some files on each run even if they are present in the backupdb, but I don't think that particularly helps.) "

If the random check is to be helpful, then when it encounters a file that the backupdb says should be there, but isn't, it should discard the backupdb and start over assuming that it needs to check every file.

It would also be good to be able to specify a frequency for the random checking since the size and composition of the data in question affects what the best tradeoff is between speed and thoroughness.

Note: See TracTickets for help on using tickets.