#2035 new defect

Publishing the same immutable content when some shares are unrecoverable does not repair that content. — at Version 14

Reported by: nejucomo Owned by: daira
Priority: normal Milestone: soon
Component: code-encoding Version: 1.10.0
Keywords: usability preservation reliability servers-of-happiness repair tahoe-backup performance Cc: srl
Launchpad Bug:

Description (last modified by zooko)

From srl295 on IRC:

Interesting behavior- Had some stuff on our grid backed up with 'tahoe backup' which creates Latest, Archive etc- some of them are immutable dirs. Then, I lost a drive. So that particular storage node went down and never came back. Some directories were irrecoverable, OK fine. I create new directories and run tahoe backup again on the new directory URI. However, I'm still getting errors on deep check. - ERROR: NotEnoughSharesError(ran out of shares: complete= pending=Share(sh1-on-sslg5v) overdue= unused= need 2. Last failure: None) -- I wonder, is tahoe reusing some unrecoverable URIs since the same immutable directory was created?

Change History (14)

comment:1 Changed at 2013-07-23T16:48:29Z by srl

I think it was set to need 2, happy 2, total 3 on 1.9.2 when the original directory upload happen. Same settings under 1.10 when the failure and re-publish happened.

comment:2 Changed at 2013-07-23T16:49:05Z by nejucomo

Also from IRC, a recommended reproduction:

a good test case would probably be- upload an immutable file, make it unhealthy or unrecoverable, then later try to upload it again

comment:3 Changed at 2013-07-23T16:51:08Z by nejucomo

  • Keywords preservation reliability servers-of-happiness upload repair tahoe-backup added

comment:4 follow-up: Changed at 2013-07-23T17:05:34Z by nejucomo

More IRC from nejucomo (me):

Ah! I have a hypothesis: The backup command keeps a local cache of which file revisions have been uploaded. Then it checks that as an optimization. If you can find that cache db, try renaming it, then rerun the backup.

comment:5 in reply to: ↑ 4 ; follow-up: Changed at 2013-07-23T17:22:39Z by nejucomo

Replying to nejucomo:

More IRC from nejucomo (me):

Ah! I have a hypothesis: The backup command keeps a local cache of which file revisions have been uploaded. Then it checks that as an optimization. If you can find that cache db, try renaming it, then rerun the backup.

srl verified that removing backupdb.sqlite, deleting the backup directories, and then rerunning backup successfully stored their data into the same immutable caps.

Therefore I propose this is a bug in backupdb caching logic. If possible it should verify the health of items in the cache. If this is expensive, maybe it could be an opt-in behavior with a commandline option.

I'm going to update the keywords to reflect this new information.

comment:6 Changed at 2013-07-23T17:23:57Z by nejucomo

  • Keywords usability added; upload removed

comment:7 Changed at 2013-07-23T17:26:30Z by nejucomo

I'm not certain the current keywords are accurate. I attempted to err on the side of caution and apply them liberally.

  • I removed upload because I believe upload does the right thing.
  • repair may not be relevant because although this is about repairing backups, it's not using any specialized repair mechanism outside of immutable-dedup upload.
  • I added usability because without knowing the trick of nuking backupdb.sqlite users may believe they've successfully made a backup where some files remain unrecoverable due to the cache.

comment:8 Changed at 2013-07-23T17:55:28Z by daira

  • Keywords changed from usability, preservation, reliability, servers-of-happiness, repair, tahoe-backup to usability preservation reliability servers-of-happiness repair tahoe-backup

comment:9 in reply to: ↑ 5 Changed at 2013-07-23T18:25:30Z by srl

Replying to nejucomo:

Therefore I propose this is a bug in backupdb caching logic. If possible it should verify the health of items in the cache. If this is expensive, maybe it could be an opt-in behavior with a commandline option.

backup-and-verify would be nice. I would think that backup could be efficient here, check if the shares are there before re-using its cache.

Also note the trick is to nuke the database AND unlink. So that trick probably can't work with preserving the Archived items.

Last edited at 2013-07-23T18:26:01Z by srl (previous) (diff)

comment:10 Changed at 2013-07-23T19:03:02Z by daira

  • Keywords performance added

The backupdb is a performance hack to avoid the latency cost of asking servers whether each file exists on the grid. If the latter were fast enough (which would probably require batching requests for multiple files), then it wouldn't be needed. (tahoe backup probabilistically checks some files on each run even if they are present in the backupdb, but I don't think that particularly helps.)

In the meantime, how about adding a --repair option to tahoe backup, which would bypass the backupdb-based conditional upload and upload/repair every file?

Last edited at 2013-07-23T19:04:25Z by daira (previous) (diff)

comment:11 follow-up: Changed at 2013-07-23T19:10:29Z by daira

Hmm, it looks from this code in the method BackupDB_v2.check_file in src/allmydata/scripts/backupdb.py, as though the --ignore-timestamps option of tahoe backup causes existing db entries to be completely ignored, rather than only ignoring timestamps.

Perhaps we just need to rename --ignore-timestamps or document it better?

comment:12 Changed at 2013-07-23T19:13:05Z by daira

Sorry, intended to paste the code:

        if ((last_size != size
             or not use_timestamps
             or last_mtime != mtime
             or last_ctime != ctime) # the file has been changed
            or (not row2) # we somehow forgot where we put the file last time
            ):
            c.execute("DELETE FROM local_files WHERE path=?", (path,))
            self.connection.commit()
            return FileResult(self, None, False, path, mtime, ctime, size)

So when not use_timestamps, the existing db entry is deleted and the FileResult has None for the existing file URI. (Note that we still might not repair the file very well; see #1382.)

comment:13 in reply to: ↑ 11 Changed at 2013-07-23T19:36:53Z by srl

Replying to daira:

Hmm, it looks from this code in the method BackupDB_v2.check_file in src/allmydata/scripts/backupdb.py, as though the --ignore-timestamps option of tahoe backup causes existing db entries to be completely ignored, rather than only ignoring timestamps.

Perhaps we just need to rename --ignore-timestamps or document it better?

Just to note, I had to both rename the db AND unlink the bad directories to get them repaired.

comment:14 Changed at 2013-07-24T14:58:59Z by zooko

  • Description modified (diff)
Note: See TracTickets for help on using tickets.