#1115 closed defect (fixed)
count-good-share-hosts is calculated incorrectly (post-repair says 10 hosts have good shares but there are only 4 hosts)
Reported by: | zooko | Owned by: | davidsarah |
---|---|---|---|
Priority: | major | Milestone: | 1.9.2 |
Component: | code-frontend-web | Version: | 1.7.0 |
Keywords: | usability easy repair servers-of-happiness upload review-needed | Cc: | |
Launchpad Bug: |
Description
I guess the web page reporting repair results is putting in the number of shares where it ought to have the number of servers.
Change History (33)
comment:1 Changed at 2010-07-10T04:51:44Z by zooko
comment:2 Changed at 2010-07-17T07:35:32Z by zooko
This would be a bugfix, and it would be a shallow patch with little danger of introducing worse bugs, so it would still be welcome for v1.7.1.
comment:3 Changed at 2010-07-18T03:33:36Z by davidsarah
- Milestone changed from 1.7.1 to 1.8β
Out of time.
comment:4 Changed at 2010-07-20T04:05:05Z by davidsarah
The bug appears to be here in immutable/filenode.py's check_and_repair method:
prr.data['servers-responding'] = list(servers_responding) prr.data['count-shares-good'] = len(sm) prr.data['count-good-share-hosts'] = len(sm)
count-good-share-hosts should not be the same as count-shares-good. It possibly should be len(servers_responding), but I'm not sure of that.
comment:5 Changed at 2010-07-20T04:31:24Z by zooko
Hm it is documented in webapi.txt:
count-good-share-hosts: the number of distinct storage servers with good shares. If this number is less than count-shares-good, then some shares are doubled up, increasing the correlation of failures. This indicates that one or more shares should be moved to an otherwise unused server, if one is available.
So it should be ... Hrm... I'm not 100% certain but I think that count-good-share-hosts should be len(reduce(set.__union__, sm.itervalues()). Clearly we need a unit test for this.
comment:6 follow-up: ↓ 7 Changed at 2010-07-20T05:08:31Z by davidsarah
Hmm, isn't that documentation misleading? Suppose that the share->server mapping is
{0: [A, B, C], 1: [D], 2: [D], 3: [D]}
for example. Then the number of good shares is 4 (0, 1, 2, 3), and the number of servers that hold good shares is 4 (A, B, C, D), but server D holds shares that should ideally be moved to other servers.
If the "if" in the comment is interpreted as a sufficient but not necessary condition for shares to be doubled up, then it is strictly speaking correct, but the wording tends to imply a necessary-and-sufficient condition.
The measure that should really be used to decide whether the shares for a file are properly distributed, is servers-of-happiness. We should add that to the repair report, and have the docs deemphasize the usefulness of count-good-share-hosts -- or perhaps even remove it, given that it was calculated incorrectly.
comment:7 in reply to: ↑ 6 Changed at 2010-07-20T15:30:31Z by zooko
- Summary changed from post-repair says 10 hosts have good shares but there are only 4 hosts to add servers-of-happiness to reports (post-repair says 10 hosts have good shares but there are only 4 hosts)
Replying to davidsarah:
The measure that should really be used to decide whether the shares for a file are properly distributed, is servers-of-happiness. We should add that to the repair report,
+1
and have the docs deemphasize the usefulness of count-good-share-hosts -- or perhaps even remove it, given that it was calculated incorrectly.
I agree to de-emphasize the usefulness of it for deciding if you should rebalance. However, maybe we should go ahead and leave it (corrected) in the report, as someone might find that information useful for something.
comment:8 Changed at 2010-09-11T00:40:28Z by davidsarah
- Keywords test-needed added
- Milestone changed from 1.8β to 1.9.0
- Owner set to davidsarah
- Status changed from new to assigned
We have no unit test, so not for 1.8.
comment:9 Changed at 2010-09-11T00:45:48Z by david-sarah@…
In 0091205e3c56a4de:
comment:10 Changed at 2010-12-29T09:14:46Z by zooko
- Keywords servers-of-happiness added
comment:11 Changed at 2010-12-29T09:19:11Z by zooko
- Keywords upload added
comment:12 Changed at 2011-08-14T01:13:20Z by davidsarah
- Milestone changed from 1.9.0 to 1.10.0
comment:13 Changed at 2012-04-01T05:59:26Z by amiller
- Keywords review-needed added; test-needed removed
Right now there are two conflicting definitions for computing count-good-share-hosts:
- in immutable/filenode.py, which runs after a repair operation
prr.data['count-good-share-hosts'] = len(sm)
- in immutable/checker.py, which runs just when checking
d['count-good-share-hosts'] = len([s for s in servers.keys() if servers[s]])
So, I wrote a test for both of these conditions. Per comment:5 I'm providing a patch to fix the former (incorrect) one.
comment:14 Changed at 2012-04-01T22:34:02Z by davidsarah
len(reduce(set.union, sm.itervalues())) as used in the pull request is wrong for the empty map:
>>> reduce(set.union, {}.itervalues()) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: reduce() of empty sequence with no initial value
len(set(sm.itervalues())) is correct. However, the docs should make clear that a high value of count-good-share-hosts does not imply good share distribution (although a low value does imply bad distribution or insufficient shares).
comment:15 Changed at 2012-04-01T22:43:36Z by davidsarah
- Keywords review-needed removed
- Milestone changed from 1.11.0 to 1.9.2
- Owner changed from davidsarah to amiller
- Status changed from assigned to new
comment:16 Changed at 2012-04-01T23:39:10Z by amiller
- Keywords review-needed added
Good catch. I fixed the problem by adding an initial argument to reduce:
len(reduce(set.union, sm.itervalues(), set())
I modified the unit test to cover this as well.
Apparently github pull requests are not stateless so the previous link now includes my new commit, so perhaps that's not the ideal for archiving progress. Here's the commit:
https://github.com/amiller/tahoe-lafs/commit/3898753b9549f5b355a0233b835da8fc38faa5c6
comment:17 Changed at 2012-04-02T00:10:52Z by davidsarah
- Owner changed from amiller to davidsarah
- Status changed from new to assigned
comment:18 Changed at 2012-05-13T08:26:00Z by warner
- Keywords review-needed removed
- Owner changed from davidsarah to warner
- Status changed from assigned to new
Yeah, that describes what I had in mind for count-good-share-hosts, but I agree that it probably isn't as meaningful as I'd originally thought. The sharemap described in comment:6 is a great example, although I think it'd be awfully hard to get into this situation (it would take some serious contortions to get an uploader+repairer to leave shares in that state). count-good-share-hosts probably covered the situations I was able to imagine happening after a typical upload process, but if there's a better "what needs to happen to make your file healthier" value, I'm happy to emphasize that instead.
Anyways, I reviewed the patch and it looks good. I added a few cosmetic fixes (whitespace, name of the test file), and will land it in a minute.
comment:19 Changed at 2012-05-13T08:26:27Z by Brian Warner <warner@…>
- Resolution set to fixed
- Status changed from new to closed
In fcc7e6475918eab1:
comment:20 Changed at 2012-05-13T08:27:27Z by Brian Warner <warner@…>
In fcc7e6475918eab1:
comment:21 Changed at 2012-05-13T23:53:48Z by davidsarah
- Resolution fixed deleted
- Status changed from closed to reopened
Want to double-check the correctness of this.
comment:22 Changed at 2012-05-13T23:54:20Z by davidsarah
- Keywords review-needed added
- Owner changed from warner to davidsarah
- Status changed from reopened to new
comment:23 Changed at 2012-05-13T23:54:33Z by davidsarah
- Status changed from new to assigned
comment:24 Changed at 2012-07-01T17:12:28Z by davidsarah
- Summary changed from add servers-of-happiness to reports (post-repair says 10 hosts have good shares but there are only 4 hosts) to count-good-shares is calculated incorrectly (post-repair says 10 hosts have good shares but there are only 4 hosts)
comment:25 Changed at 2012-07-01T17:15:34Z by davidsarah
- Summary changed from count-good-shares is calculated incorrectly (post-repair says 10 hosts have good shares but there are only 4 hosts) to count-good-share-hosts is calculated incorrectly (post-repair says 10 hosts have good shares but there are only 4 hosts)
comment:26 Changed at 2012-07-01T17:17:44Z by davidsarah
Adding an entry for the happiness count has been split to #1784.
comment:27 follow-up: ↓ 29 Changed at 2012-07-01T22:05:43Z by davidsarah
The change to the computation of count-good-share-hosts at 1.9.2/src/allmydata/immutable/checker.py@5499#L765 looks correct. A minor simplification is possible:
d['count-good-share-hosts'] = len([s for s in servers.keys() if servers[s]])
could just be
d['count-good-share-hosts'] = len(servers)
since the servers dict never includes any entries that are empty sets. I don't think this simplification is needed for 1.9.2.
I spotted another problem, though: the value of needs-rebalancing is computed inconsistently between checker.py and filenode.py. In checker.py it is computed as:
# The file needs rebalancing if the set of servers that have at least # one share is less than the number of uniquely-numbered shares # available. cr.set_needs_rebalancing(d['count-good-share-hosts'] < d['count-shares-good'])
In filenode.py it is computed as
prr.set_needs_rebalancing(len(sm) >= verifycap.total_shares)
where len(sm) is equal to count-shares-good. I don't understand this latter definition at all, it looks completely wrong. The definition in checker.py is more subtly wrong because it credits servers that only have duplicated shares as contributing to existing balance. The correct definition should be something like 'iff the happiness count is less than the number of uniquely-numbered good shares available'.
I propose to change the definition in filenode.py to be consistent with checker.py in 1.9.2, and then change it to use the happiness count in 1.10, as part of #1784.
comment:28 Changed at 2012-07-01T22:28:12Z by davidsarah
The definition of needs-rebalancing in 1.9.2/docs/frontends/webapi.rst is:
needs-rebalancing: (bool) True if there are multiple shares on a single storage server, indicating a reduction in reliability that could be resolved by moving shares to new servers.
but:
- it is not true in general that multiple shares on a single storage server indicates a reduction in reliability. It depends whether those shares are not present on other servers.
- the definition given does not correspond to what either checker.py or filenode.py computes :-(
comment:29 in reply to: ↑ 27 Changed at 2012-07-01T22:57:09Z by davidsarah
- Resolution set to fixed
- Status changed from assigned to closed
Replying to davidsarah:
I propose to change the definition in filenode.py to be consistent with checker.py in 1.9.2, and then change it to use the happiness count in 1.10, as part of #1784.
Actually, sod it, I can't decide how to fix needs-rebalancing for 1.9.2, so I'll put a caveat in webapi.rst and bump it to 1.10 (as part of #1784).
comment:30 Changed at 2012-07-01T23:10:16Z by david-sarah@…
In 5532/1.9.2:
comment:31 Changed at 2012-07-16T16:33:55Z by david-sarah@…
comment:32 Changed at 2013-04-18T22:50:13Z by Daira Hopwood <david-sarah@…>
In b06f8cd8d03a6239:
comment:33 Changed at 2013-11-14T17:53:57Z by daira
The needs-rebalancing issue is now #2105.