#1115 closed defect (fixed)

count-good-share-hosts is calculated incorrectly (post-repair says 10 hosts have good shares but there are only 4 hosts)

Reported by: zooko Owned by: davidsarah
Priority: major Milestone: 1.9.2
Component: code-frontend-web Version: 1.7.0
Keywords: usability easy repair servers-of-happiness upload review-needed Cc:
Launchpad Bug:

Description

I guess the web page reporting repair results is putting in the number of shares where it ought to have the number of servers.

Change History (33)

comment:1 Changed at 2010-07-10T04:51:44Z by zooko

File Check-And-Repair Results for SI=mkajm2jwe4oaegn5ganiv6guuu

Healthy : healthy
Repair successful
Post-Repair Checker Results:
Report:
Share Counts: need 3-of-10, have 10
Hosts with good shares: 10
Corrupt shares: none
Wrong Shares: 0
Good Shares (sorted in share order):
Share ID	
Nickname
Node ID
0	
soultcer@skulls
sp26qyqclbwjv6lgqdxik5lfdlxrkpnz
1	
soultcer@skulls
sp26qyqclbwjv6lgqdxik5lfdlxrkpnz
2	
strato
tavrk54ewt2bl2faybb55wrs3ghissvx
3	
FreeStorm-Neptune
fp3xjndgjt2npubdl2jqqb26clanyag7
4	
rxp_apathy
lv3fqmev464vcyi5yn7idjopbbptidxl
5	
soultcer@skulls
sp26qyqclbwjv6lgqdxik5lfdlxrkpnz
6	
soultcer@skulls
sp26qyqclbwjv6lgqdxik5lfdlxrkpnz
7	
strato
tavrk54ewt2bl2faybb55wrs3ghissvx
8	
FreeStorm-Neptune
fp3xjndgjt2npubdl2jqqb26clanyag7
9	
rxp_apathy
lv3fqmev464vcyi5yn7idjopbbptidxl
Recoverable Versions: 1
Unrecoverable Versions: 0
Share Balancing (servers in permuted order):
Nickname
Node ID
Share IDs
soultcer@skulls
sp26qyqclbwjv6lgqdxik5lfdlxrkpnz
0 1 5 6
strato
tavrk54ewt2bl2faybb55wrs3ghissvx
2 7
FreeStorm-Neptune
fp3xjndgjt2npubdl2jqqb26clanyag7
3 8
rxp_apathy
lv3fqmev464vcyi5yn7idjopbbptidxl
4 9
Pre-Repair Checker Results:
Report:
Share Counts: need 3-of-10, have 8
Hosts with good shares: 4
Corrupt shares: none
Wrong Shares: 0
Good Shares (sorted in share order):
Share ID	
Nickname
Node ID
1	
soultcer@skulls
sp26qyqclbwjv6lgqdxik5lfdlxrkpnz
2	
strato
tavrk54ewt2bl2faybb55wrs3ghissvx
3	
FreeStorm-Neptune
fp3xjndgjt2npubdl2jqqb26clanyag7
4	
rxp_apathy
lv3fqmev464vcyi5yn7idjopbbptidxl
6	
soultcer@skulls
sp26qyqclbwjv6lgqdxik5lfdlxrkpnz
7	
strato
tavrk54ewt2bl2faybb55wrs3ghissvx
8	
FreeStorm-Neptune
fp3xjndgjt2npubdl2jqqb26clanyag7
9	
rxp_apathy
lv3fqmev464vcyi5yn7idjopbbptidxl
Recoverable Versions: 1
Unrecoverable Versions: 0
Share Balancing (servers in permuted order):
Nickname
Node ID
Share IDs
soultcer@skulls
sp26qyqclbwjv6lgqdxik5lfdlxrkpnz
1 6
strato
tavrk54ewt2bl2faybb55wrs3ghissvx
2 7
FreeStorm-Neptune
fp3xjndgjt2npubdl2jqqb26clanyag7
3 8
rxp_apathy
lv3fqmev464vcyi5yn7idjopbbptidxl
4 9

comment:2 Changed at 2010-07-17T07:35:32Z by zooko

This would be a bugfix, and it would be a shallow patch with little danger of introducing worse bugs, so it would still be welcome for v1.7.1.

comment:3 Changed at 2010-07-18T03:33:36Z by davidsarah

  • Milestone changed from 1.7.1 to 1.8β

Out of time.

comment:4 Changed at 2010-07-20T04:05:05Z by davidsarah

The bug appears to be here in immutable/filenode.py's check_and_repair method:

    prr.data['servers-responding'] = list(servers_responding)
    prr.data['count-shares-good'] = len(sm)
    prr.data['count-good-share-hosts'] = len(sm)

count-good-share-hosts should not be the same as count-shares-good. It possibly should be len(servers_responding), but I'm not sure of that.

comment:5 Changed at 2010-07-20T04:31:24Z by zooko

Hm it is documented in webapi.txt:

     count-good-share-hosts: the number of distinct storage servers with
                             good shares. If this number is less than
                             count-shares-good, then some shares are doubled
                             up, increasing the correlation of failures. This
                             indicates that one or more shares should be
                             moved to an otherwise unused server, if one is
                             available.

So it should be ... Hrm... I'm not 100% certain but I think that count-good-share-hosts should be len(reduce(set.__union__, sm.itervalues()). Clearly we need a unit test for this.

comment:6 follow-up: Changed at 2010-07-20T05:08:31Z by davidsarah

Hmm, isn't that documentation misleading? Suppose that the share->server mapping is

{0: [A, B, C], 1: [D], 2: [D], 3: [D]}

for example. Then the number of good shares is 4 (0, 1, 2, 3), and the number of servers that hold good shares is 4 (A, B, C, D), but server D holds shares that should ideally be moved to other servers.

If the "if" in the comment is interpreted as a sufficient but not necessary condition for shares to be doubled up, then it is strictly speaking correct, but the wording tends to imply a necessary-and-sufficient condition.

The measure that should really be used to decide whether the shares for a file are properly distributed, is servers-of-happiness. We should add that to the repair report, and have the docs deemphasize the usefulness of count-good-share-hosts -- or perhaps even remove it, given that it was calculated incorrectly.

comment:7 in reply to: ↑ 6 Changed at 2010-07-20T15:30:31Z by zooko

  • Summary changed from post-repair says 10 hosts have good shares but there are only 4 hosts to add servers-of-happiness to reports (post-repair says 10 hosts have good shares but there are only 4 hosts)

Replying to davidsarah:

The measure that should really be used to decide whether the shares for a file are properly distributed, is servers-of-happiness. We should add that to the repair report,

+1

and have the docs deemphasize the usefulness of count-good-share-hosts -- or perhaps even remove it, given that it was calculated incorrectly.

I agree to de-emphasize the usefulness of it for deciding if you should rebalance. However, maybe we should go ahead and leave it (corrected) in the report, as someone might find that information useful for something.

comment:8 Changed at 2010-09-11T00:40:28Z by davidsarah

  • Keywords test-needed added
  • Milestone changed from 1.8β to 1.9.0
  • Owner set to davidsarah
  • Status changed from new to assigned

We have no unit test, so not for 1.8.

comment:9 Changed at 2010-09-11T00:45:48Z by david-sarah@…

In 0091205e3c56a4de:

docs/frontends/webapi.txt: note that 'count-good-share-hosts' is computed incorrectly; refs #1115

comment:10 Changed at 2010-12-29T09:14:46Z by zooko

  • Keywords servers-of-happiness added

comment:11 Changed at 2010-12-29T09:19:11Z by zooko

  • Keywords upload added

comment:12 Changed at 2011-08-14T01:13:20Z by davidsarah

  • Milestone changed from 1.9.0 to 1.10.0

comment:13 Changed at 2012-04-01T05:59:26Z by amiller

  • Keywords review-needed added; test-needed removed

Right now there are two conflicting definitions for computing count-good-share-hosts:

  1. in immutable/filenode.py, which runs after a repair operation
     prr.data['count-good-share-hosts'] = len(sm)
    
  1. in immutable/checker.py, which runs just when checking
     d['count-good-share-hosts'] = len([s for s in servers.keys() if servers[s]])
    

So, I wrote a test for both of these conditions. Per comment:5 I'm providing a patch to fix the former (incorrect) one.

https://github.com/amiller/tahoe-lafs/pull/4/files

Last edited at 2012-04-01T06:06:52Z by amiller (previous) (diff)

comment:14 Changed at 2012-04-01T22:34:02Z by davidsarah

len(reduce(set.union, sm.itervalues())) as used in the pull request is wrong for the empty map:

>>> reduce(set.union, {}.itervalues())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: reduce() of empty sequence with no initial value

len(set(sm.itervalues())) is correct. However, the docs should make clear that a high value of count-good-share-hosts does not imply good share distribution (although a low value does imply bad distribution or insufficient shares).

Version 1, edited at 2012-04-01T22:38:05Z by davidsarah (previous) (next) (diff)

comment:15 Changed at 2012-04-01T22:43:36Z by davidsarah

  • Keywords review-needed removed
  • Milestone changed from 1.11.0 to 1.9.2
  • Owner changed from davidsarah to amiller
  • Status changed from assigned to new

comment:16 Changed at 2012-04-01T23:39:10Z by amiller

  • Keywords review-needed added

Good catch. I fixed the problem by adding an initial argument to reduce:

len(reduce(set.union, sm.itervalues(), set()))

I modified the unit test to cover this as well.

Apparently github pull requests are not stateless so the previous link now includes my new commit, so perhaps that's not the ideal for archiving progress. Here's the commit:

https://github.com/amiller/tahoe-lafs/commit/3898753b9549f5b355a0233b835da8fc38faa5c6

Last edited at 2012-04-01T23:39:39Z by amiller (previous) (diff)

comment:17 Changed at 2012-04-02T00:10:52Z by davidsarah

  • Owner changed from amiller to davidsarah
  • Status changed from new to assigned

comment:18 Changed at 2012-05-13T08:26:00Z by warner

  • Keywords review-needed removed
  • Owner changed from davidsarah to warner
  • Status changed from assigned to new

Yeah, that describes what I had in mind for count-good-share-hosts, but I agree that it probably isn't as meaningful as I'd originally thought. The sharemap described in comment:6 is a great example, although I think it'd be awfully hard to get into this situation (it would take some serious contortions to get an uploader+repairer to leave shares in that state). count-good-share-hosts probably covered the situations I was able to imagine happening after a typical upload process, but if there's a better "what needs to happen to make your file healthier" value, I'm happy to emphasize that instead.

Anyways, I reviewed the patch and it looks good. I added a few cosmetic fixes (whitespace, name of the test file), and will land it in a minute.

comment:19 Changed at 2012-05-13T08:26:27Z by Brian Warner <warner@…>

  • Resolution set to fixed
  • Status changed from new to closed

In fcc7e6475918eab1:

Doc updates and cosmetic fixes for #1115 patch.

Removes the caveat from webapi.txt about count-good-share-hosts being wrong.

This series should close #1115.

comment:20 Changed at 2012-05-13T08:27:27Z by Brian Warner <warner@…>

In fcc7e6475918eab1:

Doc updates and cosmetic fixes for #1115 patch.

Removes the caveat from webapi.txt about count-good-share-hosts being wrong.

This series should close #1115.

comment:21 Changed at 2012-05-13T23:53:48Z by davidsarah

  • Resolution fixed deleted
  • Status changed from closed to reopened

Want to double-check the correctness of this.

comment:22 Changed at 2012-05-13T23:54:20Z by davidsarah

  • Keywords review-needed added
  • Owner changed from warner to davidsarah
  • Status changed from reopened to new

comment:23 Changed at 2012-05-13T23:54:33Z by davidsarah

  • Status changed from new to assigned

comment:24 Changed at 2012-07-01T17:12:28Z by davidsarah

  • Summary changed from add servers-of-happiness to reports (post-repair says 10 hosts have good shares but there are only 4 hosts) to count-good-shares is calculated incorrectly (post-repair says 10 hosts have good shares but there are only 4 hosts)

comment:25 Changed at 2012-07-01T17:15:34Z by davidsarah

  • Summary changed from count-good-shares is calculated incorrectly (post-repair says 10 hosts have good shares but there are only 4 hosts) to count-good-share-hosts is calculated incorrectly (post-repair says 10 hosts have good shares but there are only 4 hosts)

comment:26 Changed at 2012-07-01T17:17:44Z by davidsarah

Adding an entry for the happiness count has been split to #1784.

comment:27 follow-up: Changed at 2012-07-01T22:05:43Z by davidsarah

The change to the computation of count-good-share-hosts at 1.9.2/src/allmydata/immutable/checker.py@5499#L765 looks correct. A minor simplification is possible:

d['count-good-share-hosts'] = len([s for s in servers.keys() if servers[s]])

could just be

d['count-good-share-hosts'] = len(servers)

since the servers dict never includes any entries that are empty sets. I don't think this simplification is needed for 1.9.2.

I spotted another problem, though: the value of needs-rebalancing is computed inconsistently between checker.py and filenode.py. In checker.py it is computed as:

# The file needs rebalancing if the set of servers that have at least
# one share is less than the number of uniquely-numbered shares
# available.
cr.set_needs_rebalancing(d['count-good-share-hosts'] < d['count-shares-good'])

In filenode.py it is computed as

prr.set_needs_rebalancing(len(sm) >= verifycap.total_shares)

where len(sm) is equal to count-shares-good. I don't understand this latter definition at all, it looks completely wrong. The definition in checker.py is more subtly wrong because it credits servers that only have duplicated shares as contributing to existing balance. The correct definition should be something like 'iff the happiness count is less than the number of uniquely-numbered good shares available'.

I propose to change the definition in filenode.py to be consistent with checker.py in 1.9.2, and then change it to use the happiness count in 1.10, as part of #1784.

comment:28 Changed at 2012-07-01T22:28:12Z by davidsarah

The definition of needs-rebalancing in 1.9.2/docs/frontends/webapi.rst is:

    needs-rebalancing: (bool) True if there are multiple shares on a single
                       storage server, indicating a reduction in reliability
                       that could be resolved by moving shares to new
                       servers.

but:

  • it is not true in general that multiple shares on a single storage server indicates a reduction in reliability. It depends whether those shares are not present on other servers.
  • the definition given does not correspond to what either checker.py or filenode.py computes :-(

comment:29 in reply to: ↑ 27 Changed at 2012-07-01T22:57:09Z by davidsarah

  • Resolution set to fixed
  • Status changed from assigned to closed

Replying to davidsarah:

I propose to change the definition in filenode.py to be consistent with checker.py in 1.9.2, and then change it to use the happiness count in 1.10, as part of #1784.

Actually, sod it, I can't decide how to fix needs-rebalancing for 1.9.2, so I'll put a caveat in webapi.rst and bump it to 1.10 (as part of #1784).

comment:30 Changed at 2012-07-01T23:10:16Z by david-sarah@…

In 5532/1.9.2:

[1.9.2 branch] Add comments and a caveat in webapi.rst indicating that the needs-rebalancing field may be computed incorrectly. refs #1115 refs #1784

comment:31 Changed at 2012-07-16T16:33:55Z by david-sarah@…

In 5886/cloud-backend:

[1.9.2 branch] Add comments and a caveat in webapi.rst indicating that the needs-rebalancing field may be computed incorrectly. refs #1115 refs #1784

comment:32 Changed at 2013-04-18T22:50:13Z by Daira Hopwood <david-sarah@…>

In b06f8cd8d03a6239:

Add comments and a caveat in webapi.rst indicating that
the needs-rebalancing field may be computed incorrectly. refs #1115, #1784, #1477

Signed-off-by: Daira Hopwood <david-sarah@…>

comment:33 Changed at 2013-11-14T17:53:57Z by daira

The needs-rebalancing issue is now #2105.

Note: See TracTickets for help on using tickets.