Context Navigation

← Previous Ticket
Next Ticket →

Opened at 2010-07-20T02:33:07Z

Last modified at 2021-03-30T18:40:19Z

#1130 new defect

Failure to achieve happiness in upload or repair

Reported by:	kmarkley86	Owned by:	kevan
Priority:	major	Milestone:	soon
Component:	code-peerselection	Version:	1.7.1
Keywords:	upload repair rebalancing availability unfinished-business servers-of-happiness	Cc:
Launchpad Bug:

Description (last modified by daira)

Prior to Tahoe-LAFS v1.7.1, the immutable uploader would sometimes raise an assertion error (#1118). We fixed that problem, and we also fixed the problem of uploader uploading an insufficiently well-distributed set of shares while thinking that it achieved servers-of-happiness. But now uploader gives up and doesn't upload at all, saying that it hasn't achieved happiness, when if it were smarter it could achieve happiness. This ticket is to make it successfully upload in this case.

Log excerpt:

19:12:35.519 L20 []#1337 CHKUploader starting
19:12:35.519 L20 []#1338 starting upload of <allmydata.immutable.upload.EncryptAnUploadable instance at 0x20886b5a8>
19:12:35.520 L20 []#1339 creating Encoder <Encoder for unknown storage index>
19:12:35.520 L20 []#1340 file size: 106
19:12:35.520 L10 []#1341 my encoding parameters: (2, 4, 4, 106)
19:12:35.520 L20 []#1342 got encoding parameters: 2/4/4 106
19:12:35.520 L20 []#1343 now setting up codec
19:12:35.520 L20 []#1344 using storage index 5xpii
19:12:35.520 L20 []#1345 <Tahoe2PeerSelector for upload 5xpii> starting
19:12:35.633 L10 []#1346 response from peer 47cslusc: alreadygot=(), allocated=(0,)
19:12:36.590 L10 []#1347 response from peer vjqcroal: alreadygot=(0, 3), allocated=(1,)
19:12:37.119 L10 []#1348 response from peer sn4ana4b: alreadygot=(1,), allocated=(2,)
19:12:37.124 L20 []#1349 storage: allocate_buckets 5xpiivbjrybcmy4ws7xp7dxez4
19:12:37.130 L10 []#1350 response from peer yuzbctlc: alreadygot=(2,), allocated=(0,)
19:12:37.130 L25 []#1351 server selection unsuccessful for <Tahoe2PeerSelector for upload 5xpii>: shares could be placed on only 3 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them have enough shares to recover the file, sent 4 queries to 4 peers, 4 queries placed some shares, 0 placed none (of which 0 placed none due to the server being full and 0 placed none due to an error)), merged={0: set(['\xc52\x11Mb\xa1\xff\x8d\xafn\x0b#s\x17\xbe\x82\x85\x93G0']), 1: set(['\xaa`(\xb8\x0b\x89\x98Y\xfb\xcc2,T\xd0\xde\xf7\xca\xbfA#', '\x93x\x06\x83\x81\xdb\x12*\xe5\xb095T\xf0\x1e\xa5\x00V+\x0f']), 2: set(['\xc52\x11Mb\xa1\xff\x8d\xafn\x0b#s\x17\xbe\x82\x85\x93G0', '\x93x\x06\x83\x81\xdb\x12*\xe5\xb095T\xf0\x1e\xa5\x00V+\x0f']), 3: set(['\xaa`(\xb8\x0b\x89\x98Y\xfb\xcc2,T\xd0\xde\xf7\xca\xbfA#'])}
19:12:37.133 L20 []#1352 web: 127.0.0.1 PUT /uri/[CENSORED].. 500 1826
19:12:37.148 L23 []#1353 storage: aborting sharefile /home/tahoe/.tahoe/storage/shares/incoming/5x/5xpiivbjrybcmy4ws7xp7dxez4/0

Attachments (1)

stuff.flog.bz2 (9.8 KB) - added by kmarkley86 at 2010-07-20T02:33:43Z.: Log from flogtool

Download all attachments as: .zip

Change History (27)

Changed at 2010-07-20T02:33:43Z by kmarkley86

Attachment stuff.flog.bz2 added

Log from flogtool

comment:1 Changed at 2010-07-20T02:34:46Z by kmarkley86

Description modified (diff)

comment:2 Changed at 2010-07-20T02:44:44Z by kmarkley86

I think I had originally uploaded this file when I was configured to use encoding parameters 2/3/4. That may explain the original distribution of the shares. I assume it's legal for a client to change their parameters (as I did, to 2/4/4) and continue using the grid. In this case the share needs to be migrated, but the migration doesn't happen.

comment:3 Changed at 2010-07-20T03:04:36Z by davidsarah

Component changed from unknown to code-peerselection
Description modified (diff)
Keywords upload rebalancing added
Version changed from 1.7.0 to 1.7.1

comment:4 Changed at 2010-07-20T03:05:15Z by davidsarah

Keywords availability added

comment:5 Changed at 2010-07-20T03:12:25Z by davidsarah

Description modified (diff)

comment:6 Changed at 2010-08-12T23:34:04Z by davidsarah

Milestone changed from undecided to 1.9.0

comment:7 Changed at 2010-08-12T23:34:21Z by davidsarah

Keywords unfinished-business added

comment:8 Changed at 2010-12-29T08:45:45Z by zooko

Keywords servers-of-happiness added

This issue reinforces Brian's sense of dubiousity of servers-of-happiness: http://tahoe-lafs.org/pipermail/tahoe-dev/2010-December/005704.html . This bothers me! I want Brian to love servers of happiness and revel in its excellence. Perhaps fixing this ticket would help.

comment:9 Changed at 2010-12-29T09:10:28Z by zooko

According to David-Sarah in this tahoe-dev message, this issue is nearly the same as the one tested in test_problem_layout_ticket_1128. So anybody who wants to fix this can start by running that one unit test.

comment:10 Changed at 2010-12-29T20:06:45Z by davidsarah

Yes, #1128 had already been closed as a duplicate of this ticket. The name of the unit test should probably be changed (although I hope we fix it before the next release anyway).

comment:11 Changed at 2011-06-09T00:11:37Z by davidsarah

Keywords repair added
Summary changed from Failure to achieve happiness in upload to Failure to achieve happiness in upload or repair

Upload and repair are sufficiently similar that I think they can be covered by the same ticket for this issue. They are implemented mostly by the same code, and they both should change to take into account existing shares in the same way, probably along the lines of ticket:1212#comment:14. The difference is when happiness is not achieved, upload should fail, while repair should still make a best effort to improve preservation of the file. But that needn't stop them from using the same improvement to the share placement algorithm.

comment:12 Changed at 2011-06-09T00:28:36Z by davidsarah

[copying the algorithm from ticket:1212#comment:14 here, with some minor refinements, for ease of reference]

This is how I think the repairer should work:

let k and N be the shares-needed and total number of shares for this file, and let H be the happiness threshold read from tahoe.cfg.
construct a server map for this file by asking all connected servers which shares they have. (In the case of a mutable file, construct a server map for the latest retrievable version.)
construct a maximum matching M : server -> share, of size |M|, for this file (preferring to include servers that are earlier on the permuted list when there is a choice).
while |M| < N, and we have not tried to put shares on all connected servers:
- pick a share not in M, and the server not in M that is next on the permuted list, wrapping around if necessary. Try to extend M by putting that share onto that server.
place any remaining shares on servers that are already in the map (don't count these in |M|).
if the file is not retrievable, report that the repair failed completely. If k <= |M| < H, report that the file is retrievable but unhealthy. In any case report what |M| is.

The while loop should be done in parallel, with up to N - |M| outstanding requests.

Upload would work in the same way (for the general case where there may be existing shares), except that it would fail if it is not possible to achieve |M| >= H.

[edit: numbered the steps]

Last edited at 2012-09-29T20:47:46Z by davidsarah (previous) (diff)

comment:13 Changed at 2011-10-01T04:27:53Z by zooko

The algorithm David-Sarah proposes in comment:12 sounds fine to me.

comment:14 Changed at 2011-10-13T17:05:29Z by warner

Milestone changed from 1.9.0 to 1.10.0

not making it into 1.9

comment:15 Changed at 2012-03-25T18:55:00Z by zooko

Owner changed from nobody to kevan

Kevan: would the algorithm from your master's thesis solve this ticket? Would it be compatible with, or equivalent to, the algorithm that David-Sarah proposed in comment:12?

comment:16 Changed at 2012-09-29T20:50:08Z by davidsarah

I just thought of another wrinkle: the initial servermap in step 2 may contain shares with leases that are about to expire. The repairer should attempt to renew any leases on shares that are still needed, and only then (once it knows which renew operations succeeded) decide which new or replacement shares need to be stored.

comment:17 Changed at 2013-02-15T03:50:21Z by davidsarah

The comment:12 algorithm would fix #699. Note that in the case where there are existing shares that don't contribute to the maximum matching found in step 3, those shares (which are redundant if the repair is successful) will not be deleted. However, any redundant shares would not have their leases renewed.

comment:18 Changed at 2013-06-27T17:11:31Z by daira

Description modified (diff)

Step 5 in the comment:12 algorithm isn't very specific about where the remaining shares are placed. I can think of two possibilities:

a) continue the loop in step 4, i.e. place in the order of the permuted list with wrap-around.

b) sort the servers by the number of shares they have at that point (breaking ties in some deterministic way) and place on the servers with fewest shares first.