#66 closed task (duplicate)

peers failure tolerance

Reported by: lvo Owned by: lvo
Priority: minor Milestone: 0.7.0
Component: documentation Version: 0.6.0
Keywords: Cc:
Launchpad Bug:

Description

-Linux ubuntu4lu 2.6.20-15-server #2 SMP Sun Apr 15 07:41:34 UTC 2007 i686 GNU/Linux -# allmydata-tahoe --version Twisted version: 2.5.0

In my testing, I have 1 introducer and 3 clients. I uploaded a large file (450MB) via peer #3 (btw, it only works with peer #3 due to it having 1GB RAM. The others have 512MB and could not handle the large file)

If I take down BOTH peer #1 and peer#2 , I can still download the file. If I take down peer#3, the file starts to download but cannot be completed.

Is there any predictable way to know what is the failure tolerance ? It would be good to also have the client knows ahead of time that it cannot provide a complete file based on the information polled from all other peers.

Thanks. Lu

Change History (8)

comment:1 Changed at 2007-06-29T18:30:02Z by zooko

  • Owner changed from somebody to lvo

Dear lvo:

This is a very good question.

If I (or Brian Warner) answer this question, will you submit a patch to the relevant docs which will make the answer apparent to the next person who comes after you, who is wondering the same thing?

Simplistically, the current failure tolerance is 3-out-of-4. If less than 3/4 of your servers fail, then you'll almost certainly be able to get your data back. If more than 3/4 of your servers fail, then you'll almost certainly not be able to. If exactly 3/4 of your servers fail, then it depends. :-)

Thanks,

Zooko

comment:2 Changed at 2007-07-03T19:44:25Z by lvo

Thanks Zooko. I will. Is that your answer ? or is it just a simplistic answer and you or Brian will supply a more detailed answer ? :-)

I have since added a 4th peer to my setup and you are correct in saying that if 3/4 of servers fail, if the remaining 1/4 is NOT the peer that served as the original peer for upload then the file is lost. Otherwise the file is intact.

Lu

comment:3 Changed at 2007-07-04T01:28:52Z by warner

So to be precise, we're using 25-out-of-100 encoding by default, so what matters is whether you are able to retrieve at least 25 shares. The shares are assigned to peers according to the tahoe three? algorithm, which will distribute them evenly only in the limit as the number of peers you have is much larger than 100.

Imagine a clock face, with 100 marks evenly spaced around the edge: these represent the shares. Now choose a random location for each peer (these represent the permuted peerlist, in which each peerid is hashed together with the per-file storage index). Each share travels clockwise until it hits a peer. That's it. You can see that for some files, two peers will wind up very close to each other, in which case one of them will get a lot more shares than the other. If there are lots of peers, this tends to be a bit more uniform, but if you only have 3 peers, then the distribution will be a lot less uniform.

Also note that each file you upload gets a different mapping, so if you upload a few hundred equally-sized files and then compare all the peers, you should see them all hosting about the same amount of space. But if you only upload one file, you'll see very non-uniform distribution of shares.

So in very small networks, it is not easy to predict how many (or which) peers need to be alive to provide for any individual file.

It is probably the case that the tahoe two? algorithm provides more uniform allocation of shares, even in small networks. Ticket #16 is a request to explain and justify our choice of tahoe three over tahoe two: I suspect that this non-uniform allocation of shares is an argument to move back to tahoe two.

When a file is downloaded, the very first thing source:src/allmydata/download.py does is to ask around and find out who has which shares. If it cannot find enough, you get an immediate NotEnoughPeersError?.

comment:4 Changed at 2007-07-04T01:31:07Z by warner

  • Component changed from code to documentation
  • Priority changed from major to minor
  • Version changed from 0.2.0 to 0.4.0

Oh, also note when counting peers, your own host is just as valuable a peer as all the others. So if you join a mesh that already has three clients, your own machine is a fourth, and on average each client (including your own) will wind up holding 25% of the total shares for anything you upload. That means that your own machine, all by itself, should be sufficient (!!!on average!!!) to recover any files you've uploaded. But of course the non-uniformity of share distribution probably gives you a 50/50 chance of success.

comment:5 Changed at 2007-07-05T00:04:53Z by lvo

I really appreciate the detailed explanation Warner. However, I am still unclear on this point: "That means that your own machine, all by itself, should be sufficient (!!!on average!!!) to recover any files you've uploaded" If my machine is only 1 out of 1000 peers, and if the file I upload is divided into ~ 100 shares (which I understand to be what you refer to as segments), and the shares are distributed starting at a random location and going around the rim, how would my machine managed to have most or all of that shares ?

Thanks. Lu

comment:6 Changed at 2007-09-18T20:26:54Z by zooko

Dear Lu:

In the imminent release of Tahoe v0.6, we have fixed the distribution of shares onto a small number of peers to be more even.

comment:7 Changed at 2007-09-25T04:31:42Z by zooko

  • Resolution set to duplicate
  • Status changed from new to closed

So with v0.6 the behavior is better, but not until ticket #92 is done will the user be able to *see* what the behavior is. Merging this ticket into ticket #92.

comment:8 Changed at 2007-09-25T04:31:49Z by zooko

  • Milestone set to 0.7.0
  • Version changed from 0.4.0 to 0.6.0
Note: See TracTickets for help on using tickets.