[tahoe-lafs-trac-stream] [tahoe-lafs] #2106: RAIC behaviour different from RAID behaviour

Thu Nov 14 21:03:27 UTC 2013

#2106: RAIC behaviour different from RAID behaviour
----------------------+------------------------
 Reporter:  sickness  |          Owner:
     Type:  defect    |         Status:  new
 Priority:  normal    |      Milestone:  1.11.0
Component:  code      |        Version:  1.10.0
 Keywords:            |  Launchpad Bug:
----------------------+------------------------
 Let's assume we have a local RAID5 set of 4 identical disks attached on a
 controller inside a computer.[[BR]]

 This RAID5 level guarantees that if we lose 1 of 4 disks, we can continue
 to not only read, but also write on the set, but in degraded mode.[[BR]]

 When we change the failed disk with a new one, the RAID takes care of
 repairing the set syncing the data in background and the 4th disk gets
 populated again with chunks of our waluable data (not only parity because
 we know that in RAID5 parity is striped but explaining this isn't the
 scope of this ticket)[[BR]]

 starting condition:[[BR]]

 DISK1[chunk1] DISK2[chunk2] DISK3[chunk3] DISK4[chunk4] [[BR]]

 broken disk:[[BR]]

 DISK1[chunk1] DISK2[chunk2] DISK3[chunk3] DISK4[XXXXXX][[BR]]

 new disk is put in place:[[BR]]

 DISK1[chunk1] DISK2[chunk2] DISK3[chunk3] DISK4[      ][[BR]]

 repair rebuilds DISK4's chunk of data reading the other 3 disks:[[BR]]

 DISK1[chunk1] DISK2[chunk2] DISK3[chunk3] DISK4[chunk4][[BR]]

 Now let's assume we have a tahoe-lafs RAIC set of 4 identical servers on a
 LAN.[[BR]]

 To mimic the RAID5 behaviour we configure it to write 4 shares for every
 file, needing only any 3 of them to succesfully read the file.[[BR]]

 So in this way we have a RAIC that should behave like a RAID5.[[BR]]

 We can lose any 1 of these 4 servers, and still be able to read the data,
 and to repair it should we lose 1 server.[[BR]]

 But what happens if we actually lose 1 of those 4 servers and then try to
 read/repair the data? or maybe even write new data?[[BR]]

 We will end up having ALL the 4 shares on just 3 servers, and when we
 rebuild the 4th server
 and put it back online, even repairing will not put shares on it because
 the file will be seen as already healthy, but now what if we lose that one
 server wich actually holds 2 shares of the same file?[[BR]]

 starting condition:[[BR]]

 SERV1[share1] SERV2[share2] SERV3[share3] SERV4[share4][[BR]]

 broken server:[[BR]]

 SERV1[share1] SERV2[share2] SERV3[share3] SERV4[XXXXXX][[BR]]

 data is written, or scheduled repair is attempted and we get to this
 situation:[[BR]]

 SERV1[share1,share4] SERV2[share2] SERV3[share3] SERV4[XXXXXX][[BR]]

 new server is put in place:[[BR]]

 SERV1[share1,share4] SERV2[share2] SERV3[share3] SERV4[      ] [[BR]]

 now if we try to repair situation remains the same because as of now the
 repairer
 DOESN'T know that he has to actually rebalance share4 on SERV4, he just
 tell us the file is healthy[[BR]]

 we can still read and write data, so far so good, isn't it?[[BR]]

 but what if SERV1 now suddenly gets broken?[[BR]]

 SERV1[XXXXXX] SERV2[share2] SERV3[share3] SERV4[      ] [[BR]]

 ok we can replace it:[[BR]]

 SERV1[      ] SERV2[share2] SERV3[share3] SERV4[      ] [[BR]]

 ok now we have a problem: how can we rebuild if we need 3 shares of 4 but
 we have just 2 even if we previously had 4 servers and the file was listed
 as "healthy" by the repairer?[[BR]]

-- 
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/2106>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage