Changeset 9acf5be in trunk


Ignore:
Timestamp:
2012-05-16T23:55:09Z (13 years ago)
Author:
Brian Warner <warner@…>
Branches:
master
Children:
cc36690
Parents:
5ec2076
git-author:
Brian Warner <warner@…> (2012-05-16 23:50:43)
git-committer:
Brian Warner <warner@…> (2012-05-16 23:55:09)
Message:

immutable repairer: populate servers-responding properly

If a server did not respond to the pre-repair filecheck, but did respond
to the repair, that server was not correctly added to the
RepairResults?.dataservers-responding? list. (This resulted from a
buggy usage of DictOfSets?.union() in filenode.py).

In addition, servers to which filecheck queries were sent, but did not
respond, were incorrectly added to the servers-responding list
anyawys. (This resulted from code in the checker.py not paying attention
to the 'responded' flag).

The first bug was neatly masked by the second: it's pretty rare to have
a server suddenly start responding in the one-second window between a
filecheck and a subsequent repair, and if the server was around for the
filecheck, you'd never notice the problem. I only spotted the smelly
code while I was changing it for IServer cleanup purposes.

I added coverage to test_repairer.py for this. Trying to get that test
to fail before fixing the first bug is what led me to discover the
second bug. I also had to update test_corrupt_file_verno, since it was
incorrectly asserting that 10 servers responded, when in fact one of
them throws an error (but the second bug was causing it to be reported
anyways).

Location:
src/allmydata
Files:
4 edited

Legend:

Unmodified
Added
Removed
  • TabularUnified src/allmydata/immutable/checker.py

    r5ec2076 r9acf5be  
    748748        corruptshare_locators = [] # (serverid, storageindex, sharenum)
    749749        incompatibleshare_locators = [] # (serverid, storageindex, sharenum)
     750        servers_responding = set() # serverid
    750751
    751752        for verified, server, corrupt, incompatible, responded in results:
     
    758759            for sharenum in incompatible:
    759760                incompatibleshare_locators.append((server_id, SI, sharenum))
     761            if responded:
     762                servers_responding.add(server_id)
    760763
    761764        d['count-shares-good'] = len(verifiedshares)
     
    781784            d['count-unrecoverable-versions'] = 1
    782785
    783         d['servers-responding'] = list(servers)
     786        d['servers-responding'] = list(servers_responding)
    784787        d['sharemap'] = verifiedshares
    785788        # no such thing as wrong shares of an immutable file
  • TabularUnified src/allmydata/immutable/filenode.py

    r5ec2076 r9acf5be  
    120120                    sm.update(ur.sharemap)
    121121                    servers_responding = set(prr.data['servers-responding'])
    122                     servers_responding.union(ur.sharemap.iterkeys())
    123                     prr.data['servers-responding'] = list(servers_responding)
     122                    for shnum, serverids in ur.sharemap.items():
     123                        servers_responding.update(serverids)
     124                    servers_responding = sorted(servers_responding)
     125                    prr.data['servers-responding'] = servers_responding
    124126                    prr.data['count-shares-good'] = len(sm)
    125127                    good_hosts = len(reduce(set.union, sm.itervalues(), set()))
  • TabularUnified src/allmydata/test/no_network.py

    r5ec2076 r9acf5be  
    6868        def _call():
    6969            if self.broken:
     70                if self.broken is not True: # a counter, not boolean
     71                    self.broken -= 1
    7072                raise IntentionalError("I was asked to break")
    7173            if self.hung_until:
     
    300302        return ss
    301303
    302     def break_server(self, serverid):
     304    def break_server(self, serverid, count=True):
    303305        # mark the given server as broken, so it will throw exceptions when
    304         # asked to hold a share or serve a share
    305         self.wrappers_by_id[serverid].broken = True
     306        # asked to hold a share or serve a share. If count= is a number,
     307        # throw that many exceptions before starting to work again.
     308        self.wrappers_by_id[serverid].broken = count
    306309
    307310    def hang_server(self, serverid):
  • TabularUnified src/allmydata/test/test_repairer.py

    r5ec2076 r9acf5be  
    171171        self.failUnless(data['count-shares-expected'] == 10, data)
    172172        self.failUnless(data['count-good-share-hosts'] == 9, data)
    173         self.failUnless(len(data['servers-responding']) == 10, data)
     173        self.failUnless(len(data['servers-responding']) == 9, data)
    174174        self.failUnless(len(data['list-corrupt-shares']) == 0, data)
    175175
     
    707707        return d
    708708
     709    def test_servers_responding(self):
     710        self.basedir = "repairer/Repairer/servers_responding"
     711        self.set_up_grid(num_clients=2)
     712        d = self.upload_and_stash()
     713        # now cause one of the servers to not respond during the pre-repair
     714        # filecheck, but then *do* respond to the post-repair filecheck
     715        def _then(ign):
     716            ss = self.g.servers_by_number[0]
     717            self.g.break_server(ss.my_nodeid, count=1)
     718            self.delete_shares_numbered(self.uri, [9])
     719            return self.c0_filenode.check_and_repair(Monitor())
     720        d.addCallback(_then)
     721        def _check(rr):
     722            # this exercises a bug in which the servers-responding list did
     723            # not include servers that responded to the Repair, but which did
     724            # not respond to the pre-repair filecheck
     725            prr = rr.get_post_repair_results()
     726            expected = set(self.g.get_all_serverids())
     727            self.failUnlessEqual(expected, set(prr.data["servers-responding"]))
     728        d.addCallback(_check)
     729        return d
    709730
    710731# XXX extend these tests to show that the checker detects which specific
Note: See TracChangeset for help on using the changeset viewer.