#1182 assigned defect

clean up and improve asymptotic complexity of Spans and DataSpans

Reported by: davidsarah Owned by: zooko
Priority: normal Milestone: soon
Component: code Version: 1.8β
Keywords: performance cleanup unfinished-business immutable download Cc:
Launchpad Bug:

Description (last modified by davidsarah)

Asympotic efficiency problems in src/allmydata/util/spans.py were partly responsible for the regression in download performance discussed in #1170. See:

We changed the downloader so that fewer spans were retained in Spans and DataSpans? objects, but this limits their potential for reuse, and we haven't checked whether this could still affect the downloader in some situations.

Here is a copy of ticket:798#comment:18 :

Reviewing cbcb728e7ea0031d:

  • the (start, length) form of the [Simple]Spans constructor is not used outside test code (and the test code can be changed to pass a [(start, length)] array). Removing this would slightly simplify the constructor and avoid a possibly error-prone overloading.
  • in the Spans class comment, ", frequently used to represent .newsrc contents" is out-of-context and not needed.
  • in the _check method of Spans, if assertions are switched on then the self._spans array is re-sorted in order to check whether it is ordered. This is unnecessary: if you add an assert length > 0, length in the loop, then the loop will be checking a condition that is stronger than the array being ordered, given that the starts and lengths are numbers. (The sorting actually takes O(n) time rather than O(n log n) on each call to _check, because Timsort will detect the ordered input, but it's still unnecessary overhead.)
  • the assert statements should include the variables they use in the exception message, e.g. assert start > prev_end, (start, prev_end).
  • "overlap(s_start, s_length, start, length) or adjacent(s_start, s_length, start, length)" is equivalent to "overlap(s_start, s_length, start-1, length+2)".
  • in the only other use of adjacent (in DataSpans.add), only the start0 < start1 case should be able to occur. Inline and simplify.
  • in the loop over enumerate(self._spans), you could exit early when s_start > start+length. At that point you know where to insert the (start, length) span without having to re-sort.
  • a Spans object behaves somewhat like a set of the elements in all of its spans, but the __contains__ and __iter__ methods are not consistent with that (instead viewing it as a set of (start, length) pairs). I realize this may allow for more convenient use of in and iteration, but it should at least be documented.
  • _check and assert_invariants do similar things; give them the same name.
  • DataSpans._dump is poorly named.
  • DataSpans.assert_invariants should check that none of the data strings are zero-length.
  • is it intentional that DataSpans.add calls self.assert_invariants() but remove (and pop, although that's much simpler) don't?
  • if s_start <= start < s_end: I find this Python construct too clever by half. Anyway, at this point s_start <= start is always true (otherwise we won't get past the first continue), so I would write this as
    assert s_start <= start, (s_start, start)
    if start < s_end:
    
  • Perhaps rename s_* to old_*.
  • DataSpans.add: if I understand correctly, case A also covers:
        OLD      OLD    OLD    OLD
    NEW       NEW      NEWW   NEEWW
    
    This isn't immediately clear from the comment. Depicting it as
        OLD
    NEW.....
    
    might help. Also, use uppercase consistently for the cases.
  • suffix_len in case D has a different meaning to suffix_len in case E. Maybe rename it to replace_len for case D.

Tests:

  • The Spans class is tested by ByteSpans, but it just stores spans of integers, not necessarily byte offsets. I would suggest s/ByteSpans/TestSpans/ and s/StringSpans/TestDataSpans/.
  • I could be wrong, but I don't think the deterministic tests are covering all cases in add and remove. I'd prefer to see those tests have full coverage rather than relying on the randomized tests to make up the gap.

Change History (6)

comment:1 Changed at 2010-08-20T00:19:46Z by zooko

trunk/misc/simulators/bench_spans.py@4700 could be useful for measuring your success on this ticket. It reads in a trace file, such as run-112-above28-flog-dump-sh8-on-nsziz.txt and exercise a DataSpans object in the way the trace file indicates.

This patch: debuggery-trace-spans.dpatch.txt was used to generate that trace file with 1.8.0c2 (which had an issue with superlinear behavior of Share._received):

Please see the docs of pyutil/benchutil.py for an explanation of the meaning of the measurements output by bench_spans.py.

comment:2 Changed at 2010-09-11T00:17:35Z by davidsarah

  • Description modified (diff)
  • Keywords cleanup added
  • Summary changed from improve asymptotic complexity of Spans and DataSpans to clean up and improve asymptotic complexity of Spans and DataSpans

#1196 is a duplicate.

comment:3 Changed at 2010-09-11T00:18:46Z by davidsarah

  • Keywords unfinished-business immutable download added

comment:4 Changed at 2011-01-04T06:57:27Z by zooko

  • Owner changed from somebody to zooko
  • Status changed from new to assigned

comment:5 Changed at 2011-08-13T23:25:55Z by davidsarah

  • Milestone changed from 1.9.0 to soon

comment:6 Changed at 2012-04-01T05:10:24Z by davidsarah

  • Priority changed from major to normal
Note: See TracTickets for help on using tickets.