wiki:WikiStart
close Warning: Can't synchronize with repository "(default)" (Unsupported version control system "darcs": Can't find an appropriate component, maybe the corresponding plugin was not enabled? ). Look in the Trac log for more information.

Version 5 (modified by zooko, 15 years ago) (diff)

edit

StringChain

Sometimes you want to accumulate data from some source while at the same time processing the data that arrived first. The naive way to do it in Python is like this:

def __init__(self):
    self.accum = '' # Will hold all unprocessed bytes

def add_data(self, some_more_data):
    # some_more_data is a string
    self.accum += some_more_data

def process_some(self, how_much):
    some = self.accum[:how_much]
    del self.accum[:how_much]

This work fine as long as the total amount of bytes accumulated and the number of separate add_data() events stay small, but it has O(N2) behavior and has bad performance if those numbers get large. Here are some benchmarks generated by running python -OOu -c 'from stringchain.bench import bench; bench.quick_bench()' as instructed by the README.txt file:

_buildup init_naive
N:   65536, time: best:    0.00,   2th-best:    0.00, ave:    0.00,   2th-worst:    0.00, worst:    0.00 (of      5), reps/s:    890, ave rate: 58350579
N:  131072, time: best:    0.00,   2th-best:    0.00, ave:    0.00,   2th-worst:    0.00, worst:    0.00 (of      5), reps/s:    265, ave rate: 34800398
N:  262144, time: best:    0.01,   2th-best:    0.01, ave:    0.01,   2th-worst:    0.01, worst:    0.01 (of      5), reps/s:     79, ave rate: 20745346
N:  524288, time: best:    0.05,   2th-best:    0.05, ave:    0.05,   2th-worst:    0.05, worst:    0.05 (of      5), reps/s:     20, ave rate: 10823850
_buildup init_strch
N:   65536, time: best:    0.00,   2th-best:    0.00, ave:    0.00,   2th-worst:    0.00, worst:    0.00 (of      5), reps/s:  25543, ave rate: 1674043282
N:  131072, time: best:    0.00,   2th-best:    0.00, ave:    0.00,   2th-worst:    0.00, worst:    0.00 (of      5), reps/s:  14179, ave rate: 1858538925
N:  262144, time: best:    0.00,   2th-best:    0.00, ave:    0.00,   2th-worst:    0.00, worst:    0.00 (of      5), reps/s:   8016, ave rate: 2101513050
N:  524288, time: best:    0.00,   2th-best:    0.00, ave:    0.00,   2th-worst:    0.00, worst:    0.00 (of      5), reps/s:   4108, ave rate: 2154215572
_consume init_naive_loaded
N:   65536, time: best:    0.00,   2th-best:    0.00, ave:    0.00,   2th-worst:    0.00, worst:    0.00 (of      5), reps/s:    931, ave rate: 61037862
N:  131072, time: best:    0.00,   2th-best:    0.00, ave:    0.00,   2th-worst:    0.00, worst:    0.00 (of      5), reps/s:    270, ave rate: 35454393
N:  262144, time: best:    0.01,   2th-best:    0.01, ave:    0.01,   2th-worst:    0.01, worst:    0.01 (of      5), reps/s:     74, ave rate: 19471963
N:  524288, time: best:    0.05,   2th-best:    0.05, ave:    0.05,   2th-worst:    0.05, worst:    0.06 (of      5), reps/s:     19, ave rate: 10146747
_consume init_strch_loaded
N:   65536, time: best:    0.00,   2th-best:    0.00, ave:    0.00,   2th-worst:    0.00, worst:    0.00 (of      5), reps/s:   4309, ave rate: 282447500
N:  131072, time: best:    0.00,   2th-best:    0.00, ave:    0.00,   2th-worst:    0.00, worst:    0.00 (of      5), reps/s:   2313, ave rate: 303263357
N:  262144, time: best:    0.00,   2th-best:    0.00, ave:    0.00,   2th-worst:    0.00, worst:    0.00 (of      5), reps/s:   1186, ave rate: 311159052
N:  524288, time: best:    0.00,   2th-best:    0.00, ave:    0.00,   2th-worst:    0.00, worst:    0.00 (of      5), reps/s:    606, ave rate: 317814669
_randomy init_naive
N:   65536, time: best:    0.00,   2th-best:    0.00, ave:    0.00,   2th-worst:    0.00, worst:    0.00 (of      5), reps/s:    479, ave rate: 31450561
N:  131072, time: best:    0.01,   2th-best:    0.01, ave:    0.01,   2th-worst:    0.01, worst:    0.01 (of      5), reps/s:    140, ave rate: 18461191
N:  262144, time: best:    0.02,   2th-best:    0.02, ave:    0.02,   2th-worst:    0.03, worst:    0.03 (of      5), reps/s:     42, ave rate: 11127714
N:  524288, time: best:    0.06,   2th-best:    0.07, ave:    0.08,   2th-worst:    0.08, worst:    0.09 (of      5), reps/s:     13, ave rate:  6906341
_randomy init_strch
N:   65536, time: best:    0.00,   2th-best:    0.00, ave:    0.00,   2th-worst:    0.00, worst:    0.00 (of      5), reps/s:    973, ave rate: 63827127
N:  131072, time: best:    0.00,   2th-best:    0.00, ave:    0.00,   2th-worst:    0.00, worst:    0.00 (of      5), reps/s:    495, ave rate: 64970669
N:  262144, time: best:    0.00,   2th-best:    0.00, ave:    0.00,   2th-worst:    0.00, worst:    0.00 (of      5), reps/s:    239, ave rate: 62913360
N:  524288, time: best:    0.01,   2th-best:    0.01, ave:    0.01,   2th-worst:    0.01, worst:    0.01 (of      5), reps/s:    121, ave rate: 63811569

The naive approach is slower than the StringChain library, and the bigger the dataset the slower it goes. The StringChain library is scalable (with regard to these benchmarks at least...).

Okay how do you use it? It is very simple -- see stringchain/stringchain.py and let me know if that interface doesn't fit your use case.

Starting Points

For a complete list of local wiki pages, see TitleIndex.