[tahoe-dev] big data usability question

Fri Dec 9 01:03:46 UTC 2011

(slowly catching up on interesting mail from my queue)

On 11/5/11 10:06 AM, Raoul Duke wrote:
...
> when you have small data, you can study it every time you want to do
> something to it
...
> you have a f* ton of data and it takes rsync 3 hours to figure out
> what is different, let alone to start the transfer of actual content
> bits
...
> like, how do i have a meaningful, happy, rewarding, informative,
> correct, useful, etc. user experience when it try to rsync from s3
> over and back and among openstack-swift or backblaze or whatever else?

Yeah, that's a hard spot to be in. We used to be in this spot all the
time, but most of us (myself included) are too young to remember:
imagine writing a compiler and needing to make it one-pass instead of
two-pass because that second pass requires someone to rewind the tape
first. Or to take the deck of punch-cards out of the machine and reload
them.

My general thought is: everything is a process. Rather than knowing that
your backup is complete, you treat the backup as a continuous background
thing. Instead of asking questions like "how long with the backup take
to finish", you ask "given a file that's X minutes old, what is the
chance that it's present in the backup?'.

Data loss is also a process, measured as a probability distribution of
the lifetime of any given piece of data. RAID, erasure-coding,
replication, everything Tahoe does and more, these just stretch out that
distribution, for a given amount of work (usually measured in dollars
per unit time). Instead of a mean-time-between-failure of 10 years, you
might get 100 years. The general goal is reduce the chance of failure
within your lifetime (or career, or job expectancy :) to below some
tolerable level without spending too much money on it.

So, for Big Data, you think in terms of crawlers, spiders, programs that
walk the data at a manageable rate, learn what's changed, update remote
copies, test and repair shares, etc. rsync itself is not a great tool
for this unless you know the directory structure ahead of time and only
run rsync on a small piece of it at a time (something which can complete
in less than an hour or so).

It also helps a lot to make as much of the datastore immutable as
possible: that reduces the amount of crawling you have to do to find out
what's changed. Things like append-only datasets, where once one day's
worth of data has been added to a directory, you make a new directory
for the next day and never modify the old one again. Also, look for ways
to grow things at O(log(N)) instead of O(N), e.g. one directory per
year, one below that per month, day, hour, etc.

hope that's useful,
 -Brian