[volunteergrid2-l] Recommended settings

Marco Tedaldi marco.tedaldi at gmail.com
Tue Jun 28 21:37:11 PDT 2011


On 29.06.2011 03:16, Shawn Willden wrote:
> On Mon, Jun 27, 2011 at 10:27 PM, Marco Tedaldi <marco.tedaldi at gmail.com>wrote:
> 
>>> For best file reliability, you want to get your shares dispersed as
>> widely
>>> as possible.  For the VG2 grid right now, that's 10 shares.  So I think
>> you
>>> should set shares-total to 10.
>>
>> which also increases space usage and bandwith usage while uploading.
>>
> 
> Indeed it does.  If that's a concern, you can reduce the expansion by
> increasing K -- but if H (shares-happy) isn't enough larger than K then you
> may not have enough redundancy to be sure you can get your files later.  But
> if you set H == N so you have good redundancy, you may not have good write
> availability.
> 
Than it is a question: do I prefer a backup that has substandard
availability or no backup at all...

> In all cases, I think it makes sense to try to distribute shares to as many
> servers as possible, so set N to the number of nodes in the grid.  Then you
> can choose H and K to pick your tradeoffs between write availability, read
> availability and upload time.
> 
Ok, this sounds reasonable.

> BTW, the complexity of this analysis is one of the reasons that I would like
> to see Tahoe move away from having the user specify H, K and N.  Instead,
> I'd like to see Tahoe offer the user the ability to choose read availability
> probability and then dynamically compute N and K (H would disappear) based
> on the available nodes in the grid and their estimated (or assumed)
> reliabilities.  I envision a "Tahoe won't lose my files probability" slider
> with an adjacent "expansion factor" field that changes as you move the
> slider back and forth.  I think users could make better decisions about the
> part of the tradeoff they care about.
> 
Nice Idea. It would even be nice if the config option where not fixed
values but variables.
So I could set N to "100%" (of nodes), H to "90%" and K to maybe 45% or
something alike. Estimated availability of the nodes would not be in
this calcualation but it would make it easier to keepup with changing
sizes of the grid.

> But I haven't cared enough to actually implement anything like that :-)  I
> did care enough to do a lot of the mathematical modeling to lay the
> groundwork, but stopped there.
> 
Nice! I did not even manage to understand that stuff.

>> If
>>> you set shares-happy to be large (perhaps 10), then you sometimes might
>> not
>>> be able to write a file... but you'll maximize your chances of being able
>> to
>>> read it later.
>>>
>> Or we set files needed low enough which increases reliability but also
>> space and bandwith use.
>>
> 
> Right.  You can lower H to increase write availability at the expense of
> reducing read availability, then you can reduce K to recover read
> availability at the expense of increasing expansion.
> 
A typical tradeoff situation...

> BTW, there is one way to reduce bandwidth use:  Use a helper.  I think we
> have one or two.  This would be especially good for you since you have a
> relatively slow upstream connection.
> 
Oh, that's a nice Idea. And the data would only have to cross the ocean
once (which is a quite minor issue I think... If I could only set up a
tahoe node at work... the 100Mbit-Connection of my Computer is the
limiting factor there :-))... maybe, if i set it to use port 22...

> When your node is configured to use a helper, it does the encryption of the
> file locally and then uploads it to the helper, which does the erasure
> coding and delivers the shares to the storage nodes for you.  That way the
> expanded use of bandwidth is done by the helper, which presumably has a fast
> network connection.  This doesn't change storage consumption, obviously, but
> it does partially work around your low bandwidth.
> 
So I could setup a only as a helper at my workspace (I can connect there
via vpn anyway) and use tis to distribute the data? sounds nice to me.

>> Yes, very important!  Of course, we've chosen to set the expiration time
>> at
>>> one year (that might be revised downward in the future, but we decided to
>>> start conservatively), so you shouldn't have to worry too much about it.
>>
>> I've been wondering about that. It for sure reduces the maintenance
>> overhead for the client but might greatly increase the wasted space if
>> there is short lived data around.
>>
> 
> Yes.  If the storage nodes start getting full, one of the first things we'll
> do is ask everyone to lower their expiration timeout some.
> 
Ok... seems a reasonable solution.
How often should I run the tahoe to check my data? I thought of a daily
run. Or is this a waste of bandwith? Is a weekly check enough?

>>> what interfaces are you using anyway?
>>>
>>> The HTTP API.
>>>
>> Ok.
>> Is this recommended?
> 
> 
> The HTTP API is the recommended API for tools to use to talk to a Tahoe
> node.  Writing software that constantly spawned shells to run the tahoe
> command-line tool would be painful, and the HTTP API gives you the ability
> to use a Tahoe node on a different machine, basically for free.
> 
Nice. One thing I plan to take a look at is the python fuse interface.
But first thing will be to get a simple backup running.

> 
>> I personally think that the command line tool looks
>> quite nice (when I could just wrap my head around why I need aliases.
> 
> 
> Aliases just keep you from having to remember/type long rootcap strings.
> 
Which is a good thing :-)

> 
>> Is
>> this a bit like different filesystems insid tahoe?)
>>
> 
> I'm not sure what you mean by this question.
> 
You already answered my question. I thought it was like creating several
independent filesystems (a bit like when partitioning a harddisk)

> 
>> When I use tahoe backup what would I need to restore the data in case of
>> a failure with full data loss? Or better: What data do I need to backup
>> outside of tahoe to be able to restore my data from there?
>> Is it adviseable to use some other online storage (loke dropbox or
>> wuala) for these data?
>>
> 
> You need your rootcap -- the actual URI, not your alias.  That's it.  It's a
> very good idea to make a copy and put it somewhere safe and secure.  If you
> put it in dropbox or something, I'd encrypt it first because it's the key to
> all of your data.  Another common recommendation is to print a copy on a
> piece of paper and store it somewhere safe.
> 
So the rootcap stays the same all the time? I could print it out and put
it into a safe without the need to update is everytime I do a backup?
This sounds almost too easy to be true :-)

So before asking a lot more stupid questions I'll go dig trough the
documentation..

thanx for all the help!

Marco

PS: sorry for the downtime yesterday... there have been network issues
in my home network.


More information about the volunteergrid2-l mailing list