<div dir="ltr"><div>Hey all,<br></div><div><br></div><div>Over the years, I've been putting together various UNIX hacks in the name of security and backup and I wanted to share a few of them and possibly join forces with other like-minded individuals.  A BSD/Linux user of a couple decades, I am also an Apple fan.  Hopefully you will find some of the tweaks interesting and maybe there can be cause to co-hack or collaborate.</div>

<div><br></div><div>As I'm sure you're familiar with the principle of least privilege design, I've set up a practical secure backup configuration that achieves the aim of security even if a third party provider is compromised.  And so for one, I'm stably running a TrueCrypt volume on top of an encrypted sparse bundle, and I have that sparse bundle set to back up to a cheap external backup provider.  I omit unneeded metadata from the backup such as the token file and instead pool all metadata into another secure archive which is then also backed up.  Then even if the external provider becomes compromised or my private provider-specific key is known to them, they'll only be seeing encrypted data, and without the token they would have a hard time even brute forcing the data to see if a given passphrase works.  Then I can leverage the advantages of a given provider and can even give friends updated band files that they can redundantly back up using spare space without any worries of confidential information being shared.  I understand I'm suggesting different tradeoffs and this has additional overhead compared to Tahoe LAFS for example since I'm using an intermediate "staging" server, but my setup includes a server purposed for that and the overhead doesn't affect me too much.</div>

<div><br></div><div>And so I think there are a few assumptions we can make about people's data.  For one, metadata can be assumed to be much smaller (<<) than data, and so things like timestamps or checksum can be included in an encrypted archive meant for easy restoration.  Also, frequently used/modified data << the majority of the data.  And configuration data, private keychains, GnuPG and SSH keys, some root files, and in general the user's most important files can be considered smaller (<<) in size compared to the rest.  So why not ensure that the most critical data is most spread out geographically in case something goes wrong?  It is posisble to make file backup decisions based upon derivation information for the source of acquired files, so that a video or source code file that appears in a stable web location can be considered "reproducible" and thus not as critical to back up individually.  As an extension, it could be possible to actively trace and map out the derivation of files so that a binary file is known to originate from a hierarchy of dependent source files.</div>

<div><br></div><div>What about a staging server that deliberately lies on a local network and performs deduplication privately for a group (family, group of friends, connected researchers, group behind the same intranet -- thus without exposing which files have been pooled/shared) and then after compression and encryption forwards on the data to be backed up?  What about a very practical compromise of just storing relatively inert data encrypted on hard drives and archived in a vault in a geographically distant location?  Then incremental updates can be combined and the overhead cost is very little above a hard drive, with a short day or two turnaround time in recovery.  The staging server can very quickly synchronise itself with any local connected devices due to network proximity and presumably internal bandwidth wouldn't need to be paid for, but at the same time the server would be "staging" in that incremental high-priority updates would be pushed to the cloud and offsite during the evenings after some postprocessing (which can include further compression or incremental delta calculations if the intermediate data is available as plaintext before it is encrypted offsite).</div>

<div><br></div><div>I thought it would be interesting to implement a meta-{data backup} solution that learns what resources are available (network bandwidth and associated cost, free space and reliability of proximal drives, overall budget, use cases for local systems, possible staging servers) and then uses those resources automatically in an efficient and clever manner.  There is a lot of very useful metadata that can fit within 2GB and that can be redundantly backed up to various reliable geographically diverse sites (most likely for free or at very low cost).  A nearby university can offer very cheap storage for archived data that is not accessed very often.  There are various backup sources and sinks and these can be configured in a way to achieve an optimal result overall.  Also, there is much to tweak and configure by default.  What if some over-arching UI can be provided that will automatically create and maintain needed accounts and will forward on payment as needed?  The UI can be like a universal interface to backup functionality.  This is just one idea but suffice it to say that there is already a lot of unneeded complexity and diversity in various backup solutions and resources and while the flexibility is there, based upon overall high-level guiding principles, maybe the parameters can be categorised such that the flexibility can be hidden under a sensible and unified interface.</div>

<div><br></div><div>Also for an archive of a keyfile I make use of (threshold) secret sharing which in my opinion is an under-utilised paradigm.  But although it can take twice the amount of storage, for a set of critical data like password files the overhead is not significant in comparison with a full backup.  The basic idea is E_K1(data + R), E_K2 (R), where R is randomly generated like a one time pad and + is an XOR.  Then those two halves can be kept separate.  The statistical properties of each are highly random and uncorrelated, and even when someone has both halves, they in principle need both keys in entirety to be able to decrypt the data.  Anyhow, as much as this appears abstract and impractical, I can say that FreeBSD's GEOM layer offers a gshsec implementation which I use, and so I combine two GELI-encrypted devices under a shared secret and form a filesystem on top of it.</div>

<div><br></div><div>Technologically, on the hardware side, I'm a big fan of the 2TB My Passport Studio drive (what a small elegant little drive that doesn't need additional power source -- it can even be a boot volume on OS X and via Firewire can be disconnected while a laptop sleeps), a series of reliable $120 refurbished netbooks, and am considering combining a Rasperry Pi or similar with the former drive to create a local file sync and staging server that continuously backs up to multiple geographically redundant backup providers (while automatically connecting to local wired/wireless networks and presenting itself as a server for various protocols).</div>

<div><br></div><div>I hope you'll find some of this intriguing or interesting.</div><div><br></div><div style>Mike</div></div>