[tahoe-dev] Running Tahoe on ARM plugs

Fri Feb 18 09:20:35 PST 2011

Forwarding to the list as Zooko thought some of this might be of
interest to the general population.

----- Forwarded message from Jack Lloyd <lloyd at randombit.net> -----

From: Jack Lloyd <lloyd at randombit.net>
To: Zooko O'Whielacronx <zooko at zooko.com>
Date: Fri, 18 Feb 2011 11:00:29 -0500
Subject: Re: want a SheevaPlug?

Hi Zooko,

It appears that this lil guy has hardware AES-128 support [1] that
reportedly more than doubles AES performance [2] (along with DES, MD5,
and SHA-1, none of which seem relevent to Tahoe), but unfortunately it
appears like the crypto processor is only accessible from kernel mode
(though I haven't read enough of the spec sheet to completely confirm
this). I assume a /dev/crypto implementation like [3] or [4] would
make it accessible to userspace, but am not sure if special per-device
support is required in the crypto mapper, or if they can use any
random in-kernel crypto driver that happens to be running. (Also as
far as I know none of the various /dev/crypto implementations for
Linux are in the mainline kernel, so the distro, or the user, would
have to patch them in - and the number of people who are going to roll
their own custom patched kernel is pretty small compared to the number
of people who might theoretically want to run Tahoe on a plug).

I was really hoping there was NEON support included, as AES, XSalsa20,
and SHA-256 could all (at least in theory) be helped by SIMD (though I
believe NEON is a significantly weaker SIMD ISA than, say, AltiVec or
SSE2/SSSE3). But in any case the Sheevaplug CPU seems to be scalar
only.

For tree hashing (or bulk hashing in general of indepedent streams), a
2 to 4 way interleaved SHA-256 compression function might help, even
on a scalar CPU (I've seen good wins on PowerPC for interleaving of
this form, for instance). But a SHA-256 compression function takes ~10
registers (8 chaining, 2 temps), and ARM only has 16 visible
registers, so any gains in hiding instruction latency might well be
lost to register spills. I assume a modern ARM has a larger physical
register file than this, and does register renaming (as modern x86
does), but I couldn't find anything to confirm this for any ARM, or
the Feroceon (Sheeva) core in particular. Nor do I currently know what
the instruction latencies or cache timings look like. So currently,
too many variables to even guess on if it might work, and as usual
with these things, the easiest approach is probably to just try it and
measure.

Just writing the SHA-256 compression function in straight asm might
improve things, though that will depend on how good the compilers
are. When I first wrote a SHA-1 compression function in x86 asm, it
was ~50% faster than what GCC was producing on cooresponding C++. With
GCC 4.5, the asm is 25% slower than C++. I have no idea how well GCC
optimizes on ARM, so hard to say how much room for improvement there
might be.

Probably writing some of the core integer operation loops in ARM asm
would help RSA/ECDSA - OpenSSL has some asm for this with the comment
that "The code was observed to provide +65-35% improvement [depending
on key length, less for longer keys] on ARM920T, and +115-80% on Intel
IXP425." (however Andy Polyakov is a much better asm programmer than I
am, so I would be quite surprised if I could match that improvement).
And the structure of Crypto++'s Integer algorithms are a little
difficult for me to follow, though I do see where one could plug in
inline ARM asm for multiply-adds. And the M-class ARMs have a
32x32+32->64 multiply-add operation which, at least according to some
(undated) slides I just read [5], is not generated by compilers. The
Sheeva is an ARMv5TE, and (according to Wikipedia) "E-variants also
imply T,D,M and I [architecture extensions]." so that's one thing that
would work, at least.

Perhaps some of the DSP instructions could be used to speed up the FEC
encoding/decoding? I haven't investigated this area at all yet.

So the short answer is while I'm willing to give it a shot, I don't
see many obvious/easy opportunities for huge wins, and wanted to set
expectations accordingly.

-Jack

[1] http://www.marvell.com/products/processors/embedded/kirkwood/FS_88F6180_9x_6281_OpenSource.pdf
[2] http://smorgasbord.gavagai.nl/2010/02/sheevaplug-hardware-crypto/
[3] http://www.logix.cz/michal/devel/cryptodev/
[4] http://home.gna.org/cryptodev-linux/index.html
[5] http://www.simplemachines.it/doc/arm_inst.pdf

On Fri, Feb 18, 2011 at 01:13:03AM -0700, Zooko O'Whielacronx wrote:
> http://tahoe-lafs.org/pipermail/tahoe-dev/2011-February/006134.html
> 
> Just tell me that you intend to optimize some of the crypto that we
> need for ARM, or to experiment with the performance of hash-based
> crypto on ARM and report back. No obligation on your part except for
> sincere intent. :-)
> 
> Regards,
> 
> Zooko

----- End forwarded message -----