[tahoe-dev] pycryptopp vs ARM

Wed Jun 3 04:22:23 PDT 2009

I did some digging into the mysterious pycrypto++ corruption failures on
Zandr's ARM-based linkstation. The executive summary: crypto++-5.5.2 is
broken, 5.6.0 probably fixes it, the problem won't appear on x86 or most
other processors, there's a workaround but it's too ugly to contemplate.

== The problem with crypto++-5.5.2 and ARM ==

The main workhorse of AES-CTR_Mode is in strcipr.cpp,
AdditiveCipherTemplate<S>::ProcessData. This is responsible for each call to
AES.ProcessData(), which means it is effectively generating a stream cipher
that looks like:

 16 bytes of AES.encode(00000) ("00000" means 128 bits of zeros)
 16 bytes of AES.encode(00001) (127 bits of zeros and a single "one" bit)
 16 bytes of AES.encode(00002)
 etc..

and XORing that stream cipher with the input data to produce the output data.
Each call to ProcessData() can involve an arbitrary amount of data, but
always works with adjacent spans of the stream, so it is responsible for
keeping track of how many bytes have been processed already: an offset from
the beginning of the stream.

It remembers this offset in two pieces. The high-order piece is the number of
16-byte blocks which have been encoded, stored in "m_counterArray", as a
128-bit/16-byte number (remember that we're using AES here, so the keysize
can be either 128-bits or 256-bits, but the blocksize is always
128-bits/16-bytes).

The low-order piece is stored backwards in "m_leftOver": upon entry to
ProcessData, if we're sitting at offset=0,16,32, then m_leftOver=0, but if
we're sitting at offset=1,17,33 then m_leftOver=15, and if we're at
offset=15,31,47 then m_leftOver=1.

The AES object has a buffer named "m_buffer" which is used to hold the
leftover stream cipher blocks between calls to ProcessData. It always holds
exactly 16 bytes, and always updates these bytes as a unit. m_buffer[0] holds
the stream byte that corresponds to offset=0,16,32 . m_buffer[1] holds the
byte that is used for offset=1,17,33.

m_leftOver then tells you which of the right-most bytes in m_buffer[] are
waiting to be used. If m_leftOver=3, then m_buffer[13..15] are waiting to be
XORed to provide offset=13..15, or offset=29..31 .

ProcessData() has three phases:
 1: use up any leftover stream data sitting in m_buffer from last time
 2: process as many full 16-byte blocks as possible
 3: process a trailing partial block, if any, putting the leftover stream in
    m_buffer

Phase 1 happens if m_leftOver>0 and works by just XORing the right offset in
m_buffer with the input data, and incrementing all the pointers. It does this
one byte at a time: not as efficient as you might like, but it's never being
used for more than 15 bytes per operation.

Phase 2 has two forms: some operations can handle multple blocks at once very
efficiently (remember that ProcessData() is pretty generic and is used by
lots of different codepaths), so it sets up the underlying operation (i.e.
AES256) to encrypt-and-XOR a couple of blocks and says Go. The operation code
that is passed in to OperateKeystream() includes a note that says whether the
input and output string pointers are aligned, but many operations (including
modes.cpp:CTR_ModePolicy::OperateKeystream) ignore this note.

If the first form couldn't be used, it falls back to the second form, which
has a while(length>=bufferByteSize) loop and explictly writes the keystream
into m_buffer, XORs it with the input string into the output string, and
increments all the pointers. This XOR loop will run one byte at a time if its
arguments aren't aligned.

Phase 3 (which only happens if the remaining length is nonzero) writes the
keystream into m_buffer, XORs just the partial block (one byte at a time),
then sets m_leftOver so that the next time through Phase 1 will use the
previously generated keystream. A given keystream block is only generated
once.

Note that Phase 1 and Phase 3 (and the second form of Phase 2, but not the
first) all use byte-at-a-time loops. Also note that Phase 2 is used for the
bulk of the data, so it wants to be as fast as possible.

== The Gun On The Mantle In Act One ==

All full-size CPUs like big fat memory reads, for efficiency: they've got 32-
or 64- bit memory busses, and read whole cache lines at a time. They prefer
aligned memory reads: x=*((int32_t*)0x0), because unaligned reads like
x=*((int32_t*)0x1) or x=*((int32_t*)0x2) actually require the processor to
perform two reads (the first of 0x0-0x3, the second of 0x4-0x7) and then
shuffle bytes around to get them into the right place. Memory writes behave
similarly.

(microcontrollers, at least those with an 8-bit bus, don't care about
alignment. In fact, many of them don't even have 16- or 32- bit operations,
so the issue never comes up).

Some processors will begrudgingly perform unaligned reads for you, but
they'll be slow and grouchy about it. Some will throw a hardware exception,
which might be caught by the kernel and emulated (meaning it'll work, but now
the kernel is grouchy too, and things are really really slow), or might be
delivered as a SIGILL to your program (which will probably kill it).

The ARM processor has an exciting quirk. In certain configurations (which
depend upon what the kernel wants to do), it uses a third mode, in which
unaligned reads cause the specific byte to be loaded correctly but the
remaining bytes get shifted around. This effectively corrupts most of the
loaded word. http://www.aleph1.co.uk/chapter-10-arm-structured-alignment-faq
has a good writeup on the issue: their basic example is (remember these are
little-endian):

        Address: 0  1  2  3  4  5  6  7
        Value  : 10 21 66 23 ab 5e 9c 1d

  Using *(unsigned long*)2 would give:

        on x86: 0x5eab2366
        on ARM: 0x21102366

Similar things happen if you try to do an unaligned write: instead of
modifying bytes [2..5], you'll modify bytes [0..3] with some surprisingly
rotated version of what you were writing. Byte [2] might get the right thing,
but byte[0] will be clobbered.

== The Gun On The Mantle In Act Two ==

The function in Phase 2 which generates the stream data bottoms out in
rijndael.cpp's Rijndael::Enc::ProcessAndXorBlock, which takes three (byte *)
pointers (plus the internal key, already prepared and turned into subkeys):
the input block (i.e. the counter), the XOR block (i.e. the plaintext), and
the output block (i.e. the ciphertext). Most of the code in rijndael.cpp is
x86 assembly code, but when that's disabled (hint: use M-x
cpp-highlight-buffer to make emacs colorize or hide code inside specific
conditionals, like !defined(CRYPTOPP_X86_ASM_AVAILABLE) ) the part that's
left is the regular C AES implementation. It shuffles a bunch of data around
internally and then does something like this:

	word32 *const obw = (word32 *)outBlock;
	const word32 *const xbw = (const word32 *)xorBlock;

	obw[0] = tbw[0] ^ xbw[0] ^ rk[0];

That means it's taking the xorBlock plaintext pointer (a byte*), pretending
it's really a 4-byte (word32*), and then reading from it one word at a time.
If xorBlock was not actually a multiple of 4 bytes, then this will be an
unaligned read. The write is doing the same thing, using outBlock and casting
it to a word32* and then writing words into it, so you can get an unaligned
write from this. This doesn't occur on x86 because the assembly version gets
used instead of the C version (I assume that either the assembly handles
misalignment explicitly, or that the x86 behaves correctly in the face of
unaligned accesses).

== The Gun Is Fired In Act Three ==

The sequence of operations that triggers a pycryptopp failure on ARM is when
an AES encryption instance is used to process two chunks of data in which the
first chunk is less than 16 bytes long and the two chunks combined are at
least 32 bytes long. The python code looks something like this:

 expected = AES(key).process(plaintext)
 e2 = AES(key)
 got = e2.process(plaintext[:9]) + e2.process(plaintext[9:9+23])
 assert expected == got # fails

In particular, on ARM we see something like:
 09,23: 0551d7974ff1d9e29c 9d883746f146a6,ffc9ce5ecaa9abd890da470e46000000 <--GOT
 09,23: 0551d7974ff1d9e29c 9d883746080343,f15abafbc9ca5ac6a9a7eeaada790a42 <--EXPECTED

(where the space is the break between the 9-byte call and the 23-byte call,
and the comma is where the 16-byte blocksize lands).

We also see failures with other length combinations: 9+24, 10+22, 10+23,
10+24, 15+17, etc.

Let's look specifically at 9+23. The first call to process() will skip Phase
1 (since this is the first time we've used this object), will skip Phase 2
(since we're not processing a full block), and only run Phase 3. Phase 3 will
generate the stream data for offset=0..15 and store it in m_buffer, then XOR
the first 9 bytes of that to produce the ciphertext, and store it in the
output pointer. Note that pycryptopp/Python allocates a new string for each
call to ProcessData(), and those strings are always 4-byte aligned. Phase 3
sets m_leftOver to 7, since we've only used 9 bytes out of the 16 in
m_buffer. This works fine. The 9 bytes of ciphertext (0551d7974ff1d9e29c)
returned by process() are correct. The output buffer so far:

 09,23: 0551d7974ff1d9e29c

The second call to process() wants to process 23 bytes of plaintext. It
allocates a Python string for the result, let's pretend it gets address
0x1000 (it will generally be 4-byte aligned, so it could just as easily be at
0x1004 or 0x1008 or 0x100c). Phase 1 will see that m_leftOver=7, so it does
byte-wise XORs the remaining stream data from m_buffer into the output and
gets 9d883746080343. So far, so good. Phase 1 finishes by incrementing the
output pointer, in this case 0x1000+7=0x1007, and decrementing the remaining
length to be processed, 23-7=16. The output buffer now has:

 09,23: 0551d7974ff1d9e29c 9d883746080343

Phase 2 wants to work with whole blocks, and since length>=16, it gets to
run. It calls OperateKeystream() and tells it to encrypt-and-XOR data using a
ciphertext target pointer (outblock) of... 0x1007. Here is the problem. That
32-bit cast and obw[0]= dereference causes an unaligned access, and on ARM
you wind up with writes to [0x1004-0x1007] instead of [0x1007-0x100a]. This
clobbers the bytes which were already written, giving us:

 09,23: 0551d7974ff1d9e29c 9d883746080343
                                          xxxxxxxx  <- expected write
                                   f146a6,ff   <- actual write
 09,23: 0551d7974ff1d9e29c 9d883746f146a6,ff   <- resulting data

The rest of Phase 2 does more damage. I think the unalignedness of the
plaintext pointer (xorBlock) is affecting stuff too, which is why the
corrupted ciphertext looks both shifted and bit-flipped from the expected
value.

I instrumented the ProcessData() code in cryptopp-5.5.2 to confirm that the
last byte of the output buffer was correct (0x43) at the end of Phase 1 and
then clobbered (0xa6) at the end of Phase 2.

This corruption doesn't happen if Phase 1 didn't run, since then the pointers
are still equal to the original (4-byte aligned) Python string buffer, which
means that if your call to ProcessData() leaves it on a 16-byte boundary,
then you're fine.

It also doesn't happen unless Phase 2 runs, which requires that there be at
least 16 bytes left to process (so it can do a full block). And it only
happens on ARM where aligned accesses give you this weird corruption behavior
(on other chips you might get a trap or a hard-to-spot performance problem).

Calling ProcessData() only once per SigningKey instance won't trigger it.
Always calling ProcessData() with short strings (<16 bytes) won't trigger it,
and always calling it with multiples of 16-byte strings won't trigger it.

== Episode 4: A New Hope ==

cryptopp-5.6.0 changes the AES implementation considerably. I haven't traced
through it enough to be sure, but I suspect that they've fixed the problem,
because of a new #define CRYPTOPP_ALLOW_UNALIGNED_DATA_ACCESS (which is
conservatively only set on x86, x64, and PowerPC). In addition, the vital
write in rijndael.cpp now looks like:

  Block::Put(xorBlock, outBlock)(tbw[0]^rk[0])(tbw[1]^rk[1])(tbw[2]^rk[2])(tbw[3]^rk[3]);

and is done without casting xorBlock or outBlock to word32*. I can't find
where Block::Put is defined, but I'm encouraged that this will fix the
problem. The terse release notes for 5.6.0 don't mention intentionally fixing
any problem like this.

I've got a library compiling now to test this. I'll have to let it run
overnight.. compiling libcrypto++ on Zandr's linkstation takes about 3 hours.
(incidentally, it runs a lot faster if you edit the makefile to use -O0).

What this means is that libcrypto++-5.5.2 has a grave bug on ARM, which is
probably fixed in the current libcryptopp++-5.6.0 . If we really wanted to,
pycryptopp could work around the bug, by keeping track of how many bytes we'd
submitted to ProcessData() and splitting up the input data (at most one split
per call to .process) such that we either tell ProcessData() to start at a
16-byte boundary, or we give it less than 16 bytes of data to work with.
Something like:

    def process(self, plaintext):
        if len(plaintext) < 16 or self.counter%16==0:
            self.counter += len(plaintext)
            return self.e.ProcessData(plaintext)
        gap = 16 - (self.counter%16)
        return self.process(plaintext[:gap]) + self.process(plaintext[gap:])

That ought to sound crazy to you.

Lenny has 5.5.2, and while we should develop a minimal (C++) demonstration
case and file a "serious" or "grave" debian/lenny bug on libcrypto++7
(specific to arm/armel), I think it's unlikely that we could convince them to
issue a security fix and upgrade stable all the way to 5.6.0 for just one
architecture. So ARM boxes that want to use pycryptopp with
--disable-embedded-cryptopp will need to upgrade to sid (which has 5.6.0).
Fortunately, pycryptopp.deb isn't going to go into Lenny anyways (since it's
already shipped), so this is less of an issue for the
get-pycryptopp-into-Debian effort, just for people who want to use pycryptopp
in the syslib form on a lenny/stable ARM box. We may want to consider
building a backported cryptopp-5.6.0 for these folks.

The embedded-cryptopp form should work (assuming my hopes for Block::Put are
justified), because Zooko recently upgraded the embedded copy to 5.6.0 .

We should also look at crypto++'s test suite and make sure there's a case
which uses a single encryption object to two ProcessData operations (with
various lengths) so this case is being exercised right from the source. I've
attached the python equivalent below.

cheers,
 -Brian

#! /usr/bin/python

from pycryptopp.cipher.aes import AES
from binascii import a2b_hex, b2a_hex

key = a2b_hex("54b0789aeeddb6b3351e6d7b8d79d8d489582376ab1b322dd3362ccbfdb82f7a")
assert len(key) == 32
plaintext = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
assert len(plaintext) == 52
expected_ciphertext = AES(key=key).process(plaintext)
#print b2a_hex(expected_ciphertext[:16]), b2a_hex(expected_ciphertext[16:32]), b2a_hex(expected_ciphertext[32:48])
expected_s = "0551d7974ff1d9e29c9d883746080343"+"f15abafbc9ca5ac6a9a7eeaada790a42"+"73a45de485a97ecd52d79fe702a07a67"+"9b8d73ad"
assert b2a_hex(expected_ciphertext) == expected_s

errors = []
for i in range(1, 25):
    for j in range(1, 25):
        p = AES(key=key)
        a = p.process(plaintext[:i])
        assert len(a) == i
        b = p.process(plaintext[i:i+j])
        assert len(b) == j
        c = a+b
        if c != expected_ciphertext[:i+j]:
            errors.append((i,j,a,b,expected_ciphertext[:i],expected_ciphertext[i:i+j]))
for (i,j,a,b,ea,eb) in errors:
    print "%02d,%02d: %s %s <--GOT" % (i, j, b2a_hex(a), b2a_hex(b))
    print "%02d,%02d: %s %s <--EXPECTED" % (i, j, b2a_hex(ea), b2a_hex(eb))
    print
if not errors:
    print "no errors detected"