Opened at 2010-07-06T05:21:58Z
Closed at 2012-02-17T06:59:38Z
#44 closed defect (fixed)
test what happens if another Python module loads Crypto++ dynamic link library
Reported by: | zooko | Owned by: | zooko |
---|---|---|---|
Priority: | major | Milestone: | 0.6.0 |
Version: | 0.5.19 | Keywords: | |
Cc: | Launchpad Bug: |
Description
In theory, it will cause crashes or possibly exploitable holes on some platforms.
Change History (25)
comment:1 Changed at 2010-07-06T05:24:49Z by zooko
comment:2 Changed at 2010-08-31T12:21:10Z by zooko
Hopefully Crypto++ v5.6.1 fixed this. We should write this test and then see if upgrading to Crypto++ v5.6.1 makes it stop crashing.
comment:3 Changed at 2010-09-01T03:58:44Z by zooko
Hm, so as usual actually sitting down and trying to write a unit test makes me realize that I don't really understand.
Along the way of attempting to understand, I re-read this thread:
http://groups.google.com/group/cryptopp-users/browse_thread/thread/eb815f228db50380
and then I found this nice page about visibility control in GCC:
http://www.nedprod.com/programs/gccvisibility.html
Now I think the way to test this is to make another .so aside from _pycryptopp.so which also links to libcryptopp.so and then load _pycryptopp.so and then load that other .so.
comment:4 Changed at 2010-09-04T20:34:25Z by zooko
Okay, I've added a test of this. setup.py will build another .so, named _testdoubleload.so, if you pass the --test-double-load argument on the command-line. I updated the buildbot to build with --test-double-load and then to try this command-line:
python -c 'import pycryptopp; import _testdoubleload'
On many systems this causes a memory management error. It is typically reported as a double free detected by glibc. Here are the details reported by with valgrind on an Ubuntu 10.04 amd64 box:
Invalid read of size 4 at 0x75BA116: std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() (in /usr/lib/libstdc++.so.6.0.13) by 0x5AC9261: exit (exit.c:78) by 0x5AAEC53: (below main) (libc-start.c:258) Address 0x7ab1210 is 16 bytes inside a block of size 28 free'd at 0x4C27DCF: operator delete(void*) (vg_replace_malloc.c:387) by 0x75BA128: std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() (in /usr/lib/libstdc++.so.6.0.13) by 0x5AC9261: exit (exit.c:78) by 0x5AAEC53: (below main) (libc-start.c:258)
Now if I force use_RTLD_GLOBAL to False on line 22 of pycryptopp/__init__.py then the error goes away. Next I ran this experiment (with use_RTLD_GLOBAL at its default setting) on all of the buildslaves. The results were that, among unixy platforms, CentOS5, debian-lenny-c7-i386, Mac 10.6/amd64, OpenSolaris/amd64, OpenBSD-4.6/amd64, and NetBSD5/i386 showed no problems, but Ubuntu 10.04/amd64 showed the problem.
(See the buildbot around this time: http://tahoe-lafs.org/buildbot-pycryptopp/waterfall?show_events=true&last_time=1283631265 .)
Next I'm going to push a patch to pycryptopp trunk which disables the RTLD_GLOBAL hack and see if that breaks anything.
comment:5 Changed at 2010-09-04T21:45:25Z by zooko
Okay I disabled RTLD_GLOBAL in [20100904201728-92b7f-1d838bc41934d7375fe2e611470a4e4c56185c49] and the buildbots look perfectly happy, e.g. http://tahoe-lafs.org/buildbot-pycryptopp/waterfall?show_events=true&last_time=1283634414
Of the unixy platforms, it appears that all of the ones that were working back when he had RTLD_GLOBAL are still working, and the one that was failing -- Ubuntu 10.04/amd64 -- is now working.
Does this mean that the problem that we were originally trying to fix when we introduced the RTLD_GLOBAL hack back in [20090527235925-92b7f-c7b71c236b9ca5ed4083327793cdea27c3201b32], problems with RTTI crossing shared-library boundaries, are fixed? Perhaps we no longer throw C++ exceptions in one shared-library (libcryptopp.so) and attempt to catch them in another (pycryptopp.so)? Hm a quick grep tells me that we are attempting to catch exceptions, in pycryptopp, which were thrown by libcryptopp, for example:
Does this mean we don't have test coverage of this behavior? No, look, there are some right there in test_aes.py. So does this mean that the C++ compiler (g++) has gotten smarter about making cross-shared-lib RTTI work since a year ago?
Hm. I'm not sure what's going on. I guess as long as I don't see evidence of a current problem (especially evidence in the form of a red unit test on our buildbot), I'll leave the RTLD_GLOBAL hack turned off for now.
comment:6 Changed at 2010-09-07T16:27:59Z by zooko
According to gcc developers, using RTTI across shared library boundaries is unsupported unless you set RTLD_GLOBAL:
http://gcc.gnu.org/ml/gcc-help/2010-08/msg00293.html
and according to Jeffrey Walton's letter to the Crypto++ mailing list:
http://groups.google.com/group/cryptopp-users/browse_thread/thread/7eae009a4e02e726
The version of Crypto++ which we have bundled in pycryptopp (which pretty much corresponds to Crypto++ v5.6.0), should exhibit bugs when we don't have RTLD_GLOBAL set.
However, this is the opposite of what our unit tests currently show! Currently everything works if we leave RTLD_GLOBAL unset, but things fail (we get a double-free when cleaning up after loading two different .so's each of which links to libcryptopp.so) if we set RTLD_GLOBAL. So I think this means our unit tests are advanced enough to demonstrate the bugs that the gcc developers and Jeffrey Walton are trying to explain to us.
Could someone help me write a unit test which fails due to these issues, when we leave RTLD_GLOBAL unset?
Here are some notes that I wrote to Jeffrey Walton and Wei Dai in email just now:
ere it is giving a double-free error (which could in theory even
become an exploitable security hole I guess) with RTLD_GLOBAL set:
Here is the same test after the following patch unset RTLD_GLOBAL:
Here is the same test, after unsetting RTLD_GLOBAL, when linked with
Ubuntu's provided system libcryptopp.so instead of
compiled-by-our-setup-script libcryptopp.so:
comment:7 Changed at 2011-01-02T15:58:36Z by zooko
- Resolution set to fixed
- Status changed from new to closed
I think this issue is fixed by [20100904201728-92b7f-1d838bc41934d7375fe2e611470a4e4c56185c49].
comment:8 Changed at 2011-12-28T07:55:01Z by zooko
I've now figured out (thanks in part to nejucomo) that this patch from the upstream Crypto++ project is probably Wei Dai's solution to this problem -- in particular to the problem of double-free when multiple libraries (Python extension modules) are loaded that link to the Crypto++ shared library.
What I need to do next is, first undo the patch [20100904201728-92b7f-1d838bc41934d7375fe2e611470a4e4c56185c49] the patch that was intended to fix this before, and see if this reproduces the double-free problem and/or the missed-exception problem on our various buildslaves.
Next, try extending _testdoubleloadmodule.cpp to actually use some symbol from the Crypto++ dynamic library, in order to force it to be actually linked and loaded, and see if that causes the problems to appear on more buildslaves.
Next, try applying Wei Dai's patch from the upstream Crypto++ project and see if that cures the double-free problem on any buildslaves that are exhibiting it. Along the way, of course, see if we can reproduce and/or cure the missed-exception problem (presumably unsetting RTLD_GLOBAL will (still) cure the missed-exception problem).
We need lots of good green buildslaves to be ready before we try this! Currently quite a few buildslaves are not green, mostly due to the still in-progress transition from darcs to git.
comment:9 Changed at 2011-12-28T07:57:28Z by zooko
- Resolution fixed deleted
- Status changed from closed to reopened
comment:10 Changed at 2011-12-28T08:08:25Z by zooko
comment:11 Changed at 2012-01-09T18:23:37Z by zooko
- Milestone set to 0.6.0
comment:12 Changed at 2012-01-12T00:07:48Z by zooko
I suspect these valgrind warnings might be relevant:
https://tahoe-lafs.org/buildbot-pycryptopp/builders/Ruben%20Fedora/builds/17
==14846== Conditional jump or move depends on uninitialised value(s) ==14846== at 0x40172A6: index (in /lib64/ld-2.15.so) ==14846== by 0x4007727: expand_dynamic_string_token (in /lib64/ld-2.15.so) ==14846== by 0x4007FA6: _dl_map_object (in /lib64/ld-2.15.so) ==14846== by 0x400186D: map_doit (in /lib64/ld-2.15.so) ==14846== by 0x400E6C5: _dl_catch_error (in /lib64/ld-2.15.so) ==14846== by 0x4000EFE: do_preload (in /lib64/ld-2.15.so) ==14846== by 0x400458D: dl_main (in /lib64/ld-2.15.so) ==14846== by 0x40147C3: _dl_sysdep_start (in /lib64/ld-2.15.so) ==14846== by 0x4005151: _dl_start (in /lib64/ld-2.15.so) ==14846== by 0x4001697: ??? (in /lib64/ld-2.15.so) ==14846== by 0x3: ??? ==14846== by 0x7FF00061E: ???
==14846== Conditional jump or move depends on uninitialised value(s) ==14846== at 0x40172AB: index (in /lib64/ld-2.15.so) ==14846== by 0x4007727: expand_dynamic_string_token (in /lib64/ld-2.15.so) ==14846== by 0x4007FA6: _dl_map_object (in /lib64/ld-2.15.so) ==14846== by 0x400186D: map_doit (in /lib64/ld-2.15.so) ==14846== by 0x400E6C5: _dl_catch_error (in /lib64/ld-2.15.so) ==14846== by 0x4000EFE: do_preload (in /lib64/ld-2.15.so) ==14846== by 0x400458D: dl_main (in /lib64/ld-2.15.so) ==14846== by 0x40147C3: _dl_sysdep_start (in /lib64/ld-2.15.so) ==14846== by 0x4005151: _dl_start (in /lib64/ld-2.15.so) ==14846== by 0x4001697: ??? (in /lib64/ld-2.15.so) ==14846== by 0x3: ??? ==14846== by 0x7FF00061E: ???
comment:13 Changed at 2012-01-18T04:45:39Z by zooko
Okay, the programme from comment:8 starts with undoing [20100904201728-92b7f-1d838bc41934d7375fe2e611470a4e4c56185c49], but it turns out that in [459114f2aaf043274aa851c367ddfddabdb6f2e3/git] we already removed all trace of the old way of doing things, which was to set the RTLD_GLOBAL flag. No buildbot-performed tests or reports from the field have said that there was any problem with failure to identify two occurrences of the same symbol, from two different compilation modules, as the same, such as the original problem of "exception thrown from one module can't be caught in the other because the type of the exception class never compares equal". Therefore, I'm going to skip this step and assume that there is no need to set RTLD_GLOBAL anymore and proceed with the next step.
comment:14 Changed at 2012-01-25T13:21:35Z by zooko
Well, this is pretty disappointing. I applied http://bazaar.launchpad.net/~zooko/cryptopp/trunk/revision/466 as https://github.com/zooko/pycryptopp/commit/95c78ce5f08b03be70e5d69e5eb09d7acc1ce420, and it didn't change anything in terms of test results! In particular, the two systems that had valgrind warnings with the old version have the exact same warnings with the new version. Here are the test results from two systems in the new build with Wei Dai's patch applied: https://tahoe-lafs.org/buildbot-pycryptopp/builders/Ruben%20Fedora/builds/27 and https://tahoe-lafs.org/buildbot-pycryptopp/builders/Ruben%20Fedora/builds/27.
I had thought that those valgrind warnings were due to the problem of handling global symbols and singletons. In fact, now that I think about it, what exactly did I think was happening here? I guess on Linux the module loading flags default to RTLD_LOCAL instead of RTLD_GLOBAL, which I think should be causing a failure of cross-module typeinfo comparison (e.g. a failure to catch an exception of a certain type, raised in a different C++ module), but there isn't such a failure demonstrated by the tests.
In contrast if the flag was set to RTLD_GLOBAL then I would have expected exception-catching to work, but multiple libraries which dynamically link to the Crypto++ dynamic link library would each think they were solely responsible for managing the global symbols in that library, including constructing and destructing singletons, which should lead to some problems, most importantly a double-free problem. I don't see that either, on the buildbots, although possibly that's because none of the systems is using RTLD_GLOBAL now.
Okay, now what? The pressing questions are:
- what these valgrind warnings mean (pasted into this comment below),
- why Wei Dai wrote that patch,
- why we no longer get a problem where exceptions raised from one module can't be caught (by their type) in another module,
- if the current version of pycryptopp triggers a double-free error (or another critical problem) when it and another library (e.g. another Python module) are loaded into the same process and linking to the same Crypto++ DLL
Help!
==30887== Conditional jump or move depends on uninitialised value(s) ==30887== at 0x4017A16: index (in /lib64/ld-2.15.so) ==30887== by 0x4007927: expand_dynamic_string_token (in /lib64/ld-2.15.so) ==30887== by 0x40081BC: _dl_map_object (in /lib64/ld-2.15.so) ==30887== by 0x400171D: map_doit (in /lib64/ld-2.15.so) ==30887== by 0x400EB65: _dl_catch_error (in /lib64/ld-2.15.so) ==30887== by 0x4000EE3: do_preload (in /lib64/ld-2.15.so) ==30887== by 0x4003EB2: dl_main (in /lib64/ld-2.15.so) ==30887== by 0x4014E4A: _dl_sysdep_start (in /lib64/ld-2.15.so) ==30887== by 0x4004E71: _dl_start (in /lib64/ld-2.15.so) ==30887== by 0x4001537: ??? (in /lib64/ld-2.15.so) ==30887== by 0x1: ??? ==30887== by 0x7FF000E96: ???
comment:15 Changed at 2012-02-10T16:50:41Z by zooko
Jeffrey Walton had some suggestions: http://groups.google.com/group/cryptopp-users/browse_thread/thread/44ad84eb354a570c
comment:16 Changed at 2012-02-10T17:44:54Z by zooko
- Owner set to zooko
- Status changed from reopened to new
comment:17 Changed at 2012-02-10T17:45:01Z by zooko
- Status changed from new to assigned
comment:18 Changed at 2012-02-13T12:34:29Z by zooko
The valgrind warnings have gone away. None of the patches we committed between when it was giving warnings and when it wasn't would seem to explain it. I found a build that had the warnings and that didn't involve any experimental patches not destined for trunk, and hit the "rebuild" button on that. Buildbot rebuilt the exact same source, based on the git hash. This time there were no valgrind warnings! Hooray for the rebuild button! Comparing the outputs from show-tool-versions for the builds with and without valgrind warnings:
buildmaster: buildmaster: no such file or directory buildslave: Buildslave version: 0.8.5 Twisted version: 11.1.0 cl: cl: no such file or directory -g++: g++ (GCC) 4.7.0 20120119 (Red Hat 4.7.0-0.8) +g++: g++ (GCC) 4.7.0 20120126 (Red Hat 4.7.0-0.10) cryptest: 5.6.1 darcs: 2.7.98.3 (beta 3) 7za: 7za: no such file or directory flappclient: Foolscap version: 0.6.3 Twisted version: 11.1.0 -valgrind: valgrind-3.6.1 +valgrind: valgrind-3.7.0 as: GNU assembler version 2.22 Copyright 2011 Free Software Foundation, Inc. This program is free software; you may redistribute it under the terms of the GNU General Public License version 3 or later. This program has absolutely no warranty. This assembler was configured for a target of `x86_64-redhat-linux'. setuptools: [distribute 0.6.24 (/usr/lib/python2.7/site-packages)]
Either Fedora updating to a newer version of the (not yet released) gcc-4.7.0 or Fedora upgrading valgrind from 3.6.1 to 3.7.0 could explain the warnings going away. In either case, we can consider them to be not a problem.
comment:19 Changed at 2012-02-13T20:36:39Z by davidsarah
You could try to reproduce the warnings with g++ (GCC) 4.7.0 20120119 (Red Hat 4.7.0-0.8) and/or valgrind-3.6.1, to confirm that's what was fixed (rather than some other subtle environmental breakage hiding the warnings).
comment:20 Changed at 2012-02-13T20:43:34Z by zooko
Well, that's not very convenient since we don't have direct ssh-style access to a Fedora box. We could always arrange such access, e.g. a short-term virtual machine with Rawhide installed (?), but I'm inclined to assume that this isn't a problem.
comment:21 Changed at 2012-02-15T21:44:37Z by zooko
Hm, now the warning is back in a subsequent run:
Let's see, here was the rebuild-the-exact-same-version (95c78ce5f08b03be70e5d69e5eb09d7acc1ce420) that I did earlier and it came out valgrind-clean:
https://tahoe-lafs.org/buildbot-pycryptopp/builders/Ruben%20Fedora%20syslib/builds/41
And here is the rebuild of the exact same version again:
https://tahoe-lafs.org/buildbot-pycryptopp/builders/Ruben%20Fedora%20syslib/builds/48
And the same valgrind warnings are back.
comment:22 Changed at 2012-02-16T00:30:01Z by davidsarah
Possibly http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=583856 ?
(Always google the error message!)
Edit: the bug is claimed to have been fixed by updating the suppressions in valgrind 3.6.1. However, if some of the warnings occur nondeterministically, it's plausible that the fix didn't catch all of them.
comment:23 Changed at 2012-02-16T00:41:38Z by davidsarah
I suggest closing this ticket and opening another one for the valgrind warnings, since there's no evidence that the latter have anything to do with double-loading (and we do have tests for cross-module exception catching as pointed out in comment:5). We should also file a debian bug.
comment:24 Changed at 2012-02-17T06:04:43Z by zooko
Created #81 to be about the valgrind warnings.
comment:25 Changed at 2012-02-17T06:59:38Z by zooko
- Resolution set to fixed
- Status changed from assigned to closed
Okay, so the whole thing about the valgrind warnings starting in comment:12 was a red herring -- those warnings were almost certainly unrelated to the subject of this ticket.
I also just opened #82 to be about the last remaining valgrind warnings, which appear to be unrelated to this issue.
As for the actual subject bug(s) of this ticket, I think they can be marked as "presumed dead":
- Raising exceptions from one shared library (libcryptopp.so) and catching it in another (_pycryptopp.so) is tested and works (see comment:5)
- Setting RTLD_GLOBAL cured the double-free errors on all platforms where they occurred, setting it back to RTLD_LOCAL later re-introduced them and setting it back to RTLD_GLOBAL again cured them again
- I applied Wei Dai's patch in [95c78ce5f08b03be70e5d69e5eb09d7acc1ce420] and while it didn't cause anything to improve on the buildbot, it didn't cause anything to degrade either (and I inspected the patch and didn't see any bugs in it)
- Valgrind isn't telling us about any potential problems that are related to this ticket
- We changed it so the double-load-tester is actually loading the Crypto++ dynamic-link library necessarily (by making it use a symbol provided by that library, per comment:8), in [7b821f79032fcba0e956813d1be93d19eef7d103]
So, I'm going to mark this ticket as "fixed".
http://tahoe-lafs.org/pipermail/tahoe-dev/2010-July/004604.html