#1418 closed defect (invalid)

"cannot convert float NaN to integer" in next_power_of_k, during upload via helper

Reported by: rycee Owned by: rycee
Priority: major Milestone: undecided
Component: code Version: 1.8.2
Keywords: helper upload heisenbug floatingpoint debian pycryptopp Cc: zooko
Launchpad Bug:

Description

While performing a backup just a few minutes ago, I got the following exception:

Traceback (most recent call last):
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/support/bin/tahoe", line 9, in <module>
    load_entry_point('allmydata-tahoe==1.8.2', 'console_scripts', 'tahoe')()
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/runner.py", line 113, in run
    rc = runner(sys.argv[1:], install_node_control=install_node_control)
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/runner.py", line 99, in runner
    rc = cli.dispatch[command](so)
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/cli.py", line 540, in backup
    rc = tahoe_backup.backup(options)
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 325, in backup
    return bu.run()
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 118, in run
    new_backup_dircap = self.process(options.from_dir)
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 188, in process
    childcap = self.process(childpath)
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 188, in process
    childcap = self.process(childpath)
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 188, in process
    childcap = self.process(childpath)
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 188, in process
    childcap = self.process(childpath)
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 194, in process
    childcap, metadata = self.upload(childpath)
  File "/home/rycee/bin/allmydata-tahoe-1.8.2/src/allmydata/scripts/tahoe_backup.py", line 305, in upload
    raise HTTPError("Error during file PUT", resp)
allmydata.scripts.common_http.HTTPError: Error during file PUT: 500 Internal Server Error

The remote exception was:

Traceback (most recent call last):
Failure: foolscap.tokens.RemoteException: <RemoteException around '[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
  File "/home/rycee/allmydata-tahoe-1.8.2/support/lib/python2.6/site-packages/foolscap-0.6.1-py2.6.egg/foolscap/call.py", line 674, in _done
    self.request.complete(res)
  File "/home/rycee/allmydata-tahoe-1.8.2/support/lib/python2.6/site-packages/foolscap-0.6.1-py2.6.egg/foolscap/call.py", line 60, in complete
    self.deferred.callback(res)
  File "/home/rycee/allmydata-tahoe-1.8.2/support/lib/python2.6/site-packages/Twisted-10.2.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 361, in callback
    self._startRunCallbacks(result)
  File "/home/rycee/allmydata-tahoe-1.8.2/support/lib/python2.6/site-packages/Twisted-10.2.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 455, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "/home/rycee/allmydata-tahoe-1.8.2/support/lib/python2.6/site-packages/Twisted-10.2.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 542, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/upload.py", line 924, in locate_all_shareholders
    num_segments, n, k, desired)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/upload.py", line 225, in get_shareholders
    None)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/layout.py", line 88, in make_write_bucket_proxy
    num_share_hashes, uri_extension_size_max, nodeid)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/layout.py", line 108, in __init__
    effective_segments = mathutil.next_power_of_k(num_segments,2)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/util/mathutil.py", line 35, in next_power_of_k
    x = int(math.log(n, k) + 0.5)
exceptions.ValueError: cannot convert float NaN to integer

I have previously performed successful backups with this precise setup, no configurations have been changed or servers restarted recently. The storage node has ample available space.

I tried running the backup several times and the error occur each time. However, when running a second time the file on which the first run failed is reported as uploaded and the second run fails on the next file. Running a check on the reported cap gives:

$ tahoe check --verify --raw URI:CHK:2pqpbe...
{
 "results": {
  "needs-rebalancing": false, 
  "count-shares-expected": 3, 
  "healthy": false, 
  "count-unrecoverable-versions": 1, 
  "count-shares-needed": 1, 
  "sharemap": {}, 
  "count-recoverable-versions": 0, 
  "servers-responding": [
   "5yea4my3w3frgp524lgthrb7rdd6frtr", 
   "44g5kkgwulzrrrntdzci7jtt5rgt6nuo", 
   "bzyf23mghgxycnr34pdkqdmybnevf4ks"
  ], 
  "count-good-share-hosts": 0, 
  "count-wrong-shares": 0, 
  "count-shares-good": 0, 
  "count-corrupt-shares": 3, 
  "list-corrupt-shares": [
   [
    "bzyf23mghgxycnr34pdkqdmybnevf4ks", 
    "barr7kgra3pst6icbtnmxrggca", 
    0
   ], 
   [
    "44g5kkgwulzrrrntdzci7jtt5rgt6nuo", 
    "barr7kgra3pst6icbtnmxrggca", 
    1
   ], 
   [
    "5yea4my3w3frgp524lgthrb7rdd6frtr", 
    "barr7kgra3pst6icbtnmxrggca", 
    2
   ]
  ], 
  "recoverable": false
 }, 
 "storage-index": "barr7kgra3pst6icbtnmxrggca", 
 "summary": "Not Healthy: 0 shares (enc 1-of-3)"
}

Without using --verify

$ tahoe check --raw URI:CHK:2pqp...
{
 "results": {
  "needs-rebalancing": false, 
  "count-shares-expected": 3, 
  "healthy": true, 
  "count-unrecoverable-versions": 0, 
  "count-shares-needed": 1, 
  "sharemap": {
   "0": [
    "bzyf23mghgxycnr34pdkqdmybnevf4ks"
   ], 
   "1": [
    "44g5kkgwulzrrrntdzci7jtt5rgt6nuo"
   ], 
   "2": [
    "5yea4my3w3frgp524lgthrb7rdd6frtr"
   ]
  }, 
  "count-recoverable-versions": 1, 
  "servers-responding": [
   "5yea4my3w3frgp524lgthrb7rdd6frtr", 
   "44g5kkgwulzrrrntdzci7jtt5rgt6nuo", 
   "bzyf23mghgxycnr34pdkqdmybnevf4ks"
  ], 
  "count-good-share-hosts": 3, 
  "count-wrong-shares": 0, 
  "count-shares-good": 3, 
  "count-corrupt-shares": 0, 
  "list-corrupt-shares": [], 
  "recoverable": true
 }, 
 "storage-index": "barr7kgra3pst6icbtnmxrggca", 
 "summary": "Healthy"
}

Both client and server nodes are running 1.8.2 on Debian stable. Sorry for the big pastes, if there is any other information you would like please let me know.

Attachments (8)

preconditions-about-ticket1418.diff (3.6 KB) - added by zooko at 2011-06-21T17:54:17Z.
tahoe.cfg (391 bytes) - added by rycee at 2011-06-22T09:01:52Z.
The tahoe.cfg of my storage/helper node (the non-helper nodes just have enabled=false in [helper]
tahoe-client.cfg (340 bytes) - added by rycee at 2011-06-22T09:04:06Z.
My client config with helper enabled.
diagnostics-about-NaN-from-npok.diff (1.1 KB) - added by zooko at 2011-06-22T18:26:00Z.
1.8.2.diff (16.8 KB) - added by rycee at 2011-06-23T08:58:49Z.
diff -ur between build throwing NaN exception and pristine build
show-tool-versions-output.txt (10.8 KB) - added by rycee at 2011-06-23T20:00:36Z.
Output from show-tools-versions
valgrind-output.txt (82.3 KB) - added by rycee at 2011-06-23T20:01:05Z.
Output from valgrind.
pycryptopp-dynlink-test-failure.txt (56.7 KB) - added by rycee at 2012-03-01T20:57:17Z.
Output from pycryptopp test when built with --disable-embedded-cryptopp

Download all attachments as: .zip

Change History (63)

comment:1 Changed at 2011-06-14T11:16:29Z by rycee

I'm currently running a fsck on the storage disks. Maybe a disk corruption would have this effect?

No, the fsck runs came back clean (I run two storage nodes on the computer).

Last edited at 2011-06-14T11:50:57Z by rycee (previous) (diff)

comment:2 Changed at 2011-06-14T21:31:14Z by davidsarah

It seems as though mathutil.next_power_of_k(num_segments, 2) was probably called with num_segments = NaN. (The next power of 2 is also undefined for negative numbers and infinity, but those give different exception messages.)

comment:3 Changed at 2011-06-22T02:38:12Z by zooko

  • Keywords response-needed added
  • Owner nobody deleted

Dear rycee: thank you for the good bug report! I scanned through the code shown by the stack trace and didn't see the problem. Along the way I added preconditions which should identify the problem more precisely. Could you please apply attachment:preconditions-about-ticket1418.diff to your 1.8.2 source tree and reproduce the problem? Thank you very much!

I wonder if it could stem from a configuration change you've made in tahoe.cfg, perhaps to the "max segment size" setting. If so, this patch will tell us that this is the problem.

comment:4 Changed at 2011-06-22T02:38:35Z by zooko

  • Owner set to rycee

comment:5 follow-up: Changed at 2011-06-22T08:55:30Z by rycee

With zooko's patch I get pretty much the same thing as earlier (do I need to give python some special flags to enable the preconditions?):

[snip]
--- <exception caught here> ---
  File "/home/rycee/allmydata-tahoe-1.8.2.foo/support/lib/python2.6/site-packages/Twisted-10.2.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 542, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/upload.py", line 926, in locate_all_shareholders
    num_segments, n, k, desired)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/upload.py", line 225, in get_shareholders
    None)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/layout.py", line 88, in make_write_bucket_proxy
    num_share_hashes, uri_extension_size_max, nodeid)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/layout.py", line 108, in __init__
    effective_segments = mathutil.next_power_of_k(num_segments,2)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/util/mathutil.py", line 35, in next_power_of_k
    x = int(math.log(n, k) + 0.5)
exceptions.ValueError: cannot convert float NaN to integer

Please note that upload.py:926 now appear in the stack compared to upload.py:924 in the initial stack so the patch was really applied! I checked this many times to convince myself that I'm not going insane :-)

After a bit of further investigation it seems like the problem only occurs when I'm using a helper. When uploading directly from my client node it works perfectly fine. When I realized this I also realized that it is possible that since upgrading the helper node to 1.8.2, I have only used the client without helper! I did successfully use the helper when it was running 1.8.1 though. I don't know if this helps you narrow down the problem.

Disabling the use of helper and uploading a random file gives:

$ dd if=/dev/urandom of=rnd0 bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.256792 s, 4.1 MB/s
$ tahoe cp rnd0 priv:
Success: files copied
$ tahoe check priv:rnd0
Summary: Healthy
 storage index: fpeoa7vqrcrjny354zmru3gqda
 good-shares: 3 (encoding is 1-of-3)
 wrong-shares: 0
$ tahoe check --verify priv:rnd0
Summary: Healthy
 storage index: fpeoa7vqrcrjny354zmru3gqda
 good-shares: 3 (encoding is 1-of-3)
 wrong-shares: 0

which is good. However, when I enable the use of a helper and restart the client node I get:

$ dd if=/dev/urandom of=rnd1 bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.256289 s, 4.1 MB/s
$ tahoe cp rnd1 priv:
Success: files copied
$ tahoe check priv:rnd1
Summary: Healthy
 storage index: rzlouseev5hbrte62facntsj4i
 good-shares: 3 (encoding is 1-of-3)
 wrong-shares: 0
$ tahoe check --verify priv:rnd1
Summary: Not Healthy: 0 shares (enc 1-of-3)
 storage index: rzlouseev5hbrte62facntsj4i
 good-shares: 0 (encoding is 1-of-3)
 wrong-shares: 0
 corrupt shares:
  server bzyf23mghgxycnr34pdkqdmybnevf4ks, SI rzlouseev5hbrte62facntsj4i, shnum 2
  server 44g5kkgwulzrrrntdzci7jtt5rgt6nuo, SI rzlouseev5hbrte62facntsj4i, shnum 0
  server 5yea4my3w3frgp524lgthrb7rdd6frtr, SI rzlouseev5hbrte62facntsj4i, shnum 1

which is bad.

Changed at 2011-06-22T09:01:52Z by rycee

The tahoe.cfg of my storage/helper node (the non-helper nodes just have enabled=false in [helper]

Changed at 2011-06-22T09:04:06Z by rycee

My client config with helper enabled.

comment:6 Changed at 2011-06-22T09:29:49Z by davidsarah

  • Component changed from unknown to code
  • Keywords helper upload added; response-needed removed
  • Owner changed from rycee to zooko
  • Summary changed from "cannot convert float NaN to integer" in next_power_of_k to "cannot convert float NaN to integer" in next_power_of_k, during upload via helper

comment:7 in reply to: ↑ 5 Changed at 2011-06-22T17:45:54Z by zooko

Replying to rycee:

With zooko's patch I get pretty much the same thing as earlier (do I need to give python some special flags to enable the preconditions?)

No, the precondition() function runs the tests regardless of any flags. (There is a flag in Python to disable assert, but precondition() doesn't use assert or respect that flag.)

Oh, I guess my patch was written under the mistaken assumption that this NaN was arising because one of the inputs to math.log() was not an int or long. I'll need to think harder to provide a more useful diagnostics patch, probably starting by re-reading davidsarah's comment:2.

However, when I enable the use of a helper and restart the client node I get:

$ dd if=/dev/urandom of=rnd1 bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.256289 s, 4.1 MB/s
$ tahoe cp rnd1 priv:
Success: files copied
$ tahoe check priv:rnd1
Summary: Healthy
 storage index: rzlouseev5hbrte62facntsj4i
 good-shares: 3 (encoding is 1-of-3)
 wrong-shares: 0
$ tahoe check --verify priv:rnd1
Summary: Not Healthy: 0 shares (enc 1-of-3)
 storage index: rzlouseev5hbrte62facntsj4i
 good-shares: 0 (encoding is 1-of-3)
 wrong-shares: 0
 corrupt shares:
  server bzyf23mghgxycnr34pdkqdmybnevf4ks, SI rzlouseev5hbrte62facntsj4i, shnum 2
  server 44g5kkgwulzrrrntdzci7jtt5rgt6nuo, SI rzlouseev5hbrte62facntsj4i, shnum 0
  server 5yea4my3w3frgp524lgthrb7rdd6frtr, SI rzlouseev5hbrte62facntsj4i, shnum 1

which is bad.

Wait a minute, are you saying that tahoe cp reports Success: files copied even though the helper encountered this uncaught exception!? That seems unlikely. Which process encounters the uncaught exception and in which step of the transcript above?

comment:8 Changed at 2011-06-22T18:26:40Z by zooko

  • Owner changed from zooko to rycee

Okay, this patch will definitely give actually useful diagnostics! attachment:diagnostics-about-NaN-from-npok.diff Please try it out.

comment:9 Changed at 2011-06-22T18:28:53Z by zooko

Oh, by the way, are applying these diagnostics patches to the helper source code, too? And restarting the helper? The code in the stack trace would be running in the helper, not in the client, I believe.

comment:10 Changed at 2011-06-23T08:56:45Z by rycee

Yes, I have applied the patches you've given on the helper node, not the client node. With the new patch I did indeed get some output but being a Python novice I feel more confused, not less. The stacktrace says:

  File "/home/rycee/allmydata-tahoe-1.8.2-bug1418/support/lib/python2.6/site-packages/Twisted-10.2.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 542, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/upload.py", line 926, in locate_all_shareholders
    num_segments, n, k, desired)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/upload.py", line 225, in get_shareholders
    None)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/layout.py", line 88, in make_write_bucket_proxy
    num_share_hashes, uri_extension_size_max, nodeid)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/layout.py", line 108, in __init__
    effective_segments = mathutil.next_power_of_k(num_segments,2)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/util/mathutil.py", line 49, in next_power_of_k
    return next_power_of_k_math(n, k)
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/util/mathutil.py", line 35, in next_power_of_k_math
    x = int(math.log(n, k) + 0.5)
exceptions.ValueError: ('cannot convert float NaN to integer', 30L, 2, 32)
]'>

and in the node's twistd.log I found

2011-06-23 08:19:10+0200 [Negotiation,1,46.10.48.88] XXX n: 30 :: <type 'long'>, k: 2 :: <type 'int'>, next_power_of_k_alt: 32

In the python REPL on the same computer I get

$ python
Python 2.6.6 (r266:84292, Dec 27 2010, 00:02:40) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import math
>>> math.log(30L, 2)
4.9068905956085187
>>> int(math.log(30L, 2) + 0.5)
5
>>> 2**5
32

This, together with the Success: files copied, makes me quite confused. Feels like the NaN error is a decoy put under our noses while the real problem slips quietly into the night.

I also tried creating a completely pristine 1.8.2 build on my helper node and now it also fails in the same way as the cp did, i.e., it claims success and the CHK-caps reported by a verbose backup are OK when running check but fails when running check --verify. For example, backup says

/home/rycee/photos.new/2011/05/22/IMGP4679.JPG -> URI:CHK:k3fpasihz7g7ogsrmbywdfgdy4:jlcyer5z43nuuvdm72qyqlw2eq5uyjubpxey25gfdizidmcdlnrq:1:3:3877153

and checking gives

$ tahoe check URI:CHK:k3fpasihz7g7ogsrmbywdfgdy4:jlcyer5z43nuuvdm72qyqlw2eq5uyjubpxey25gfdizidmcdlnrq:1:3:3877153
Summary: Healthy
 storage index: 7sxgsu3edgy43at77tuskouuay
 good-shares: 3 (encoding is 1-of-3)
 wrong-shares: 0
$ tahoe check --verify URI:CHK:k3fpasihz7g7ogsrmbywdfgdy4:jlcyer5z43nuuvdm72qyqlw2eq5uyjubpxey25gfdizidmcdlnrq:1:3:3877153
Summary: Not Healthy: 0 shares (enc 1-of-3)
 storage index: 7sxgsu3edgy43at77tuskouuay
 good-shares: 0 (encoding is 1-of-3)
 wrong-shares: 0
 corrupt shares:
  server bzyf23mghgxycnr34pdkqdmybnevf4ks, SI 7sxgsu3edgy43at77tuskouuay, shnum 2
  server 44g5kkgwulzrrrntdzci7jtt5rgt6nuo, SI 7sxgsu3edgy43at77tuskouuay, shnum 0
  server 5yea4my3w3frgp524lgthrb7rdd6frtr, SI 7sxgsu3edgy43at77tuskouuay, shnum 1

I will attach the result of a diff -ur between the version giving the exception (including zooko's patches from this bug) and the pristine version. Note that the version giving exception contains edits of mine that changes minimum_cycle_time in crawler.py and expirer.py but I have had those changes without trouble since a long time (many versions ago).

Oh, finally. Since next_power_of_k_alt returns a sensible result, I tried making next_power_of_k return that value in my build that produces exceptions. Running a backup then proceeds in the same way as the pristine version, i.e., the client reports success but check --verify fails.

Changed at 2011-06-23T08:58:49Z by rycee

diff -ur between build throwing NaN exception and pristine build

comment:11 Changed at 2011-06-23T13:11:00Z by zooko

Yes, this is getting weird. It *looks* like next_power_of_k_math raised this exception when its inputs were n=30, k=2, but when you tried it yourself in the REPL, the same calls to math.log() on the same computer worked. It could be a failure in your CPU's floating point support, but I would have expected that to be sporadic or permanent, rather than to fail every time under tahoe and work every time under the REPL! I wonder if executing tahoe is somehow changing the floating point mode of your CPU...

Maybe there's a bug in next_power_of_k_math(). Could you please try something like:

HACK zompu:~/playground/tahoe-lafs/trunk$ PYTHONPATH=src python
Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53) 
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from allmydata.util import mathutil
>>> mathutil.next_power_of_k(30, 2)
32
>>> mathutil.next_power_of_k_math(30, 2)
32
>>> mathutil.next_power_of_k_alt(30, 2)
32

A next-step on this for me is to try to understand why upload is succeeding but then verify failing. I guess this error is happening during write. If the error is causing it to get an incorrect value for effective_segments:

  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/layout.py", line 108, in __init__
    effective_segments = mathutil.next_power_of_k(num_segments,2)

Then it might write the data out incorrectly. If the file you are uploading was previously uploaded then the deduplication will prevent your gateway from uploading a new copy. This would explain why changing it to use next_power_of_k_alt() and then uploading and verifying previously written files got the same failure-to-verify. Oh wait, did you upload a new random file instead of a previously uploaded file, when you did the experiment that showed next_power_of_k_alt had the same problem? Gah! If you already tried that, then perhaps there are two bugs here -- the NaN exception and a different bug that is corrupting files on write.

Anyway it wouldn't make sense for this NaN exception to result in an incorrect value of effective_segments, when we can clearly see that what results is an uncaught exception!

Here are a couple of ideas you could try:

  1. Run next_power_of_k() (the version that uses _math and then uses _alt if _math raised an exception) in a tight loop, possibly with multiple processes doing it, and leave those running and reporting if they got any exceptions.
  1. Edit the code to use next_power_of_k_alt exclusively and mv your entire storage directory aside, or create an entirely separate storage server and introducer for testing, and upload and then verify a random file. (If you haven't already done this.)

comment:12 Changed at 2011-06-23T13:22:47Z by zooko

Two more ideas:

  1. Run trial --until-failure allmydata.test and just leave it running.
  2. memtest86 :-)

comment:13 Changed at 2011-06-23T19:36:08Z by rycee

I don't have physical access to the machine but I ran memtester for quite some time and it didn't show anything. Neither did some stress testing of math.log. But running the trial thingy was good, I noticed that allmydata.test.test_cli.Backup.test_backup failed with an exception on the x = int(math.log(n, k) + 0.5) line. And I found something that seems rather curious to me, putting print math.log(10) first in next_power_of_k and running test_backup gives

...
  File "/home/rycee/allmydata-tahoe-1.8.2/src/allmydata/util/mathutil.py", line 32, in next_power_of_k
    print math.log(10)
exceptions.ValueError: math domain error

Yeah, that's right. Running math.log(10) gives an exception and since it's only run once I can't imagine it's being from the CPU being overloaded or anything. Further:

$ python
Python 2.6.6 (r266:84292, Dec 27 2010, 00:02:40) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import math
>>> math.log(10)
2.3025850929940459

If I change math.log(10) to math.log10(10) it still gives the error. But – and hang on to your hat – if I use math.sqrt(10) it works, not only does the sqrt work but the math.log(n, k) a few lines down does as well! With next_power_of_k defined

def next_power_of_k(n, k):
    hmmmm = math.sqrt(10)
    if n == 0:
        x = 0
    else:
        x = int(math.log(n, k) + 0.5)
    if k**x < n:
        return k**(x+1)
    else:
        return k**x

the unit test passes!

Now, while this elegant fix removes the exception during backup the uploaded files unfortunately still fail check --verify).

Anyway, obviously something strange happens with some floating point operations and like you said, some floating point mode is probably being altered when running tahoe but not otherwise.

I also tried adding a print of effective_segments right after its assignment. It turns out that this variable is being set to 1 when I try doing a cp of a 1kB file and 128 with a 10MB file. I don't know if these values are OK but regardless, it appears that next_power_of_k is OK to call sometimes but not otherwise.

comment:14 Changed at 2011-06-23T19:42:49Z by zooko

Could you please run this single unit test that fails -- allmydata.test.test_cli.Backup.test_backup -- with the math.log(10) in place, under valgrind?

http://valgrind.org

Valgrind is very good. You'll need to use the Python suppressions file with it:

http://svn.python.org/projects/python/trunk/Misc/valgrind-python.supp

Another idea suggested by Riastradh on IRC is to turn on floating point exceptions: http://docs.python.org/library/fpectl.html

comment:15 Changed at 2011-06-23T19:46:12Z by zooko

Could you please attach the output of:

$ python misc/build_helpers/show-tool-versions.py 

and

$ ./bin/tahoe --version

comment:16 Changed at 2011-06-23T19:59:57Z by rycee

I'll attach the show-tool-versions output as a file (it's a bit messy).

$ ./bin/tahoe --version
allmydata-tahoe: 1.8.2,
foolscap: 0.6.1,
pycryptopp: 0.5.17,
zfec: 1.4.10,
Twisted: 10.2.0,
Nevow: 0.10.0,
zope.interface: unknown,
python: 2.6.6,
platform: Linux-debian_6.0.1-i686-32bit_ELF,
pyOpenSSL: 0.10,
simplejson: 2.1.6,
pycrypto: 2.1.0,
pyasn1: unknown,
mock: 0.6.0,
sqlite3: 2.4.1 [sqlite 3.7.3],
setuptools: 0.6c16dev3

Changed at 2011-06-23T20:00:36Z by rycee

Output from show-tools-versions

Changed at 2011-06-23T20:01:05Z by rycee

Output from valgrind.

comment:17 Changed at 2011-06-23T20:12:03Z by rycee

About the fpectl library. I tried adding fpectl.turnon_sigfpe() to the beginning of the test_backup method, this gave me

$ bin/tahoe debug trial allmydata.test.test_cli.Backup.test_backup
allmydata.test.test_cli
  Backup
    test_backup ... Fatal Python error: Unprotected floating point exception

which seems … interesting.

I don't know how to associate this to a source location though so I dunno if it corresponds to the math.log(10) or some other piece of code.

comment:18 Changed at 2011-06-23T20:24:21Z by zooko

Sigh. The Valgrind Python suppressions file that you used is apparently not working. Perhaps it was written for 64-bit architecture and has a bunch of "8" in it which need to be changed to "4" to make it suppress all of those false positives in attachment:valgrind-output.txt ?

All of those "Conditional jump or move depends on uninitialised value(s)" and "Use of uninitialised value of size 4" emitting from PyObject_Free or PyObject_Realloc are false positives. You can generate new suppression entries to put into a suppressions file by executing

valgrind --gen-suppressions=yes bin/tahoe debug trial allmydata.test.test_cli.Backup.test_backup

You might also want to report a bug to the distributor of your valgrind-python.supp file. If you do, please let me know the URL to the ticket.

comment:19 Changed at 2011-06-23T20:34:00Z by zooko

Hm, judging from http://docs.python.org/library/fpectl.html , any Python code (in the CPython interpreter or in an extension module) which does floating point math is supposed to wrap the calls to floating point functions in macros named PyFPE_START_PROTECT and PyFPE_END_PROTECT in order to trap the exception and raise a nice Python exception. Apparently this hasn't happened!

Hopefully you can get a run under valgrind without all the noise from false positives. Maybe whatever problem ends in the floating point exception starts with a memory mis-usage that valgrind can report.

Jason Riedy on identi.ca wrote: "If the 30L somehow is pushed on a stack as an integer, the resulting corruption likely would produce a NaN." -- http://identi.ca/notice/77048991

comment:20 Changed at 2011-06-23T20:55:41Z by rycee

As suggested by Riastradh in #tahoe-lafs, I ran it under strace. Using strace -f python bin/tahoe debug trial allmydata.test.test_cli.Backup.test_backup I got a 3.3MB file which I can attach if requested. The last 20 lines are

[pid  3482] gettimeofday({1308861861, 391151}, NULL) = 0
[pid  3482] gettimeofday({1308861861, 391248}, NULL) = 0
[pid  3482] gettimeofday({1308861861, 391487}, NULL) = 0
[pid  3482] gettimeofday({1308861861, 391770}, NULL) = 0
[pid  3482] gettimeofday({1308861861, 391924}, NULL) = 0
[pid  3482] gettimeofday({1308861861, 392020}, NULL) = 0
[pid  3482] --- SIGFPE (Floating point exception) @ 0 (0) ---
[pid  3482] rt_sigaction(SIGFPE, {0xb77028d0, [], 0}, {0xb77028d0, [], 0}, 8) = 0
[pid  3482] write(2, "Fatal Python error: Unprotected "..., 57) = 57
[pid  3482] rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
[pid  3482] tgkill(3482, 3482, SIGABRT) = 0
[pid  3482] --- SIGABRT (Aborted) @ 0 (0) ---
Process 3481 resumed
Process 3482 detached
[pid  3487] +++ killed by SIGABRT +++
<... waitpid resumed> [{WIFSIGNALED(s) && WTERMSIG(s) == SIGABRT}], 0) = 3482
--- SIGCHLD (Child exited) @ 0 (0) ---
rt_sigaction(SIGINT, {SIG_DFL, [], 0}, {0x8110720, [], 0}, 8) = 0
exit_group(-6)                          = ?

comment:21 Changed at 2011-06-23T21:29:16Z by rycee

After some help I also managed to catch the SIGFPE in gdb. Since I don't have debug symbols in my python (simply the version found in Debian Squeeze), I don't know how useful it is. Anyway, I pasted the stack trace to http://hpaste.org/48228

comment:22 follow-up: Changed at 2011-06-24T06:48:28Z by rycee

Riastradh suggested that the cause could be a cosmic ray flipping a bit in memory, especially after seeing strange output from info float in gdb. I tried flushing the file cache, rebooting, reinstalling python 2.6, uninstalling python 2.5 (in case it was somehow interfering), rebuilding tahoe and building tahoe 1.8.1. Still get an exception from x = int(math.log(n, k) + 0.5). Note, I also get an exception from my newly built 1.8.1. I have previously used 1.8.1 reliably as a helper so I suppose the problem might be caused due to some newer version of a library pulled in during python setup.py build?

comment:23 in reply to: ↑ 22 Changed at 2011-06-24T13:14:13Z by zooko

Replying to rycee:

I have previously used 1.8.1 reliably as a helper so I suppose the problem might be caused due to some newer version of a library pulled in during python setup.py build?

Yes, Tahoe-LAFS itself is written in pure Python. I suspect this problem is due to memory corruption or some other incorrect C/C++ code, so it is probably due to one of the dependencies. I really want to track down which one it is, so please try to continue maintaining a "paper trail" of what you changed at each step of this. Thanks!

To see which dependencies have native code in them, look at this table: http://tahoe-lafs.org/source/tahoe-lafs/deps/tahoe-lafs-dep-eggs/README.html Every column there is a library that uses native (C/C++) code.

If I were you, my next step would be to clear the false positives out of valgrind (as I described in comment:18). valgrind will either identify the problem, or if it gives a clean report then we can be confident that there is no branching on uninitialized memory, writing to memory out of bounds, or other of the memory errors that valgrind checks for.

If you're not interested in doing that then say so and lets think of some other approaches to track down this bug.

comment:24 Changed at 2011-06-24T13:20:01Z by zooko

Oh I have an idea. Turn on tracing -- http://www.doughellmann.com/PyMOTW/trace/ -- then you'll know what line of source code triggered the fp exception. (Caveat: the Python process might buffer the output and then die due to the fp exception before it flushes the line telling the actual Python code which came immediately before the exception. So turn on the PYTHONUNBUFFERED=1 env var or pass -u to python.

comment:25 Changed at 2011-06-24T15:49:39Z by rycee

Turns out Debian stable has a python2.6-dbg package. I installed that and caught the SIGFPE when running the allmydata.test.test_upload.EncodingParameters.test_upper_limit_on_readonly_queries test (it also fails and runs quicker than the test_backup test). Anyway, I get

Program received signal SIGFPE, Arithmetic exception.
0x0807de7c in int_float (v=0xa65a168) at ../Objects/intobject.c:945
945	../Objects/intobject.c: No such file or directory.
	in ../Objects/intobject.c
(gdb) info float
  R7: Valid   0x400e9c42000000000000 +40002                     
  R6: Valid   0x400e9c42000000000000 +40002                     
  R5: Zero    0x00000000000000000000 +0                         
  R4: Valid   0x40029000000000000000 +9                         
  R3: Zero    0x00000000000000000000 +0                         
  R2: Zero    0x00000000000000000000 +0                         
  R1: Special 0xffff6edb1408f6ce3bbe Unsupported
=>R0: Special 0xffffe649a0c99e4d8714 QNaN

Status Word:         0xc2e1   IE             PE   ES   SF      C1    C3
                       TOP: 0
Control Word:        0x1372      DM       UM PM
                       PC: Extended Precision (64-bits)
                       RC: Round to nearest
Tag Word:            0x045a
Instruction Pointer: 0x73:0x0807de79
Operand Pointer:     0x7b:0x0a65a170
Opcode:              0xdb40
(gdb) where
#0  0x0807de7c in int_float (v=0xa65a168) at ../Objects/intobject.c:945
#1  0x08062c78 in PyNumber_Float (o=40000) at ../Objects/abstract.c:1749
#2  0x0807d277 in float_new (type=0x822cf20, args=(40000,), kwds=0x0)
    at ../Objects/floatobject.c:1657
#3  0x080accbd in type_call (type=0x822cf20, args=(40000,), kwds=0x0)
    at ../Objects/typeobject.c:726
#4  0x0806232a in PyObject_Call (func=<type at remote 0x822cf20>, arg=(40000,), 
    kw=0x0) at ../Objects/abstract.c:2492
#5  0x080e016b in do_call (f=
    Frame 0xac1249c, for file /home/rycee/allmydata-tahoe-1.8.2/src/allmydata/immutable/upload.py, line 1331, in _got (params=(3, 1, 10, 40002), k=3, happy=1, n=10, segsize=40002, f=<cStringIO.StringI at remote 0xa5c83f8>, enckey_hasher=<_SHA256d_Hasher(_digest=None, h=<_sha256.SHA256 at remote 0xb73bd700>, truncate_to=16) at remote 0x9c5f4cc>, BLOCKSIZE=65536, bytes_read=40000, data='datadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadata...(truncated), throwflag=0)
    at ../Python/ceval.c:3968
Quit
(gdb) print v
$1 = (PyIntObject *) 0xa65a168
(gdb) print *v
$2 = {ob_refcnt = 2, ob_type = 0x822da20, ob_ival = 40000}

The failing line in intobject.c is return PyFloat_FromDouble((double)(v -> ob_ival)); so presumably it is the cast to double that fails. But since ob_ival = 40000 I can't see any reasonable reason why. The failing line in tahoe is self._status.set_progress(0, float(bytes_read)/self._size) in immutable/upload.py.

If I have time, then I'll try the valgrind thingy but at the moment I'm pretty much ready to chalk it all up to hexed hardware.

comment:26 Changed at 2011-06-24T16:43:05Z by zooko

It can't be hexed hardware, because it is reproducible (at least mostly?) and it doesn't happen with any other executables, nor with other Python experiments. Say, here's an idea. My best guess is that one of the native-code dependencies (hopefully not zfec or pycryptopp -- the ones I'm responsible for!) are doing something bad. You could test this by running some complicated floating point math, then importing each native code dependency (see comment:23) and then running the floating point math again.

Also the tracing approach that I suggested in comment:24 is quite likely to give us useful information. If you're not comfortable with all the Python coding suggested on http://www.doughellmann.com/PyMOTW/trace/ then I can whip up a script that'll do it for you. :-)

comment:27 follow-up: Changed at 2011-06-25T06:46:19Z by zooko

Chad Austin had a suggestion on twitter:

Did you try asserting on the x87 and SSE control words periodically? I put that assertion after every test in our test runner.

https://twitter.com/#!/chadaustin/status/84508454357696514

comment:28 in reply to: ↑ 27 Changed at 2011-06-25T10:41:53Z by zooko

I asked Chad Austin on twitter how to detect the x87 and SSE control words. He replied:

On Windows with Python 2.6, this seems to work:

from ctypes import *
x87_control_word = c_uint()
sse_control_word = c_uint()
print 'success?', cdll.msvcr90.__control87_2(0, 0, pointer(x87_control_word), pointer(sse_control_word))
print 'x87', hex(x87_control_word.value)
print 'sse', hex(sse_control_word.value)

The meanings of the control registers: http://software.intel.com/en-us/articles/x87-and-sse-floating-point-assists-in-ia-32-flush-to-zero-ftz-and-denormals-are-zero-daz/

comment:29 Changed at 2011-08-16T00:31:11Z by davidsarah

Kludgy but more cross-platform way to get the FP environment (assuming sizeof(fenv_t) <= 100):

from ctypes import cdll, create_string_buffer
from ctypes.util import find_library
libm = cdll.LoadLibrary(find_library('m'))
b = create_string_buffer(100)
assert libm.fegetenv(b) == 0
b.raw

which on my system gives

'\x7f\x03\xff\xff\x00\x00\xff\xff\xff\xff\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xff\xa0\x1f\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

To decode the structure, see http://linux.die.net/include/bits/fenv.h. Note that we probably only need to detect when it changes during a unit test. For that, I would try hacking the run method of TestCase in twisted.trial.unittest so that it calls libm.fegetenv(b) before and after running the test code (the result.startTest(self) line) and prints the values of b.raw if they are different.

comment:30 Changed at 2011-08-26T21:37:07Z by davidsarah

  • Keywords heisenbug floatingpoint added

comment:31 Changed at 2012-02-28T16:37:35Z by zooko

There was a bug in the Linux kernel that would cause floating point state to be wrong "on 32-bit linux on modern CPUs":

http://article.gmane.org/gmane.linux.kernel/1258780

comment:16 says that rycee's machine was in 32-bit mode. rycee's show-tool-versions output says that the kernel was 2.6.32-5 (from Debian). I wasn't able to tell from the kernel patch comments when this bug was introduced into the kernel.

Anyway, since rycee appears to have lost interest in debugging this, I suggest we close this ticket as unreproducible until rycee or someone else who can reproduce it reappears.

comment:32 Changed at 2012-02-28T16:37:49Z by zooko

  • Cc zooko added
  • Owner changed from rycee to zooko
  • Status changed from new to assigned

comment:33 follow-up: Changed at 2012-02-28T17:23:20Z by rycee

Yeah, sorry about not being active on this bug. I've been happy with using my tahoe grid without helper so I haven't felt very motivated to dig around.

zooko: The Linux bug you linked seem unrelated since the CPU on this computer is an old Pentium III-M, 1.2GHz and the bug was for newer, presumably 64-bit, CPUs.

But I have interesting news, just tried downloading and building Tahoe 1.9.1. When I ran allmydata.test.test_cli.Backup.test_backup it was successful! I thought it might be related to some glibc update or something so I also downloaded both 1.8.2 and 1.8.3, built them and ran the same test. They both failed in the same way as described in my previous comments. So, I assume that some change occurred between 1.8.3 and 1.9.0 that must have fixed my problem.

Just to double check that 1.9.1 actually fixes my problem I enabled the helper functionality and tried to upload through it, indeed it was successful.

comment:34 in reply to: ↑ 33 Changed at 2012-02-28T17:30:39Z by zooko

  • Owner changed from zooko to rycee
  • Status changed from assigned to new

rycee: oh boy! You've managed to reproducibly isolate a way to make the bug happen or not happen! Now here's a question: can you run git bisect and get it to automatically run the unit tests in between each version change so that it finds exactly what patch makes the difference? Or do the equivalent manually -- try different versions and narrow down exactly what causes the problem?

Honestly, I'm still skeptical that anything we do in our 100% Python code could cause this bug, but perhaps something about the sequence of instructions that we're sending through the CPU accidentally avoids or accidentally triggers it...

Also I am still willing to help you use some diagnostic tools like valgrind or the fp env from comment:29...

comment:35 Changed at 2012-02-28T17:31:52Z by zooko

Also you might want to try applying that one patchset from linux to change the way it manages floating point state during context switches and see if that makes the problem go away. I don't know for sure that the linux bug doesn't apply on an old Pentium III -- it was just something from, IIRC, gregkh's announcement that mentioned "newer CPUs".

comment:36 Changed at 2012-02-28T18:49:58Z by rycee

Ok, gave bisect a try and it told me that the issue was fixed in commit 095efbf7a5f8ee375a81136f87ba3943ec995fc6. This change seems to make Tahoe depend on pycryptopp version 0.5.20 or later on my machine, before this change it was satisfied with version 0.5.14 or later.

On my Debian Squeeze installation, the python-pycryptopp package is of version 0.5.17-1+b1 satisfying the dependency before davidsarah's change but not after. Finding this out, I of course tried uninstalling the python-pycryptopp Debian package and rebuild a pristine Tahoe 1.8.2 source tree. The failing test now happily passes!

About the kernel patch. I'm reluctant to try this since I have no physical access to the machine and rely on it being up stably.

Last edited at 2012-02-28T18:51:13Z by rycee (previous) (diff)

comment:37 Changed at 2012-02-28T18:59:33Z by davidsarah

Aha. It seems quite likely that one of the bugs fixed between pycryptopp 0.5.14 0.5.17-1+b1 and 0.5.20 could have been responsible. I doubt it has anything to do with the kernel bug.

Last edited at 2012-02-28T19:00:38Z by davidsarah (previous) (diff)

comment:38 Changed at 2012-02-28T19:06:56Z by rycee

I guess we can raise the lower bound to pycryptopp version 0.5.17-1+b1 since that's what my Debian box was using when the test was failing. I don't know whether it is worth trying to determine the problem in pycryptopp…

Anyway, thank you David-Sarah for making the commit 095efbf7a5f8ee375a81136f87ba3943ec995fc6 that got us closer to the underlying cause :-)

comment:39 Changed at 2012-02-28T22:15:58Z by davidsarah

Note that it's possible that the Debian cryptopp was broken or miscompiled or included a bad Debian-specific patch, rather than this being a fix between upstream pycryptopp 0.5.17 and 0.5.20.

(Can you tell I don't really trust the Debian packaging policies or practices? We'd have a lot fewer problems if they would just follow upstream rather than picking out individual patches without really understanding them.)

Last edited at 2012-02-28T23:25:51Z by davidsarah (previous) (diff)

comment:40 Changed at 2012-02-28T22:18:10Z by davidsarah

Oh, also note that the Debian pycryptopp links against a non-embedded Crypto++, so we actually have no idea what version of Crypto++ that is (unless rycee can tell what it was).

comment:41 Changed at 2012-02-28T23:30:03Z by davidsarah

  • Keywords debian pycryptopp added

I considered closing this ticket as fixed, since it seems relatively certain from the bisection that this is a fixed pycryptopp or Crypto++ issue. However, since we don't know whether Debian added a bad patch to their pycryptopp or Crypto++, I'll leave it open until we can determine that.

comment:42 follow-up: Changed at 2012-02-28T23:37:00Z by zooko

Yes, it is also possible that there was a bug in upstream pycryptopp itself. rycee: can you experiment with swapping in the old Debian version of pycryptopp versus the old upstream version of pycryptopp, maybe?

comment:43 in reply to: ↑ 42 Changed at 2012-02-29T01:31:20Z by davidsarah

Replying to zooko:

Yes, it is also possible that there was a bug in upstream pycryptopp itself.

One reason I thought that less likely is that I reviewed all the patches between 0.5.17 and 0.5.20 and didn't see any fixes that could have caused this bug.

rycee: can you experiment with swapping in the old Debian version of pycryptopp versus the old upstream version of pycryptopp, maybe?

Yes, that's a good test.

comment:44 follow-up: Changed at 2012-02-29T07:19:39Z by rycee

I added 0.5.17 as an upper bound in _auto_deps.py and running setup.py build now prints

…
Searching for pycryptopp>=0.5.14,<=0.5.17
Reading http://tahoe-lafs.org/source/tahoe-lafs/deps/tahoe-dep-sdists/
Reading http://tahoe-lafs.org/source/tahoe-lafs/deps/tahoe-lafs-dep-eggs/
Reading http://pypi.python.org/simple/pycryptopp/
Reading http://allmydata.org/source/pycryptopp
Reading http://allmydata.org/trac/pycryptopp
Reading http://tahoe-lafs.org/trac/pycryptopp
Reading http://allmydata.org/source/zfec
Best match: pycryptopp 0.5.17
Downloading http://pypi.python.org/packages/source/p/pycryptopp/pycryptopp-0.5.17.tar.gz#md5=b3d19e7203531f8bd241ae58062f99e4
Processing pycryptopp-0.5.17.tar.gz
Running pycryptopp-0.5.17/setup.py -q bdist_egg --dist-dir /tmp/easy_install-SlyAet/pycryptopp-0.5.17/egg-dist-tmp-2wdZ7B
…

and it goes on to spend quite some time building the crypto++ library. After finishing, this build successfully runs the the previously failing test even though it's Tahoe version 1.8.2. If I then replace the _pycryptopp.so file from inside this Tahoe tree with the corresponding file from the Debian package the error comes back. The Debian version of pycryptopp is dynamically linked against crypto++ version 5.6.0-6.

So, finally, leaving Debian's _pycryptopp.so in place, I tried downloading and building a dynamic library from upstream crypto++ 5.6.0 and 5.6.1 and substituting Debian's copy. Curiously the test failed with 5.6.0 and worked with 5.6.1…

comment:45 in reply to: ↑ 44 Changed at 2012-03-01T04:09:25Z by zooko

Okay, this is very good bug isolation. I didn't quite follow the last bit though -- the bug is in *upstream* Crypto++ 5.6.0 and not *upstream* Crypto++ 5.6.1? Or it is in Debian Crypto++ 5.6.0 and not Debian Crypto++ 5.6.1?

By the way, we have a "quick startup self-test" built into pycryptopp, which is an attempt to catch the most egregious of these sorts of failures at run-time. I wonder if it would be possible to make that quick startup self-test more rigorous so that it would catch this one. I guess I'll be better able to tell once we really isolate this bug.

comment:46 Changed at 2012-03-01T06:16:03Z by rycee

Sorry about the vagueness. I used upstream cryptopp560.zip and cryptopp561.zip downloaded from http://www.cryptopp.com/. The 5.6.0 make file didn't have a target for a shared library so I copied that build target from 5.6.1, this is the only change I made to them. Building the cryptest.exe program and running it with ./cryptest.exe v didn't show any problems. Similarly, there were no errors reported for ./cryptest.exe tv TestVectors/all.txt.

comment:47 Changed at 2012-03-01T06:57:17Z by zooko

Hm, well okay so we're pretty sure there is a bug in Crypto++ 5.6.0 that is fixed in Crypto++ 5.6.1, right? Maybe we could let it go at that, since it is apparently fixed. The steps to reproduce are:

  1. Get and build Crypto++ 5.6.0
  2. Build pycryptopp with the --disable-embedded-cryptopp option so that it will link against the .so generated in step 1 instead of using its own bundled copy of Crypto++ source code
  3. Run python setup.py test -s allmydata.test.test_cli.Backup.test_backup

On your machine this 100% reproducible triggers the problem, and swapping in Crypto++ 5.6.1 for 5.6.0 makes it stop happening?

Okay one more thing: how do I know I don't have the same bug in the copy of Crypto++ that is bundled inside pycryptopp? I don't think we've upgraded the embedded copy of Crypto++ a newer version of Crypto++ since this patch [91f454b47f66636fffde68e8a8bf4364bed7516e]:

commit 91f454b47f66636fffde68e8a8bf4364bed7516e
Author: weidai <weidai>
Date:   Sat Feb 2 00:40:27 2008 -0800

Four years ago, before Crypto++ 5.6.0 was released. Since then we've just been cherry-picking patches from upstream Crypto++. So if this bug was present in the version that we copied, and if we didn't realize that we needed to cherry-pick the patch that fixes it, then it could be present in the embedded copy of Crypto++ inside pycryptopp.

Could you please replace step 2 above with building and omitting the --disable-embedded-cryptopp flag, to force it to use its embedded version of Crypto++? I think you've probably done that already in this process, but if you could do it again now just to be sure, that would make me feel better.

comment:48 follow-up: Changed at 2012-03-01T20:55:57Z by rycee

Ok, I downloaded pycryptopp version 0.5.29 and did two builds with it. Once with --disable-embedded-cryptopp and once without. I then ran python setup.py test on both. For the build that used the embedded version of Crypto++ this finished without any errors. For the one using the dynamically linked Crypto++ (v 5.6.0) it ended in failure and I'll make an attachment containing the output of the failed run.

Note, when I used --disable-embedded-cryptopp it was built against the Debian version of Crypto++ 5.6.0 since I don't know how to convince setup.py to look for headers and libraries in a non-standard directory. It still fails in the same way, though, if I use LD_LIBRARY_PATH to force it to load the upstream Crypto++ 5.6.0. Using the same technique to force it to use Crypto++ 5.6.1 results in an undefined symbol error.

I'm not entirely sure how to use a custom pycryptopp in a Tahoe build so I didn't try to use my two pycryptopp builds in Tahoe but I suppose the pycryptopp test is telling enough. Also, when I run the Tahoe build now (once I uninstalled the Debian pycryptopp) it downloads and builds pycryptopp using the embedded Crypto++; and as previously seen those builds are doing just fine.

Anyway, I think it's safe to say that the Crypto++ embedded in pycryptopp doesn't exhibit the problem I encountered.

Changed at 2012-03-01T20:57:17Z by rycee

Output from pycryptopp test when built with --disable-embedded-cryptopp

comment:49 in reply to: ↑ 48 ; follow-up: Changed at 2012-03-02T00:18:09Z by davidsarah

Replying to rycee:

Note, when I used --disable-embedded-cryptopp it was built against the Debian version of Crypto++ 5.6.0 since I don't know how to convince setup.py to look for headers and libraries in a non-standard directory. It still fails in the same way, though, if I use LD_LIBRARY_PATH to force it to load the upstream Crypto++ 5.6.0.

This is almost sufficient to convince me that this is a fixed bug in Crypto++ (despite it not being clear which change fixed it). However, we haven't excluded the possibility that the problem interacts with the dynamic linking, i.e. that it fails in non-embedded mode.

Using the same technique to force it to use Crypto++ 5.6.1 results in an undefined symbol error.

It sounds like we should open a new (pycryptopp) ticket for that, since at some point Debian or Ubuntu will upgrade to 5.6.1 and then it may fail with the same error.

comment:50 Changed at 2012-03-02T05:25:30Z by zooko

I agree to close this ticket as "invalid".

About the undefined symbol error, it sounds like rycee compiled it against the Crypto++ 5.6.0 headers provided by Debian, and then used LD_LIBRARY_PATH to force it to load Crypto++ 5.6.1 dynamic link library. That's not a safe thing to do and nobody, or at least not us, can do anything about that giving undefined symbol errors. Am I right, or am I confused?

comment:51 in reply to: ↑ 49 ; follow-up: Changed at 2012-03-02T10:13:45Z by rycee

Replying to davidsarah:

Replying to rycee:

Note, when I used --disable-embedded-cryptopp it was built against the Debian version of Crypto++ 5.6.0 since I don't know how to convince setup.py to look for headers and libraries in a non-standard directory. It still fails in the same way, though, if I use LD_LIBRARY_PATH to force it to load the upstream Crypto++ 5.6.0.

This is almost sufficient to convince me that this is a fixed bug in Crypto++ (despite it not being clear which change fixed it). However, we haven't excluded the possibility that the problem interacts with the dynamic linking, i.e. that it fails in non-embedded mode.

Ok, so I tried a new way. I unpacked pycryptopp twice. In one of the trees I replaced the embeddedcryptopp directory by a pristine upstream Crypto++ 5.6.0 and in the other I replaced embeddedcryptopp by a pristine upstream Crypto 5.6.1. I then did a python setup.py build followed by python setup.py test in each of the two trees. The pycryptopp statically built with 5.6.0 failed the test (in the same way as the previously attached test log) while the build using Crypto++ 5.6.1 ran all tests successfully.

I'm a bit curious, however, as to why the cryptest.exe program from Crypto++ 5.6.0 didn't fail, maybe the failing test from pycryptopp could be ported to a test in cryptest.exe?

Using the same technique to force it to use Crypto++ 5.6.1 results in an undefined symbol error.

It sounds like we should open a new (pycryptopp) ticket for that, since at some point Debian or Ubuntu will upgrade to 5.6.1 and then it may fail with the same error.

It's like zooko said. Since I don't know how to make setup.py use a custom directly header-directory it built against the standard Debian Cryptop++ 5.6.0 headers.

However, curiously enough, when I used Debian's pycryptopp – which is dynamically linked against Debian's Crypto++ – I was able to change out the .so file from 5.6.0 to 5.6.1 without problem. Doing this is how I found out that 5.6.1 worked. I suppose it has something to do with the patches and how the Debian pycryptopp and Crypto++ packages are built.

comment:52 in reply to: ↑ 51 Changed at 2012-03-03T01:19:28Z by davidsarah

Replying to rycee:

Replying to davidsarah:

Replying to rycee:

[...]

I'm a bit curious, however, as to why the cryptest.exe program from Crypto++ 5.6.0 didn't fail, maybe the failing test from pycryptopp could be ported to a test in cryptest.exe?

The bug seems to affect the floating point state, and it's quite plausible that cryptest.exe doesn't test for that. A regression test would indeed be nice -- if you're sufficiently motivated then I suggest asking on the Crypto++ mailing list.

Using the same technique to force it to use Crypto++ 5.6.1 results in an undefined symbol error.

It sounds like we should open a new (pycryptopp) ticket for that, since at some point Debian or Ubuntu will upgrade to 5.6.1 and then it may fail with the same error.

It's like zooko said. Since I don't know how to make setup.py use a custom directly header-directory it built against the standard Debian Cryptop++ 5.6.0 headers.

Yes, as Zooko said this is an invalid test.

However, curiously enough, when I used Debian's pycryptopp – which is dynamically linked against Debian's Crypto++ – I was able to change out the .so file from 5.6.0 to 5.6.1 without problem. Doing this is how I found out that 5.6.1 worked. I suppose it has something to do with the patches and how the Debian pycryptopp and Crypto++ packages are built.

Ah, so if that worked then it is not the dynamic linking. Okay, let's stop worrying about this one.

comment:53 Changed at 2012-03-05T22:48:47Z by zooko

Okay, so we think this is a bug that was introduced into Crypto++ after we stopped updating pycryptopp's embedded copy of Crypto++ to track upstream Crypto++, and was then released in Crypto++ v5.6.0, and then was fixed in Crypto++ v5.6.1. It would be nice if we knew more precisely what bug was introduced and what patch fixed it, or if we at least had a reliable regression test that would tell us if we later introduce this bug into pycryptopp, but I'm not sure how to go about that and it might not be worth the effort.

So, shall we close this ticket now?

comment:54 Changed at 2012-03-06T09:03:11Z by rycee

This bug should certainly be closed as invalid or something similar. I may explore Crypto++ a bit further as time and interest permits but will raise such a discussion in a more suitable forum.

Thanks for your help in tracking this down. It's been a tricky issue and I'm happy it resolved itself so pleasantly.

comment:55 Changed at 2012-03-06T11:55:58Z by zooko

  • Resolution set to invalid
  • Status changed from new to closed

Okay, thanks a lot for your help!

Note: See TracTickets for help on using tickets.