#1976 assigned defect

SFTP+SSHFS hangs for second concurrent operation

Reported by: luckyredhot Owned by: daira
Priority: normal Milestone: undecided
Component: code-frontend-ftp-sftp Version: 1.10.0
Keywords: sftp sshfs hang reliability Cc:
Launchpad Bug:

Description (last modified by daira)

I am using Tahoe-Lafs FTPS frontend with SSHFS on Ubuntu 12.04. If I try to run second operation (simply "ls" or "du") while first writing is running, second one can completely hang sometimes. It does not even stops on sending SIGKILL so I need to kill parent bash session.

Tahoe-LAFS versions 1.9.2 and 1.10.0 are both affected.

SSHFS mount options:

sshfs -p 8022 -o uid=33 -o gid=33 -o nonempty -o allow_other -o idmap=user tahoe@127.0.0.1:/ /mnt/tahoe

If this is SFTP issue it should be fixed. If this is SSHFS issue then probably we have to find other client or some workaround (probably 2 sshfs mounts - for writing and for reading).

Any help is appreciated :) Please also suggest on commands which I may run when issue occurs to gather some debug information.

Thanks!

Attachments (3)

tahoe_version (376 bytes) - added by luckyredhot at 2013-05-23T16:36:21Z.
tahoe --version
Tahoe-LAFS_SSHFS_Debug_001.log (2.5 KB) - added by luckyredhot at 2013-06-11T09:22:22Z.
incident-2013-06-11--12-22-26Z-oqcgkpa.flog.bz2 (30.4 KB) - added by luckyredhot at 2013-06-14T15:06:42Z.
incident file

Download all attachments as: .zip

Change History (13)

Changed at 2013-05-23T16:36:21Z by luckyredhot

tahoe --version

comment:1 Changed at 2013-05-23T16:44:58Z by daira

  • Description modified (diff)
  • Keywords sftp hang reliability added; ftps removed
  • Owner set to daira
  • Status changed from new to assigned
  • Summary changed from FTPS+SSHFS hangs for second operation to SFTP+SSHFS hangs for second concurrent operation

To get debugging output from sshfs, restart it in the foreground with options:

-o debug,sshfs_debug,loglevel=debug

To get debugging output from the gateway, see the Realtime Logging section of docs/logging.rst.

comment:2 Changed at 2013-06-11T09:20:34Z by luckyredhot

Ok, I've catched an issue. It happens when

  1. One write operation is in progress (I am constantly copying files to grid folder)
  2. Second operation tries to get listing/attributes. It usually happens not from the first time, but consequently running "ls" command causes all operations to freeze for long span. In my case I've got only 5 files in folder, but "ls" operation took 40 (!) seconds. It will last forever on hundreds of files.

See attached logs. I've issued ls before [80576] LSTAT

Changed at 2013-06-11T09:22:22Z by luckyredhot

comment:3 Changed at 2013-06-12T05:39:12Z by zooko

Thanks for the bug report, luckyredhot! Is there any incident report file generated by the LAFS gateway when this happens? If not, could you force it to generate one? See wiki:HowToReportABug for instructions.

Changed at 2013-06-14T15:06:42Z by luckyredhot

incident file

comment:4 Changed at 2013-06-14T15:07:12Z by luckyredhot

Incident file has been attached. Hope it'll be helpful.

comment:5 Changed at 2013-06-21T07:40:15Z by luckyredhot

Daira, thanks for yesterday's analysis. What are the following steps we can make? Probably I may try to raise issue one more time to get additional logs? Or you think upgrading to 1.10 may also be helpful? (AFAIK SFTP wasn't modified there from 1.9.2).

comment:6 Changed at 2013-06-21T09:32:28Z by daira

SFTP was actually modified in 1.10 to improve error handling; I doubt it affects this bug, but it may help slightly in debugging. I'm going to try to reproduce the problem myself, but please feel free to attach another log, since the file incident-2013-06-11--12-22-26Z-oqcgkpa.flog.bz2 seems to be corrupted in some way.

It's unfortunate that the sshfs debug log doesn't include timestamps that could be correlated with the foolscap log.

comment:7 follow-up: Changed at 2013-06-25T06:44:35Z by luckyredhot

What do you think if I perform partial Grid update (for example, upgrade 2 of existing 5 nodes to 1.10) and try to catch issue on both 1.9.2 and 1.10 nodes of the same Grid? Sound reasonable?

comment:8 in reply to: ↑ 7 Changed at 2013-06-25T15:16:33Z by zooko

Replying to luckyredhot:

What do you think if I perform partial Grid update (for example, upgrade 2 of existing 5 nodes to 1.10) and try to catch issue on both 1.9.2 and 1.10 nodes of the same Grid? Sound reasonable?

Dear Oleksandr:

I would assume that the storage servers have nothing to do with this bug. However, since I don't understand this bug, maybe my assumption is bad.

However, I suspect you'd get more better debugging information for your effort if you try different versions of Tahoe-LAFS for the gateway rather than the servers.

comment:9 Changed at 2013-06-26T00:23:26Z by daira

I agree with zooko that this is unlikely to be related to the storage server versions.

comment:10 Changed at 2014-12-02T19:50:10Z by warner

  • Component changed from code-frontend to code-frontend-ftp-sftp
Note: See TracTickets for help on using tickets.