#1201 assigned defect

'rm' via sftp+sshfs may hang if previous operations on the file are "stuck"

Reported by: bj0 Owned by: davidsarah
Priority: major Milestone: soon
Component: code-frontend-ftp-sftp Version: 1.8β
Keywords: hang sftp sshfs reliability Cc:
Launchpad Bug:

Description (last modified by warner)

I was trying to figure out what part of the bonnie++ benchmark was failing on tahoe (via sshfs), so I mounted the sshfs interface and ran:

> bonnie -d ./mnt -s 100m  -n 0  -r 50m
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...
^C
Can't read data from ITC pipe.
Can't read from pipe, expected 32, got -1.

It hung for a while after "start 'em", so I checked the flog and there was no activity, so I hit Ctr-C, which gave me those 2 errors and then the interface hung.

I had to kill -9 sshfs to get everything to free up, then I re-mounted the sshfs interface, did a:

> ls ./mnt
Bonnie.3520

> rm ./mnt/Bonnie.3520

And it hung on the 'rm'.

I had flogtool running for the 'rm', and the output was (below).

After restarting the tahoe client process and then reconnecting via sshfs, I was able to delete the file.

17:48:43.537 L20 []#12940179 SUCCESS ".getAttrs('/.', followLinks=0)" {'permissions': 16895}
17:48:47.416 L20 []#12940180 <SFTPUserHandler #3>(bjp): .removeFile('/Bonnie.3520')
17:48:47.416 L10 []#12940181 <SFTPUserHandler #3>(bjp): CONVERT '/Bonnie.3520'
17:48:47.416 L10 []#12940182 <SFTPUserHandler #3>(bjp):  PATH [u'Bonnie.3520']
17:48:47.416 L20 []#12940183 <SFTPUserHandler #3>(bjp): ._abandon_any_heisenfiles('/Bonnie.3520', 'URI:DIR2:unzmpoxlhyvhhhe37tenmmk5km:nfifb3ehimdhep3qbkxfimheznfgfrgc6hen5piwim3kda5zykca/Bonnie.3520')
17:48:47.416 L10 []#12940184 <SFTPUserHandler #3>(bjp): files = [<GeneralSFTPFile #7>, <GeneralSFTPFile #7>, <GeneralSFTPFile #8>, <GeneralSFTPFile #7>, <GeneralSFTPFile #8>, <GeneralSFTPFile #9>, <GeneralSFTPFile #7>, <GeneralSFTPFile #8>, <GeneralSFTPFile #9>, <GeneralSFTPFile #10>, <GeneralSFTPFile #7>, <GeneralSFTPFile #8>, <GeneralSFTPFile #9>, <GeneralSFTPFile #10>, <GeneralSFTPFile #11>, <GeneralSFTPFile #7>, <GeneralSFTPFile #8>, <GeneralSFTPFile #9>, <GeneralSFTPFile #10>, <GeneralSFTPFile #11>, <GeneralSFTPFile #7>, <GeneralSFTPFile #8>, <GeneralSFTPFile #9>, <GeneralSFTPFile #10>, <GeneralSFTPFile #11>, <GeneralSFTPFile #7>, <GeneralSFTPFile #8>, <GeneralSFTPFile #9>, <GeneralSFTPFile #10>, <GeneralSFTPFile #11>, <GeneralSFTPFile #7>, <GeneralSFTPFile #8>, <GeneralSFTPFile #9>, <GeneralSFTPFile #10>, <GeneralSFTPFile #11>, <GeneralSFTPFile #7>, <GeneralSFTPFile #8>, <GeneralSFTPFile #9>, <GeneralSFTPFile #10>, <GeneralSFTPFile #11>, <GeneralSFTPFile #7>, <GeneralSFTPFile #8>, <GeneralSFTPFile #9>, <GeneralSFTPFile #10>, <GeneralSFTPFile #11>] in "._abandon_any_heisenfiles('/Bonnie.3520', 'URI:DIR2:unzmpoxlhyvhhhe37tenmmk5km:nfifb3ehimdhep3qbkxfimheznfgfrgc6hen5piwim3kda5zykca/Bonnie.3520')"
17:48:47.416 L20 []#12940185 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon()
17:48:47.416 L20 []#12940186 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon()
17:48:47.416 L20 []#12940187 <GeneralSFTPFile #8>(/Bonnie.3520): .abandon()
17:48:47.416 L20 []#12940188 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon()
17:48:47.416 L20 []#12940189 <GeneralSFTPFile #8>(/Bonnie.3520): .abandon()
17:48:47.416 L20 []#12940190 <GeneralSFTPFile #9>(/Bonnie.3520): .abandon()
17:48:47.416 L20 []#12940191 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon()
17:48:47.416 L20 []#12940192 <GeneralSFTPFile #8>(/Bonnie.3520): .abandon()
17:48:47.416 L20 []#12940193 <GeneralSFTPFile #9>(/Bonnie.3520): .abandon()
17:48:47.416 L20 []#12940194 <GeneralSFTPFile #10>(/Bonnie.3520): .abandon()
17:48:47.416 L20 []#12940195 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon()
17:48:47.416 L20 []#12940196 <GeneralSFTPFile #8>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940197 <GeneralSFTPFile #9>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940198 <GeneralSFTPFile #10>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940199 <GeneralSFTPFile #11>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940200 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940201 <GeneralSFTPFile #8>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940202 <GeneralSFTPFile #9>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940203 <GeneralSFTPFile #10>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940204 <GeneralSFTPFile #11>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940205 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940206 <GeneralSFTPFile #8>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940207 <GeneralSFTPFile #9>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940208 <GeneralSFTPFile #10>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940209 <GeneralSFTPFile #11>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940210 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940211 <GeneralSFTPFile #8>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940212 <GeneralSFTPFile #9>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940213 <GeneralSFTPFile #10>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940214 <GeneralSFTPFile #11>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940215 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940216 <GeneralSFTPFile #8>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940217 <GeneralSFTPFile #9>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940218 <GeneralSFTPFile #10>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940219 <GeneralSFTPFile #11>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940220 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940221 <GeneralSFTPFile #8>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940222 <GeneralSFTPFile #9>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940223 <GeneralSFTPFile #10>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940224 <GeneralSFTPFile #11>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940225 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940226 <GeneralSFTPFile #8>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940227 <GeneralSFTPFile #9>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940228 <GeneralSFTPFile #10>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940229 <GeneralSFTPFile #11>(/Bonnie.3520): .abandon()
17:48:47.417 L20 []#12940230 <GeneralSFTPFile #7>(/Bonnie.3520): .sync()


Attachments (2)

stra2 (5.2 MB) - added by bj0 at 2010-09-19T03:30:41Z.
strace of the bonnie hang
stra4 (7.6 KB) - added by bj0 at 2010-09-19T03:38:35Z.
strace of 'rm' hang

Change History (7)

comment:1 Changed at 2010-09-19T02:28:04Z by davidsarah

  • Owner changed from nobody to davidsarah
  • Status changed from new to assigned

The most likely explanation is that the .sync on <GeneralSFTPFile #7>(/Bonnie.3520) is hanging because some previous operation on that file object failed to complete. By design, operations that would be synchronous in a POSIX filesystem (like unlink/delete) wait for previous operations on the same file, which is necessary to avoid race conditions. (Otherwise, create-followed-by-delete could be reordered to delete-followed-by-create, which would leave the file in place.)

When an SFTP client connection is closed, we still may have operations that were performed on that connection that need to complete, so we don't just drop open handles to a file at that point. However, there's no timeout for how long an operation takes, so operations that never complete can result in a file becoming "stuck" and undeletable, until the gateway is restarted. This seems to be much more likely to happen if the sshfs process is killed (even though the server should be able to tolerate that). I'll have to have a good think about how to resolve this; it isn't obvious how to do it without risking integrity.

(The calls to .abandon mark the file handle as being immediately closeable, but in this case it isn't closed before the .sync, so that doesn't help.)

It's a bit strange that there are so many copies of the same file object in the heisenfile dicts (shown after 'files =' in line 6 of the log). There should normally be no more than two per open handle to a file.

comment:2 Changed at 2010-09-19T02:35:26Z by davidsarah

  • Component changed from unknown to code-frontend
  • Milestone changed from undecided to soon
  • Summary changed from 'rm' via sftp+sshfs to 'rm' via sftp+sshfs may hang if previous operations on the file are "stuck"

comment:3 Changed at 2010-09-19T02:48:34Z by davidsarah

Ah, it's all coming back to me now. I'd originally intended to close any open files on an SFTP connection when the connection was dropped (as it is when the sshfs process is killed), but then realized that is incorrect and might lead to data loss. So that explains why the comment at src/allmydata/frontends/sftpd.py@4545#L837 (which claims that this bug can't happen!) is incorrect.

Changed at 2010-09-19T03:30:41Z by bj0

strace of the bonnie hang

Changed at 2010-09-19T03:38:35Z by bj0

strace of 'rm' hang

comment:4 Changed at 2010-09-19T03:40:24Z by bj0

Ok I posted the straces, after i 'kill -9' sshfs, the rest of the traces come out:

bonnie:

read(4, 0x7fff57d767d0, 32)             = ? ERESTARTSYS (To be restarted)
--- SIGINT (Interrupt) @ 0 (0) ---
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
write(2, "Can't read data from ITC pipe.\n", 31) = 31
write(2, "Can't read from pipe, expected 3"..., 43) = 43
unlink("./Bonnie.9228")                 = -1 ENOTCONN (Transport endpoint is not connected)
munmap(0x7f49d3f63000, 8392704)         = 0
close(4)                                = 0
close(7)                                = 0
close(6)                                = 0
close(5)                                = 0
exit_group(1)                           = ?

and rm:

unlinkat(AT_FDCWD, "mnt/Bonnie.9228", 0) = -1 ECONNABORTED (Software caused connection abort)
open("/usr/share/locale/en_US.utf8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en.utf8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en_US.utf8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en_US/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en.utf8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en/LC_MESSAGES/coreutils.mo", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=613, ...}) = 0
mmap(NULL, 613, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f578fbc7000
close(3)                                = 0
write(2, "rm: ", 4)                     = 4
write(2, "cannot remove `mnt/Bonnie.9228'", 31) = 31
open("/usr/share/locale/en_US.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en_US.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en_US/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
write(2, ": Software caused connection abo"..., 34) = 34
write(2, "\n", 1)                       = 1
close(0)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(1)                           = ?

comment:5 Changed at 2014-12-02T19:44:17Z by warner

  • Component changed from code-frontend to code-frontend-ftp-sftp
  • Description modified (diff)
Note: See TracTickets for help on using tickets.