Opened at 2010-09-19T01:54:04Z
Last modified at 2014-12-02T19:44:17Z
#1201 assigned defect
'rm' via sftp+sshfs may hang if previous operations on the file are "stuck"
Reported by: | bj0 | Owned by: | davidsarah |
---|---|---|---|
Priority: | major | Milestone: | soon |
Component: | code-frontend-ftp-sftp | Version: | 1.8β |
Keywords: | hang sftp sshfs reliability | Cc: | |
Launchpad Bug: |
Description (last modified by warner)
I was trying to figure out what part of the bonnie++ benchmark was failing on tahoe (via sshfs), so I mounted the sshfs interface and ran:
> bonnie -d ./mnt -s 100m -n 0 -r 50m Writing a byte at a time...done Writing intelligently...done Rewriting...done Reading a byte at a time...done Reading intelligently...done start 'em... ^C Can't read data from ITC pipe. Can't read from pipe, expected 32, got -1.
It hung for a while after "start 'em", so I checked the flog and there was no activity, so I hit Ctr-C, which gave me those 2 errors and then the interface hung.
I had to kill -9 sshfs to get everything to free up, then I re-mounted the sshfs interface, did a:
> ls ./mnt Bonnie.3520 > rm ./mnt/Bonnie.3520
And it hung on the 'rm'.
I had flogtool running for the 'rm', and the output was (below).
After restarting the tahoe client process and then reconnecting via sshfs, I was able to delete the file.
17:48:43.537 L20 []#12940179 SUCCESS ".getAttrs('/.', followLinks=0)" {'permissions': 16895} 17:48:47.416 L20 []#12940180 <SFTPUserHandler #3>(bjp): .removeFile('/Bonnie.3520') 17:48:47.416 L10 []#12940181 <SFTPUserHandler #3>(bjp): CONVERT '/Bonnie.3520' 17:48:47.416 L10 []#12940182 <SFTPUserHandler #3>(bjp): PATH [u'Bonnie.3520'] 17:48:47.416 L20 []#12940183 <SFTPUserHandler #3>(bjp): ._abandon_any_heisenfiles('/Bonnie.3520', 'URI:DIR2:unzmpoxlhyvhhhe37tenmmk5km:nfifb3ehimdhep3qbkxfimheznfgfrgc6hen5piwim3kda5zykca/Bonnie.3520') 17:48:47.416 L10 []#12940184 <SFTPUserHandler #3>(bjp): files = [<GeneralSFTPFile #7>, <GeneralSFTPFile #7>, <GeneralSFTPFile #8>, <GeneralSFTPFile #7>, <GeneralSFTPFile #8>, <GeneralSFTPFile #9>, <GeneralSFTPFile #7>, <GeneralSFTPFile #8>, <GeneralSFTPFile #9>, <GeneralSFTPFile #10>, <GeneralSFTPFile #7>, <GeneralSFTPFile #8>, <GeneralSFTPFile #9>, <GeneralSFTPFile #10>, <GeneralSFTPFile #11>, <GeneralSFTPFile #7>, <GeneralSFTPFile #8>, <GeneralSFTPFile #9>, <GeneralSFTPFile #10>, <GeneralSFTPFile #11>, <GeneralSFTPFile #7>, <GeneralSFTPFile #8>, <GeneralSFTPFile #9>, <GeneralSFTPFile #10>, <GeneralSFTPFile #11>, <GeneralSFTPFile #7>, <GeneralSFTPFile #8>, <GeneralSFTPFile #9>, <GeneralSFTPFile #10>, <GeneralSFTPFile #11>, <GeneralSFTPFile #7>, <GeneralSFTPFile #8>, <GeneralSFTPFile #9>, <GeneralSFTPFile #10>, <GeneralSFTPFile #11>, <GeneralSFTPFile #7>, <GeneralSFTPFile #8>, <GeneralSFTPFile #9>, <GeneralSFTPFile #10>, <GeneralSFTPFile #11>, <GeneralSFTPFile #7>, <GeneralSFTPFile #8>, <GeneralSFTPFile #9>, <GeneralSFTPFile #10>, <GeneralSFTPFile #11>] in "._abandon_any_heisenfiles('/Bonnie.3520', 'URI:DIR2:unzmpoxlhyvhhhe37tenmmk5km:nfifb3ehimdhep3qbkxfimheznfgfrgc6hen5piwim3kda5zykca/Bonnie.3520')" 17:48:47.416 L20 []#12940185 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon() 17:48:47.416 L20 []#12940186 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon() 17:48:47.416 L20 []#12940187 <GeneralSFTPFile #8>(/Bonnie.3520): .abandon() 17:48:47.416 L20 []#12940188 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon() 17:48:47.416 L20 []#12940189 <GeneralSFTPFile #8>(/Bonnie.3520): .abandon() 17:48:47.416 L20 []#12940190 <GeneralSFTPFile #9>(/Bonnie.3520): .abandon() 17:48:47.416 L20 []#12940191 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon() 17:48:47.416 L20 []#12940192 <GeneralSFTPFile #8>(/Bonnie.3520): .abandon() 17:48:47.416 L20 []#12940193 <GeneralSFTPFile #9>(/Bonnie.3520): .abandon() 17:48:47.416 L20 []#12940194 <GeneralSFTPFile #10>(/Bonnie.3520): .abandon() 17:48:47.416 L20 []#12940195 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon() 17:48:47.416 L20 []#12940196 <GeneralSFTPFile #8>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940197 <GeneralSFTPFile #9>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940198 <GeneralSFTPFile #10>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940199 <GeneralSFTPFile #11>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940200 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940201 <GeneralSFTPFile #8>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940202 <GeneralSFTPFile #9>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940203 <GeneralSFTPFile #10>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940204 <GeneralSFTPFile #11>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940205 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940206 <GeneralSFTPFile #8>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940207 <GeneralSFTPFile #9>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940208 <GeneralSFTPFile #10>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940209 <GeneralSFTPFile #11>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940210 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940211 <GeneralSFTPFile #8>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940212 <GeneralSFTPFile #9>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940213 <GeneralSFTPFile #10>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940214 <GeneralSFTPFile #11>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940215 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940216 <GeneralSFTPFile #8>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940217 <GeneralSFTPFile #9>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940218 <GeneralSFTPFile #10>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940219 <GeneralSFTPFile #11>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940220 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940221 <GeneralSFTPFile #8>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940222 <GeneralSFTPFile #9>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940223 <GeneralSFTPFile #10>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940224 <GeneralSFTPFile #11>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940225 <GeneralSFTPFile #7>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940226 <GeneralSFTPFile #8>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940227 <GeneralSFTPFile #9>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940228 <GeneralSFTPFile #10>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940229 <GeneralSFTPFile #11>(/Bonnie.3520): .abandon() 17:48:47.417 L20 []#12940230 <GeneralSFTPFile #7>(/Bonnie.3520): .sync()
Attachments (2)
Change History (7)
comment:1 Changed at 2010-09-19T02:28:04Z by davidsarah
- Owner changed from nobody to davidsarah
- Status changed from new to assigned
comment:2 Changed at 2010-09-19T02:35:26Z by davidsarah
- Component changed from unknown to code-frontend
- Milestone changed from undecided to soon
- Summary changed from 'rm' via sftp+sshfs to 'rm' via sftp+sshfs may hang if previous operations on the file are "stuck"
comment:3 Changed at 2010-09-19T02:48:34Z by davidsarah
Ah, it's all coming back to me now. I'd originally intended to close any open files on an SFTP connection when the connection was dropped (as it is when the sshfs process is killed), but then realized that is incorrect and might lead to data loss. So that explains why the comment at src/allmydata/frontends/sftpd.py@4545#L837 (which claims that this bug can't happen!) is incorrect.
comment:4 Changed at 2010-09-19T03:40:24Z by bj0
Ok I posted the straces, after i 'kill -9' sshfs, the rest of the traces come out:
bonnie:
read(4, 0x7fff57d767d0, 32) = ? ERESTARTSYS (To be restarted) --- SIGINT (Interrupt) @ 0 (0) --- rt_sigreturn(0x2) = -1 EINTR (Interrupted system call) write(2, "Can't read data from ITC pipe.\n", 31) = 31 write(2, "Can't read from pipe, expected 3"..., 43) = 43 unlink("./Bonnie.9228") = -1 ENOTCONN (Transport endpoint is not connected) munmap(0x7f49d3f63000, 8392704) = 0 close(4) = 0 close(7) = 0 close(6) = 0 close(5) = 0 exit_group(1) = ?
and rm:
unlinkat(AT_FDCWD, "mnt/Bonnie.9228", 0) = -1 ECONNABORTED (Software caused connection abort) open("/usr/share/locale/en_US.utf8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale/en_US/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale/en.utf8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale/en/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale-langpack/en_US.utf8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale-langpack/en_US/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale-langpack/en.utf8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale-langpack/en/LC_MESSAGES/coreutils.mo", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=613, ...}) = 0 mmap(NULL, 613, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f578fbc7000 close(3) = 0 write(2, "rm: ", 4) = 4 write(2, "cannot remove `mnt/Bonnie.9228'", 31) = 31 open("/usr/share/locale/en_US.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale/en_US/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale/en.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale-langpack/en_US.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale-langpack/en_US/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale-langpack/en.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale-langpack/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory) write(2, ": Software caused connection abo"..., 34) = 34 write(2, "\n", 1) = 1 close(0) = 0 close(1) = 0 close(2) = 0 exit_group(1) = ?
comment:5 Changed at 2014-12-02T19:44:17Z by warner
- Component changed from code-frontend to code-frontend-ftp-sftp
- Description modified (diff)
The most likely explanation is that the .sync on <GeneralSFTPFile #7>(/Bonnie.3520) is hanging because some previous operation on that file object failed to complete. By design, operations that would be synchronous in a POSIX filesystem (like unlink/delete) wait for previous operations on the same file, which is necessary to avoid race conditions. (Otherwise, create-followed-by-delete could be reordered to delete-followed-by-create, which would leave the file in place.)
When an SFTP client connection is closed, we still may have operations that were performed on that connection that need to complete, so we don't just drop open handles to a file at that point. However, there's no timeout for how long an operation takes, so operations that never complete can result in a file becoming "stuck" and undeletable, until the gateway is restarted. This seems to be much more likely to happen if the sshfs process is killed (even though the server should be able to tolerate that). I'll have to have a good think about how to resolve this; it isn't obvious how to do it without risking integrity.
(The calls to .abandon mark the file handle as being immediately closeable, but in this case it isn't closed before the .sync, so that doesn't help.)
It's a bit strange that there are so many copies of the same file object in the heisenfile dicts (shown after 'files =' in line 6 of the log). There should normally be no more than two per open handle to a file.