[tahoe-lafs-trac-stream] [tahoe-lafs] #1937: back up the content of a file even if the content changes without changing mtime
tahoe-lafs
trac at tahoe-lafs.org
Wed Mar 27 18:31:06 UTC 2013
#1937: back up the content of a file even if the content changes without changing
mtime
-------------------------------------------------+-------------------------
Reporter: zooko | Owner:
Type: defect | Status: new
Priority: normal | Milestone:
Component: code | undecided
Keywords: tahoe-backup reliability | Version: 1.9.2
preservation | Launchpad Bug:
-------------------------------------------------+-------------------------
From [//pipermail/tahoe-dev/2008-September/000809.html].
If an application writes to a file twice in quick succession, then the
operating system may give that file the same {{{mtime}}} value both times.
{{{mtime}}} granularity varies between OSes and filesystems, and is often
coarser than you would wish:
¹ http://www.infosec.jmu.edu/documents/jmu-infosec-tr-2009-002.pdf
² http://msdn.microsoft.com/en-
us/library/windows/desktop/ms724290%28v=vs.85%29.aspx
* Linux/ext3 - 1 sec ![¹]
* Linux/ext4 - 1 nanosec ![¹]; actually 1 millisec (observed by my
experiment just now on linux 3.2, ext4)
* FreeBSD/UFS - 1 sec ![¹]
* Mac - 1 sec ![¹]
* Windows/FAT - 2 sec, no timezone, when DST changes it is off by one hour
until next reboot: ![¹]
* Windows/NTFS - 100 nanosec: ![¹]; possibly actually 1.6 microsec ![²]?
* Windows/* - {{{mtime}}} isn't necessarily updated until the filehandle
is closed [¹, ²]
Note that FAT is the standard filesystem for removable media (isn't it?),
so it is actually very common.
Now the problem is, what happens if
1. an application writes some data, `D1` into a file, and the timestamp
gets updated to `T1`, and then
2. {{{tahoe backup}}} reads `D1`, and then
3. the app writes some new data, `D2`, and the timestamp doesn't get
updated because steps 2 and 3 happened within the filesystem's
granularity?
What happens is that {{{tahoe backup}}} has saved `D1`, but from then on
it will never save `D2`, since it falsely believes it already saved it
since its timestamp is still `T1`. If this were to happen in practice, the
effect for the user would be that when they go to read the file from
Tahoe-LAFS, they find the previous version of its contents — `D1` — and
not the most recent version — `D2`. This unfortunately user would probably
not have any way to figure out what happened, and would justly blame
Tahoe-LAFS for being unreliable.
The same problem can happen if the timestamp of a file gets reset to an
earlier value, such as with the {{{touch -t}}} unix command, or by the
system clock getting moved. (The system clock getting moved happens
surprisingly often in the wild.)
A user can avoid this problem by passing {{{--ignore-timestamps}}} to
{{{tahoe backup}}}, which will cause that run of {{{tahoe backup}}} to
reupload every file. That is very expensive in terms of time, disk, and
CPU usage (even if the files get deduplicated by the servers).
--
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1937>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage
More information about the tahoe-lafs-trac-stream
mailing list