[Bugs] [Bug 1138970] file corruption during concurrent read/write

Mon Oct 13 11:44:09 UTC 2014

https://bugzilla.redhat.com/show_bug.cgi?id=1138970

--- Comment #11 from Raghavendra G <rgowdapp at redhat.com> ---
The culprit seems to be stale stat information cached in
stat-prefetch/md-cache. Consider following sequence of events:

1. stat is cached with size=10 in md-cache.
2. write of size 10 bytes happens, and stat is flushed.
3. write is _NOT_ flushed from write-behind and _IS CACHED_.
4. An operation in its reply gives out stat information (with size=10, since
write was not flushed to brick) and this information is cached in md-cache.
5. Now, any fstat calls from applications would still get a stale stat with
size=10. This is because stat is served from md-cache and fstat call never
reaches write-behind and hence write is not flushed.

The evidence to this seems to be from following logs:
[2014-10-13 10:28:23.357618] I [trace.c:2320:trace_writev] 0-ptop-trace: 2966:
gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1,  offset=120
flags=08401)
[2014-10-13 10:28:23.357652] I [trace.c:267:trace_writev_cbk] 0-ptop-trace:
2966: (op_ret=10, *prebuf = {gfid=00000000-0000-0000-0000-000000000000 ino=0,
mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Jan 01 05:30:00],
mtime=[Jan 01 05:30:00], ctime=[Jan 01 05:30:00]}, *postbuf =
{gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0,
gid=0, size=0, blocks=0, atime=[Jan 01 05:30:00], mtime=[Jan 01 05:30:00],
ctime=[Jan 01 05:30:00]})
[2014-10-13 10:28:23.357781] I [trace.c:1225:trace_fstat_cbk] 0-ptop-trace:
2964: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 op_ret=0
buf=gfid=496dbe26-3103-4a33-84d4-4facb08c7623 ino=9571362711184111139,
mode=100644, nlink=1, uid=0, gid=0, size=120, blocks=1, atime=[Oct 13
15:58:23], mtime=[Oct 13 15:58:23], ctime=[Oct 13 15:58:23]
[2014-10-13 10:28:23.357870] I [trace.c:2848:trace_fstat] 0-ptop-trace: 2967:
gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1ecc
[2014-10-13 10:28:23.357891] I [trace.c:1225:trace_fstat_cbk] 0-ptop-trace:
2967: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 op_ret=0
buf=gfid=496dbe26-3103-4a33-84d4-4facb08c7623 ino=9571362711184111139,
mode=100644, nlink=1, uid=0, gid=0, size=120, blocks=1, atime=[Oct 13
15:58:23], mtime=[Oct 13 15:58:23], ctime=[Oct 13 15:58:23]
[2014-10-13 10:28:23.357931] I [trace.c:2287:trace_readv] 0-ptop-trace: 2968:
gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1ecc, size=4096offset=0
flags=08000)
[2014-10-13 10:28:23.358140] I [trace.c:228:trace_readv_cbk] 0-ptop-trace:
2968: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 op_ret=130
buf=gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0,
gid=0, size=0, blocks=0, atime=[Jan 01 05:30:00], mtime=[Jan 01 05:30:00],
ctime=[Jan 01 05:30:00]
[2014-10-13 10:28:23.359398] I [trace.c:2848:trace_fstat] 0-ptop-trace: 2969:
gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1ecc
[2014-10-13 10:28:23.359489] I [trace.c:1225:trace_fstat_cbk] 0-ptop-trace:
2969: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 op_ret=0
buf=gfid=496dbe26-3103-4a33-84d4-4facb08c7623 ino=9571362711184111139,
mode=100644, nlink=1, uid=0, gid=0, size=120, blocks=1, atime=[Oct 13
15:58:23], mtime=[Oct 13 15:58:23], ctime=[Oct 13 15:58:23]
[2014-10-13 10:28:23.363943] I [trace.c:2320:trace_writev] 0-ptop-trace: 2971:
gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1,  offset=120
flags=08401)

As can be seen two writes are issued to same offset 120 above. Also see that
stat returns size as 120 even though a write of 10 bytes at offset 120 is
successful and reply is returned to application. For sake of brevity I am
pasting just write traces below at around offset 120:

[2014-10-13 10:28:23.356296] I [trace.c:2320:trace_writev] 0-ptop-trace: 2963:
gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1,  offset=110
flags=08401)
[2014-10-13 10:28:23.357618] I [trace.c:2320:trace_writev] 0-ptop-trace: 2966:
gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1,  offset=120
flags=08401)
[2014-10-13 10:28:23.363943] I [trace.c:2320:trace_writev] 0-ptop-trace: 2971:
gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1,  offset=120
flags=08401)
[2014-10-13 10:28:23.365178] I [trace.c:2320:trace_writev] 0-ptop-trace: 2974:
gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1,  offset=130
flags=08401)

However, there is one hole in this hypothesis. Since we pass the same flags to
underlying fs in storage/posix during open, file would be opened in O_APPEND
mode on bricks. From man 2 pwrite, about interaction of O_APPEND and pwrite,

<man 2 pwrite>
BUGS        
       POSIX requires that opening a file with the O_APPEND flag should have
       no effect on the location at which pwrite() writes data.  However, on
       Linux, if a file is opened with O_APPEND, pwrite() appends data to
       the end of the file, regardless of the value of offset.
</man 2 pwrite>

Also logs show that last write of 10 bytes happened at offset 970. But the file
size happens to 990 bytes as shown below.

[root at booradley 1138970]# ls -l /home/export/ptop/file 
-rw-r--r--. 2 root root 990 Oct 13 15:58 /home/export/ptop/file

So, whatever offset being passed to storage/posix is ignored by pwrite when
O_APPEND is present. This behaviour is also inline with the observation in
comment 8 that file on backend is no different when this issue is reported than
the case where no issue was reported.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=16VQKl0b1w&a=cc_unsubscribe