[Bugs] [Bug 1153892] New: file corruption during concurrent read/write

Fri Oct 17 05:14:44 UTC 2014

https://bugzilla.redhat.com/show_bug.cgi?id=1153892

            Bug ID: 1153892
           Summary: file corruption during concurrent read/write
           Product: Red Hat Storage
           Version: 2.1
         Component: glusterfs
     Sub Component: core
          Severity: urgent
          Assignee: vbellur at redhat.com
          Reporter: nsathyan at redhat.com
        QA Contact: sdharane at redhat.com
                CC: bugs at gluster.org, gluster-bugs at redhat.com,
                    mike at luminatewireless.com, pkarampu at redhat.com,
                    rgowdapp at redhat.com, srangana at redhat.com
        Depends On: 1138970

+++ This bug was initially created as a clone of Bug #1138970 +++

Description of problem:

Using the Fuse client on a distributed, replicated volume results in
permanently corrupt data if the file is opened simultaneously for reading and
writing and the O_APPEND flag is set. This is a single reader, single writer
scenario.

The issue is that the Fuse client is sending the wrong offsets to the remote
glusterfsd daemons. I verified this with tcpdump and stracing the targets
glusterfsd processes.

I am confident this is a Fuse-only issue because the problem is not
reproducible when the volume is mounted via NFS.

There is a thread on gluster-user about this issue:

http://supercolony.gluster.org/pipermail/gluster-users/2014-September/018652.html

Version-Release number of selected component (if applicable):
3.5.1, 3.5.2

How reproducible:

It is not deterministic (the record number varies), but it occurs 100% of the
time with the attached python script.

Steps to Reproduce:
1. python2.7 gluster-fuse-client-interleaved-read-write-corruption.py
/path/to/file/mounted/on/gluster

Actual results:
If you see output like "FAILED: expected record id 21 retrieved record id 22"
you have reproduced the issue. The record ids vary from run to run.

Expected results:
The program should exit with no output.

Additional info:

--- Additional comment from  on 2014-09-06 19:00:00 EDT ---

This script has tiny syntax bug.

--- Additional comment from  on 2014-09-06 19:01:29 EDT ---

Fixes a syntax bug in the first version I uploaded.

--- Additional comment from Pranith Kumar K on 2014-09-07 09:20:52 EDT ---

Raghavendra,
    I observed that disabling write-behind is preventing the issue. Do you mind
taking a look once.

Pranith

--- Additional comment from Shyamsundar on 2014-09-08 10:12:51 EDT ---

@Pranith, was the test to disable write behind (as a part of comment #3) made
in 3.5.2 code or master?

Thinking about this patch, http://review.gluster.org/#/c/6835/ at master, that
may help in this bug?

--- Additional comment from Pranith Kumar K on 2014-09-08 11:08:16 EDT ---

Shyam,
     3.5.2 has that patch.

Pranith

--- Additional comment from  on 2014-09-12 00:56:42 EDT ---

Has there been any more progress on this? This is still a major concern for our
deployment.

Is there some way I can better isolate which layer is corrupting the offsets?

--- Additional comment from  on 2014-09-12 13:23:43 EDT ---

I have confirmed that disabling write-behind and mounting with the fuse client
does allow the test to succeed.

--- Additional comment from Raghavendra G on 2014-10-13 02:54:34 EDT ---

Seems like there is no permanent corruption (though the reader running
concurrently with write gets wrong sequence number)

[root at booradley 1138970]# >/mnt/glusterfs/file;
./gluster-fuse-client-interleaved-read-write-corruption.py /mnt/glusterfs/file
[root at booradley 1138970]# cp -fv /mnt/glusterfs/file .
`/mnt/glusterfs/file' -> `./file'
[root at booradley 1138970]# >/mnt/glusterfs/file;
./gluster-fuse-client-interleaved-read-write-corruption.py /mnt/glusterfs/file
[root at booradley 1138970]# >/mnt/glusterfs/file;
./gluster-fuse-client-interleaved-read-write-corruption.py /mnt/glusterfs/file
FAILED: expected record id 58 retrieved record id 59
[root at booradley 1138970]# diff /mnt/glusterfs/file ./file 
[root at booradley 1138970]# >/mnt/glusterfs/file;
./gluster-fuse-client-interleaved-read-write-corruption.py /mnt/glusterfs/file
FAILED: expected record id 39 retrieved record id 40
[root at booradley 1138970]# diff /mnt/glusterfs/file ./file 

As can be seen above, I am comparing the file where:
1. the script reported an error
2. the script didn't report an error

diff didn't report any difference b/w both files above. This seems to point to
corruption being transient, where reader got corrupted data only when executing
with a concurrent writer.

--- Additional comment from Raghavendra G on 2014-10-13 03:15:31 EDT ---

The issue is reproducible only when BOTH write-behind and stat-prefetch are
turned on. If any one of them is off, we don't hit this issue.

--- Additional comment from Raghavendra G on 2014-10-13 03:23:38 EDT ---

Also, presence/absence of read-cache translators (read-ahead, io-cache,
quick-read) has no impact on the reproducibility of the issue.

--- Additional comment from Raghavendra G on 2014-10-13 07:44:09 EDT ---

The culprit seems to be stale stat information cached in
stat-prefetch/md-cache. Consider following sequence of events:

1. stat is cached with size=10 in md-cache.
2. write of size 10 bytes happens, and stat is flushed.
3. write is _NOT_ flushed from write-behind and _IS CACHED_.
4. An operation in its reply gives out stat information (with size=10, since
write was not flushed to brick) and this information is cached in md-cache.
5. Now, any fstat calls from applications would still get a stale stat with
size=10. This is because stat is served from md-cache and fstat call never
reaches write-behind and hence write is not flushed.

The evidence to this seems to be from following logs:
[2014-10-13 10:28:23.357618] I [trace.c:2320:trace_writev] 0-ptop-trace: 2966:
gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1,  offset=120
flags=08401)
[2014-10-13 10:28:23.357652] I [trace.c:267:trace_writev_cbk] 0-ptop-trace:
2966: (op_ret=10, *prebuf = {gfid=00000000-0000-0000-0000-000000000000 ino=0,
mode=0, nlink=0, uid=0, gid=0, size=0, blocks=0, atime=[Jan 01 05:30:00],
mtime=[Jan 01 05:30:00], ctime=[Jan 01 05:30:00]}, *postbuf =
{gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0,
gid=0, size=0, blocks=0, atime=[Jan 01 05:30:00], mtime=[Jan 01 05:30:00],
ctime=[Jan 01 05:30:00]})
[2014-10-13 10:28:23.357781] I [trace.c:1225:trace_fstat_cbk] 0-ptop-trace:
2964: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 op_ret=0
buf=gfid=496dbe26-3103-4a33-84d4-4facb08c7623 ino=9571362711184111139,
mode=100644, nlink=1, uid=0, gid=0, size=120, blocks=1, atime=[Oct 13
15:58:23], mtime=[Oct 13 15:58:23], ctime=[Oct 13 15:58:23]
[2014-10-13 10:28:23.357870] I [trace.c:2848:trace_fstat] 0-ptop-trace: 2967:
gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1ecc
[2014-10-13 10:28:23.357891] I [trace.c:1225:trace_fstat_cbk] 0-ptop-trace:
2967: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 op_ret=0
buf=gfid=496dbe26-3103-4a33-84d4-4facb08c7623 ino=9571362711184111139,
mode=100644, nlink=1, uid=0, gid=0, size=120, blocks=1, atime=[Oct 13
15:58:23], mtime=[Oct 13 15:58:23], ctime=[Oct 13 15:58:23]
[2014-10-13 10:28:23.357931] I [trace.c:2287:trace_readv] 0-ptop-trace: 2968:
gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1ecc, size=4096offset=0
flags=08000)
[2014-10-13 10:28:23.358140] I [trace.c:228:trace_readv_cbk] 0-ptop-trace:
2968: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 op_ret=130
buf=gfid=00000000-0000-0000-0000-000000000000 ino=0, mode=0, nlink=0, uid=0,
gid=0, size=0, blocks=0, atime=[Jan 01 05:30:00], mtime=[Jan 01 05:30:00],
ctime=[Jan 01 05:30:00]
[2014-10-13 10:28:23.359398] I [trace.c:2848:trace_fstat] 0-ptop-trace: 2969:
gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1ecc
[2014-10-13 10:28:23.359489] I [trace.c:1225:trace_fstat_cbk] 0-ptop-trace:
2969: gfid=496dbe26-3103-4a33-84d4-4facb08c7623 op_ret=0
buf=gfid=496dbe26-3103-4a33-84d4-4facb08c7623 ino=9571362711184111139,
mode=100644, nlink=1, uid=0, gid=0, size=120, blocks=1, atime=[Oct 13
15:58:23], mtime=[Oct 13 15:58:23], ctime=[Oct 13 15:58:23]
[2014-10-13 10:28:23.363943] I [trace.c:2320:trace_writev] 0-ptop-trace: 2971:
gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1,  offset=120
flags=08401)

As can be seen two writes are issued to same offset 120 above. Also see that
stat returns size as 120 even though a write of 10 bytes at offset 120 is
successful and reply is returned to application. For sake of brevity I am
pasting just write traces below at around offset 120:

[2014-10-13 10:28:23.356296] I [trace.c:2320:trace_writev] 0-ptop-trace: 2963:
gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1,  offset=110
flags=08401)
[2014-10-13 10:28:23.357618] I [trace.c:2320:trace_writev] 0-ptop-trace: 2966:
gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1,  offset=120
flags=08401)
[2014-10-13 10:28:23.363943] I [trace.c:2320:trace_writev] 0-ptop-trace: 2971:
gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1,  offset=120
flags=08401)
[2014-10-13 10:28:23.365178] I [trace.c:2320:trace_writev] 0-ptop-trace: 2974:
gfid=496dbe26-3103-4a33-84d4-4facb08c7623 fd=0x18c1f38, count=1,  offset=130
flags=08401)

However, there is one hole in this hypothesis. Since we pass the same flags to
underlying fs in storage/posix during open, file would be opened in O_APPEND
mode on bricks. From man 2 pwrite, about interaction of O_APPEND and pwrite,

<man 2 pwrite>
BUGS        
       POSIX requires that opening a file with the O_APPEND flag should have
       no effect on the location at which pwrite() writes data.  However, on
       Linux, if a file is opened with O_APPEND, pwrite() appends data to
       the end of the file, regardless of the value of offset.
</man 2 pwrite>

Also logs show that last write of 10 bytes happened at offset 970. But the file
size happens to 990 bytes as shown below.

[root at booradley 1138970]# ls -l /home/export/ptop/file 
-rw-r--r--. 2 root root 990 Oct 13 15:58 /home/export/ptop/file

So, whatever offset being passed to storage/posix is ignored by pwrite when
O_APPEND is present. This behaviour is also inline with the observation in
comment 8 that file on backend is no different when this issue is reported than
the case where no issue was reported.

--- Additional comment from Raghavendra G on 2014-10-13 07:45:03 EDT ---

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1138970
[Bug 1138970] file corruption during concurrent read/write
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=HYDD0BQVD8&a=cc_unsubscribe