[Bugs] [Bug 1179169] New: tar on a gluster directory gives message "file changed as we read it" even though no updates to file in progress

Tue Jan 6 10:40:11 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1179169

            Bug ID: 1179169
           Summary: tar on a gluster directory gives message "file changed
                    as we read it" even though no updates to file in
                    progress
           Product: GlusterFS
           Version: mainline
         Component: replicate
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: kdhananj at redhat.com
                CC: bugs at gluster.org, byarlaga at redhat.com,
                    gluster-bugs at redhat.com, kdhananj at redhat.com,
                    mpillai at redhat.com, nsathyan at redhat.com,
                    pkarampu at redhat.com, rcyriac at redhat.com,
                    spandura at redhat.com, storage-qa-internal at redhat.com
        Depends On: 1104618

+++ This bug was initially created as a clone of Bug #1104618 +++

Description of problem:
A script that includes a tar command on a gluster volume directory is failing
with the error "file changed as we read it". There is no active I/O to the
gluster volume when the tar command is issued. All previous writes were sync
writes. Seeing this fairly consistently on a replica 2 volume; not yet seen on
a distribute volume.

Version-Release number of selected component (if applicable):
Client and Servers are running glusterfs*-3.6.0.11-1.el6rhs.x86_64

How reproducible:
Quite consistently. Some runs with only a few small files did not show the
error message. But have been seeing it every time I have runs with the steps
below.

Steps to Reproduce:
1. create a replicate or distribute-replicate volume. Native mount on a single
client.
2. mkdir /mnt/glustervol/dir_1 (where /mnt/glustervol is the mount point for
gluster volume)
3. for ii in `seq 5`; do 
dd bs=1024k count=1024 if=/dev/zero of=/mnt/glustervol/dir_1/f_${ii} conv=fsync
done
4. tar -cf /home/tar_compare.tar /mnt/glustervol/dir_1 (/home is a separate
local file system)

Actual results:
tar: Removing leading `/' from member names
tar: /mnt/glustervol/dir_1/f_1: file changed as we read it

Expected results:
In this test scenario, tar should not give the "file changed as we read it"
message.

Additional info:

--- Additional comment from Krutika Dhananjay on 2014-11-18 02:00:22 EST ---

Was able to recreate the bug and debug it. I used the same steps as the one
described above on a plain replicate except that i created 3 files (as opposed
to 5) and the following are my observations:

I attached the tar program to strace and the output tells me that tar does an
fstat() before and after archiving a given file:

fstat(5</mnt/dir_1/f_1>, {st_dev=makedev(0, 35), st_ino=12126758327912018215,
st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=131072,
st_blocks=524288, st_size=268435456, st_atime=2014/11/18-11:54:23,
st_mtime=2014/11/18-11:54:46, st_ctime=2014/11/18-11:54:49}) = 0               
              <============= fstat #1
read(5</mnt/dir_1/f_1>,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 9216) =
9216
write(3</home/compare.tar>,
"mnt/dir_1/\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) = 10240
read(5</mnt/dir_1/f_1>,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) =
10240
write(3</home/compare.tar>,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) =
10240
read(5</mnt/dir_1/f_1>,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) =
10240
...
...
...
write(3</home/compare.tar>,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) =
10240
read(5</mnt/dir_1/f_1>,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) =
10240
write(3</home/compare.tar>,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) =
10240
fstat(5</mnt/dir_1/f_1>, {st_dev=makedev(0, 35), st_ino=12126758327912018215,
st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=131072,
st_blocks=524288, st_size=268435456, st_atime=2014/11/18-11:56:39,
st_mtime=2014/11/18-11:54:47, st_ctime=2014/11/18-11:54:50}) = 0               
             <======= fstat #2
write(2</dev/pts/1>, "tar: ", 5)        = 5
write(2</dev/pts/1>, "/mnt/dir_1/f_1: file changed as "..., 42) = 42
write(2</dev/pts/1>, "\n", 1)           = 1

In case of /dir_1/f_1, the ctime of the file before and after the archival was
found by tar to be different, causing it to think f_1's attributes changed
while it was being processed.

Initial suspicion was on client-side caching xlators like io-cache, and
stat-prefetch. Disabling the two translators still caused the bug to be hit.
Hence the possibility of client side caching leading up to this bug was
eliminated.

Initial suspicion was on afr, that the read child of "/dir_1/f_1" in AFR could
have possibly changed between the first and the second fstat (every file has a
designated read_child in AFR on which AFR performs read-only FOPS like
getxattr(), readv(), stat(), fstat(), readdirp() etc). Adding more logs and
trying again revealed that the read child of "/dir_1/f_1" never changed between
the two fstat()s. The possibility of a bug in AFRs's read_child selection was
eliminated.

Of the 3 files under /dir_1, the file-changed-as-we-read-it error occurred only
with f_1.
>From the added logs, the read children of /dir_1, /dir_1/f_1, /dir_1/f_2 and
/dir_1/f_3 were found to be the following:

/dir_1          :  0
/dir_1/f_1      :  1
/dir_1/f_2      :  0
/dir_1/f-3      :  0

Therefore, on /dir_1, readdirp() was wound to child-0 of AFR, and the directory
entries and their attributes from brick-0 were cached by the fuse kernel.

Also, the first of the two fstat()s was never called on user space (glusterfs
fuse mount), and was possibly served up by fuse kernel using the cached values
from readdirp() on brick-0.

Then at the end, when the second fstat() was called (which made it into user
space this time) on /dir_1/f_1, the attributes were being served up from the
actual read child of f_1, which is child-1.
When the tar program compared the output of the two fstat()s, it found the
ctime to have changed and threw this error.

I tried the same experiment, by mounting the volume with --user-readdirp=no
option, and this warning was not seen on the mount. Also the logs indicated
that an explicit fstat() was passed down to the user space before tar-ing f_1.

Will try the same experiment with plain distribute, 2x2 dis-rep volume, and on
nfs mount and update the bug if there is anything more to it.

--- Additional comment from Manoj Pillai on 2014-11-18 14:04:26 EST ---

(In reply to Krutika Dhananjay from comment #2)

Nice Analysis!

> 
> I tried the same experiment, by mounting the volume with --user-readdirp=no
> option, and this warning was not seen on the mount. Also the logs indicated
> that an explicit fstat() was passed down to the user space before tar-ing
> f_1.
> 
> Will try the same experiment with plain distribute, 2x2 dis-rep volume, and
> on nfs mount and update the bug if there is anything more to it.

I think this is worse than a warning because tar is returning an exit code that
is interpreted as an error.

Is setting cluster.read-subvolume-index also expected to make the problem go
away?

--- Additional comment from Krutika Dhananjay on 2014-11-19 02:02:41 EST ---

(In reply to Manoj Pillai from comment #3)
> (In reply to Krutika Dhananjay from comment #2)
> 
> Nice Analysis!
> 
> > 
> > I tried the same experiment, by mounting the volume with --user-readdirp=no
> > option, and this warning was not seen on the mount. Also the logs indicated
> > that an explicit fstat() was passed down to the user space before tar-ing
> > f_1.
> > 
> > Will try the same experiment with plain distribute, 2x2 dis-rep volume, and
> > on nfs mount and update the bug if there is anything more to it.
> 
> I think this is worse than a warning because tar is returning an exit code
> that is interpreted as an error.
> 
> Is setting cluster.read-subvolume-index also expected to make the problem go
> away?

Interesting. That never crossed my mind! I am guessing it will work fine ONLY
for this case, where we don't bring down any child of AFR during the course of
this test. However, let me dig into the code a bit more and I'll post whatever
I find here.

I am actually now able to recreate this bug even with --use-readdirp=no option.
It happened 2/3 times. There is definitely more to this bug than whatever I
said in comment #2.

--- Additional comment from Krutika Dhananjay on 2014-11-20 02:13:56 EST ---

> 
> I am actually now able to recreate this bug even with --use-readdirp=no
> option. It happened 2/3 times. There is definitely more to this bug than
> whatever I said in comment #2.

So I hand-wrote the volfile for the fuse client to load trace xlator between
md-cache and quick-read and tried this again on a plain replicate volume
mounted with use-readdirp=no

<volfile-snippet>

...
...
volume rep-quick-read
    type performance/quick-read
    subvolumes rep-io-cache
end-volume

volume rep-trace
    type debug/trace
    subvolumes rep-quick-read
    option log-file on
end-volume

volume rep-md-cache
    type performance/md-cache
    subvolumes rep-trace
end-volume
...
...
</volfile-snippet>

What I found was that even with readdirp disabled, trace xlator logged
occurrences of readdirp():

[2014-11-20 07:00:00.402564] I [trace.c:2530:trace_opendir] 0-rep-trace: 6940:
gfid=552c56e2-87e4-4334-bf50-f9df4e473631 path=/dir_1 fd=0x9cdb5c
[2014-11-20 07:00:00.462940] I [trace.c:848:trace_opendir_cbk] 0-rep-trace:
6940: gfid=552c56e2-87e4-4334-bf50-f9df4e473631 op_ret=0, op_errno=117,
fd=0x9cdb5c
[2014-11-20 07:00:00.463454] I [trace.c:2562:trace_readdirp] 0-rep-trace: 6941:
gfid=552c56e2-87e4-4334-bf50-f9df4e473631 fd=0x9cdb5c, size=4096, offset=0
dict=0x7fd164e6edf8
[2014-11-20 07:00:00.464761] I [trace.c:319:trace_readdirp_cbk] 0-rep-trace:
6941 : gfid=552c56e2-87e4-4334-bf50-f9df4e473631 op_ret=5, op_errno=0
[2014-11-20 07:00:00.468683] I [trace.c:2562:trace_readdirp] 0-rep-trace: 6942:
gfid=552c56e2-87e4-4334-bf50-f9df4e473631 fd=0x9cdb5c, size=4096, offset=512
dict=0x7fd164e6e998
[2014-11-20 07:00:00.474459] I [trace.c:319:trace_readdirp_cbk] 0-rep-trace:
6942 : gfid=552c56e2-87e4-4334-bf50-f9df4e473631 op_ret=0, op_errno=0
[2014-11-20 07:00:00.477221] I [md-cache.c:907:mdc_stat]
(-->/usr/local/lib/glusterfs/3.6.0.32/xlator/meta.so(meta_stat+0x55)
[0x7fd156603569] (-->/usr/local/lib/libglusterfs.so.0(default_stat+0x97)
[0x7fd166678e28]
(-->/usr/local/lib/glusterfs/3.6.0.32/xlator/debug/io-stats.so(io_stats_stat+0x2a1)
[0x7fd156824d1c]))) 0-MDC_STAT: 552c56e2-87e4-4334-bf50-f9df4e473631: serving
up cached data

md-cache converts readdir()s to readdirp()s internally to cache the entries'
metadata unless the option performance.force-readdirp is set to "off" (thanks
to Raghavendra G for this piece of information).
And what came after the readdir() from fuse kernel was a stat() (see logs
above) which was served up by md-cache based on the attributes it had cached
from previous readdir(), in which case the overall RCA posted in comment #2 is
still valid.

I tried this again, with --use-readdirp=no and force-readdirp set to off. This
time the bug was not hit.

--- Additional comment from Krutika Dhananjay on 2014-11-20 07:41:19 EST ---

(In reply to Manoj Pillai from comment #3)
> (In reply to Krutika Dhananjay from comment #2)
> 
> Nice Analysis!
> 
> > 
> > I tried the same experiment, by mounting the volume with --user-readdirp=no
> > option, and this warning was not seen on the mount. Also the logs indicated
> > that an explicit fstat() was passed down to the user space before tar-ing
> > f_1.
> > 
> > Will try the same experiment with plain distribute, 2x2 dis-rep volume, and
> > on nfs mount and update the bug if there is anything more to it.
> 
> I think this is worse than a warning because tar is returning an exit code
> that is interpreted as an error.
> 
> Is setting cluster.read-subvolume-index also expected to make the problem go
> away?

The problem with read-subvolume-index configured to one single sub-volume is
that the chosen subvolume will be the read child only as long as it has good
copies of the files/dirs. And in the event that it is down, AFR will
necessarily have to pick its other subvol(s).
Also, I got to know from Pranith that in the event of a split brain, the read
child will unconditionally be chosen as the source during self-heal (when
ideally it should be failing with EIO). And that can lead to loss of data.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.