[Bugs] [Bug 1186119] New: tar on a gluster directory gives message "file changed as we read it" even though no updates to file in progress

Tue Jan 27 06:28:04 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1186119

            Bug ID: 1186119
           Summary: tar on a gluster directory gives message "file changed
                    as we read it" even though no updates to file in
                    progress
           Product: GlusterFS
           Version: 3.6.1
         Component: replicate
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: kdhananj at redhat.com
                CC: bugs at gluster.org, byarlaga at redhat.com,
                    gluster-bugs at redhat.com, kdhananj at redhat.com,
                    mpillai at redhat.com, nsathyan at redhat.com,
                    pkarampu at redhat.com, rcyriac at redhat.com,
                    spandura at redhat.com, storage-qa-internal at redhat.com
        Depends On: 1179169

+++ This bug was initially created as a clone of Bug #1179169 +++

+++ This bug was initially created as a clone of Bug #1104618 +++

Description of problem:
A script that includes a tar command on a gluster volume directory is failing
with the error "file changed as we read it". There is no active I/O to the
gluster volume when the tar command is issued. All previous writes were sync
writes. Seeing this fairly consistently on a replica 2 volume; not yet seen on
a distribute volume.

Version-Release number of selected component (if applicable):
Client and Servers are running glusterfs*-3.6.0.11-1.el6rhs.x86_64

How reproducible:
Quite consistently. Some runs with only a few small files did not show the
error message. But have been seeing it every time I have runs with the steps
below.

Steps to Reproduce:
1. create a replicate or distribute-replicate volume. Native mount on a single
client.
2. mkdir /mnt/glustervol/dir_1 (where /mnt/glustervol is the mount point for
gluster volume)
3. for ii in `seq 5`; do 
dd bs=1024k count=1024 if=/dev/zero of=/mnt/glustervol/dir_1/f_${ii} conv=fsync
done
4. tar -cf /home/tar_compare.tar /mnt/glustervol/dir_1 (/home is a separate
local file system)

Actual results:
tar: Removing leading `/' from member names
tar: /mnt/glustervol/dir_1/f_1: file changed as we read it

Expected results:
In this test scenario, tar should not give the "file changed as we read it"
message.

Additional info:

--- Additional comment from Krutika Dhananjay on 2014-11-18 02:00:22 EST ---

Was able to recreate the bug and debug it. I used the same steps as the one
described above on a plain replicate except that i created 3 files (as opposed
to 5) and the following are my observations:

I attached the tar program to strace and the output tells me that tar does an
fstat() before and after archiving a given file:

fstat(5</mnt/dir_1/f_1>, {st_dev=makedev(0, 35), st_ino=12126758327912018215,
st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=131072,
st_blocks=524288, st_size=268435456, st_atime=2014/11/18-11:54:23,
st_mtime=2014/11/18-11:54:46, st_ctime=2014/11/18-11:54:49}) = 0               
              <============= fstat #1
read(5</mnt/dir_1/f_1>,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 9216) =
9216
write(3</home/compare.tar>,
"mnt/dir_1/\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) = 10240
read(5</mnt/dir_1/f_1>,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) =
10240
write(3</home/compare.tar>,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) =
10240
read(5</mnt/dir_1/f_1>,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) =
10240
...
...
...
write(3</home/compare.tar>,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) =
10240
read(5</mnt/dir_1/f_1>,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) =
10240
write(3</home/compare.tar>,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) =
10240
fstat(5</mnt/dir_1/f_1>, {st_dev=makedev(0, 35), st_ino=12126758327912018215,
st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=131072,
st_blocks=524288, st_size=268435456, st_atime=2014/11/18-11:56:39,
st_mtime=2014/11/18-11:54:47, st_ctime=2014/11/18-11:54:50}) = 0               
             <======= fstat #2
write(2</dev/pts/1>, "tar: ", 5)        = 5
write(2</dev/pts/1>, "/mnt/dir_1/f_1: file changed as "..., 42) = 42
write(2</dev/pts/1>, "\n", 1)           = 1

In case of /dir_1/f_1, the ctime of the file before and after the archival was
found by tar to be different, causing it to think f_1's attributes changed
while it was being processed.

Initial suspicion was on client-side caching xlators like io-cache, and
stat-prefetch. Disabling the two translators still caused the bug to be hit.
Hence the possibility of client side caching leading up to this bug was
eliminated.

Initial suspicion was on afr, that the read child of "/dir_1/f_1" in AFR could
have possibly changed between the first and the second fstat (every file has a
designated read_child in AFR on which AFR performs read-only FOPS like
getxattr(), readv(), stat(), fstat(), readdirp() etc). Adding more logs and
trying again revealed that the read child of "/dir_1/f_1" never changed between
the two fstat()s. The possibility of a bug in AFRs's read_child selection was
eliminated.

Of the 3 files under /dir_1, the file-changed-as-we-read-it error occurred only
with f_1.
>From the added logs, the read children of /dir_1, /dir_1/f_1, /dir_1/f_2 and
/dir_1/f_3 were found to be the following:

/dir_1          :  0
/dir_1/f_1      :  1
/dir_1/f_2      :  0
/dir_1/f-3      :  0

Therefore, on /dir_1, readdirp() was wound to child-0 of AFR, and the directory
entries and their attributes from brick-0 were cached by the fuse kernel.

Also, the first of the two fstat()s was never called on user space (glusterfs
fuse mount), and was possibly served up by fuse kernel using the cached values
from readdirp() on brick-0.

Then at the end, when the second fstat() was called (which made it into user
space this time) on /dir_1/f_1, the attributes were being served up from the
actual read child of f_1, which is child-1.
When the tar program compared the output of the two fstat()s, it found the
ctime to have changed and threw this error.

I tried the same experiment, by mounting the volume with --user-readdirp=no
option, and this warning was not seen on the mount. Also the logs indicated
that an explicit fstat() was passed down to the user space before tar-ing f_1.

Will try the same experiment with plain distribute, 2x2 dis-rep volume, and on
nfs mount and update the bug if there is anything more to it.

--- Additional comment from Manoj Pillai on 2014-11-18 14:04:26 EST ---

(In reply to Krutika Dhananjay from comment #2)

Nice Analysis!

> 
> I tried the same experiment, by mounting the volume with --user-readdirp=no
> option, and this warning was not seen on the mount. Also the logs indicated
> that an explicit fstat() was passed down to the user space before tar-ing
> f_1.
> 
> Will try the same experiment with plain distribute, 2x2 dis-rep volume, and
> on nfs mount and update the bug if there is anything more to it.

I think this is worse than a warning because tar is returning an exit code that
is interpreted as an error.

Is setting cluster.read-subvolume-index also expected to make the problem go
away?

--- Additional comment from Krutika Dhananjay on 2014-11-19 02:02:41 EST ---

(In reply to Manoj Pillai from comment #3)
> (In reply to Krutika Dhananjay from comment #2)
> 
> Nice Analysis!
> 
> > 
> > I tried the same experiment, by mounting the volume with --user-readdirp=no
> > option, and this warning was not seen on the mount. Also the logs indicated
> > that an explicit fstat() was passed down to the user space before tar-ing
> > f_1.
> > 
> > Will try the same experiment with plain distribute, 2x2 dis-rep volume, and
> > on nfs mount and update the bug if there is anything more to it.
> 
> I think this is worse than a warning because tar is returning an exit code
> that is interpreted as an error.
> 
> Is setting cluster.read-subvolume-index also expected to make the problem go
> away?

Interesting. That never crossed my mind! I am guessing it will work fine ONLY
for this case, where we don't bring down any child of AFR during the course of
this test. However, let me dig into the code a bit more and I'll post whatever
I find here.

I am actually now able to recreate this bug even with --use-readdirp=no option.
It happened 2/3 times. There is definitely more to this bug than whatever I
said in comment #2.

--- Additional comment from Krutika Dhananjay on 2014-11-20 02:13:56 EST ---

> 
> I am actually now able to recreate this bug even with --use-readdirp=no
> option. It happened 2/3 times. There is definitely more to this bug than
> whatever I said in comment #2.

So I hand-wrote the volfile for the fuse client to load trace xlator between
md-cache and quick-read and tried this again on a plain replicate volume
mounted with use-readdirp=no

<volfile-snippet>

...
...
volume rep-quick-read
    type performance/quick-read
    subvolumes rep-io-cache
end-volume

volume rep-trace
    type debug/trace
    subvolumes rep-quick-read
    option log-file on
end-volume

volume rep-md-cache
    type performance/md-cache
    subvolumes rep-trace
end-volume
...
...
</volfile-snippet>

What I found was that even with readdirp disabled, trace xlator logged
occurrences of readdirp():

[2014-11-20 07:00:00.402564] I [trace.c:2530:trace_opendir] 0-rep-trace: 6940:
gfid=552c56e2-87e4-4334-bf50-f9df4e473631 path=/dir_1 fd=0x9cdb5c
[2014-11-20 07:00:00.462940] I [trace.c:848:trace_opendir_cbk] 0-rep-trace:
6940: gfid=552c56e2-87e4-4334-bf50-f9df4e473631 op_ret=0, op_errno=117,
fd=0x9cdb5c
[2014-11-20 07:00:00.463454] I [trace.c:2562:trace_readdirp] 0-rep-trace: 6941:
gfid=552c56e2-87e4-4334-bf50-f9df4e473631 fd=0x9cdb5c, size=4096, offset=0
dict=0x7fd164e6edf8
[2014-11-20 07:00:00.464761] I [trace.c:319:trace_readdirp_cbk] 0-rep-trace:
6941 : gfid=552c56e2-87e4-4334-bf50-f9df4e473631 op_ret=5, op_errno=0
[2014-11-20 07:00:00.468683] I [trace.c:2562:trace_readdirp] 0-rep-trace: 6942:
gfid=552c56e2-87e4-4334-bf50-f9df4e473631 fd=0x9cdb5c, size=4096, offset=512
dict=0x7fd164e6e998
[2014-11-20 07:00:00.474459] I [trace.c:319:trace_readdirp_cbk] 0-rep-trace:
6942 : gfid=552c56e2-87e4-4334-bf50-f9df4e473631 op_ret=0, op_errno=0
[2014-11-20 07:00:00.477221] I [md-cache.c:907:mdc_stat]
(-->/usr/local/lib/glusterfs/3.6.0.32/xlator/meta.so(meta_stat+0x55)
[0x7fd156603569] (-->/usr/local/lib/libglusterfs.so.0(default_stat+0x97)
[0x7fd166678e28]
(-->/usr/local/lib/glusterfs/3.6.0.32/xlator/debug/io-stats.so(io_stats_stat+0x2a1)
[0x7fd156824d1c]))) 0-MDC_STAT: 552c56e2-87e4-4334-bf50-f9df4e473631: serving
up cached data

md-cache converts readdir()s to readdirp()s internally to cache the entries'
metadata unless the option performance.force-readdirp is set to "off" (thanks
to Raghavendra G for this piece of information).
And what came after the readdir() from fuse kernel was a stat() (see logs
above) which was served up by md-cache based on the attributes it had cached
from previous readdir(), in which case the overall RCA posted in comment #2 is
still valid.

I tried this again, with --use-readdirp=no and force-readdirp set to off. This
time the bug was not hit.

--- Additional comment from Krutika Dhananjay on 2014-11-20 07:41:19 EST ---

(In reply to Manoj Pillai from comment #3)
> (In reply to Krutika Dhananjay from comment #2)
> 
> Nice Analysis!
> 
> > 
> > I tried the same experiment, by mounting the volume with --user-readdirp=no
> > option, and this warning was not seen on the mount. Also the logs indicated
> > that an explicit fstat() was passed down to the user space before tar-ing
> > f_1.
> > 
> > Will try the same experiment with plain distribute, 2x2 dis-rep volume, and
> > on nfs mount and update the bug if there is anything more to it.
> 
> I think this is worse than a warning because tar is returning an exit code
> that is interpreted as an error.
> 
> Is setting cluster.read-subvolume-index also expected to make the problem go
> away?

The problem with read-subvolume-index configured to one single sub-volume is
that the chosen subvolume will be the read child only as long as it has good
copies of the files/dirs. And in the event that it is down, AFR will
necessarily have to pick its other subvol(s).
Also, I got to know from Pranith that in the event of a split brain, the read
child will unconditionally be chosen as the source during self-heal (when
ideally it should be failing with EIO). And that can lead to loss of data.

--- Additional comment from Anand Avati on 2015-01-08 07:26:56 EST ---

REVIEW: http://review.gluster.org/9418 (cluster/afr: Fix perceived change in
file due to difference in [mc]time across replicas) posted (#1) for review on
master by Krutika Dhananjay (kdhananj at redhat.com)

--- Additional comment from Anand Avati on 2015-01-16 04:23:27 EST ---

REVIEW: http://review.gluster.org/9457 (cluster/dht: Fix incorrect updates to
parent times) posted (#1) for review on master by Krutika Dhananjay
(kdhananj at redhat.com)

--- Additional comment from Anand Avati on 2015-01-19 06:12:33 EST ---

COMMIT: http://review.gluster.org/9457 committed in master by Raghavendra G
(rgowdapp at redhat.com) 
------
commit b45f623a7a8e14ca09a10c6a04c4c5f81e3a62e2
Author: Krutika Dhananjay <kdhananj at redhat.com>
Date:   Fri Jan 16 14:26:45 2015 +0530

    cluster/dht: Fix incorrect updates to parent times

    In directory write FOPs, as far as updates to timestamps associated
    with parent by DHT is concerned, there are three possibilities:
    a) time (in sec) gotten from child of DHT < time (in sec) in inode ctx
    b) time (in sec) gotten from child of DHT = time (in sec) in inode ctx
    c) time (in sec) gotten from child of DHT > time (in sec) in inode ctx

    In case (c), for time in nsecs, it is the value returned by DHT's child
    that must be selected. But what DHT_UPDATE_TIME ends up doing is to choose
    the maximum of (time in nsec gotten from DHT's child, time in nsec in inode
ctx).

    Change-Id: I535a600b9f89b8d9b6714a73476e63ce60e169a8
    BUG: 1179169
    Signed-off-by: Krutika Dhananjay <kdhananj at redhat.com>
    Reviewed-on: http://review.gluster.org/9457
    Reviewed-by: Pranith Kumar Karampuri <pkarampu at redhat.com>
    Tested-by: Gluster Build System <jenkins at build.gluster.com>
    Reviewed-by: Raghavendra G <rgowdapp at redhat.com>
    Tested-by: Raghavendra G <rgowdapp at redhat.com>

--- Additional comment from Anand Avati on 2015-01-20 05:13:27 EST ---

REVIEW: http://review.gluster.org/9465 (cluster/dht: In MKDIR(), aggregate
iatts from non-hashed subvols too) posted (#1) for review on master by Krutika
Dhananjay (kdhananj at redhat.com)

--- Additional comment from Anand Avati on 2015-01-22 05:06:19 EST ---

REVIEW: http://review.gluster.org/9476 (performance/md-cache: Initialise
local->loc before winding nameless lookup) posted (#1) for review on master by
Krutika Dhananjay (kdhananj at redhat.com)

--- Additional comment from Anand Avati on 2015-01-22 05:06:38 EST ---

REVIEW: http://review.gluster.org/9477 (cluster/afr: When parent and entry read
subvols are different, set entry->inode to NULL) posted (#1) for review on
master by Krutika Dhananjay (kdhananj at redhat.com)

--- Additional comment from Anand Avati on 2015-01-23 01:55:13 EST ---

REVIEW: http://review.gluster.org/9476 (performance/md-cache: Initialise
local->loc before winding nameless lookup) posted (#2) for review on master by
Krutika Dhananjay (kdhananj at redhat.com)

--- Additional comment from Anand Avati on 2015-01-24 00:37:48 EST ---

COMMIT: http://review.gluster.org/9476 committed in master by Raghavendra G
(rgowdapp at redhat.com) 
------
commit 304cfeea0106ee7a9e1786637bab403f72c96155
Author: Krutika Dhananjay <kdhananj at redhat.com>
Date:   Thu Jan 22 12:57:10 2015 +0530

    performance/md-cache: Initialise local->loc before winding nameless lookup

    That way, in the cbk, the latest values are updated in the cache.

    Change-Id: Ia149e352e4763e8f5b910a3f4cb64d2dda4534b1
    BUG: 1179169
    Signed-off-by: Krutika Dhananjay <kdhananj at redhat.com>
    Reviewed-on: http://review.gluster.org/9476
    Tested-by: Gluster Build System <jenkins at build.gluster.com>
    Reviewed-by: Raghavendra G <rgowdapp at redhat.com>
    Tested-by: Raghavendra G <rgowdapp at redhat.com>

--- Additional comment from Anand Avati on 2015-01-24 00:43:18 EST ---

COMMIT: http://review.gluster.org/9465 committed in master by Raghavendra G
(rgowdapp at redhat.com) 
------
commit 6c10f6db7c53e96881b29ca70dfc08b1ffb231b3
Author: Krutika Dhananjay <kdhananj at redhat.com>
Date:   Tue Jan 20 14:58:34 2015 +0530

    cluster/dht: In MKDIR(), aggregate iatts from non-hashed subvols too

    PROBLEM:

    Gathering iatts from ONLY the hashed subvol during MKDIR and unwinding
    them can cause md-cache to cache and serve these values for a while to
    the application.
    And then, at a later point of time, when a LOOKUP on either the dir or
    its parent gathers attributes from all subvolumes of dht and things are
    evened out as part of DHT_UPDATE_TIME, the application could be getting
    a different set of [cm]times (i.e., one of the non-hashed subvolumes'
    times could be selected by virtue of having the highest values),
    causing it to think the directory underwent modification even when it
    might not have.

    The effect of this bug becomes apparent in programs like tar, which rely
    on the ctime of the files before and after archiving a file to
    ascertain that the file remained unchanged during this time.

    FIX:

    Aggregate iatts from ALL sub-volumes of DHT during MKDIR.

    Change-Id: I04c4ca3e3b9552772e2b089be680f8afeb72089e
    BUG: 1179169
    Signed-off-by: Krutika Dhananjay <kdhananj at redhat.com>
    Reviewed-on: http://review.gluster.org/9465
    Tested-by: Gluster Build System <jenkins at build.gluster.com>
    Reviewed-by: Raghavendra G <rgowdapp at redhat.com>
    Tested-by: Raghavendra G <rgowdapp at redhat.com>

--- Additional comment from Anand Avati on 2015-01-24 03:16:12 EST ---

REVIEW: http://review.gluster.org/9477 (cluster/afr: When parent and entry read
subvols are different, set entry->inode to NULL) posted (#2) for review on
master by Krutika Dhananjay (kdhananj at redhat.com)

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1179169
[Bug 1179169] tar on a gluster directory gives message "file changed as we
read it" even though no updates to file in progress
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.