[Gluster-devel] upstream: Symbolic link not getting healed

Thu Dec 19 02:28:22 UTC 2013

hi,
    I used the following test to figure out the bad commit.
#!/bin/bash

. $(dirname $0)/../include.rc
. $(dirname $0)/../volume.rc

function trigger_mount_self_heal {
        find $M0 | xargs stat
}

cleanup;

TEST glusterd
TEST pidof glusterd
TEST $CLI volume create $V0 replica 2 $H0:$B0/${V0}{0,1}
TEST $CLI volume set $V0 cluster.background-self-heal-count 0
TEST $CLI volume start $V0
TEST glusterfs --volfile-id=/$V0 --volfile-server=$H0 $M0 --use-readdirp=no --attribute-timeout=0 --entry-timeout=0
TEST touch $M0/a
TEST kill_brick $V0 $H0 $B0/${V0}0
TEST ln -s $M0/a $M0/s
TEST ! stat $B0/${V0}0/s
TEST stat $B0/${V0}1/s
TEST $CLI volume start $V0 force
EXPECT_WITHIN 20 "Y" glustershd_up_status
EXPECT_WITHIN 20 "1" afr_child_up_status_in_shd $V0 0
TEST $CLI volume heal $V0 full
TEST trigger_mount_self_heal
TEST stat $B0/${V0}0/s
TEST stat $B0/${V0}1/s
cleanup

According to git bisect run, the commit which introduced this problem is:

837422858c2e4ab447879a4141361fd382645406
commit 837422858c2e4ab447879a4141361fd382645406
Author: Anand Avati <avati at redhat.com>
Date:   Thu Nov 21 06:48:17 2013 -0800

    core: fix errno for non-existent GFID

    When clients refer to a GFID which does not exist, the errno to
    be returned in ESTALE (and not ENOENT). Even though ENOENT might
    look "proper" most of the time, as the application eventually expects
    ENOENT even if a parent directory does not exist, not returning
    ESTALE results in resolvers (FUSE and GFAPI) to not retry resolution
    in uncached mode. This can result in spurious ENOENTs during
    concurrent path modification operations.

    Change-Id: I7a06ea6d6a191739f2e9c6e333a1969615e05936
    BUG: 1032894
    Signed-off-by: Anand Avati <avati at redhat.com>
    Reviewed-on: http://review.gluster.org/6322
    Tested-by: Gluster Build System <jenkins at build.gluster.com>

Affected branches: master, 3.5, 3.4,

Will be working with Venkatesh to get a fix for this on all these branches.
Good catch venkatesh!!. Thanks a lot for a simple case to re-create the issue :-).

Vijay,
     Do you think we need this patch for 3.4 as well? Did we get enough baking time? The change seems delicate. In the sense that all the places which are expecting ENOENT need to be carefully examined. Even if we miss one place, we have a potential bug.

Example:

In 3.4:

pk at pranithk-laptop - ~/workspace/gerrit-repo/xlators/cluster/afr/src ((detached from v3.4.0))
07:53:47 :) ⚡ git show 837422858c2e4ab447879a4141361fd382645406 --stat | grep afr <<--- On the commit which introduced this change only one afr file is changed
 xlators/cluster/afr/src/afr-self-heald.c      | 2 +-

Where as there are quite a few files which are handling ENOENT in afr:

pk at pranithk-laptop - ~/workspace/gerrit-repo/xlators/cluster/afr/src ((detached from v3.4.0))
07:53:51 :) ⚡ git grep -l ENOENT 
afr-common.c
afr-self-heal-common.c
afr-self-heal-entry.c
afr-self-heald.c

Pranith

----- Original Message -----
> From: "Venkatesh Somyajulu" <vsomyaju at redhat.com>
> To: gluster-devel at nongnu.org
> Sent: Tuesday, December 17, 2013 4:14:57 PM
> Subject: [Gluster-devel] upstream: Symbolic link not getting healed
> 
> Hi,
> 
> For the upstream master branch, I found that symbolic link is not getting
> healed.
> 
> How I reproduced:
> -----------------
> 1. Created replicate volume with 2 bricks in a replica.
> 2. Created file from the mount point.
> 3. Killed one of the brick of replica.
> 4. Created symbolic link to that file from mount point and then brought the
> killed brick back up.
> 
> Tried to heal by both a) Mount Process and b) Self heal Daemon
> 
> a) When self heal daemon is off:
> 6. Gave ls at mount point.
>    Observation: Rather than file getting healed, getting this output.
>                ls: cannot read symbolic link Link: No such file or directory
>                 File  Link
> 
> b) When self heal daemon is on:
> 6. "gluster volume heal volumename full" fails to heal and the output
> includes:
>    [2013-12-17 10:21:49.863960] I
>    [afr-self-heal-entry.c:1502:afr_sh_entry_impunge_readlink_sink_cbk]
>    0-volume1-replicate-0: readlink of /Link on volume1-client-1 failed
>    (Stale file handle)
> 
> 
> Still root causing the issue. Seems link ESTALE error needs to be handled
> properly.
> 
> Regards,
> Venkatesh
> 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>