[Bugs] [Bug 1600941] [geo-rep]: geo-replication scheduler is failing due to unsuccessful umount

Fri Jul 13 12:45:19 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1600941

--- Comment #2 from Kotresh HR <khiremat at redhat.com> ---
Analysis:

Thanks Csaba(csaba at redhat.com) for the detailed analysis and help. Following is
the analysis done by Csaba.

=================================================================================
When you run the schedule script and it ends up in mess, there are three bogus
ex-mountpoints:

[root at dhcp42-119 ~]# rmdir -v /tmp/g* 2>&1 | grep -B1 busy
rmdir: removing directory, ‘/tmp/georepsetup_qKu_20’
rmdir: failed to remove ‘/tmp/georepsetup_qKu_20’: Device or resource busy
rmdir: removing directory, ‘/tmp/gsyncd-aux-mount-1rWuvM’
rmdir: failed to remove ‘/tmp/gsyncd-aux-mount-1rWuvM’: Device or resource busy
rmdir: removing directory, ‘/tmp/gsyncd-aux-mount-SWi0_m’
rmdir: failed to remove ‘/tmp/gsyncd-aux-mount-SWi0_m’: Device or resource busy

Let me tell you about two kernel inspection features.

Each mount has a “device id” associated with it (unique in the given moment).
For any file, stat(3) result carries the device id of the mount it’s within in
the st_dev field; stat(1) shows this info as “Device” (see also sys_stat(3p)).

You can see mounts with associated device ids in /proc/self/mountinfo.

Also fuse comes with a fusectl virtual fs (commonly mounted on
/sys/fs/fuse/connections/) that collects metadata and some control hooks for
the existing fuse mounts in directories named with the device ids of the
certain mounts.

So the following command lists those fuse connections which outlived their
mounts, ie. they are listed in fusectl but not in mountinfo:

 for c in `ls /sys/fs/fuse/connections/`; do grep -q 0:$c /proc/self/mountinfo
|| echo BAD $c; done

In this situation you’ll get:

[root at dhcp42-119 ~]# for c in `ls /sys/fs/fuse/connections/`; do grep -q 0:$c
/proc/self/mountinfo || echo BAD $c; done
BAD 41
BAD 42
BAD 43

(Actually these are not necessarily bad — this discrepancy is expected when a
lazy umount occurs and the fs survives this because a process still using it.
But in this case AFAIK neither of them was lazily unmount occurred; and even if
I’m wrong about it, after a lazy unmount the directory should not be busy.)

Let’s see which fuse servers are behind these entries, by checking the
processes that are using /dev/fuse:

[root at dhcp42-119 ~]# lsof /dev/fuse | awk '{print $2}' | grep -v PID | xargs ps
ww | grep /tmp
13769 ?        Ssl    0:00 /usr/sbin/glusterfs --volfile-server localhost
--volfile-id master -l
/var/log/glusterfs/geo-replication/schedule_georep.mount.log
/tmp/georepsetup_qKu_20
13863 ?        Ssl    0:00 /usr/sbin/glusterfs --aux-gfid-mount --acl
--log-file=/var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.42.88%3Agluster%3A%2F%2F127.0.0.1%3Aslave.%2Fbricks%2Fbrick0%2Fb1.gluster.log
--volfile-server=localhost --volfile-id=master --client-pid=-1
/tmp/gsyncd-aux-mount-SWi0_m
13864 ?        Ssl    0:01 /usr/sbin/glusterfs --aux-gfid-mount --acl
--log-file=/var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.42.88%3Agluster%3A%2F%2F127.0.0.1%3Aslave.%2Fbricks%2Fbrick1%2Fb4.gluster.log
--volfile-server=localhost --volfile-id=master --client-pid=-1
/tmp/gsyncd-aux-mount-1rWuvM

(We can do a grep /tmp because we know the bogus mounts are in /tmp).

Continuing the investigation we can see they are still sitting in a read to
/dev/fuse:

[root at dhcp42-119 ~]# lsof /dev/fuse | awk '{print $2}' | grep -v PID | xargs ps
ww | grep /tmp | \
awk '{print $1}' | while read l; do head /proc/$l/task/*/stack; done | grep -B1
fuse

----------------------------------------------------------------
==> /proc/13769/task/13846/stack <==
[<ffffffffc087d6a4>] fuse_dev_do_read.isra.18+0x274/0x870 [fuse]
[<ffffffffc087df7d>] fuse_dev_read+0x7d/0xa0 [fuse]
--
==> /proc/13863/task/13884/stack <==
[<ffffffffc087d6a4>] fuse_dev_do_read.isra.18+0x274/0x870 [fuse]
[<ffffffffc087df7d>] fuse_dev_read+0x7d/0xa0 [fuse]
--
==> /proc/13864/task/13888/stack <==
[<ffffffffc087d6a4>] fuse_dev_do_read.isra.18+0x274/0x870 [fuse]
[<ffffffffc087df7d>] fuse_dev_read+0x7d/0xa0 [fuse]
--------------------------------------------------------------------

When a fuse mount is decommissioned then the fuse server’s read call to
/dev/fuse should return with ENODEV. That apparently didn’t happen here.

[root at dhcp42-119 ~]# lsof | grep '0,4[123]' | awk '{print $1" "$2" "$NF }' |
uniq
python 13800 /tmp/gsyncd-aux-mount-SWi0_m
python 13802 /tmp/gsyncd-aux-mount-1rWuvM

Curiosly, there is no reference to one of the bogus dirs, the /tmp/georepsetup*
one.

Let’s see whom are the culprits:

[root at dhcp42-119 ~]# lsof | grep '0,4[123]' | awk '{print $2}' | uniq | xargs
ps ww
  PID TTY      STAT   TIME COMMAND
13800 ?        Sl     0:09 python
/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/bricks/brick0/b1
--path=/bricks/brick1/b4  -c
/var/lib/glusterd/geo-replication/master_10.70.42.88_slave/gsyncd.conf
--iprefix=/var :master --glusterd-uuid=161b39fa-8c35-4cf0-b837-1af3d5b87b6f
10.70.42.88::slave -N -p  --slave-id 7b190f51-42bf-4378-b6d2-e76b422c448d
--feedback-fd 15 --local-path /bricks/brick0/b1 --local-node 10.70.42.119
--local-node-id 161b39fa-8c35-4cf0-b837-1af3d5b87b6f --local-id
.%2Fbricks%2Fbrick0%2Fb1 --rpc-fd 11,9,8,13 --subvol-num 1 --resource-remote
ssh://root@10.70.42.88:gluster://localhost:slave
13802 ?        Sl     0:06 python
/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/bricks/brick0/b1
--path=/bricks/brick1/b4  -c
/var/lib/glusterd/geo-replication/master_10.70.42.88_slave/gsyncd.conf
--iprefix=/var :master --glusterd-uuid=161b39fa-8c35-4cf0-b837-1af3d5b87b6f
10.70.42.88::slave -N -p  --slave-id 7b190f51-42bf-4378-b6d2-e76b422c448d
--feedback-fd 15 --local-path /bricks/brick1/b4 --local-node 10.70.42.119
--local-node-id 161b39fa-8c35-4cf0-b837-1af3d5b87b6f --local-id
.%2Fbricks%2Fbrick1%2Fb4 --rpc-fd 11,9,8,12 --subvol-num 2 --resource-remote
ssh://root@10.70.42.88:gluster://localhost:slave

That is, the gsyncd workers which have a resource-remote option:

[root at dhcp42-119 ~]# pgrep -f resource-remote
13800
13802

I found that all this badness goes away with stopping geo-rep, so as a more
targeted action we can just try to kill the above processes.

[root at dhcp42-119 ~]# pkill -f resource-remote

[root at dhcp42-119 ~]# for c in `ls /sys/fs/fuse/connections/`; do  grep -q 0:$c
/proc/self/mountinfo || echo BAD $c; done
[root at dhcp42-119 ~]#

All the badness is gone!
================================================================================

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=kMK4628dHZ&a=cc_unsubscribe