[Bugs] [Bug 1452527] New: Shared volume doesn' t get mounted on few nodes after rebooting all nodes in cluster.

Fri May 19 07:11:07 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1452527

            Bug ID: 1452527
           Summary: Shared volume doesn't get mounted on few nodes after
                    rebooting all nodes in cluster.
           Product: GlusterFS
           Version: mainline
         Component: scripts
          Keywords: Triaged
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: jthottan at redhat.com
                CC: amukherj at redhat.com, asengupt at redhat.com,
                    bugs at gluster.org, jthottan at redhat.com,
                    kkeithle at redhat.com, mzywusko at redhat.com,
                    ndevos at redhat.com, nlevinki at redhat.com,
                    rhs-bugs at redhat.com, skoduri at redhat.com,
                    sraj at redhat.com, storage-qa-internal at redhat.com,
                    vbellur at redhat.com
        Depends On: 1335090
            Blocks: 1451981

+++ This bug was initially created as a clone of Bug #1335090 +++

Description of problem:

shared volume doesn't get mounted on one (maybe two) node after rebooting all
nodes in cluster, resulting in missing symlink (/var/lib/nfs ->
/var/run/gluster/shared_storage/nfs-ganesha/dhcp42-239.lab.eng.blr.redhat.com/nfs)
.

Version-Release number of selected component (if applicable):
mainline

How reproducible:
Always

Steps to Reproduce:
1. Create a 4 node ganesha cluster.
2. Make sure the shared volume is created and mounted on all the nodes of
cluster and the symlink is created as below.

[root at dhcp42-20 ~]# gluster volume status gluster_shared_storage
Status of volume: gluster_shared_storage
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick dhcp42-239.lab.eng.blr.redhat.com:/va
r/lib/glusterd/ss_brick                     49155     0          Y       2293 
Brick dhcp43-175.lab.eng.blr.redhat.com:/va
r/lib/glusterd/ss_brick                     49155     0          Y       2281 
Brick dhcp42-20.lab.eng.blr.redhat.com:/var
/lib/glusterd/ss_brick                      49155     0          Y       2266 
Self-heal Daemon on localhost               N/A       N/A        Y       2257 
Self-heal Daemon on dhcp42-239.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       2287 
Self-heal Daemon on dhcp43-175.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       2253 
Self-heal Daemon on dhcp42-196.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       2258 

Task Status of Volume gluster_shared_storage
------------------------------------------------------------------------------
There are no active volume tasks

dhcp42-20.lab.eng.blr.redhat.com:/gluster_shared_storage  27740928 1697152 
26043776   7% /run/gluster/shared_storage

dhcp42-239.lab.eng.blr.redhat.com:/gluster_shared_storage  27740928 1697152 
26043776   7% /run/gluster/shared_storage

dhcp43-175.lab.eng.blr.redhat.com:/gluster_shared_storage  27740928 1697152 
26043776   7% /run/gluster/shared_storage

dhcp42-196.lab.eng.blr.redhat.com:/gluster_shared_storage  27740928 1697152 
26043776   7% /run/gluster/shared_storage

[root at dhcp42-20 ~]# ls -ld /var/lib/nfs
lrwxrwxrwx. 1 root root 80 May 11 21:26 /var/lib/nfs ->
/var/run/gluster/shared_storage/nfs-ganesha/dhcp42-20.lab.eng.blr.redhat.com/nfs

[root at dhcp42-239 ~]# ls -ld /var/lib/nfs
lrwxrwxrwx. 1 root root 81 May 11 21:26 /var/lib/nfs ->
/var/run/gluster/shared_storage/nfs-ganesha/dhcp42-239.lab.eng.blr.redhat.com/nfs

[root at dhcp43-175 ~]# ls -ld /var/lib/nfs
lrwxrwxrwx. 1 root root 81 May 11 21:26 /var/lib/nfs ->
/var/run/gluster/shared_storage/nfs-ganesha/dhcp43-175.lab.eng.blr.redhat.com/nfs

[root at dhcp42-196 ~]# ls -ld /var/lib/nfs
lrwxrwxrwx. 1 root root 81 May 11 21:19 /var/lib/nfs ->
/var/run/gluster/shared_storage/nfs-ganesha/dhcp42-196.lab.eng.blr.redhat.com/nfs

3. Reboot all the nodes of the cluster.
4. Observe that on 2 of the 4 nodes, shared storage is not mounted. (most of
the times it doesnt get mounted on any one node).
5.And the symlink from /var/lib/nfs doesn't get created because of this on
these 2 nodes.
6. Both of these nodes have the entries in /etc/fstab and manually mounting the
shared storage on these nodes works.

Actual results:

Shared volume doesn't get mounted on few nodes after rebooting all nodes in
cluster.

Expected results:

Shared volume should get mounted on all the nodes after reboot

Additional info:

--- Additional comment from Soumya Koduri on 2016-05-11 07:38:20 EDT ---

I see below error in node4 logs 

[2016-05-11 15:56:04.984079] E [MSGID: 114058]
[client-handshake.c:1524:client_query_portmap_cbk]
0-gluster_shared_storage-client-1: failed to get the port number for remote
subvolume. Please run 'gluster volume status' on server to see if brick process
is running.
[2016-05-11 15:56:04.984357] I [MSGID: 114018]
[client.c:2030:client_rpc_notify] 0-gluster_shared_storage-client-1:
disconnected from gluster_shared_storage-client-1. Client process will keep
trying to connect to glusterd until brick's port is available
[2016-05-11 15:56:04.984374] W [MSGID: 108001] [afr-common.c:4210:afr_notify]
0-gluster_shared_storage-replicate-0: Client-quorum is not met
[2016-05-11 15:56:05.291773] E [MSGID: 114058]
[client-handshake.c:1524:client_query_portmap_cbk]
0-gluster_shared_storage-client-2: failed to get the port number for remote
subvolume. Please run 'gluster volume status' on server to see if brick process
is running.
[2016-05-11 15:56:05.292104] I [MSGID: 114018]
[client.c:2030:client_rpc_notify] 0-gluster_shared_storage-client-2:
disconnected from gluster_shared_storage-client-2. Client process will keep
trying to connect to glusterd until brick's port is available
[2016-05-11 15:56:05.292165] E [MSGID: 108006] [afr-common.c:4152:afr_notify]
0-gluster_shared_storage-replicate-0: All subvolumes are down. Going offline
until atleast one of them comes back up.
[2016-05-11 15:56:05.295895] I [fuse-bridge.c:5166:fuse_graph_setup] 0-fuse:
switched to graph 0
[2016-05-11 15:56:05.296679] I [fuse-bridge.c:4077:fuse_init] 0-glusterfs-fuse:
FUSE inited with protocol versions: glusterfs 7.22 kernel 7.22
[2016-05-11 15:56:05.296828] I [MSGID: 108006]
[afr-common.c:4261:afr_local_init] 0-gluster_shared_storage-replicate-0: no
subvolumes up
[2016-05-11 15:56:05.297606] E [dht-helper.c:1602:dht_inode_ctx_time_update]
(-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_discover+0x1ca)
[0x7fbcaef8ad6a]
-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_lookup_dir_cbk+0x379)
[0x7fbcaecf51d9]
-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_inode_ctx_time_update+0x210)
[0x7fbcaeccd2d0] ) 0-gluster_shared_storage-dht: invalid argument: inode
[Invalid argument]
[2016-05-11 15:56:05.298786] E [dht-helper.c:1602:dht_inode_ctx_time_update]
(-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_discover+0x1ca)
[0x7fbcaef8ad6a]
-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_lookup_dir_cbk+0x379)
[0x7fbcaecf51d9]
-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_inode_ctx_time_update+0x210)
[0x7fbcaeccd2d0] ) 0-gluster_shared_storage-dht: invalid argument: inode
[Invalid argument]
[2016-05-11 15:56:05.298818] W [fuse-bridge.c:766:fuse_attr_cbk]
0-glusterfs-fuse: 2: LOOKUP() / => -1 (Transport endpoint is not connected)
[2016-05-11 15:56:05.305894] E [dht-helper.c:1602:dht_inode_ctx_time_update]
(-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_discover+0x1ca)
[0x7fbcaef8ad6a]
-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_lookup_dir_cbk+0x379)
[0x7fbcaecf51d9]
-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_inode_ctx_time_update+0x210)
[0x7fbcaeccd2d0] ) 0-gluster_shared_storage-dht: invalid argument: inode
[Invalid argument]
[2016-05-11 15:56:05.307751] I [fuse-bridge.c:5007:fuse_thread_proc] 0-fuse:
unmounting /run/gluster/shared_storage

Since this seems to be an issue with gluster_shared_storage mount being lost,
adjusting the components accordingly and request Avra to take a look.

--- Additional comment from Avra Sengupta on 2016-05-13 01:30:09 EDT ---

This is expected behaviour. We need to understand that the shared volume itself
is hosted in these nodes, and all nodes mount it using one of the particular
nodes. Now when all nodes are down, the shared storage volume is also
essentially down. When the nodes come up, till the node whose entry is
mentioned in /etc/fstab is up and serving, none of them will be able to connect
to the shared storage. That node itself will never connect to the shared
storage on reboot, as by the time /etc/fstab entry is replayed, the volume is
not being served.

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1335090
[Bug 1335090] Shared volume doesn't get mounted on few nodes after
rebooting all nodes in cluster.
https://bugzilla.redhat.com/show_bug.cgi?id=1451981
[Bug 1451981] [GSS] NFS-ganesha is not getting started properly after node
reboot.
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.