[Bugs] [Bug 1452527] New: Shared volume doesn' t get mounted on few nodes after rebooting all nodes in cluster.
bugzilla at redhat.com
bugzilla at redhat.com
Fri May 19 07:11:07 UTC 2017
https://bugzilla.redhat.com/show_bug.cgi?id=1452527
Bug ID: 1452527
Summary: Shared volume doesn't get mounted on few nodes after
rebooting all nodes in cluster.
Product: GlusterFS
Version: mainline
Component: scripts
Keywords: Triaged
Severity: high
Assignee: bugs at gluster.org
Reporter: jthottan at redhat.com
CC: amukherj at redhat.com, asengupt at redhat.com,
bugs at gluster.org, jthottan at redhat.com,
kkeithle at redhat.com, mzywusko at redhat.com,
ndevos at redhat.com, nlevinki at redhat.com,
rhs-bugs at redhat.com, skoduri at redhat.com,
sraj at redhat.com, storage-qa-internal at redhat.com,
vbellur at redhat.com
Depends On: 1335090
Blocks: 1451981
+++ This bug was initially created as a clone of Bug #1335090 +++
Description of problem:
shared volume doesn't get mounted on one (maybe two) node after rebooting all
nodes in cluster, resulting in missing symlink (/var/lib/nfs ->
/var/run/gluster/shared_storage/nfs-ganesha/dhcp42-239.lab.eng.blr.redhat.com/nfs)
.
Version-Release number of selected component (if applicable):
mainline
How reproducible:
Always
Steps to Reproduce:
1. Create a 4 node ganesha cluster.
2. Make sure the shared volume is created and mounted on all the nodes of
cluster and the symlink is created as below.
[root at dhcp42-20 ~]# gluster volume status gluster_shared_storage
Status of volume: gluster_shared_storage
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick dhcp42-239.lab.eng.blr.redhat.com:/va
r/lib/glusterd/ss_brick 49155 0 Y 2293
Brick dhcp43-175.lab.eng.blr.redhat.com:/va
r/lib/glusterd/ss_brick 49155 0 Y 2281
Brick dhcp42-20.lab.eng.blr.redhat.com:/var
/lib/glusterd/ss_brick 49155 0 Y 2266
Self-heal Daemon on localhost N/A N/A Y 2257
Self-heal Daemon on dhcp42-239.lab.eng.blr.
redhat.com N/A N/A Y 2287
Self-heal Daemon on dhcp43-175.lab.eng.blr.
redhat.com N/A N/A Y 2253
Self-heal Daemon on dhcp42-196.lab.eng.blr.
redhat.com N/A N/A Y 2258
Task Status of Volume gluster_shared_storage
------------------------------------------------------------------------------
There are no active volume tasks
dhcp42-20.lab.eng.blr.redhat.com:/gluster_shared_storage 27740928 1697152
26043776 7% /run/gluster/shared_storage
dhcp42-239.lab.eng.blr.redhat.com:/gluster_shared_storage 27740928 1697152
26043776 7% /run/gluster/shared_storage
dhcp43-175.lab.eng.blr.redhat.com:/gluster_shared_storage 27740928 1697152
26043776 7% /run/gluster/shared_storage
dhcp42-196.lab.eng.blr.redhat.com:/gluster_shared_storage 27740928 1697152
26043776 7% /run/gluster/shared_storage
[root at dhcp42-20 ~]# ls -ld /var/lib/nfs
lrwxrwxrwx. 1 root root 80 May 11 21:26 /var/lib/nfs ->
/var/run/gluster/shared_storage/nfs-ganesha/dhcp42-20.lab.eng.blr.redhat.com/nfs
[root at dhcp42-239 ~]# ls -ld /var/lib/nfs
lrwxrwxrwx. 1 root root 81 May 11 21:26 /var/lib/nfs ->
/var/run/gluster/shared_storage/nfs-ganesha/dhcp42-239.lab.eng.blr.redhat.com/nfs
[root at dhcp43-175 ~]# ls -ld /var/lib/nfs
lrwxrwxrwx. 1 root root 81 May 11 21:26 /var/lib/nfs ->
/var/run/gluster/shared_storage/nfs-ganesha/dhcp43-175.lab.eng.blr.redhat.com/nfs
[root at dhcp42-196 ~]# ls -ld /var/lib/nfs
lrwxrwxrwx. 1 root root 81 May 11 21:19 /var/lib/nfs ->
/var/run/gluster/shared_storage/nfs-ganesha/dhcp42-196.lab.eng.blr.redhat.com/nfs
3. Reboot all the nodes of the cluster.
4. Observe that on 2 of the 4 nodes, shared storage is not mounted. (most of
the times it doesnt get mounted on any one node).
5.And the symlink from /var/lib/nfs doesn't get created because of this on
these 2 nodes.
6. Both of these nodes have the entries in /etc/fstab and manually mounting the
shared storage on these nodes works.
Actual results:
Shared volume doesn't get mounted on few nodes after rebooting all nodes in
cluster.
Expected results:
Shared volume should get mounted on all the nodes after reboot
Additional info:
--- Additional comment from Soumya Koduri on 2016-05-11 07:38:20 EDT ---
I see below error in node4 logs
[2016-05-11 15:56:04.984079] E [MSGID: 114058]
[client-handshake.c:1524:client_query_portmap_cbk]
0-gluster_shared_storage-client-1: failed to get the port number for remote
subvolume. Please run 'gluster volume status' on server to see if brick process
is running.
[2016-05-11 15:56:04.984357] I [MSGID: 114018]
[client.c:2030:client_rpc_notify] 0-gluster_shared_storage-client-1:
disconnected from gluster_shared_storage-client-1. Client process will keep
trying to connect to glusterd until brick's port is available
[2016-05-11 15:56:04.984374] W [MSGID: 108001] [afr-common.c:4210:afr_notify]
0-gluster_shared_storage-replicate-0: Client-quorum is not met
[2016-05-11 15:56:05.291773] E [MSGID: 114058]
[client-handshake.c:1524:client_query_portmap_cbk]
0-gluster_shared_storage-client-2: failed to get the port number for remote
subvolume. Please run 'gluster volume status' on server to see if brick process
is running.
[2016-05-11 15:56:05.292104] I [MSGID: 114018]
[client.c:2030:client_rpc_notify] 0-gluster_shared_storage-client-2:
disconnected from gluster_shared_storage-client-2. Client process will keep
trying to connect to glusterd until brick's port is available
[2016-05-11 15:56:05.292165] E [MSGID: 108006] [afr-common.c:4152:afr_notify]
0-gluster_shared_storage-replicate-0: All subvolumes are down. Going offline
until atleast one of them comes back up.
[2016-05-11 15:56:05.295895] I [fuse-bridge.c:5166:fuse_graph_setup] 0-fuse:
switched to graph 0
[2016-05-11 15:56:05.296679] I [fuse-bridge.c:4077:fuse_init] 0-glusterfs-fuse:
FUSE inited with protocol versions: glusterfs 7.22 kernel 7.22
[2016-05-11 15:56:05.296828] I [MSGID: 108006]
[afr-common.c:4261:afr_local_init] 0-gluster_shared_storage-replicate-0: no
subvolumes up
[2016-05-11 15:56:05.297606] E [dht-helper.c:1602:dht_inode_ctx_time_update]
(-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_discover+0x1ca)
[0x7fbcaef8ad6a]
-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_lookup_dir_cbk+0x379)
[0x7fbcaecf51d9]
-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_inode_ctx_time_update+0x210)
[0x7fbcaeccd2d0] ) 0-gluster_shared_storage-dht: invalid argument: inode
[Invalid argument]
[2016-05-11 15:56:05.298786] E [dht-helper.c:1602:dht_inode_ctx_time_update]
(-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_discover+0x1ca)
[0x7fbcaef8ad6a]
-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_lookup_dir_cbk+0x379)
[0x7fbcaecf51d9]
-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_inode_ctx_time_update+0x210)
[0x7fbcaeccd2d0] ) 0-gluster_shared_storage-dht: invalid argument: inode
[Invalid argument]
[2016-05-11 15:56:05.298818] W [fuse-bridge.c:766:fuse_attr_cbk]
0-glusterfs-fuse: 2: LOOKUP() / => -1 (Transport endpoint is not connected)
[2016-05-11 15:56:05.305894] E [dht-helper.c:1602:dht_inode_ctx_time_update]
(-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_discover+0x1ca)
[0x7fbcaef8ad6a]
-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_lookup_dir_cbk+0x379)
[0x7fbcaecf51d9]
-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_inode_ctx_time_update+0x210)
[0x7fbcaeccd2d0] ) 0-gluster_shared_storage-dht: invalid argument: inode
[Invalid argument]
[2016-05-11 15:56:05.307751] I [fuse-bridge.c:5007:fuse_thread_proc] 0-fuse:
unmounting /run/gluster/shared_storage
Since this seems to be an issue with gluster_shared_storage mount being lost,
adjusting the components accordingly and request Avra to take a look.
--- Additional comment from Avra Sengupta on 2016-05-13 01:30:09 EDT ---
This is expected behaviour. We need to understand that the shared volume itself
is hosted in these nodes, and all nodes mount it using one of the particular
nodes. Now when all nodes are down, the shared storage volume is also
essentially down. When the nodes come up, till the node whose entry is
mentioned in /etc/fstab is up and serving, none of them will be able to connect
to the shared storage. That node itself will never connect to the shared
storage on reboot, as by the time /etc/fstab entry is replayed, the volume is
not being served.
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1335090
[Bug 1335090] Shared volume doesn't get mounted on few nodes after
rebooting all nodes in cluster.
https://bugzilla.redhat.com/show_bug.cgi?id=1451981
[Bug 1451981] [GSS] NFS-ganesha is not getting started properly after node
reboot.
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list