[Bugs] [Bug 1222409] New: nfs-ganesha: HA failover happens but I/O does not move ahead when volume has two mounts and I/O going on both mounts

Mon May 18 07:26:02 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1222409

            Bug ID: 1222409
           Summary: nfs-ganesha: HA failover happens but I/O  does not
                    move ahead when volume has two mounts and I/O going on
                    both mounts
           Product: GlusterFS
           Version: 3.7.0
         Component: ganesha-nfs
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: saujain at redhat.com

Description of problem:
The problem is that the I/O does not move ahead even though as "pcs status"
command the nfs-ganesha process has failed over to some other node.

The two mounts for the same volume(say vol2) were done using the vers=3 on a
client. The mounts were done using the virtual IP. Now, iozone was started on
both mount points and on one of the server the nfs-ganehsa process was stopped.
So, after a certain grace period on cluster the I/O should have started moving
but that has not happened in this case.

[root at nfs1 ~]# gluster volume status
Status of volume: gluster_shared_storage
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.148:/rhs/brick1/d1r1-share   49156     0          Y       3549 
Brick 10.70.37.77:/rhs/brick1/d1r2-share    49155     0          Y       3329 
Brick 10.70.37.76:/rhs/brick1/d2r1-share    49155     0          Y       3081 
Brick 10.70.37.69:/rhs/brick1/d2r2-share    49155     0          Y       3346 
Brick 10.70.37.148:/rhs/brick1/d3r1-share   49157     0          Y       3566 
Brick 10.70.37.77:/rhs/brick1/d3r2-share    49156     0          Y       3346 
Brick 10.70.37.76:/rhs/brick1/d4r1-share    49156     0          Y       3098 
Brick 10.70.37.69:/rhs/brick1/d4r2-share    49156     0          Y       3363 
Brick 10.70.37.148:/rhs/brick1/d5r1-share   49158     0          Y       3583 
Brick 10.70.37.77:/rhs/brick1/d5r2-share    49157     0          Y       3363 
Brick 10.70.37.76:/rhs/brick1/d6r1-share    49157     0          Y       3115 
Brick 10.70.37.69:/rhs/brick1/d6r2-share    49157     0          Y       3380 
Self-heal Daemon on localhost               N/A       N/A        Y       28128
Self-heal Daemon on 10.70.37.69             N/A       N/A        Y       30533
Self-heal Daemon on 10.70.37.77             N/A       N/A        Y       16037
Self-heal Daemon on 10.70.37.76             N/A       N/A        Y       6128 

Task Status of Volume gluster_shared_storage
------------------------------------------------------------------------------
There are no active volume tasks

Status of volume: vol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.148:/rhs/brick1/d1r1         49153     0          Y       28060
Brick 10.70.37.77:/rhs/brick1/d1r2          49152     0          Y       15975
Brick 10.70.37.76:/rhs/brick1/d2r1          49152     0          Y       6068 
Brick 10.70.37.69:/rhs/brick1/d2r2          49152     0          Y       30472
Brick 10.70.37.148:/rhs/brick1/d3r1         49154     0          Y       28077
Brick 10.70.37.77:/rhs/brick1/d3r2          49153     0          Y       15992
Brick 10.70.37.76:/rhs/brick1/d4r1          49153     0          Y       6085 
Brick 10.70.37.69:/rhs/brick1/d4r2          49153     0          Y       30489
Brick 10.70.37.148:/rhs/brick1/d5r1         49155     0          Y       28094
Brick 10.70.37.77:/rhs/brick1/d5r2          49154     0          Y       16009
Brick 10.70.37.76:/rhs/brick1/d6r1          49154     0          Y       6102 
Brick 10.70.37.69:/rhs/brick1/d6r2          49154     0          Y       30506
Self-heal Daemon on localhost               N/A       N/A        Y       28128
Self-heal Daemon on 10.70.37.69             N/A       N/A        Y       30533
Self-heal Daemon on 10.70.37.77             N/A       N/A        Y       16037
Self-heal Daemon on 10.70.37.76             N/A       N/A        Y       6128 

Task Status of Volume vol2
------------------------------------------------------------------------------
There are no active volume tasks

status of nfs-ganesha on all four nodes,
nfs1
====
root      3790     1  0 May13 ?        00:00:09 /usr/sbin/glusterfs
--volfile-server=nfs1 --volfile-id=/gluster_shared_storage
/var/run/gluster/shared_storage
---
nfs2
====
root      3300     1  0 May13 ?        00:00:09 /usr/sbin/glusterfs
--volfile-server=nfs1 --volfile-id=/gluster_shared_storage
/var/run/gluster/shared_storage
root     11003     1  0 May15 ?        00:01:08 /usr/bin/ganesha.nfsd -L
/var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT -p
/var/run/ganesha.nfsd.pid
---
nfs3
====
root      3577     1  0 May13 ?        00:00:08 /usr/sbin/glusterfs
--volfile-server=nfs1 --volfile-id=/gluster_shared_storage
/var/run/gluster/shared_storage
root      4195     1  0 May15 ?        00:01:08 /usr/bin/ganesha.nfsd -L
/var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT -p
/var/run/ganesha.nfsd.pid
---
nfs4
====
root     14760     1  0 May15 ?        00:00:04 /usr/sbin/glusterfs
--volfile-server=nfs1 --volfile-id=/gluster_shared_storage
/var/run/gluster/shared_storage
root     23970     1  0 May15 ?        00:02:17 /usr/bin/ganesha.nfsd -L
/var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT -p
/var/run/ganesha.nfsd.pid

pcs status; this clearly shows that the failover of nfs-ganesha process running
on nfs1 happened on nfs4

[root at nfs1 ~]# pcs status
Cluster name: ganesha-ha-360
Last updated: Mon May 18 12:54:12 2015
Last change: Fri May 15 19:25:20 2015
Stack: cman
Current DC: nfs1 - partition with quorum
Version: 1.1.11-97629de
4 Nodes configured
17 Resources configured

Online: [ nfs1 nfs2 nfs3 nfs4 ]

Full list of resources:

 Clone Set: nfs-mon-clone [nfs-mon]
     Started: [ nfs1 nfs2 nfs3 nfs4 ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ nfs1 nfs2 nfs3 nfs4 ]
 nfs1-cluster_ip-1    (ocf::heartbeat:IPaddr):    Started nfs4 
 nfs1-trigger_ip-1    (ocf::heartbeat:Dummy):    Started nfs4 
 nfs2-cluster_ip-1    (ocf::heartbeat:IPaddr):    Started nfs2 
 nfs2-trigger_ip-1    (ocf::heartbeat:Dummy):    Started nfs2 
 nfs3-cluster_ip-1    (ocf::heartbeat:IPaddr):    Started nfs3 
 nfs3-trigger_ip-1    (ocf::heartbeat:Dummy):    Started nfs3 
 nfs4-cluster_ip-1    (ocf::heartbeat:IPaddr):    Started nfs4 
 nfs4-trigger_ip-1    (ocf::heartbeat:Dummy):    Started nfs4 
 nfs1-dead_ip-1    (ocf::heartbeat:Dummy):    Started nfs1 

Version-Release number of selected component (if applicable):
glusterfs-3.7.0beta2-0.0.el6.x86_64
nfs-ganesha-2.2.0-0.el6.x86_64

How reproducible:
Happens in first attempt itself

Steps to Reproduce:
1. create a volume of type 6x2, start it
2. start nfs-ganesha on all nodes in consideration, after doing the
pre-requisites
3. mount the volume on a client on two mount-points, using different virtual
IPs.
4. on one server, execute servce nfs-ganesha stop
5. wait for grace_period to finish and let I/O resume

Actual results:
step 4 result,
I/O does not resume as expected,

Expected results:
In this case I/O should resume

Additional info:

-- 
You are receiving this mail because:
You are the assignee for the bug.