[Bugs] [Bug 1273267] New: nfs-ganesha: nfs-ganesha server process gefaults and post failover the I/O doesn't resume

Tue Oct 20 06:06:40 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1273267

            Bug ID: 1273267
           Summary: nfs-ganesha: nfs-ganesha server process gefaults and
                    post failover the I/O doesn't resume
           Product: GlusterFS
           Version: 3.7.5
         Component: ganesha-nfs
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: saujain at redhat.com

Created attachment 1084605
  --> https://bugzilla.redhat.com/attachment.cgi?id=1084605&action=edit
vm1 messages

Description of problem:
I created a tiering volume and started I/O on the nfs-ganesha with vers=4. I/O
being ltp test suite. The tests are hung as the nfs-ganesha server process got
seg faulted and failover happens, still I/O does not resume. 

Version-Release number of selected component (if applicable):
nfs-ganesha-2.3-0.rc6.el7.centos.x86_64
glusterfs-3.7.5-1.el7.x86_64

How reproducible:
seg fault seen in first attempt

Steps to Reproduce:
1. create a volume of type dist-rep with tiering enabled
2. export the volume over nfs-ganesha and mount it with vers=4
3. execute the fs-sanity test suite

Actual results:
while ltp test suite is getting executed, the nfs-ganesha process sees a
segfault,
as can be seen with the logs in /var/log/messages,
Oct 20 05:21:20 vm1 kernel: ganesha.nfsd[9750]: segfault at 0 ip
00000000004b0ede sp 00007f59122a0ae0 error 4 in ganesha.nfsd[400000+1df000]
Oct 20 05:21:21 vm1 systemd: nfs-ganesha.service: main process exited,
code=killed, status=11/SEGV
Oct 20 05:21:21 vm1 systemd: Unit nfs-ganesha.service entered failed state.
Oct 20 05:21:31 vm1 cibadmin[21227]: notice: Additional logging available in
/var/log/pacemaker.log
Oct 20 05:21:31 vm1 cibadmin[21227]: notice: Invoked: /usr/sbin/cibadmin
--replace -o configuration -V --xml-pipe
Oct 20 05:21:31 vm1 crmd[19954]: notice: Operation vm1-dead_ip-1_monitor_0: not
running (node=vm1, call=119, rc=7, cib-update=142, confirmed=true)
Oct 20 05:21:31 vm1 crmd[19954]: notice: Operation vm1-dead_ip-1_start_0: ok
(node=vm1, call=120, rc=0, cib-update=143, confirmed=true)
Oct 20 05:21:38 vm1 IPaddr(vm1-cluster_ip-1)[21296]: INFO: IP status = ok,
IP_CIP=
Oct 20 05:21:38 vm1 crmd[19954]: notice: Operation vm1-cluster_ip-1_stop_0: ok
(node=vm1, call=123, rc=0, cib-update=145, confirmed=true)
Oct 20 05:21:38 vm1 crmd[19954]: notice: Operation nfs-grace_stop_0: ok
(node=vm1, call=125, rc=0, cib-update=146, confirmed=true)
Oct 20 05:21:38 vm1 crmd[19954]: notice: Operation vm1-trigger_ip-1_stop_0: ok
(node=vm1, call=127, rc=0, cib-update=147, confirmed=true)
Oct 20 05:21:38 vm1 crmd[19954]: notice: Operation nfs-grace_start_0: ok
(node=vm1, call=128, rc=0, cib-update=148, confirmed=true)
Oct 20 05:21:48 vm1 logger: warning: pcs resource create vm1-dead_ip-1
ocf:heartbeat:Dummy failed

Even the failover happens, as per the pcs status mentioned below,
 vm1-cluster_ip-1    (ocf::heartbeat:IPaddr):    Started vm4
 vm1-trigger_ip-1    (ocf::heartbeat:Dummy):    Started vm4
 vm2-cluster_ip-1    (ocf::heartbeat:IPaddr):    Started vm2
 vm2-trigger_ip-1    (ocf::heartbeat:Dummy):    Started vm2
 vm3-cluster_ip-1    (ocf::heartbeat:IPaddr):    Started vm3
 vm3-trigger_ip-1    (ocf::heartbeat:Dummy):    Started vm3
 vm4-cluster_ip-1    (ocf::heartbeat:IPaddr):    Started vm4
 vm4-trigger_ip-1    (ocf::heartbeat:Dummy):    Started vm4
 vm1-dead_ip-1    (ocf::heartbeat:Dummy):    Started vm1

But even after failover, the I/O doesn't resume as the nfs-ganesha logs on the
failed-over node has errors being mentioned,

19/10/2015 23:49:57 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[dbus_heartbeat]
glusterfs_create_export :FSAL :EVENT :Volume vol3 exported at : '/'
20/10/2015 05:22:16 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-16]
file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not
connected
20/10/2015 05:22:16 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-16]
cache_inode_close :INODE :CRIT :FSAL_close failed, returning
37(CACHE_INODE_SERVERFAULT) for entry 0x7f781c0109c0
20/10/2015 05:22:16 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-12]
file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not
connected
20/10/2015 05:22:16 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-12]
cache_inode_close :INODE :CRIT :FSAL_close failed, returning
37(CACHE_INODE_SERVERFAULT) for entry 0x7f7814026c30
20/10/2015 05:22:21 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-14]
file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not
connected
20/10/2015 05:22:21 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-14]
cache_inode_close :INODE :CRIT :FSAL_close failed, returning
37(CACHE_INODE_SERVERFAULT) for entry 0x7f781000b1d0
20/10/2015 05:22:24 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-8]
file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not
connected
20/10/2015 05:22:24 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-8]
cache_inode_close :INODE :CRIT :FSAL_close failed, returning
37(CACHE_INODE_SERVERFAULT) for entry 0x7f77f803ee00
20/10/2015 05:22:24 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-14]
file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not
connected
20/10/2015 05:22:24 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-14]
cache_inode_close :INODE :CRIT :FSAL_close failed, returning
37(CACHE_INODE_SERVERFAULT) for entry 0x7f78180352c0
20/10/2015 05:22:24 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-16]
file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not
connected
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-16]
cache_inode_close :INODE :CRIT :FSAL_close failed, returning
37(CACHE_INODE_SERVERFAULT) for entry 0x7f77f8020bd0
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9]
file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not
connected
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9]
cache_inode_close :INODE :CRIT :FSAL_close failed, returning
37(CACHE_INODE_SERVERFAULT) for entry 0x7f77f803ee00
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-10]
file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not
connected
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-10]
cache_inode_close :INODE :CRIT :FSAL_close failed, returning
37(CACHE_INODE_SERVERFAULT) for entry 0x7f784c037f50
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9]
file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not
connected
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9]
cache_inode_close :INODE :CRIT :FSAL_close failed, returning
37(CACHE_INODE_SERVERFAULT) for entry 0x7f784c037f50
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-4]
file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not
connected
20/10/2015 05:22:25 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-4]
cache_inode_close :INODE :CRIT :FSAL_close failed, returning
37(CACHE_INODE_SERVERFAULT) for entry 0x7f77f803ee00
20/10/2015 05:22:26 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9]
file_close :FSAL :CRIT :Error : close returns with Transport endpoint is not
connected
20/10/2015 05:22:26 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-9]
cache_inode_close :INODE :CRIT :FSAL_close failed, returning
37(CACHE_INODE_SERVERFAULT) for entry 0x7f77f803ee00
20/10/2015 05:22:28 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-1]
cache_inode_lookup_impl :INODE :EVENT :FSAL returned STALE from a lookup.
20/10/2015 05:22:28 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[work-3]
cache_inode_lookup_impl :INODE :EVENT :FSAL returned STALE from a lookup.
20/10/2015 06:09:51 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[dbus_heartbeat]
dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending
heartbeat
20/10/2015 06:11:06 : epoch 56215bb5 : vm4 : ganesha.nfsd-16329[dbus_heartbeat]
dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending
heartbeat

Expected results:
Even if the nfs-ganesha has seg-faulted the failover should let the I/O resume.
Also, we need to overcome the problem of this segfault

Additional info:
The segfault related coredump is not found, now I will the test again and see
if can be reproduced.

-- 
You are receiving this mail because:
You are the assignee for the bug.