[Gluster-users] [SOLVED] [Nfs-ganesha-support] volume start: gv01: failed: Quorum not met. Volume operation not allowed.

Tue May 22 05:43:42 UTC 2018

Hey All,

Appears I solved this one and NFS mounts now work on all my clients.  No 
issues since fixing it a few hours back.

RESOLUTION

Auditd is to blame for the trouble.  Noticed this in the logs on 2 of 
the 3 NFS servers (nfs01, nfs02, nfs03):

type=AVC msg=audit(1526965320.850:4094): avc:  denied  { write } for 
pid=8714 comm="ganesha.nfsd" name="nfs_0" dev="dm-0" ino=201547689 
scontext=system_u:system_r:ganesha_t:s0 
tcontext=system_u:object_r:krb5_host_rcache_t:s0 tclass=file
type=SYSCALL msg=audit(1526965320.850:4094): arch=c000003e syscall=2 
success=no exit=-13 a0=7f23b0003150 a1=2 a2=180 a3=2 items=0 ppid=1 
pid=8714 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 
fsgid=0 tty=(none) ses=4294967295 comm="ganesha.nfsd" 
exe="/usr/bin/ganesha.nfsd" subj=system_u:system_r:ganesha_t:s0 key=(null)
type=PROCTITLE msg=audit(1526965320.850:4094): 
proctitle=2F7573722F62696E2F67616E657368612E6E667364002D4C002F7661722F6C6F672F67616E657368612F67616E657368612E6C6F67002D66002F6574632F67616E657368612F67616E657368612E636F6E66002D4E004E49565F4556454E54
type=AVC msg=audit(1526965320.850:4095): avc:  denied  { unlink } for 
pid=8714 comm="ganesha.nfsd" name="nfs_0" dev="dm-0" ino=201547689 
scontext=system_u:system_r:ganesha_t:s0 
tcontext=system_u:object_r:krb5_host_rcache_t:s0 tclass=file
type=SYSCALL msg=audit(1526965320.850:4095): arch=c000003e syscall=87 
success=no exit=-13 a0=7f23b0004100 a1=7f23b0000050 a2=7f23b0004100 a3=5 
items=0 ppid=1 pid=8714 auid=4294967295 uid=0 gid=0 euid=0 suid=0 
fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 
comm="ganesha.nfsd" exe="/usr/bin/ganesha.nfsd" 
subj=system_u:system_r:ganesha_t:s0 key=(null)
type=PROCTITLE msg=audit(1526965320.850:4095): 
proctitle=2F7573722F62696E2F67616E657368612E6E667364002D4C002F7661722F6C6F672F67616E657368612F67616E657368612E6C6F67002D66002F6574632F67616E657368612F67616E657368612E636F6E66002D4E004E49565F4556454E54

Fix was to adjust the SELinux rules using audit2allow.

All the errors below including the one in the link below, were due to that.

Turns out that when ever it worked, it hit the only working server in 
the system, nfs03.  Whenever it didn't work, it was hitting the non 
working servers.  So sometimes it worked, and other times it didn't.  It 
looked like it was to do with Haproxy / Keepalived as well since I 
couldn't mount using the VIP but could using the host.  But that wasn't 
the case either.

I've also added the third brick to the Gluster FS, nfs03, trying to see 
if the backend FS was to blame since Gluster FS recommends 3 bricks 
minimum for replication, but that had no effect.

In case anyone runs into this, I've added notes here as well:

http://microdevsys.com/wp/kernel-nfs-nfs4_discover_server_trunking-unhandled-error-512-exiting-with-error-eio-and-mount-hangs/

http://microdevsys.com/wp/nfs-reply-xid-3844308326-reply-err-20-auth-rejected-credentials-client-should-begin-new-session/

The errors thrown included:

NFS reply xid 3844308326 reply ERR 20: Auth Rejected Credentials (client 
should begin new session)

kernel: NFS: nfs4_discover_server_trunking unhandled error -512. Exiting 
with error EIO and mount hangs

+ the kernel exception below.

-- 
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.

May 21 23:53:13 psql01 kernel: CPU: 3 PID: 2273 Comm: mount.nfs Tainted: 
G             L ------------   3.10.0-693.21.1.el7.x86_64 #1
.
.
.
May 21 23:53:13 psql01 kernel: task: ffff880136335ee0 ti: 
ffff8801376b0000 task.ti: ffff8801376b0000
May 21 23:53:13 psql01 kernel: RIP: 0010:[<ffffffff816b6545>] 
[<ffffffff816b6545>] _raw_spin_unlock_irqrestore+0x15/0x20
May 21 23:53:13 psql01 kernel: RSP: 0018:ffff8801376b3a60  EFLAGS: 00000206
May 21 23:53:13 psql01 kernel: RAX: ffffffffc05ab078 RBX: 
ffff880036973928 RCX: dead000000000200
May 21 23:53:13 psql01 kernel: RDX: ffffffffc05ab078 RSI: 
0000000000000206 RDI: 0000000000000206
May 21 23:53:13 psql01 kernel: RBP: ffff8801376b3a60 R08: 
ffff8801376b3ab8 R09: ffff880137de1200
May 21 23:53:13 psql01 kernel: R10: ffff880036973928 R11: 
0000000000000000 R12: ffff880036973928
May 21 23:53:13 psql01 kernel: R13: ffff8801376b3a58 R14: 
ffff88013fd98a40 R15: ffff8801376b3a58
May 21 23:53:13 psql01 kernel: FS:  00007fab48f07880(0000) 
GS:ffff88013fd80000(0000) knlGS:0000000000000000
May 21 23:53:13 psql01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
May 21 23:53:13 psql01 kernel: CR2: 00007f99793d93cc CR3: 
000000013761e000 CR4: 00000000000007e0
May 21 23:53:13 psql01 kernel: DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
May 21 23:53:13 psql01 kernel: DR3: 0000000000000000 DR6: 
00000000ffff0ff0 DR7: 0000000000000400
May 21 23:53:13 psql01 kernel: Call Trace:
May 21 23:53:13 psql01 kernel: [<ffffffff810b4d86>] finish_wait+0x56/0x70
May 21 23:53:13 psql01 kernel: [<ffffffffc0580361>] 
nfs_wait_client_init_complete+0xa1/0xe0 [nfs]
May 21 23:53:13 psql01 kernel: [<ffffffff810b4fc0>] ? 
wake_up_atomic_t+0x30/0x30
May 21 23:53:13 psql01 kernel: [<ffffffffc0581e9b>] 
nfs_get_client+0x22b/0x470 [nfs]
May 21 23:53:13 psql01 kernel: [<ffffffffc05eafd8>] 
nfs4_set_client+0x98/0x130 [nfsv4]
May 21 23:53:13 psql01 kernel: [<ffffffffc05ec77e>] 
nfs4_create_server+0x13e/0x3b0 [nfsv4]
May 21 23:53:13 psql01 kernel: [<ffffffffc05e391e>] 
nfs4_remote_mount+0x2e/0x60 [nfsv4]
May 21 23:53:13 psql01 kernel: [<ffffffff81209f1e>] mount_fs+0x3e/0x1b0
May 21 23:53:13 psql01 kernel: [<ffffffff811aa685>] ? 
__alloc_percpu+0x15/0x20
May 21 23:53:13 psql01 kernel: [<ffffffff81226d57>] 
vfs_kern_mount+0x67/0x110
May 21 23:53:13 psql01 kernel: [<ffffffffc05e3846>] 
nfs_do_root_mount+0x86/0xc0 [nfsv4]
May 21 23:53:13 psql01 kernel: [<ffffffffc05e3c44>] 
nfs4_try_mount+0x44/0xc0 [nfsv4]
May 21 23:53:13 psql01 kernel: [<ffffffffc05826d7>] ? 
get_nfs_version+0x27/0x90 [nfs]
May 21 23:53:13 psql01 kernel: [<ffffffffc058ec9b>] 
nfs_fs_mount+0x4cb/0xda0 [nfs]
May 21 23:53:13 psql01 kernel: [<ffffffffc058fbe0>] ? 
nfs_clone_super+0x140/0x140 [nfs]
May 21 23:53:13 psql01 kernel: [<ffffffffc058daa0>] ? 
param_set_portnr+0x70/0x70 [nfs]
May 21 23:53:13 psql01 kernel: [<ffffffff81209f1e>] mount_fs+0x3e/0x1b0
May 21 23:53:13 psql01 kernel: [<ffffffff811aa685>] ? 
__alloc_percpu+0x15/0x20
May 21 23:53:13 psql01 kernel: [<ffffffff81226d57>] 
vfs_kern_mount+0x67/0x110
May 21 23:53:13 psql01 kernel: [<ffffffff81229263>] do_mount+0x233/0xaf0
May 21 23:53:13 psql01 kernel: [<ffffffff81229ea6>] SyS_mount+0x96/0xf0
May 21 23:53:13 psql01 kernel: [<ffffffff816c0715>] 
system_call_fastpath+0x1c/0x21
May 21 23:53:13 psql01 kernel: [<ffffffff816c0661>] ? 
system_call_after_swapgs+0xae/0x146

On 5/7/2018 10:28 PM, TomK wrote:
> This list has been deprecated. Please subscribe to the new support list 
> at lists.nfs-ganesha.org.
> On 4/11/2018 11:54 AM, Alex K wrote:
> 
> Hey Guy's,
> 
> Returning to this topic, after disabling the the quorum:
> 
> cluster.quorum-type: none
> cluster.server-quorum-type: none
> 
> I've ran into a number of gluster errors (see below).
> 
> I'm using gluster as the backend for my NFS storage.  I have gluster 
> running on two nodes, nfs01 and nfs02.  It's mounted on /n on each host. 
>   The path /n is in turn shared out by NFS Ganesha.  It's a two node 
> setup with quorum disabled as noted below:
> 
> [root at nfs02 ganesha]# mount|grep gv01
> nfs02:/gv01 on /n type fuse.glusterfs 
> (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072) 
> 
> 
> [root at nfs01 glusterfs]# mount|grep gv01
> nfs01:/gv01 on /n type fuse.glusterfs 
> (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072) 
> 
> 
> Gluster always reports as working no matter when I type the below two 
> commands:
> 
> [root at nfs01 glusterfs]# gluster volume info
> 
> Volume Name: gv01
> Type: Replicate
> Volume ID: e5ccc75e-5192-45ac-b410-a34ebd777666
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: nfs01:/bricks/0/gv01
> Brick2: nfs02:/bricks/0/gv01
> Options Reconfigured:
> cluster.server-quorum-type: none
> cluster.quorum-type: none
> server.event-threads: 8
> client.event-threads: 8
> performance.readdir-ahead: on
> performance.write-behind-window-size: 8MB
> performance.io-thread-count: 16
> performance.cache-size: 1GB
> nfs.trusted-sync: on
> performance.client-io-threads: off
> nfs.disable: on
> transport.address-family: inet
> [root at nfs01 glusterfs]# gluster status
> unrecognized word: status (position 0)
> [root at nfs01 glusterfs]# gluster volume status
> Status of volume: gv01
> Gluster process                             TCP Port  RDMA Port  Online  
> Pid
> ------------------------------------------------------------------------------ 
> 
> Brick nfs01:/bricks/0/gv01                  49152     0          Y 1422
> Brick nfs02:/bricks/0/gv01                  49152     0          Y 1422
> Self-heal Daemon on localhost               N/A       N/A        Y 1248
> Self-heal Daemon on nfs02.nix.my.dom       N/A       N/A        Y       
> 1251
> 
> Task Status of Volume gv01
> ------------------------------------------------------------------------------ 
> 
> There are no active volume tasks
> 
> [root at nfs01 glusterfs]#
> 
> [root at nfs01 glusterfs]# rpm -aq|grep -Ei gluster
> glusterfs-3.13.2-2.el7.x86_64
> glusterfs-devel-3.13.2-2.el7.x86_64
> glusterfs-fuse-3.13.2-2.el7.x86_64
> glusterfs-api-devel-3.13.2-2.el7.x86_64
> centos-release-gluster313-1.0-1.el7.centos.noarch
> python2-gluster-3.13.2-2.el7.x86_64
> glusterfs-client-xlators-3.13.2-2.el7.x86_64
> glusterfs-server-3.13.2-2.el7.x86_64
> libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.9.x86_64
> glusterfs-cli-3.13.2-2.el7.x86_64
> centos-release-gluster312-1.0-1.el7.centos.noarch
> python2-glusterfs-api-1.1-1.el7.noarch
> glusterfs-libs-3.13.2-2.el7.x86_64
> glusterfs-extra-xlators-3.13.2-2.el7.x86_64
> glusterfs-api-3.13.2-2.el7.x86_64
> [root at nfs01 glusterfs]#
> 
> The short of it is that everything works and mounts on guests work as 
> long as I don't try to write to the NFS share from my clients.  If I try 
> to write to the share, everything comes apart like this:
> 
> -sh-4.2$ pwd
> /n/my.dom/tom
> -sh-4.2$ ls -altri
> total 6258
> 11715278280495367299 -rw-------. 1 tom at my.dom tom at my.dom     231 Feb 17 
> 20:15 .bashrc
> 10937819299152577443 -rw-------. 1 tom at my.dom tom at my.dom     193 Feb 17 
> 20:15 .bash_profile
> 10823746994379198104 -rw-------. 1 tom at my.dom tom at my.dom      18 Feb 17 
> 20:15 .bash_logout
> 10718721668898812166 drwxr-xr-x. 3 root        root           4096 Mar 5 
> 02:46 ..
> 12008425472191154054 drwx------. 2 tom at my.dom tom at my.dom    4096 Mar 18 
> 03:07 .ssh
> 13763048923429182948 -rw-rw-r--. 1 tom at my.dom tom at my.dom 6359568 Mar 25 
> 22:38 opennebula-cores.tar.gz
> 11674701370106210511 -rw-rw-r--. 1 tom at my.dom tom at my.dom       4 Apr  9 
> 23:25 meh.txt
>   9326637590629964475 -rw-r--r--. 1 tom at my.dom tom at my.dom   24970 May  1 
> 01:30 nfs-trace-working.dat.gz
>   9337343577229627320 -rw-------. 1 tom at my.dom tom at my.dom    3734 May  1 
> 23:38 .bash_history
> 11438151930727967183 drwx------. 3 tom at my.dom tom at my.dom    4096 May  1 
> 23:58 .
>   9865389421596220499 -rw-r--r--. 1 tom at my.dom tom at my.dom    4096 May  1 
> 23:58 .meh.txt.swp
> -sh-4.2$ touch test.txt
> -sh-4.2$ vi test.txt
> -sh-4.2$ ls -altri
> ls: cannot open directory .: Permission denied
> -sh-4.2$ ls -altri
> ls: cannot open directory .: Permission denied
> -sh-4.2$ ls -altri
> 
> This is followed by a slew of other errors in apps using the gluster 
> volume.  These errors include:
> 
> 02/05/2018 23:10:52 : epoch 5aea7bd5 : nfs02.nix.my.dom : 
> ganesha.nfsd-5891[svc_12] nfs_rpc_process_request :DISP :INFO :Could not 
> authenticate request... rejecting with AUTH_STAT=RPCSEC_GSS_CREDPROBLEM
> 
> 
> ==> ganesha-gfapi.log <==
> [2018-05-03 04:32:18.009245] I [MSGID: 114021] [client.c:2369:notify] 
> 0-gv01-client-0: current graph is no longer active, destroying rpc_client
> [2018-05-03 04:32:18.009338] I [MSGID: 114021] [client.c:2369:notify] 
> 0-gv01-client-1: current graph is no longer active, destroying rpc_client
> [2018-05-03 04:32:18.009499] I [MSGID: 114018] 
> [client.c:2285:client_rpc_notify] 0-gv01-client-0: disconnected from 
> gv01-client-0. Client process will keep trying to connect to glusterd 
> until brick's port is available
> [2018-05-03 04:32:18.009557] I [MSGID: 114018] 
> [client.c:2285:client_rpc_notify] 0-gv01-client-1: disconnected from 
> gv01-client-1. Client process will keep trying to connect to glusterd 
> until brick's port is available
> [2018-05-03 04:32:18.009610] E [MSGID: 108006] 
> [afr-common.c:5164:__afr_handle_child_down_event] 0-gv01-replicate-0: 
> All subvolumes are down. Going offline until atleast one of them comes 
> back up.
> 
> 
> [2018-05-01 22:43:06.412067] E [MSGID: 114058] 
> [client-handshake.c:1571:client_query_portmap_cbk] 0-gv01-client-1: 
> failed to get the port number for remote subvolume. Please run 'gluster 
> volume status' on server to see if brick process is running.
> [2018-05-01 22:43:55.554833] E [socket.c:2374:socket_connect_finish] 
> 0-gv01-client-0: connection to 192.168.0.131:49152 failed (Connection 
> refused); disconnecting socket
> 
> 
> So I'm wondering, if this is due to the two node gluster, as it seems to 
> be, and what is it that I really need to do here?  Should I go with the 
> recommended 3 node setup to avoid this which would include a proper 
> quorum?  Or is there more to this and it really doesn't matter if I have 
> a 2 node gluster cluster without a quorum and this is due to something 
> else still?
> 
> Again, anytime I check the gluter volumes, everything checks out.  The 
> results of both 'gluster volume info' and 'gluster volume status' is 
> always as I pasted above, fully working.
> 
> I'm also using the Linux KDC Free IPA with this solution as well.
>