[Bugs] [Bug 1706842] New: Hard Failover with Samba and Glusterfs fails

Mon May 6 11:39:59 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1706842

            Bug ID: 1706842
           Summary: Hard Failover with Samba and Glusterfs fails
           Product: GlusterFS
           Version: 5
            Status: NEW
         Component: gluster-smb
          Assignee: bugs at gluster.org
          Reporter: david.spisla at iternity.com
                CC: bugs at gluster.org
  Target Milestone: ---
    Classification: Community

Created attachment 1564378
  --> https://bugzilla.redhat.com/attachment.cgi?id=1564378&action=edit
Backtrace of the SMBD and GLUSTER communication

Description of problem:

I have this setup: 4-Node Glusterfs v5.5 Cluster, using SAMBA/CTDB v4.8 to
access the volumes via vfs-glusterfs-plugin (each node has a VIP)

I was testing this failover scenario:

1. Start Writing 940 GB with small files (64K-100K)from a Win10 Client to node1
2. During the write process I hardly shutdown node1  (where the client is
connect via VIP) by turn off the power

My expectation is, that the write process stops and after a while the Win10
Client offers me a Retry, so I can continue the write on different node (which
has now the VIP of node1). In past time I did this observation (with Gluster
v3.12), but now the system shows a strange bahaviour:

The Win10 Client do nothing and the Explorer freezes, in the backend CTDB can
not perform the failover and throws errors. The glusterd from node2 and node3
logs this messages:

[2019-04-16 14:47:31.828323] W [glusterd-locks.c:795:glusterd_mgmt_v3_unlock]
(-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0x24349) [0x7f1a62fcb349]
-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0x2d950) [0x7f1a62fd4950]
-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe0359) [0x7f1a63087359]
) 0-management: Lock for vol archive1 not held
[2019-04-16 14:47:31.828350] W [MSGID: 106117]
[glusterd-handler.c:6451:__glusterd_peer_rpc_notify] 0-management: Lock not
released for archive1
[2019-04-16 14:47:31.828369] W [glusterd-locks.c:795:glusterd_mgmt_v3_unlock]
(-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0x24349) [0x7f1a62fcb349]
-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0x2d950) [0x7f1a62fd4950]
-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe0359) [0x7f1a63087359]
) 0-management: Lock for vol archive2 not held
[2019-04-16 14:47:31.828376] W [MSGID: 106117]
[glusterd-handler.c:6451:__glusterd_peer_rpc_notify] 0-management: Lock not
released for archive2
[2019-04-16 14:47:31.828412] W [glusterd-locks.c:795:glusterd_mgmt_v3_unlock]
(-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0x24349) [0x7f1a62fcb349]
-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0x2d950) [0x7f1a62fd4950]
-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe0359) [0x7f1a63087359]
) 0-management: Lock for vol gluster_shared_storage not held
[2019-04-16 14:47:31.828423] W [MSGID: 106117]
[glusterd-handler.c:6451:__glusterd_peer_rpc_notify] 0-management: Lock not
released for gluster_shared_storage

In my oponion Samba/CTDB can not perform the failover correctly and continue
the write process because glusterfs didn't released the lock. But its not clear
to me

Additional info:
I made a network trace on the Windows machine.
There it is visible that the client tries several times a TreeConnect.
This Tree Connect is the connection to a share. Samba answers this attempt with
NT_STATUS_UNSUCCESSFUL, which was unfortunately a not very meaningful message.

Similarly, I "caught" the smbd in the debugger and was able to pull a backtrace
while hangs in the futex-call we found in / proc / <pid> / stack. The backtrace
smbd-gluster-bt.txt (attached) shows that the smbd hangs in the gluster module.
You can see in Frame 9 that Samba is hanging in the TCON
(smbd_smb2_tree_connect). In frame 2 the function appears
glfs_init () whose call you can find in source3 / modules / vfs_glusterfs.c,
line 342 (in samba master). Then comes another frame in the gluster-lib and
then immediately the pthread_condwait call, which ends up in the kernel in a
futex call (see / proc / <pid> / stack).

Quintessence: Samba is waiting for gluster, and obviously pretty much 3
seconds. Gluster then gives an error and the client tries again. And obviously
for 8 minutes.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.