[Bugs] [Bug 1717824] New: Fencing: Added the tcmu-runner ALUA feature support but after one of node is rebooted the glfs_file_lock() get stucked

Thu Jun 6 09:35:29 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1717824

            Bug ID: 1717824
           Summary: Fencing: Added the tcmu-runner ALUA feature support
                    but after one of node is rebooted the glfs_file_lock()
                    get stucked
           Product: GlusterFS
           Version: mainline
            Status: NEW
         Component: locks
          Assignee: bugs at gluster.org
          Reporter: xiubli at redhat.com
                CC: bugs at gluster.org
  Target Milestone: ---
    Classification: Community

Description of problem:

In Glusterfs, we have support the fencing feature support. With this we can
suppor the ALUA feature in LIO/TCMU now.

The fencing doc:
https://review.gluster.org/#/c/glusterfs-specs/+/21925/6/accepted/fencing.md

The fencing test example:
https://review.gluster.org/#/c/glusterfs/+/21496/12/tests/basic/fencing/fence-basic.c

The LIO/tcmu-runner PR of supporting the ALUA is :
https://github.com/open-iscsi/tcmu-runner/pull/554.

But currently when testing it based the above PR in tcmu-runner by shutting
down of the HA node, and start it after 2~3 minutes, in all the HA nodes we can
see that the glfs_file_lock() get stucked, the following is from the
/var/log/tcmu-runner.log:

====
2019-06-06 13:50:15.755 1316 [DEBUG] tcmu_acquire_dev_lock:388 glfs/block3:
lock call state 2 retries 0. tag 65535 reopen 0
2019-06-06 13:50:15.757 1316 [DEBUG] tcmu_acquire_dev_lock:440 glfs/block3:
lock call done. lock state 1
2019-06-06 13:50:55.845 1316 [DEBUG] tcmu_acquire_dev_lock:388 glfs/block4:
lock call state 2 retries 0. tag 65535 reopen 0
2019-06-06 13:50:55.847 1316 [DEBUG] tcmu_acquire_dev_lock:440 glfs/block4:
lock call done. lock state 1
2019-06-06 13:57:50.102 1315 [DEBUG] tcmu_acquire_dev_lock:388 glfs/block3:
lock call state 2 retries 0. tag 65535 reopen 0
2019-06-06 13:57:50.103 1315 [DEBUG] tcmu_acquire_dev_lock:440 glfs/block3:
lock call done. lock state 1
2019-06-06 13:57:50.121 1315 [DEBUG] tcmu_acquire_dev_lock:388 glfs/block4:
lock call state 2 retries 0. tag 65535 reopen 0
2019-06-06 13:57:50.132 1315 [DEBUG] tcmu_acquire_dev_lock:440 glfs/block4:
lock call done. lock state 1
2019-06-06 14:09:03.654 1328 [DEBUG] tcmu_acquire_dev_lock:388 glfs/block3:
lock call state 2 retries 0. tag 65535 reopen 0
2019-06-06 14:09:03.662 1328 [DEBUG] tcmu_acquire_dev_lock:440 glfs/block3:
lock call done. lock state 1
2019-06-06 14:09:06.700 1328 [DEBUG] tcmu_acquire_dev_lock:388 glfs/block4:
lock call state 2 retries 0. tag 65535 reopen 0
====

The lock operation is never returned.

I am using the following glusterfs built by myself:

# rpm -qa|grep glusterfs
glusterfs-extra-xlators-7dev-0.0.el7.x86_64
glusterfs-api-devel-7dev-0.0.el7.x86_64
glusterfs-7dev-0.0.el7.x86_64
glusterfs-server-7dev-0.0.el7.x86_64
glusterfs-cloudsync-plugins-7dev-0.0.el7.x86_64
glusterfs-resource-agents-7dev-0.0.el7.noarch
glusterfs-api-7dev-0.0.el7.x86_64
glusterfs-devel-7dev-0.0.el7.x86_64
glusterfs-regression-tests-7dev-0.0.el7.x86_64
glusterfs-gnfs-7dev-0.0.el7.x86_64
glusterfs-client-xlators-7dev-0.0.el7.x86_64
glusterfs-geo-replication-7dev-0.0.el7.x86_64
glusterfs-debuginfo-7dev-0.0.el7.x86_64
glusterfs-fuse-7dev-0.0.el7.x86_64
glusterfs-events-7dev-0.0.el7.x86_64
glusterfs-libs-7dev-0.0.el7.x86_64
glusterfs-cli-7dev-0.0.el7.x86_64
glusterfs-rdma-7dev-0.0.el7.x86_64

How reproducible:
30%.

Steps to Reproduce:
1. create one rep volume(HA >= 2) with the mandantary lock enabled
2. create one gluster-blockd target
3. login and do the fio in the client node
4. shutdown one of the HA nodes, and wait 2 ~3 minutes and start it again

Actual results:
all the time the fio couldn't recovery and the rw BW will be 0kb/s, and we can
see tons of log from /var/log/tcmu-runnner.log file:

2019-06-06 15:01:06.641 1328 [DEBUG] alua_implicit_transition:561 glfs/block4:
Lock acquisition operation is already in process.
2019-06-06 15:01:06.648 1328 [DEBUG_SCSI_CMD] tcmu_cdb_print_info:353
glfs/block4: 28 0 0 3 1f 80 0 0 8 0 
2019-06-06 15:01:06.648 1328 [DEBUG] alua_implicit_transition:561 glfs/block4:
Lock acquisition operation is already in process.
2019-06-06 15:01:06.655 1328 [DEBUG_SCSI_CMD] tcmu_cdb_print_info:353
glfs/block4: 28 0 0 3 1f 80 0 0 8 0 
2019-06-06 15:01:06.655 1328 [DEBUG] alua_implicit_transition:561 glfs/block4:
Lock acquisition operation is already in process.
2019-06-06 15:01:06.661 1328 [DEBUG_SCSI_CMD] tcmu_cdb_print_info:353
glfs/block4: 28 0 0 3 1f 80 0 0 8 0 
2019-06-06 15:01:06.662 1328 [DEBUG] alua_implicit_transition:561 glfs/block4:
Lock acquisition operation is already in process.

Expected results:
just before the shutdown node is up, the fio could be recovery.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.