[Bugs] [Bug 1444128] New: [BrickMultiplex] gluster command not responding and .snaps directory is not visible after executing snapshot related command

Thu Apr 20 15:42:02 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1444128

            Bug ID: 1444128
           Summary: [BrickMultiplex] gluster command not responding and
                    .snaps directory is not visible after executing
                    snapshot related command
           Product: GlusterFS
           Version: 3.10
         Component: glusterd
          Assignee: bugs at gluster.org
          Reporter: amukherj at redhat.com
                CC: amukherj at redhat.com, bugs at gluster.org,
                    rhinduja at redhat.com, rhs-bugs at redhat.com,
                    storage-qa-internal at redhat.com, vbellur at redhat.com,
                    vdas at redhat.com
        Depends On: 1443896
            Blocks: 1443123

+++ This bug was initially created as a clone of Bug #1443896 +++

+++ This bug was initially created as a clone of Bug #1443123 +++

Description of problem:
On an existing gluster cluster with USS enabled volume and snapshot created &
activated , i enabled brick-multiplex. Post this i executed "gluster snapshot
status" which showed time out issue.
Then when i am executing gluster volume status , gluster pe status, these were
throwing time out errors. (Happening across all the 4 nodes)
Also it has become slow for other commands as in "umount", "df -Th"

Version-Release number of selected component (if applicable):
glusterfs-3.8.4-22.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Have a setup ready with snapshots activated
2. Enable brick-multiplex
3. Run gluster snapshot status
4. Run gluster volume status

Actual results:
Error : Request timed out

Expected results:
Should display the status

Additional info:

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-04-18
10:02:11 EDT ---

This bug is automatically being proposed for the current release of Red Hat
Gluster Storage 3 under active development, by setting the release flag
'rhgs‑3.3.0' to '?'. 

If this bug should be proposed for a different release, please manually change
the proposed release flag.

--- Additional comment from Vivek Das on 2017-04-18 10:29:37 EDT ---

Logs : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1443123

--- Additional comment from Atin Mukherjee on 2017-04-18 10:45:42 EDT ---

I think you are hitting BZ 1441946. Can you check if you are seeing a stale
glusterd lock entry in glusterd log file? If so that confirms this issue is
same.

--- Additional comment from Atin Mukherjee on 2017-04-18 23:56:31 EDT ---

(In reply to Atin Mukherjee from comment #3)
> I think you are hitting BZ 1441946. Can you check if you are seeing a stale
> glusterd lock entry in glusterd log file? If so that confirms this issue is
> same.

Doesn't look like the same issue. In one of the node
dhcp43-155.lab.eng.blr.redhat.com the backtrace of glusterd claims the process
is hung:

Thread 8 (Thread 0x7fe84aff6700 (LWP 1561)):
#0  0x00007fe8527d91bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fe8527d4d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007fe8527d4c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fe8539802e5 in gf_timer_proc () from /lib64/libglusterfs.so.0
#4  0x00007fe8527d2dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fe85211773d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7fe84a7f5700 (LWP 1562)):
#0  0x00007fe8527da101 in sigwait () from /lib64/libpthread.so.0
#1  0x00007fe853e69ebb in glusterfs_sigwaiter ()
#2  0x00007fe8527d2dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fe85211773d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fe849ff4700 (LWP 1563)):
#0  0x00007fe8520de66d in nanosleep () from /lib64/libc.so.6
#1  0x00007fe8520de504 in sleep () from /lib64/libc.so.6
#2  0x00007fe85399982d in pool_sweeper () from /lib64/libglusterfs.so.0
#3  0x00007fe8527d2dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fe85211773d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fe8497f3700 (LWP 1564)):
#0  0x00007fe8527d91bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fe8527d4d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007fe8527d4c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fe853980805 in gf_timer_call_cancel () from /lib64/libglusterfs.so.0
#4  0x00007fe8539765a3 in gf_log_disable_suppression_before_exit () from
/lib64/libglusterfs.so.0
#5  0x00007fe85397c8e5 in gf_print_trace () from /lib64/libglusterfs.so.0
#6  <signal handler called>
#7  0x00007fe8539807a6 in gf_timer_call_cancel () from /lib64/libglusterfs.so.0
#8  0x00007fe8484c2ac3 in glusterd_volume_start_glusterfs () from
/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so
#9  0x00007fe8484c54cf in glusterd_brick_start () from
/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so
#10 0x00007fe8484c5a4d in glusterd_restart_bricks () from
/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so
#11 0x00007fe8484d8806 in glusterd_spawn_daemons () from
/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so
#12 0x00007fe8539a9362 in synctask_wrap () from /lib64/libglusterfs.so.0
#13 0x00007fe852066cf0 in ?? () from /lib64/libc.so.6
#14 0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7fe848ff2700 (LWP 1565)):
#0  0x00007fe8527d6a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00007fe8539ab898 in syncenv_task () from /lib64/libglusterfs.so.0
#2  0x00007fe8539ac6e0 in syncenv_processor () from /lib64/libglusterfs.so.0
#3  0x00007fe8527d2dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fe85211773d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fe843f54700 (LWP 1819)):
#0  0x00007fe8527d66d5 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00007fe84854bc43 in hooks_worker () from
/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so
#2  0x00007fe8527d2dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fe85211773d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fe843753700 (LWP 1820)):
#0  0x00007fe8527d66d5 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00007fe8539a945b in __synclock_lock () from /lib64/libglusterfs.so.0
#2  0x00007fe8539ac996 in synclock_lock () from /lib64/libglusterfs.so.0
#3  0x00007fe848497c2d in glusterd_big_locked_notify () from
/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so
#4  0x00007fe85373cb84 in rpc_clnt_notify () from /lib64/libgfrpc.so.0
#5  0x00007fe8537389f3 in rpc_transport_notify () from /lib64/libgfrpc.so.0
#6  0x00007fe8459031e7 in socket_connect_finish () from
/usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
#7  0x00007fe845907848 in socket_event_handler () from
/usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
#8  0x00007fe8539cce50 in event_dispatch_epoll_worker () from
/lib64/libglusterfs.so.0
#9  0x00007fe8527d2dc5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007fe85211773d in clone () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---

Thread 1 (Thread 0x7fe853e4b780 (LWP 1559)):
#0  0x00007fe8527d3ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007fe8539cd2e0 in event_dispatch_epoll () from /lib64/libglusterfs.so.0
#2  0x00007fe853e66d95 in main ()

--- Additional comment from Atin Mukherjee on 2017-04-19 07:14:18 EDT ---

The above hang was caused due to a node reboot. Following is the gluster volume
info output for self reference.

[root at dhcp43-99 ~]# gluster v info

Volume Name: benbhai
Type: Distributed-Replicate
Volume ID: 3b0cb05f-629e-435b-998c-a0f5870e888a
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: dhcp43-99.lab.eng.blr.redhat.com:/bricks/brick1/benbhai_brick0
Brick2: dhcp43-155.lab.eng.blr.redhat.com:/bricks/brick2/benbhai_brick1
Brick3: dhcp42-240.lab.eng.blr.redhat.com:/bricks/brick2/benbhai_brick2
Brick4: dhcp43-101.lab.eng.blr.redhat.com:/bricks/brick1/benbhai_brick3
Options Reconfigured:
features.barrier: disable
features.show-snapshot-directory: enable
features.uss: enable
performance.cache-samba-metadata: on
network.inode-lru-limit: 50000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
storage.batch-fsync-delay-usec: 0
performance.stat-prefetch: on
server.allow-insecure: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
cluster.brick-multiplex: enable

Volume Name: ctdb
Type: Replicate
Volume ID: 195b67be-d9af-4b2a-9c6c-b17a088a1921
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.43.99:/bricks/brick4/ctdb
Brick2: 10.70.43.155:/bricks/brick4/ctdb
Brick3: 10.70.42.240:/bricks/brick4/ctdb
Brick4: 10.70.43.101:/bricks/brick4/ctdb
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
cluster.brick-multiplex: enable

Volume Name: fashion
Type: Distributed-Replicate
Volume ID: 3022ed91-5646-4c4e-a173-ad84cfb2556a
Status: Started
Snapshot Count: 3
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: dhcp43-99.lab.eng.blr.redhat.com:/bricks/brick2/fashion_brick0
Brick2: dhcp43-155.lab.eng.blr.redhat.com:/bricks/brick3/fashion_brick1
Brick3: dhcp42-240.lab.eng.blr.redhat.com:/bricks/brick3/fashion_brick2
Brick4: dhcp43-101.lab.eng.blr.redhat.com:/bricks/brick2/fashion_brick3
Options Reconfigured:
performance.parallel-readdir: on
features.barrier: disable
features.show-snapshot-directory: enable
features.uss: enable
performance.cache-samba-metadata: on
network.inode-lru-limit: 50000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
storage.batch-fsync-delay-usec: 0
performance.stat-prefetch: on
server.allow-insecure: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs.log-level: DEFAULT
cluster.brick-multiplex: enable

Volume Name: samba-arbitor
Type: Distributed-Replicate
Volume ID: dfb5e983-642c-4abd-8bb4-57356e31f982
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: dhcp43-99.lab.eng.blr.redhat.com:/bricks/brick0/samba-arbitor_brick0
Brick2: dhcp43-155.lab.eng.blr.redhat.com:/bricks/brick0/samba-arbitor_brick1
Brick3: dhcp42-240.lab.eng.blr.redhat.com:/bricks/brick1/samba-arbitor_brick2
(arbiter)
Brick4: dhcp42-240.lab.eng.blr.redhat.com:/bricks/brick0/samba-arbitor_brick2
Brick5: dhcp43-101.lab.eng.blr.redhat.com:/bricks/brick0/samba-arbitor_brick3
Brick6: dhcp43-155.lab.eng.blr.redhat.com:/bricks/brick1/samba-arbitor_brick1
(arbiter)
Options Reconfigured:
storage.batch-fsync-delay-usec: 0
performance.stat-prefetch: on
server.allow-insecure: on
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
cluster.brick-multiplex: enable

Volume Name: tmpvol
Type: Distribute
Volume ID: 255e41a5-c9a2-466c-9f56-4f126957147e
Status: Stopped
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: dhcp43-155.lab.eng.blr.redhat.com:/bricks/tmp
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
cluster.brick-multiplex: enable

--- Additional comment from Atin Mukherjee on 2017-04-19 08:49:02 EDT ---

Looks similar to BZ 1421721

--- Additional comment from Worker Ant on 2017-04-20 04:29:28 EDT ---

REVIEW: https://review.gluster.org/17088 (glusterd: set conn->reconnect to null
on timer cancellation) posted (#1) for review on master by Atin Mukherjee
(amukherj at redhat.com)

--- Additional comment from Worker Ant on 2017-04-20 11:14:20 EDT ---

COMMIT: https://review.gluster.org/17088 committed in master by Jeff Darcy
(jeff at pl.atyp.us) 
------
commit 98dc1f08c114adea1f4133c12dff0d4c3d75b30d
Author: Atin Mukherjee <amukherj at redhat.com>
Date:   Thu Apr 20 13:57:27 2017 +0530

    glusterd: set conn->reconnect to null on timer cancellation

    Change-Id: Ic48e6652f431daeb0db027660f6c9de16d893f08
    BUG: 1443896
    Signed-off-by: Atin Mukherjee <amukherj at redhat.com>
    Reviewed-on: https://review.gluster.org/17088
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Jeff Darcy <jeff at pl.atyp.us>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1443123
[Bug 1443123] [BrickMultiplex] gluster command not responding and .snaps
directory is not visible after executing snapshot related command
https://bugzilla.redhat.com/show_bug.cgi?id=1443896
[Bug 1443896] [BrickMultiplex] gluster command not responding and .snaps
directory is not visible after executing snapshot related command
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.