[Bugs] [Bug 1353561] New: Multiple bricks could crash after TCP port probing

Thu Jul 7 13:03:02 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1353561

            Bug ID: 1353561
           Summary: Multiple bricks could crash after TCP port probing
           Product: GlusterFS
           Version: 3.7.12
         Component: core
          Severity: medium
          Assignee: bugs at gluster.org
          Reporter: oleksandr at natalenko.name
                CC: bugs at gluster.org

Created attachment 1177299
  --> https://bugzilla.redhat.com/attachment.cgi?id=1177299&action=edit
"thread apply all backtrace" output for 0% CPU usage

Description of problem:

Given distributed-replicated volume (we didn't test other layouts) multiple
brick processes could crash under load while probing brick TCP ports.

Version-Release number of selected component (if applicable):

CentOS 7.2, GlusterFS 3.7.12 with following patches:

===
Jiffin Tony Thottan (1):
      gfapi : check the value "iovec" in glfs_io_async_cbk only for read

Kaleb S KEITHLEY (1):
      build: RHEL7 unpackaged files
.../hooks/S57glusterfind-delete-post.{pyc,pyo}

Kotresh HR (1):
      changelog/rpc: Fix rpc_clnt_t mem leaks

Pranith Kumar K (1):
      features/index: Exclude gfid-type for '.', '..'

Raghavendra G (2):
      libglusterfs/client_t: Dump the 0th client too
      storage/posix: fix inode leaks

Raghavendra Talur (1):
      gfapi: update count when glfs_buf_copy is used

Ravishankar N (1):
      afr:Don't wind reads for files in metadata split-brain

Soumya Koduri (1):
      gfapi/handleops: Avoid using glfd during create
===

How reproducible:

Reliably (see below).

Steps to Reproduce:

All the actions below we performed on one node. Another node in replica was not
used (except for maintaining the replica itself), and bricks there did not
crash.

1. create distributed-replicated (or, we suspect, any other) volume and start
it;
2. mount volume on some client via FUSE;
3. find out what TCP port are used by the volume on one of the hosts where
crash would be initiated;
4. start nmap'ing those ports in a loop: "while true; do nmap -Pn -p49163-49167
127.0.0.1; done";
5. start generating some workload on the volume (we used to write lots of zero
files, stat them and remove in parallel);
6. ...wait...
7. observe one or multiple brick crash on the node where TCP-probed bricks are
running.

Actual results:

Two variants:

1. brick could crash and generate core file;
2. brick could hang consuming 0% or 100% of CPU time.

Expected results:

Do not crash, of course :).

Additional info:

If brick crashes generating core file, gdb gives us the following stacktrace:

===
#0  0x00007fc6cb66ebd0 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007fc6cc80755d in gf_log_set_log_buf_size (buf_size=buf_size at entry=0)
at logging.c:256
#2  0x00007fc6cc8076f7 in gf_log_disable_suppression_before_exit
(ctx=0x7fc6cdc51010) at logging.c:428
#3  0x00007fc6cc829775 in gf_print_trace (signum=11, ctx=0x7fc6cdc51010) at
common-utils.c:579
#4  <signal handler called>
#5  0x00007fc6cc82d149 in __inode_ctx_free (inode=inode at entry=0x7fc6b4050c74)
at inode.c:336
#6  0x00007fc6cc82e1c7 in __inode_destroy (inode=0x7fc6b4050c74) at inode.c:358
#7  inode_table_prune (table=table at entry=0x7fc6b80cea00) at inode.c:1540
#8  0x00007fc6cc82e454 in inode_unref (inode=0x7fc6b4050c74) at inode.c:529
#9  0x00007fc6cc841354 in fd_destroy (bound=_gf_true, fd=0x7fc6b80cee20) at
fd.c:537
#10 fd_unref (fd=0x7fc6b80cee20) at fd.c:573
#11 0x00007fc6b7ddd397 in server3_3_releasedir (req=0x7fc6b6058190) at
server-rpc-fops.c:4072
#12 0x00007fc6cc5cc6ab in rpcsvc_handle_rpc_call (svc=0x7fc6b8030080,
trans=trans at entry=0x7fc6b80e9f10, msg=0x7fc6b80f44f0) at rpcsvc.c:705
#13 0x00007fc6cc5cc87b in rpcsvc_notify (trans=0x7fc6b80e9f10,
mydata=<optimized out>, event=<optimized out>, data=<optimized out>) at
rpcsvc.c:799
#14 0x00007fc6cc5ce7c3 in rpc_transport_notify (this=this at entry=0x7fc6b80e9f10,
event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=data at entry=0x7fc6b80f44f0)
at rpc-transport.c:546
#15 0x00007fc6c14959b4 in socket_event_poll_in (this=this at entry=0x7fc6b80e9f10)
at socket.c:2353
#16 0x00007fc6c14985f4 in socket_event_handler (fd=fd at entry=15,
idx=idx at entry=6, data=0x7fc6b80e9f10, poll_in=1, poll_out=0, poll_err=0) at
socket.c:2466
#17 0x00007fc6cc872e6a in event_dispatch_epoll_handler (event=0x7fc6bf7bae80,
event_pool=0x7fc6cdc70290) at event-epoll.c:575
#18 event_dispatch_epoll_worker (data=0x7fc6cdcbcd70) at event-epoll.c:678
#19 0x00007fc6cb66cdc5 in start_thread () from /lib64/libpthread.so.0
#20 0x00007fc6cafb1ced in clone () from /lib64/libc.so.6
===

Additionally, we attach corresponding core for the stacktrace above.

If brick hangs consuming 0% of CPU time, we attached to brick process using gdb
and got stacktraces of all threads (see attached
"all_threads_stacktrace.log.xz" file).

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.