[Bugs] [Bug 1353529] New: Multiple bricks could crash after invoking status command
bugzilla at redhat.com
bugzilla at redhat.com
Thu Jul 7 11:41:46 UTC 2016
https://bugzilla.redhat.com/show_bug.cgi?id=1353529
Bug ID: 1353529
Summary: Multiple bricks could crash after invoking status
command
Product: GlusterFS
Version: 3.7.12
Component: core
Severity: medium
Assignee: bugs at gluster.org
Reporter: oleksandr at natalenko.name
CC: bugs at gluster.org
Created attachment 1177253
--> https://bugzilla.redhat.com/attachment.cgi?id=1177253&action=edit
"thread apply all backtrace" output for 100% CPU usage
Description of problem:
Given distributed-replicated volume (we didn't test other layouts) multiple
brick process could crash under load while performing "volume status clients"
command and probing bricks port.
Version-Release number of selected component (if applicable):
CentOS 7.2, GlusterFS 3.7.12 with following patches:
===
Jiffin Tony Thottan (1):
gfapi : check the value "iovec" in glfs_io_async_cbk only for read
Kaleb S KEITHLEY (1):
build: RHEL7 unpackaged files
.../hooks/S57glusterfind-delete-post.{pyc,pyo}
Kotresh HR (1):
changelog/rpc: Fix rpc_clnt_t mem leaks
Pranith Kumar K (1):
features/index: Exclude gfid-type for '.', '..'
Raghavendra G (2):
libglusterfs/client_t: Dump the 0th client too
storage/posix: fix inode leaks
Raghavendra Talur (1):
gfapi: update count when glfs_buf_copy is used
Ravishankar N (1):
afr:Don't wind reads for files in metadata split-brain
Soumya Koduri (1):
gfapi/handleops: Avoid using glfd during create
===
How reproducible:
Reliably (see below).
Steps to Reproduce:
All the actions below we performed on one node. Another node in replica was not
used (except for maintaining the replica itself), and bricks there did not
crash.
1. create distributed-replicated (or, we suspect, any other) volume and start
it;
2. mount volume on some client via FUSE;
3. find out what TCP port are used by the volume on one of the hosts where
crash would be initiated;
4. start nmap'ing those ports in a loop: "while true; do nmap -Pn -p49163-49167
127.0.0.1; done";
5. start invoking status command in a loop: "while true; do sudo gluster volume
status test; sudo gluster volume status test clients; done";
6. start generating some workload on the volume (we used to write lots of zero
files and stat them in parallel);
7. ...wait...
8. observe one or multiple brick crash on the node where status command is
performed.
Actual results:
Two variants:
1. brick could crash and generate core file;
2. brick could hang consuming 100% of CPU time.
Expected results:
Do not crash, of course :).
Additional info:
If brick crashes generating core file, gdb gives us the following stacktrace:
===
#0 0x00007fefa9f1cda1 in __strlen_sse2 () from /lib64/libc.so.6
#1 0x00007fefab7d8465 in str_to_data (value=value at entry=0x66726574737562eb
<Address 0x66726574737562eb out of bounds>) at dict.c:904
#2 0x00007fefab7d9e16 in dict_set_str (this=this at entry=0x7fefababa048,
key=key at entry=0x7fef8c225280 "client2896.hostname",
str=str at entry=0x66726574737562eb <Address 0x66726574737562eb out of
bounds>) at dict.c:2224
#3 0x00007fef96d2e244 in server_priv_to_dict (this=<optimized out>,
dict=0x7fefababa048) at server.c:262
#4 0x00007fefabcc311a in glusterfs_handle_brick_status (req=0x7fefad9942dc) at
glusterfsd-mgmt.c:890
#5 0x00007fefab82d4a2 in synctask_wrap (old_task=<optimized out>) at
syncop.c:380
#6 0x00007fefa9edd110 in ?? () from /lib64/libc.so.6
#7 0x0000000000000000 in ?? ()
===
Additionally, we attach two compressed cores for the stacktrace above.
If brick hangs consuming 100% of CPU time, we attached to brick process using
gdb and got stacktraces of all threads (see attached
"all_threads_stacktrace.log.xz" file).
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list