[Bugs] [Bug 1442535] New: [GSS] Multiple bricks WILL crash after TCP port probing

bugzilla at redhat.com bugzilla at redhat.com
Sat Apr 15 05:52:58 UTC 2017


https://bugzilla.redhat.com/show_bug.cgi?id=1442535

            Bug ID: 1442535
           Summary: [GSS] Multiple bricks WILL crash after TCP port
                    probing
           Product: Red Hat Gluster Storage
           Version: 3.2
         Component: rpc
          Keywords: Triaged
          Severity: high
          Assignee: mchangir at redhat.com
          Reporter: olim at redhat.com
        QA Contact: rhinduja at redhat.com
                CC: amukherj at redhat.com, bugs at gluster.org,
                    kaushal at redhat.com, khiremat at redhat.com,
                    ndevos at redhat.com, oleksandr at natalenko.name,
                    pkarampu at redhat.com, rcyriac at redhat.com,
                    rgowdapp at redhat.com, rhs-bugs at redhat.com,
                    skoduri at redhat.com, skyler.vock at eschat.com
        Depends On: 1438966
             Group: redhat



+++ This bug was initially created as a clone of Bug #1438966 +++

+++ This bug was initially created as a clone of Bug #1353561 +++

Description of problem:

Given replicated volume (we didn't test other layouts) multiple brick processes
could crash under load while probing brick TCP ports.

Version-Release number of selected component (if applicable):

CentOS 6.8, GlusterFS 3.9.0

Options Reconfigured:
performance.readdir-ahead: on
network.ping-timeout: 5
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.data-self-heal: off
cluster.entry-self-heal: off
cluster.metadata-self-heal: off
cluster.server-quorum-ratio: 51
nfs.disable: on

How reproducible:

Reliable (see below).

Steps to Reproduce:

1. create replicated (or, we suspect, any other) volume and start it;
2. mount volume on some client via FUSE;
3. find out what TCP port are used by the volume on one of the hosts where
crash would be initiated;
4. start nmap'ing those ports in a loop: "while true; do nmap -Pn -p49150-49160
127.0.0.1; done";
5. start generating some workload on the volume (we used to write lots of zero
files, stat them and remove in parallel);
6. ...wait... VERY RANDOM TIMEFRAME, OURS WAS AN HOUR
7. observe one or multiple brick crash on the node where TCP-probed bricks are
running.

Actual results:

Two variants:

1. brick could crash and generate core file;
2. brick could hang consuming 0% or 100% of CPU time.

Expected results:

Do not crash, of course :).

Additional info:

If brick crashes generating core file, gdb gives us the following stacktrace:

RECENT (04-04-17)
===
#0  __pthread_mutex_lock (mutex=0x340) at pthread_mutex_lock.c:50
#1  0x00007fe5ea84bb4e in gf_log_set_log_buf_size (buf_size=0) at logging.c:252
#2  0x00007fe5ea84bcef in gf_log_disable_suppression_before_exit (ctx=0xc73010)
at logging.c:424
#3  0x00007fe5ea852a65 in gf_print_trace (signum=11, ctx=0xc73010) at
common-utils.c:715
#4  <signal handler called>
#5  0x00007fe5ea855d5d in __inode_ctx_free (inode=0x7fe5bcd057ac) at
inode.c:331
#6  0x00007fe5ea8577a5 in __inode_destroy (table=<value optimized out>) at
inode.c:353
#7  inode_table_prune (table=<value optimized out>) at inode.c:1540
#8  0x00007fe5ea857dfc in inode_unref (inode=0x7fe5bcdd4fc8) at inode.c:524
#9  0x00007fe5ea844eb2 in loc_wipe (loc=0x7fe5e8f083c0) at xlator.c:695
#10 0x00007fe5ea8bae72 in args_wipe (args=0x7fe5e8f083c0) at
default-args.c:1579
#11 0x00007fe5ea862781 in call_stub_wipe_args (stub=0x7fe5e8f08370) at
call-stub.c:2466
#12 call_stub_destroy (stub=0x7fe5e8f08370) at call-stub.c:2482
#13 0x00007fe5dd72551b in iot_worker (data=0x7fe5d8069f20) at io-threads.c:222
#14 0x00000032ba007aa1 in start_thread (arg=0x7fe5a7fff700) at
pthread_create.c:301
#15 0x00000032b9ce8aad in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:115



PREVIOUS BUG:

===
#0  0x00007fc6cb66ebd0 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007fc6cc80755d in gf_log_set_log_buf_size (buf_size=buf_size at entry=0)
at logging.c:256
#2  0x00007fc6cc8076f7 in gf_log_disable_suppression_before_exit
(ctx=0x7fc6cdc51010) at logging.c:428
#3  0x00007fc6cc829775 in gf_print_trace (signum=11, ctx=0x7fc6cdc51010) at
common-utils.c:579
#4  <signal handler called>
#5  0x00007fc6cc82d149 in __inode_ctx_free (inode=inode at entry=0x7fc6b4050c74)
at inode.c:336
#6  0x00007fc6cc82e1c7 in __inode_destroy (inode=0x7fc6b4050c74) at inode.c:358
#7  inode_table_prune (table=table at entry=0x7fc6b80cea00) at inode.c:1540
#8  0x00007fc6cc82e454 in inode_unref (inode=0x7fc6b4050c74) at inode.c:529
#9  0x00007fc6cc841354 in fd_destroy (bound=_gf_true, fd=0x7fc6b80cee20) at
fd.c:537
#10 fd_unref (fd=0x7fc6b80cee20) at fd.c:573
#11 0x00007fc6b7ddd397 in server3_3_releasedir (req=0x7fc6b6058190) at
server-rpc-fops.c:4072
#12 0x00007fc6cc5cc6ab in rpcsvc_handle_rpc_call (svc=0x7fc6b8030080,
trans=trans at entry=0x7fc6b80e9f10, msg=0x7fc6b80f44f0) at rpcsvc.c:705
#13 0x00007fc6cc5cc87b in rpcsvc_notify (trans=0x7fc6b80e9f10,
mydata=<optimized out>, event=<optimized out>, data=<optimized out>) at
rpcsvc.c:799
#14 0x00007fc6cc5ce7c3 in rpc_transport_notify (this=this at entry=0x7fc6b80e9f10,
event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=data at entry=0x7fc6b80f44f0)
at rpc-transport.c:546
#15 0x00007fc6c14959b4 in socket_event_poll_in (this=this at entry=0x7fc6b80e9f10)
at socket.c:2353
#16 0x00007fc6c14985f4 in socket_event_handler (fd=fd at entry=15,
idx=idx at entry=6, data=0x7fc6b80e9f10, poll_in=1, poll_out=0, poll_err=0) at
socket.c:2466
#17 0x00007fc6cc872e6a in event_dispatch_epoll_handler (event=0x7fc6bf7bae80,
event_pool=0x7fc6cdc70290) at event-epoll.c:575
#18 event_dispatch_epoll_worker (data=0x7fc6cdcbcd70) at event-epoll.c:678
#19 0x00007fc6cb66cdc5 in start_thread () from /lib64/libpthread.so.0
#20 0x00007fc6cafb1ced in clone () from /lib64/libc.so.6
===

Additionally, we attach corresponding core for the stacktrace above.

If brick hangs consuming 0% of CPU time, we attached to brick process using gdb
and got stacktraces of all threads (see attached
"all_threads_stacktrace.log.xz" file).

--- Additional comment from Oleksandr Natalenko on 2016-07-07 09:03 EDT ---



--- Additional comment from Oleksandr Natalenko on 2016-07-08 05:08:45 EDT ---

Reverting http://review.gluster.org/#/c/13658 and
http://review.gluster.org/#/c/14739 as suggested by Soumya Koduri did not help
— I've got exactly the same crash with the same stacktrace.

--- Additional comment from Oleksandr Natalenko on 2016-07-19 03:40:46 EDT ---

Another bunch of cores I've got during extra tests.

=== core.27675 ===
#0  0x00007fb97754abd0 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007fb9786e36fd in gf_log_set_log_buf_size (buf_size=buf_size at entry=0)
at logging.c:256
#2  0x00007fb9786e3897 in gf_log_disable_suppression_before_exit
(ctx=0x7fb97a3f1010) at logging.c:428
#3  0x00007fb978705915 in gf_print_trace (signum=11, ctx=0x7fb97a3f1010) at
common-utils.c:579
#4  <signal handler called>
#5  0x00007fb978709449 in __inode_ctx_free (inode=inode at entry=0x7fb933ed16d8)
at inode.c:336
#6  0x00007fb97870a4c7 in __inode_destroy (inode=0x7fb933ed16d8) at inode.c:358
#7  inode_table_prune (table=table at entry=0x7fb9640ce160) at inode.c:1540
#8  0x00007fb97870a754 in inode_unref (inode=0x7fb933ed16d8) at inode.c:529
#9  0x00007fb9786de152 in loc_wipe (loc=loc at entry=0x7fb975c91e14) at
xlator.c:695
#10 0x00007fb97871c794 in call_stub_wipe_args (stub=0x7fb975c91dd4) at
call-stub.c:2511
#11 call_stub_destroy (stub=0x7fb975c91dd4) at call-stub.c:2550
#12 0x00007fb968eb4363 in iot_worker (data=0x7fb964055bf0) at io-threads.c:215
#13 0x00007fb977548dc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007fb976e8dced in clone () from /lib64/libc.so.6
===

=== core.8362 ===
#0  0x00007f8613718bdd in __gf_free (free_ptr=0x7f86000ebbd0) at mem-pool.c:313
#1  0x00007f85fef6ffa5 in io_stats_release (this=0x7f860001dc70,
fd=0x7f85f8003d14) at io-stats.c:2540
#2  0x00007f86137166c7 in fd_destroy (bound=_gf_true, fd=0x7f85f8003d14) at
fd.c:520
#3  fd_unref (fd=0x7f85f8003d14) at fd.c:573
#4  0x00007f85fed3f227 in server3_3_release (req=0x7f85fdfba06c) at
server-rpc-fops.c:4030
#5  0x00007f86134a16ab in rpcsvc_handle_rpc_call (svc=0x7f860002f8b0,
trans=trans at entry=0x7f86000d9d80, msg=0x7f85f8038120) at rpcsvc.c:705
#6  0x00007f86134a187b in rpcsvc_notify (trans=0x7f86000d9d80,
mydata=<optimized out>, event=<optimized out>, data=<optimized out>) at
rpcsvc.c:799
#7  0x00007f86134a37c3 in rpc_transport_notify (this=this at entry=0x7f86000d9d80,
event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=data at entry=0x7f85f8038120)
at rpc-transport.c:546
#8  0x00007f860836a654 in socket_event_poll_in (this=this at entry=0x7f86000d9d80)
at socket.c:2245
#9  0x00007f860836d294 in socket_event_handler (fd=fd at entry=20,
idx=idx at entry=9, data=0x7f86000d9d80, poll_in=1, poll_out=0, poll_err=0) at
socket.c:2358
#10 0x00007f86137481ca in event_dispatch_epoll_handler (event=0x7f85fed21e80,
event_pool=0x7f8614d02290) at event-epoll.c:575
#11 event_dispatch_epoll_worker (data=0x7f86000205d0) at event-epoll.c:678
#12 0x00007f8612541dc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f8611e86ced in clone () from /lib64/libc.so.6
===

=== core.8420 ===
#0  0x00007f1404043fa8 in ?? ()
#1  0x00007f141f83c7c3 in rpc_transport_notify (this=this at entry=0x7f14040434f0,
event=event at entry=RPC_TRANSPORT_DISCONNECT, data=data at entry=0x7f14040434f0) at
rpc-transport.c:546
#2  0x00007f1414706242 in socket_event_poll_err (this=0x7f14040434f0) at
socket.c:1158
#3  socket_event_handler (fd=fd at entry=25, idx=idx at entry=10,
data=0x7f14040434f0, poll_in=1, poll_out=0, poll_err=<optimized out>) at
socket.c:2365
#4  0x00007f141fae11ca in event_dispatch_epoll_handler (event=0x7f140b150e80,
event_pool=0x7f1421eb6290) at event-epoll.c:575
#5  event_dispatch_epoll_worker (data=0x7f140c0205d0) at event-epoll.c:678
#6  0x00007f141e8dadc5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f141e21fced in clone () from /lib64/libc.so.6
===

=== core.8439 ===
#0  0x00007f3a7b79bc10 in list_del_init (old=0x7f3a74027e18) at
../../../../libglusterfs/src/list.h:88
#1  server_rpc_notify (rpc=<optimized out>, xl=0x7f3a7c01f1e0, event=<optimized
out>, data=0x7f3a74027360) at server.c:542
#2  0x00007f3a8ffec4bf in rpcsvc_handle_disconnect (svc=0x7f3a7c02f8b0,
trans=trans at entry=0x7f3a74027360) at rpcsvc.c:756
#3  0x00007f3a8ffee830 in rpcsvc_notify (trans=0x7f3a74027360,
mydata=<optimized out>, event=<optimized out>, data=<optimized out>) at
rpcsvc.c:794
#4  0x00007f3a8fff07c3 in rpc_transport_notify (this=this at entry=0x7f3a74027360,
event=event at entry=RPC_TRANSPORT_DISCONNECT, data=data at entry=0x7f3a74027360) at
rpc-transport.c:546
#5  0x00007f3a84eba242 in socket_event_poll_err (this=0x7f3a74027360) at
socket.c:1158
#6  socket_event_handler (fd=fd at entry=27, idx=idx at entry=10,
data=0x7f3a74027360, poll_in=1, poll_out=0, poll_err=<optimized out>) at
socket.c:2365
#7  0x00007f3a902951ca in event_dispatch_epoll_handler (event=0x7f3a831dce80,
event_pool=0x7f3a90eb6290) at event-epoll.c:575
#8  event_dispatch_epoll_worker (data=0x7f3a90f02d70) at event-epoll.c:678
#9  0x00007f3a8f08edc5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f3a8e9d3ced in clone () from /lib64/libc.so.6
===

Attaching xzipped core files to this ticket.

Please note that this cores were got running v3.7.13 with the following patches
applied:

===
Kaleb S KEITHLEY (1):
      build: RHEL7 unpackaged files
.../hooks/S57glusterfind-delete-post.{pyc,pyo}

Kotresh HR (1):
      changelog/rpc: Fix rpc_clnt_t mem leaks

Mohit Agrawal (1):
      rpc/socket.c: Modify approach to cleanup threads of socket_poller in
socket_spawn.

N Balachandran (1):
      rpc/socket: pthread resources are not cleaned up

Niels de Vos (1):
      rpc: invalid argument when function setsockopt sets option
TCP_USER_TIMEOUT

Pranith Kumar K (1):
      features/index: Exclude gfid-type for '.', '..'

Raghavendra G (2):
      libglusterfs/client_t: Dump the 0th client too
      storage/posix: fix inode leaks
===

Also, please, note, that core.27675 is the coredump of brick of the volume with
following options:

===
nfs.disable: off
network.ping-timeout: 10
storage.linux-aio: on
server.event-threads: 4
performance.write-behind: on
performance.write-behind-window-size: 4194304
performance.stat-prefetch: on
performance.read-ahead: on
performance.quick-read: on
performance.io-thread-count: 2
performance.flush-behind: on
performance.client-io-threads: off
performance.cache-size: 33554432
performance.cache-max-file-size: 1048576
network.inode-lru-limit: 4096
features.cache-invalidation: on
cluster.entry-self-heal: off
cluster.metadata-self-heal: off
cluster.data-self-heal: off
cluster.readdir-optimize: on
cluster.lookup-optimize: on
client.event-threads: 2
performance.readdir-ahead: on
===

Rest of cores are obtained from bricks of the volume with default options.

--- Additional comment from Oleksandr Natalenko on 2016-07-19 03:41 EDT ---



--- Additional comment from Oleksandr Natalenko on 2016-07-19 03:42 EDT ---



--- Additional comment from Oleksandr Natalenko on 2016-07-19 03:43 EDT ---



--- Additional comment from Oleksandr Natalenko on 2016-07-19 03:43 EDT ---



--- Additional comment from Oleksandr Natalenko on 2016-08-04 08:32:34 EDT ---

New bunch of cores, re-tested with 3.7.14 with the following patches:

===
Aravinda VK (1):
      packaging: Remove ".py" extension from symlink target

Atin Mukherjee (1):
      rpc : build_prog_details should iterate program list inside critical
section

Jiffin Tony Thottan (2):
      gfapi : Avoid double freeing of dict in glfs_*_*getxattr
      xlator/trash : append '/' at the end in trash_notify_lookup_cbk

Raghavendra G (2):
      libglusterfs/client_t: Dump the 0th client too
      storage/posix: fix inode leaks

Soumya Koduri (2):
      glfs/upcall: entries should be removed under mutex lock
      gfapi/upcall: Fix a ref leak

Susant Palai (1):
      posix: fix posix_fgetxattr to return the correct error
===

core.25072:

===
#0  0x00007f0e0d0875f7 in raise () from /lib64/libc.so.6
#1  0x00007f0e0d088ce8 in abort () from /lib64/libc.so.6
#2  0x00007f0e0d0c7327 in __libc_message () from /lib64/libc.so.6
#3  0x00007f0e0d0cf053 in _int_free () from /lib64/libc.so.6
#4  0x00007f0e0e993b29 in dict_destroy (this=0x7f0e0ec76e18) at dict.c:565
#5  0x00007f0e0e9d7974 in call_stub_wipe_args (stub=0x7f0e0bf4b06c) at
call-stub.c:2533
#6  call_stub_destroy (stub=0x7f0e0bf4b06c) at call-stub.c:2550
#7  0x00007f0dfb166363 in iot_worker (data=0x7f0dfc055de0) at io-threads.c:215
#8  0x00007f0e0d803dc5 in start_thread () from /lib64/libpthread.so.0
#9  0x00007f0e0d148ced in clone () from /lib64/libc.so.6
===

core.25078:

===
#0  0x00007f336bbe55f7 in raise () from /lib64/libc.so.6
#1  0x00007f336bbe6ce8 in abort () from /lib64/libc.so.6
#2  0x00007f336bc25327 in __libc_message () from /lib64/libc.so.6
#3  0x00007f336bc2d053 in _int_free () from /lib64/libc.so.6
#4  0x00007f336219080f in fini (this=0x7f334400d7d0) at socket.c:4215
#5  0x00007f336d2c35cd in rpc_transport_destroy
(this=this at entry=0x7f334400d7d0) at rpc-transport.c:471
#6  0x00007f336d2c3767 in rpc_transport_unref (this=this at entry=0x7f334400d7d0)
at rpc-transport.c:529
#7  0x00007f336218d24a in socket_event_handler (fd=fd at entry=20,
idx=idx at entry=8, data=0x7f334400d7d0, poll_in=1, poll_out=0,
poll_err=<optimized out>) at socket.c:2366
#8  0x00007f336d56832a in event_dispatch_epoll_handler (event=0x7f335cbf4e80,
event_pool=0x7f336dfb4300) at event-epoll.c:575
#9  event_dispatch_epoll_worker (data=0x7f3358020870) at event-epoll.c:678
#10 0x00007f336c361dc5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f336bca6ced in clone () from /lib64/libc.so.6
===

core.6484 (similar to introduced previously):

===
#0  0x00007fcec4c5fbd0 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007fcec5df885d in gf_log_set_log_buf_size (buf_size=buf_size at entry=0)
at logging.c:256
#2  0x00007fcec5df89f7 in gf_log_disable_suppression_before_exit
(ctx=0x7fcec7b60010) at logging.c:428
#3  0x00007fcec5e1aa75 in gf_print_trace (signum=11, ctx=0x7fcec7b60010) at
common-utils.c:579
#4  <signal handler called>
#5  0x00007fcec5e1e5a9 in __inode_ctx_free (inode=inode at entry=0x7fce8d659ac4)
at inode.c:336
#6  0x00007fcec5e1f627 in __inode_destroy (inode=0x7fce8d659ac4) at inode.c:358
#7  inode_table_prune (table=table at entry=0x7fceb40d06a0) at inode.c:1540
#8  0x00007fcec5e1f8b4 in inode_unref (inode=0x7fce8d659ac4) at inode.c:529
#9  0x00007fcec5e327b4 in fd_destroy (bound=_gf_true, fd=0x7fceb40d0b2c) at
fd.c:537
#10 fd_unref (fd=0x7fceb40d0b2c) at fd.c:573
#11 0x00007fceb139e397 in server3_3_releasedir (req=0x7fceb8010190) at
server-rpc-fops.c:4072
#12 0x00007fcec5bbd6db in rpcsvc_handle_rpc_call (svc=0x7fceb4030130,
trans=trans at entry=0x7fceb40cda10, msg=0x7fcea4009bb0) at rpcsvc.c:705
#13 0x00007fcec5bbd8ab in rpcsvc_notify (trans=0x7fceb40cda10,
mydata=<optimized out>, event=<optimized out>, data=<optimized out>) at
rpcsvc.c:799
#14 0x00007fcec5bbf7f3 in rpc_transport_notify (this=this at entry=0x7fceb40cda10,
event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=data at entry=0x7fcea4009bb0)
at rpc-transport.c:546
#15 0x00007fcebaa86654 in socket_event_poll_in (this=this at entry=0x7fceb40cda10)
at socket.c:2245
#16 0x00007fcebaa89294 in socket_event_handler (fd=fd at entry=10,
idx=idx at entry=5, data=0x7fceb40cda10, poll_in=1, poll_out=0, poll_err=0) at
socket.c:2358
#17 0x00007fcec5e6432a in event_dispatch_epoll_handler (event=0x7fceb037ee80,
event_pool=0x7fcec7b7f300) at event-epoll.c:575
#18 event_dispatch_epoll_worker (data=0x7fceb4002ac0) at event-epoll.c:678
#19 0x00007fcec4c5ddc5 in start_thread () from /lib64/libpthread.so.0
#20 0x00007fcec45a2ced in clone () from /lib64/libc.so.6
===

Attaching xz-ipped corefiles as well.

--- Additional comment from Oleksandr Natalenko on 2016-08-04 08:33 EDT ---



--- Additional comment from Oleksandr Natalenko on 2016-08-04 08:33 EDT ---



--- Additional comment from Oleksandr Natalenko on 2016-08-04 08:34 EDT ---



--- Additional comment from Oleksandr Natalenko on 2016-08-04 09:08:30 EDT ---

Tried http://review.gluster.org/#/c/15087/ as suggested by Soumya, but got the
same crash.

core.18159:

===
#0  0x00007ff5478c8bd0 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007ff548a6185d in gf_log_set_log_buf_size (buf_size=buf_size at entry=0)
at logging.c:256  
#2  0x00007ff548a619f7 in gf_log_disable_suppression_before_exit
(ctx=0x7ff5494f8010) at logging.c:428  
#3  0x00007ff548a83a75 in gf_print_trace (signum=11, ctx=0x7ff5494f8010) at
common-utils.c:579
#4  <signal handler called>
#5  0x00007ff548a875a9 in __inode_ctx_free (inode=inode at entry=0x7ff51027f8f0)
at inode.c:336
#6  0x00007ff548a8864f in __inode_destroy (inode=0x7ff51027f8f0) at inode.c:358
#7  inode_table_prune (table=table at entry=0x7ff5340cf8b0) at inode.c:1548
#8  0x00007ff548a88944 in inode_unref (inode=0x7ff51027f8f0) at inode.c:529
#9  0x00007ff548a5c2b2 in loc_wipe (loc=loc at entry=0x7ff546010cc8) at
xlator.c:695
#10 0x00007ff548a9a994 in call_stub_wipe_args (stub=0x7ff546010c88) at
call-stub.c:2511
#11 call_stub_destroy (stub=0x7ff546010c88) at call-stub.c:2550
#12 0x00007ff539232363 in iot_worker (data=0x7ff534055c20) at io-threads.c:215
#13 0x00007ff5478c6dc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007ff54720bced in clone () from /lib64/libc.so.6
===

--- Additional comment from Oleksandr Natalenko on 2016-08-04 09:09 EDT ---



--- Additional comment from Kaushal on 2017-03-08 05:50:07 EST ---

This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been
performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug
against the newer release.

--- Additional comment from Niels de Vos on 2017-04-13 11:28:59 EDT ---

I've been running a test on glusterfs-3.10.1-1.el7.x86_64 for over half a day
now, and have not noticed any problem yet. This is what is running (and will be
running for a longer period):

- single brick volume
- mounted with fuse, and recursively cp'ing and rm'ing /usr in a loop
- running 'nmap -p49152 127.0.0.1' in a loop

Is this bug still reproducible with the current version of Gluster for you?

--- Additional comment from Oleksandr Natalenko on 2017-04-13 12:42:55 EDT ---

Niels,

I assume this is the same bug I've already described in 1353561. Backtrace
looks the same.

My setup there was distributed-replicated, 2 replicas, 5 bricks on each
replica.

Sad it was closed without proper investigation :(. Unfortunately, I do not have
access to my setup anymore.

--- Additional comment from Skyler Vock on 2017-04-13 13:33:24 EDT ---

(In reply to Niels de Vos from comment #1)
> I've been running a test on glusterfs-3.10.1-1.el7.x86_64 for over half a
> day now, and have not noticed any problem yet. This is what is running (and
> will be running for a longer period):
> 
> - single brick volume
> - mounted with fuse, and recursively cp'ing and rm'ing /usr in a loop
> - running 'nmap -p49152 127.0.0.1' in a loop
> 
> Is this bug still reproducible with the current version of Gluster for you?

Niels,

This is a difficult bug to replicate. We see the outage in production anywhere
between once a week and once every 3 weeks. This outage, for us, is ONLY during
a 

'find <path> -depth -type f -name "*" -mtime +31 -delete' 

command in a backup cron. To speed up the test process, we have a recursive
script building the environment, running the backup cron, then destroying the
environment. While this is running on a gluster client, we are recursively
executing 

'while true; do nmap -Pn -sT -p49150-49160 <ip>; done' 

on a node outside of the gluster architecture, without glusterfs installed.
This is similar to a polling system ie OpenNMS.

- one volume
- replicated
- two clients
- three servers, one brick per server
- mounted with fuse

Our test runs anywhere between an hour and 24 hours to reproduce the issue. We
can consistently reproduce and continue to see this bug in production.

Let me know if you have other questions.


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1438966
[Bug 1438966] Multiple bricks WILL crash after TCP port probing
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=F2lpCgpczv&a=cc_unsubscribe


More information about the Bugs mailing list