[Bugs] [Bug 1511669] New: bitrot daemon crashes when paused

bugzilla at redhat.com bugzilla at redhat.com
Thu Nov 9 20:11:34 UTC 2017


https://bugzilla.redhat.com/show_bug.cgi?id=1511669

            Bug ID: 1511669
           Summary: bitrot daemon crashes when paused
           Product: GlusterFS
           Version: 3.10
         Component: bitrot
          Assignee: bugs at gluster.org
          Reporter: rabhat at redhat.com
                CC: bugs at gluster.org
      Docs Contact: bugs at gluster.org



Description of problem:

The bitrot daemon crashes sometimes upon being paused.

Gluster VERSION: 3.10.1

1) The bug was hit by the community user

Below is the description of the issue.
http://lists.gluster.org/pipermail/gluster-users/2017-September/032359.html

2) I was able to recreate the issue by running tests myself. Ran following
steps.

A) Created a gluster volume
B) started and mounted it
C) Created some data
D) Enabled Bitrot Daemon ===> Note, this can be done before step C as well
E) After some data was created, enabled ondemand scrubbing
F) While scrubbing was happening paused the scrubber

NOTE: I repeated steps E and F (i.e. resuming and pausing of scrubber) multiple
times to hit the crash once.

Below is the backtrace of the crash. It is not exactly similar to the backtrace
that the community user has got. But I believe both the crashes happened
because of the memory corruption.

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/local/sbin/glusterfs -s localhost --volfile-id
gluster/scrub -p /var/lib/g'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f098f07e80f in __pthread_mutex_lock_full () from
/lib64/libpthread.so.0
[Current thread is 1 (Thread 0x7f09872ca700 (LWP 8231))]
Missing separate debuginfos, use: dnf debuginfo-install
glibc-2.24-4.fc25.x86_64 keyutils-libs-1.5.9-8.fc24.x86_64
krb5-libs-1.14.4-4.fc25.x86_64 libcap-2.25-2.fc25.x86_64
libcom_err-1.43.3-1.fc25.x86_64 libgcc-6.3.1-1.fc25.x86_64
libselinux-2.5-13.fc25.x86_64 libuuid-2.28.2-1.fc25.x86_64
nss-mdns-0.10-17.fc24.x86_64 openssl-libs-1.0.2k-1.fc25.x86_64
pcre-8.40-1.fc25.x86_64 sssd-client-1.14.2-2.fc25.x86_64
systemd-libs-231-12.fc25.x86_64 zlib-1.2.8-10.fc24.x86_64
(gdb) bt
#0  0x00007f098f07e80f in __pthread_mutex_lock_full () from
/lib64/libpthread.so.0
#1  0x00007f0990279c52 in synctask_wake (task=0x7f097affc890) at
../../../libglusterfs/src/syncop.c:354
#2  0x00007f099027b7ea in syncop_lookup_cbk (frame=0x7f0970000c90,
cookie=0x7f097affc020, this=0x7f097c00f500, op_ret=-1, op_errno=107,
inode=0x7f097c134360, iatt=0x7f09872c9a70, xdata=0x0, 
    parent=0x7f09872c9a00) at ../../../libglusterfs/src/syncop.c:1211
#3  0x00007f0981bb71b6 in client3_3_lookup_cbk (req=0x7f09700043d0,
iov=0x7f09872c9cc0, count=1, myframe=0x7f0970003dd0) at
../../../../../xlators/protocol/client/src/client-rpc-fops.c:2935
#4  0x00007f098fff6f3a in call_bail (data=0x7f097c0268b0) at
../../../../rpc/rpc-lib/src/rpc-clnt.c:204
#5  0x00007f09902425cd in gf_timer_proc (data=0xe404e0) at
../../../libglusterfs/src/timer.c:155
#6  0x00007f098f07c6ca in start_thread () from /lib64/libpthread.so.0
#7  0x00007f098e955f7f in clone () from /lib64/libc.so.6
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7f09872ca700 (LWP 8231) 0x00007f098f07e80f in
__pthread_mutex_lock_full () from /lib64/libpthread.so.0
  2    Thread 0x7f097affd700 (LWP 5337) 0x00007f098f082460 in
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  3    Thread 0x7f099070a780 (LWP 8230) 0x00007f098f07d96d in pthread_join ()
from /lib64/libpthread.so.0
  4    Thread 0x7f09862c8700 (LWP 8233) 0x00007f098e91a81d in nanosleep () from
/lib64/libc.so.6
  5    Thread 0x7f097b7fe700 (LWP 5336) 0x00007f098f082460 in
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  6    Thread 0x7f0986ac9700 (LWP 8232) 0x00007f098f0861c6 in sigwait () from
/lib64/libpthread.so.0
  7    Thread 0x7f097bfff700 (LWP 5335) 0x00007f098f082460 in
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  8    Thread 0x7f09852c6700 (LWP 8235) 0x00007f098f082809 in
pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  9    Thread 0x7f0984ac5700 (LWP 8236) 0x00007f098e94bd73 in select () from
/lib64/libc.so.6
  10   Thread 0x7f0980c36700 (LWP 8238) 0x00007f098f082460 in
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  11   Thread 0x7f09789a4700 (LWP 8254) 0x00007f098f082460 in
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  12   Thread 0x7f0979ffb700 (LWP 8250) 0x00007f098e956573 in epoll_wait ()
from /lib64/libc.so.6
  13   Thread 0x7f0985ac7700 (LWP 8234) 0x00007f098f082809 in
pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  14   Thread 0x7f09825ea700 (LWP 8237) 0x00007f098f08538d in __lll_lock_wait
() from /lib64/libpthread.so.0
  15   Thread 0x7f097a7fc700 (LWP 8247) 0x00007f098f082460 in
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  16   Thread 0x7f09792dd700 (LWP 8253) 0x00007f098f082460 in
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  17   Thread 0x7f095bfff700 (LWP 8255) 0x00007f098f082460 in
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

Since the scrubber was running in valgrind mode, was able to get some report
about memory corruption that could be happening.

==16894== Invalid write of size 8
==16894==    at 0x4E98658: __gf_mem_invalidate (mem-pool.c:278)
==16894==    by 0x4E9899D: __gf_free (mem-pool.c:334)
==16894==    by 0x4EACD42: synctask_destroy (syncop.c:391)
==16894==    by 0x4EACDC9: synctask_done (syncop.c:409)
==16894==    by 0x4EAD6FC: synctask_switchto (syncop.c:668)
==16894==    by 0x4EAD7B2: syncenv_processor (syncop.c:699)
==16894==    by 0x60C06C9: start_thread (in /usr/lib64/libpthread-2.24.so)
==16894==    by 0x683FF7E: clone (in /usr/lib64/libc-2.24.so)
==16894==  Address 0x17e3db71 is 2,078,081 bytes inside a block of size
2,097,224 alloc'd
==16894==    at 0x4C2FA50: calloc (vg_replace_malloc.c:711)
==16894==    by 0x4E98101: __gf_calloc (mem-pool.c:117)
==16894==    by 0x4EAD1BA: synctask_create (syncop.c:497)
==16894==    by 0x4EAD3FC: synctask_new1 (syncop.c:571)
==16894==    by 0x4EAD46C: synctask_new (syncop.c:586)
==16894==    by 0x514E1D8: rpcsvc_handle_rpc_call (rpcsvc.c:690)
==16894==    by 0x514E54C: rpcsvc_notify (rpcsvc.c:789)
==16894==    by 0x5153B7B: rpc_transport_notify (rpc-transport.c:538)
==16894==    by 0x11491F8C: socket_event_poll_in (socket.c:2268)
==16894==    by 0x114924E2: socket_event_handler (socket.c:2398)
==16894==    by 0x4ED1880: event_dispatch_epoll_handler (event-epoll.c:572)
==16894==    by 0x4ED1C5F: event_dispatch_epoll_worker (event-epoll.c:675)
==16894== 
==16894== Invalid write of size 8
==16894==    at 0x4E98660: __gf_mem_invalidate (mem-pool.c:278)
==16894==    by 0x4E9899D: __gf_free (mem-pool.c:334)
==16894==    by 0x4EACD42: synctask_destroy (syncop.c:391)
==16894==    by 0x4EACDC9: synctask_done (syncop.c:409)
==16894==    by 0x4EAD6FC: synctask_switchto (syncop.c:668)
==16894==    by 0x4EAD7B2: syncenv_processor (syncop.c:699)
==16894==    by 0x60C06C9: start_thread (in /usr/lib64/libpthread-2.24.so)
==16894==    by 0x683FF7E: clone (in /usr/lib64/libc-2.24.so)
==16894==  Address 0x17e3db79 is 2,078,089 bytes inside a block of size
2,097,224 alloc'd
==16894==    at 0x4C2FA50: calloc (vg_replace_malloc.c:711)
==16894==    by 0x4E98101: __gf_calloc (mem-pool.c:117)
==16894==    by 0x4EAD1BA: synctask_create (syncop.c:497)
==16894==    by 0x4EAD3FC: synctask_new1 (syncop.c:571)
==16894==    by 0x4EAD46C: synctask_new (syncop.c:586)
==16894==    by 0x514E1D8: rpcsvc_handle_rpc_call (rpcsvc.c:690)
==16894==    by 0x514E54C: rpcsvc_notify (rpcsvc.c:789)
==16894==    by 0x5153B7B: rpc_transport_notify (rpc-transport.c:538)
==16894==    by 0x11491F8C: socket_event_poll_in (socket.c:2268)
==16894==    by 0x114924E2: socket_event_handler (socket.c:2398)
==16894==    by 0x4ED1880: event_dispatch_epoll_handler (event-epoll.c:572)
==16894==    by 0x4ED1C5F: event_dispatch_epoll_worker (event-epoll.c:675)
Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
You are the Docs Contact for the bug.


More information about the Bugs mailing list