[Bugs] [Bug 1705351] New: glusterfsd crash after days of running

Thu May 2 07:10:11 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1705351

            Bug ID: 1705351
           Summary: glusterfsd crash after days of running
           Product: GlusterFS
           Version: mainline
          Hardware: x86_64
                OS: Linux
            Status: NEW
         Component: HDFS
          Severity: urgent
          Assignee: bugs at gluster.org
          Reporter: waza123 at inbox.lv
                CC: bugs at gluster.org
  Target Milestone: ---
    Classification: Community

One of brick just crashed glusterfsd and it cant be started again
What can I do to start it again ?

crash dump gdb:

Program terminated with signal SIGSEGV, Segmentation fault.
#0 up_lk (frame=0x7fea88193f30, this=0x7feb3401c770, fd=0x0, cmd=6,
flock=0x7feb0d174d40, xdata=0x0) at upcall.c:239
239 local = upcall_local_init (frame, this, NULL, NULL, fd->inode, NULL);
[Current thread is 1 (Thread 0x7feb0031e700 (LWP 12319))]
(gdb) bt
#0 up_lk (frame=0x7fea88193f30, this=0x7feb3401c770, fd=0x0, cmd=6,
flock=0x7feb0d174d40, xdata=0x0) at upcall.c:239
#1 0x00007feb3e1cf65d in default_lk_resume (frame=0x7feb0d174ae0,
this=0x7feb3401e060, fd=0x0, cmd=6, lock=0x7feb0d174d40, xdata=0x0) at
defaults.c:1833
#2 0x00007feb3e166f35 in call_resume (stub=0x7feb0d174bf0) at call-stub.c:2508
#3 0x00007feb31e00d74 in iot_worker (data=0x7feb34058480) at io-threads.c:222
#4 0x00007feb3d8ca6ba in start_thread (arg=0x7feb0031e700) at
pthread_create.c:333
#5 0x00007feb3d60041d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) bt full
#0 up_lk (frame=0x7fea88193f30, this=0x7feb3401c770, fd=0x0, cmd=6,
flock=0x7feb0d174d40, xdata=0x0) at upcall.c:239
op_errno = -1
local = 0x0
__FUNCTION__ = "up_lk" 
#1 0x00007feb3e1cf65d in default_lk_resume (frame=0x7feb0d174ae0,
this=0x7feb3401e060, fd=0x0, cmd=6, lock=0x7feb0d174d40, xdata=0x0) at
defaults.c:1833
_new = 0x7fea88193f30
old_THIS = 0x7feb3401e060
tmp_cbk = 0x7feb3e1bafa0 <default_lk_cbk>
__FUNCTION__ = "default_lk_resume" 
#2 0x00007feb3e166f35 in call_resume (stub=0x7feb0d174bf0) at call-stub.c:2508
old_THIS = 0x7feb3401e060
__FUNCTION__ = "call_resume" 
#3 0x00007feb31e00d74 in iot_worker (data=0x7feb34058480) at io-threads.c:222
conf = 0x7feb34058480
this = <optimized out>
stub = 0x7feb0d174bf0
sleep_till = {tv_sec = 1556637893, tv_nsec = 0}
ret = <optimized out>
pri = 1
bye = _gf_false
__FUNCTION__ = "iot_worker" 
#4 0x00007feb3d8ca6ba in start_thread (arg=0x7feb0031e700) at
pthread_create.c:333
__res = <optimized out>
pd = 0x7feb0031e700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140647297312512,
5756482990956014801, 0, 140648089937359, 140647297313216, 140648166818944,
-5749651260269466415,
-5749590536105693999}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0,
0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
__PRETTY_FUNCTION__ = "start_thread" 
#5 0x00007feb3d60041d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.
(gdb)

# config

# gluster volume info

Volume Name: hadoop_volume
Type: Disperse
Volume ID: f13b43b0-ff9e-429b-81ed-15c92cdd1181
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: hdd1:/hadoop
Brick2: hdd2:/hadoop
Brick3: hdd3:/hadoop
Options Reconfigured:
cluster.disperse-self-heal-daemon: enable
server.statedump-path: /tmp
performance.client-io-threads: on
server.event-threads: 16
client.event-threads: 16
cluster.lookup-optimize: on
performance.parallel-readdir: on
transport.address-family: inet
nfs.disable: on
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 500000
features.lock-heal: on

# status

# gluster volume status
Status of volume: hadoop_volume
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick hdd1:/hadoop                          49152     0          Y       5085
Brick hdd2:/hadoop                          49152     0          Y       4044
Self-heal Daemon on localhost               N/A       N/A        Y       2383
Self-heal Daemon on serv3                   N/A       N/A        Y       2423
Self-heal Daemon on serv2                   N/A       N/A        Y       3429
Self-heal Daemon on hdd2                    N/A       N/A        Y       4035
Self-heal Daemon on hdd1                    N/A       N/A        Y       5076

Task Status of Volume hadoop_volume
------------------------------------------------------------------------------
There are no active volume tasks

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.