[Bugs] [Bug 1456331] New: [Bitrot]: Brick process crash observed while trying to recover a bad file in disperse volume
bugzilla at redhat.com
bugzilla at redhat.com
Mon May 29 05:39:04 UTC 2017
https://bugzilla.redhat.com/show_bug.cgi?id=1456331
Bug ID: 1456331
Summary: [Bitrot]: Brick process crash observed while trying to
recover a bad file in disperse volume
Product: GlusterFS
Version: 3.11
Component: bitrot
Severity: high
Assignee: bugs at gluster.org
Reporter: khiremat at redhat.com
CC: amukherj at redhat.com, bugs at gluster.org,
rhs-bugs at redhat.com, sanandpa at redhat.com,
storage-qa-internal at redhat.com
Depends On: 1454317
Blocks: 1451280
Docs Contact: bugs at gluster.org
+++ This bug was initially created as a clone of Bug #1454317 +++
+++ This bug was initially created as a clone of Bug #1451280 +++
Description of problem:
=======================
Had a 6 node cluster with 3.8.4-23 build. Created a 1 * (4+2) EC volume and
mounted it via fuse. Created two files 'test1' and 'test2' and corrupted both.
The scrubber detected both the files as corrupted. Updated the build to
3.8.4-25 and restarted glusterd. Followed the steps of recovering the file as
mentioned in the admin guide. 'test2' recovered successfully, but 'test1'
failed with 'Input/output error' on the mountpoint. Volume status showed 2
brick processes down.
Version-Release number of selected component (if applicable):
===========================================================
How reproducible:
=================
1:1
Additional info:
================
[root at dhcp47-121 ~]# gluster peer status
Number of Peers: 5
Hostname: dhcp47-113.lab.eng.blr.redhat.com
Uuid: a0557927-4e5e-4ff7-8dce-94873f867707
State: Peer in Cluster (Connected)
Hostname: dhcp47-114.lab.eng.blr.redhat.com
Uuid: c0dac197-5a4d-4db7-b709-dbf8b8eb0896
State: Peer in Cluster (Connected)
Hostname: dhcp47-115.lab.eng.blr.redhat.com
Uuid: f828fdfa-e08f-4d12-85d8-2121cafcf9d0
State: Peer in Cluster (Connected)
Hostname: dhcp47-116.lab.eng.blr.redhat.com
Uuid: a96e0244-b5ce-4518-895c-8eb453c71ded
State: Peer in Cluster (Connected)
Hostname: dhcp47-117.lab.eng.blr.redhat.com
Uuid: 17eb3cef-17e7-4249-954b-fc19ec608304
State: Peer in Cluster (Connected)
[root at dhcp47-121 ~]#
[root at dhcp47-121 ~]#
[root at dhcp47-121 ~]# gluster v status disp2
Status of volume: disp2
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick 10.70.47.121:/bricks/brick8/disp2_0 49154 0 Y 5552
Brick 10.70.47.113:/bricks/brick8/disp2_1 N/A N/A N N/A
Brick 10.70.47.114:/bricks/brick8/disp2_2 49154 0 Y 30916
Brick 10.70.47.115:/bricks/brick8/disp2_3 49154 0 Y 23469
Brick 10.70.47.116:/bricks/brick8/disp2_4 49153 0 Y 27754
Brick 10.70.47.117:/bricks/brick8/disp2_5 N/A N/A N N/A
Self-heal Daemon on localhost N/A N/A Y 5497
Bitrot Daemon on localhost N/A N/A Y 5515
Scrubber Daemon on localhost N/A N/A Y 5525
Self-heal Daemon on dhcp47-113.lab.eng.blr.
redhat.com N/A N/A Y 5893
Bitrot Daemon on dhcp47-113.lab.eng.blr.red
hat.com N/A N/A Y 5911
Scrubber Daemon on dhcp47-113.lab.eng.blr.r
edhat.com N/A N/A Y 5921
Self-heal Daemon on dhcp47-114.lab.eng.blr.
redhat.com N/A N/A Y 30858
Bitrot Daemon on dhcp47-114.lab.eng.blr.red
hat.com N/A N/A Y 30876
Scrubber Daemon on dhcp47-114.lab.eng.blr.r
edhat.com N/A N/A Y 30886
Self-heal Daemon on dhcp47-116.lab.eng.blr.
redhat.com N/A N/A Y 27708
Bitrot Daemon on dhcp47-116.lab.eng.blr.red
hat.com N/A N/A Y 27726
Scrubber Daemon on dhcp47-116.lab.eng.blr.r
edhat.com N/A N/A Y 27736
Self-heal Daemon on dhcp47-117.lab.eng.blr.
redhat.com N/A N/A Y 9684
Bitrot Daemon on dhcp47-117.lab.eng.blr.red
hat.com N/A N/A Y 9702
Scrubber Daemon on dhcp47-117.lab.eng.blr.r
edhat.com N/A N/A Y 9712
Self-heal Daemon on dhcp47-115.lab.eng.blr.
redhat.com N/A N/A Y 23411
Bitrot Daemon on dhcp47-115.lab.eng.blr.red
hat.com N/A N/A Y 23429
Scrubber Daemon on dhcp47-115.lab.eng.blr.r
edhat.com N/A N/A Y 23439
Task Status of Volume disp2
------------------------------------------------------------------------------
There are no active volume tasks
[root at dhcp47-121 ~]#
[root at dhcp47-121 ~]# gluster v info disp2
Volume Name: disp2
Type: Disperse
Volume ID: d7b0d170-f0e0-4e26-9369-f0a52dc92d38
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (4 + 2) = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.47.121:/bricks/brick8/disp2_0
Brick2: 10.70.47.113:/bricks/brick8/disp2_1
Brick3: 10.70.47.114:/bricks/brick8/disp2_2
Brick4: 10.70.47.115:/bricks/brick8/disp2_3
Brick5: 10.70.47.116:/bricks/brick8/disp2_4
Brick6: 10.70.47.117:/bricks/brick8/disp2_5
Options Reconfigured:
performance.stat-prefetch: off
nfs.disable: on
transport.address-family: inet
features.bitrot: on
features.scrub: Active
features.scrub-freq: hourly
cluster.brick-multiplex: disable
[root at dhcp47-121 ~]#
[root at dhcp47-121 ~]# gluster v bitrot disp2 scrub status
Volume name : disp2
State of scrub: Active (In Progress)
Scrub impact: lazy
Scrub frequency: hourly
Bitrot error log location: /var/log/glusterfs/bitd.log
Scrubber error log location: /var/log/glusterfs/scrub.log
=========================================================
Node: localhost
Number of Scrubbed files: 2
Number of Skipped files: 0
Last completed scrub time: 2017-05-16 09:35:12
Duration of last scrub (D:M:H:M:S): 0:0:0:7
Error count: 0
=========================================================
Node: dhcp47-114.lab.eng.blr.redhat.com
Number of Scrubbed files: 1
Number of Skipped files: 0
Last completed scrub time: 2017-05-16 09:35:12
Duration of last scrub (D:M:H:M:S): 0:0:0:7
Error count: 0
=========================================================
Node: dhcp47-116.lab.eng.blr.redhat.com
Number of Scrubbed files: 2
Number of Skipped files: 0
Last completed scrub time: 2017-05-16 09:35:14
Duration of last scrub (D:M:H:M:S): 0:0:0:7
Error count: 0
=========================================================
Node: dhcp47-113.lab.eng.blr.redhat.com
Number of Scrubbed files: 0
Number of Skipped files: 0
Last completed scrub time: 2017-05-16 08:35:24
Duration of last scrub (D:M:H:M:S): 0:0:0:7
Error count: 0
=========================================================
Node: dhcp47-115.lab.eng.blr.redhat.com
Number of Scrubbed files: 2
Number of Skipped files: 0
Last completed scrub time: 2017-05-16 09:35:11
Duration of last scrub (D:M:H:M:S): 0:0:0:7
Error count: 0
=========================================================
Node: dhcp47-117.lab.eng.blr.redhat.com
Number of Scrubbed files: 0
Number of Skipped files: 0
Last completed scrub time: 2017-05-16 08:35:23
Duration of last scrub (D:M:H:M:S): 0:0:0:7
Error count: 0
=========================================================
[root at dhcp47-121 ~]# gluster v heal disp2 info
Brick 10.70.47.121:/bricks/brick8/disp2_0
/d1/d2/d3/d4/test2
Status: Connected
Number of entries: 1
Brick 10.70.47.113:/bricks/brick8/disp2_1
Status: Transport endpoint is not connected
Number of entries: -
Brick 10.70.47.114:/bricks/brick8/disp2_2
/d1/d2/d3/d4/test2
Status: Connected
Number of entries: 1
Brick 10.70.47.115:/bricks/brick8/disp2_3
/d1/d2/d3/d4/test2
Status: Connected
Number of entries: 1
Brick 10.70.47.116:/bricks/brick8/disp2_4
/d1/d2/d3/d4/test2
Status: Connected
Number of entries: 1
Brick 10.70.47.117:/bricks/brick8/disp2_5
Status: Transport endpoint is not connected
Number of entries: -
[root at dhcp47-121 ~]#
[2017-05-16 08:54:10.160132] E [MSGID: 115070]
[server-rpc-fops.c:1474:server_open_cbk] 0-disp2-server: 4619: OPEN
/d1/d2/d3/d4/test2 (3673eecb-e5b5-4014-9bc6-a2fc007f08cb) ==> (Input/output
error) [Input/output error]
pending frames:
frame : type(0) op(29)
frame : type(0) op(11)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2017-05-16 08:55:01
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.8.4
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7f0e805201b2]
/lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7f0e80529bd4]
/lib64/libc.so.6(+0x35250)[0x7f0e7ec02250]
/usr/lib64/glusterfs/3.8.4/xlator/features/bitrot-stub.so(+0xadf4)[0x7f0e7174cdf4]
/usr/lib64/glusterfs/3.8.4/xlator/features/bitrot-stub.so(+0xde56)[0x7f0e7174fe56]
/usr/lib64/glusterfs/3.8.4/xlator/features/access-control.so(+0x5815)[0x7f0e71535815]
/usr/lib64/glusterfs/3.8.4/xlator/features/locks.so(+0x6dc8)[0x7f0e71312dc8]
/usr/lib64/glusterfs/3.8.4/xlator/features/worm.so(+0x7e59)[0x7f0e71106e59]
/usr/lib64/glusterfs/3.8.4/xlator/features/read-only.so(+0x4478)[0x7f0e70efb478]
/usr/lib64/glusterfs/3.8.4/xlator/features/leases.so(+0x50b4)[0x7f0e70ce70b4]
/usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so(+0xf143)[0x7f0e70ad7143]
/lib64/libglusterfs.so.0(default_open_resume+0x1c9)[0x7f0e805b1269]
/lib64/libglusterfs.so.0(call_resume+0x75)[0x7f0e80542b25]
/usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so(+0x4957)[0x7f0e708c1957]
/lib64/libpthread.so.0(+0x7dc5)[0x7f0e7f37fdc5]
/lib64/libc.so.6(clone+0x6d)[0x7f0e7ecc473d]
BT:
Program terminated with signal 11, Segmentation fault.
#0 list_add_tail (head=0x7f0e28001908, new=0x18) at
../../../../../libglusterfs/src/list.h:40
40 new->next = head;
Missing separate debuginfos, use: debuginfo-install
glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64
krb5-libs-1.14.1-27.el7_3.x86_64 libacl-2.2.51-12.el7.x86_64
libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-12.el7.x86_64
libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64
libselinux-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7_3.2.x86_64
openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64
sqlite-3.7.17-8.el7.x86_64 sssd-client-1.14.0-43.el7_3.14.x86_64
zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0 list_add_tail (head=0x7f0e28001908, new=0x18) at
../../../../../libglusterfs/src/list.h:40
#1 br_stub_add_fd_to_inode (this=this at entry=0x7f0e6c012440,
fd=fd at entry=0x7f0e6c0a5050, ctx=ctx at entry=0x0) at bit-rot-stub.c:2398
#2 0x00007f0e7174fe56 in br_stub_open (frame=0x7f0e28000ca0,
this=0x7f0e6c012440, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0)
at bit-rot-stub.c:2352
#3 0x00007f0e71535815 in posix_acl_open (frame=0x7f0e280014b0,
this=0x7f0e6c013d70, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0)
at posix-acl.c:1129
#4 0x00007f0e71312dc8 in pl_open (frame=frame at entry=0x7f0e28000ac0,
this=this at entry=0x7f0e6c015320, loc=loc at entry=0x7f0e6c0ccf90,
flags=flags at entry=2, fd=fd at entry=0x7f0e6c0a5050,
xdata=xdata at entry=0x0) at posix.c:1698
#5 0x00007f0e71106e59 in worm_open (frame=0x7f0e28000ac0, this=<optimized
out>, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at worm.c:43
#6 0x00007f0e70efb478 in ro_open (frame=0x7f0e28001740, this=0x7f0e6c018130,
loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at
read-only-common.c:341
#7 0x00007f0e70ce70b4 in leases_open (frame=0x7f0e28001b50,
this=0x7f0e6c019880, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0)
at leases.c:75
#8 0x00007f0e70ad7143 in up_open (frame=0x7f0e28002250, this=0x7f0e6c01af20,
loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at upcall.c:75
#9 0x00007f0e805b1269 in default_open_resume (frame=0x7f0e6c002020,
this=0x7f0e6c01c690, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0)
at defaults.c:1726
#10 0x00007f0e80542b25 in call_resume (stub=0x7f0e6c0ccf40) at call-stub.c:2508
#11 0x00007f0e708c1957 in iot_worker (data=0x7f0e6c0550e0) at io-threads.c:220
#12 0x00007f0e7f37fdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f0e7ecc473d in clone () from /lib64/libc.so.6
(gdb)
--- Additional comment from Worker Ant on 2017-05-22 09:00:24 EDT ---
REVIEW: https://review.gluster.org/17357 (features/bitrot: Fix glusterfsd
crash) posted (#1) for review on master by Kotresh HR (khiremat at redhat.com)
--- Additional comment from Worker Ant on 2017-05-28 23:52:14 EDT ---
COMMIT: https://review.gluster.org/17357 committed in master by Atin Mukherjee
(amukherj at redhat.com)
------
commit 6908e962f6293d38f0ee65c088247a66f2832e4a
Author: Kotresh HR <khiremat at redhat.com>
Date: Mon May 22 08:47:07 2017 -0400
features/bitrot: Fix glusterfsd crash
With object versioning being optional, it can
so happen the bitrot stub context is not always
set. When it's not found, it's initialized. But
was not being assigned to use in the local
function. This was leading for brick crash.
Fixed the same.
Change-Id: I0dab6435cdfe16a8c7f6a31ffec1a370822597a8
BUG: 1454317
Signed-off-by: Kotresh HR <khiremat at redhat.com>
Reviewed-on: https://review.gluster.org/17357
Smoke: Gluster Build System <jenkins at build.gluster.org>
NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
Reviewed-by: Raghavendra Bhat <raghavendra at redhat.com>
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1451280
[Bug 1451280] [Bitrot]: Brick process crash observed while trying to
recover a bad file in disperse volume
https://bugzilla.redhat.com/show_bug.cgi?id=1454317
[Bug 1454317] [Bitrot]: Brick process crash observed while trying to
recover a bad file in disperse volume
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
You are the Docs Contact for the bug.
More information about the Bugs
mailing list