[Bugs] [Bug 1456331] New: [Bitrot]: Brick process crash observed while trying to recover a bad file in disperse volume

Mon May 29 05:39:04 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1456331

            Bug ID: 1456331
           Summary: [Bitrot]: Brick process crash observed while trying to
                    recover a bad file in disperse volume
           Product: GlusterFS
           Version: 3.11
         Component: bitrot
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: khiremat at redhat.com
                CC: amukherj at redhat.com, bugs at gluster.org,
                    rhs-bugs at redhat.com, sanandpa at redhat.com,
                    storage-qa-internal at redhat.com
        Depends On: 1454317
            Blocks: 1451280
      Docs Contact: bugs at gluster.org

+++ This bug was initially created as a clone of Bug #1454317 +++

+++ This bug was initially created as a clone of Bug #1451280 +++

Description of problem:
=======================
Had a 6 node cluster with 3.8.4-23 build. Created a 1 * (4+2) EC volume and
mounted it via fuse. Created two files 'test1' and 'test2' and corrupted both.
The scrubber detected both the files as corrupted. Updated the build to
3.8.4-25 and restarted glusterd. Followed the steps of recovering the file as
mentioned in the admin guide. 'test2' recovered successfully, but 'test1'
failed with 'Input/output error' on the mountpoint. Volume status showed 2
brick processes down. 

Version-Release number of selected component (if applicable):
===========================================================

How reproducible:
=================
1:1

Additional info:
================

[root at dhcp47-121 ~]# gluster peer status
Number of Peers: 5

Hostname: dhcp47-113.lab.eng.blr.redhat.com
Uuid: a0557927-4e5e-4ff7-8dce-94873f867707
State: Peer in Cluster (Connected)

Hostname: dhcp47-114.lab.eng.blr.redhat.com
Uuid: c0dac197-5a4d-4db7-b709-dbf8b8eb0896
State: Peer in Cluster (Connected)

Hostname: dhcp47-115.lab.eng.blr.redhat.com
Uuid: f828fdfa-e08f-4d12-85d8-2121cafcf9d0
State: Peer in Cluster (Connected)

Hostname: dhcp47-116.lab.eng.blr.redhat.com
Uuid: a96e0244-b5ce-4518-895c-8eb453c71ded
State: Peer in Cluster (Connected)

Hostname: dhcp47-117.lab.eng.blr.redhat.com
Uuid: 17eb3cef-17e7-4249-954b-fc19ec608304
State: Peer in Cluster (Connected)
[root at dhcp47-121 ~]# 
[root at dhcp47-121 ~]# 
[root at dhcp47-121 ~]# gluster v status disp2
Status of volume: disp2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.121:/bricks/brick8/disp2_0   49154     0          Y       5552 
Brick 10.70.47.113:/bricks/brick8/disp2_1   N/A       N/A        N       N/A  
Brick 10.70.47.114:/bricks/brick8/disp2_2   49154     0          Y       30916
Brick 10.70.47.115:/bricks/brick8/disp2_3   49154     0          Y       23469
Brick 10.70.47.116:/bricks/brick8/disp2_4   49153     0          Y       27754
Brick 10.70.47.117:/bricks/brick8/disp2_5   N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       5497 
Bitrot Daemon on localhost                  N/A       N/A        Y       5515 
Scrubber Daemon on localhost                N/A       N/A        Y       5525 
Self-heal Daemon on dhcp47-113.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       5893 
Bitrot Daemon on dhcp47-113.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       5911 
Scrubber Daemon on dhcp47-113.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       5921 
Self-heal Daemon on dhcp47-114.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       30858
Bitrot Daemon on dhcp47-114.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       30876
Scrubber Daemon on dhcp47-114.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       30886
Self-heal Daemon on dhcp47-116.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       27708
Bitrot Daemon on dhcp47-116.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       27726
Scrubber Daemon on dhcp47-116.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       27736
Self-heal Daemon on dhcp47-117.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       9684 
Bitrot Daemon on dhcp47-117.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       9702 
Scrubber Daemon on dhcp47-117.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       9712 
Self-heal Daemon on dhcp47-115.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       23411
Bitrot Daemon on dhcp47-115.lab.eng.blr.red
hat.com                                     N/A       N/A        Y       23429
Scrubber Daemon on dhcp47-115.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       23439

Task Status of Volume disp2
------------------------------------------------------------------------------
There are no active volume tasks

[root at dhcp47-121 ~]# 
[root at dhcp47-121 ~]# gluster v info disp2

Volume Name: disp2
Type: Disperse
Volume ID: d7b0d170-f0e0-4e26-9369-f0a52dc92d38
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (4 + 2) = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.47.121:/bricks/brick8/disp2_0
Brick2: 10.70.47.113:/bricks/brick8/disp2_1
Brick3: 10.70.47.114:/bricks/brick8/disp2_2
Brick4: 10.70.47.115:/bricks/brick8/disp2_3
Brick5: 10.70.47.116:/bricks/brick8/disp2_4
Brick6: 10.70.47.117:/bricks/brick8/disp2_5
Options Reconfigured:
performance.stat-prefetch: off
nfs.disable: on
transport.address-family: inet
features.bitrot: on
features.scrub: Active
features.scrub-freq: hourly
cluster.brick-multiplex: disable
[root at dhcp47-121 ~]# 
[root at dhcp47-121 ~]# gluster v  bitrot disp2 scrub status

Volume name : disp2

State of scrub: Active (In Progress)

Scrub impact: lazy

Scrub frequency: hourly

Bitrot error log location: /var/log/glusterfs/bitd.log

Scrubber error log location: /var/log/glusterfs/scrub.log

=========================================================

Node: localhost

Number of Scrubbed files: 2

Number of Skipped files: 0

Last completed scrub time: 2017-05-16 09:35:12

Duration of last scrub (D:M:H:M:S): 0:0:0:7

Error count: 0

=========================================================

Node: dhcp47-114.lab.eng.blr.redhat.com

Number of Scrubbed files: 1

Number of Skipped files: 0

Last completed scrub time: 2017-05-16 09:35:12

Duration of last scrub (D:M:H:M:S): 0:0:0:7

Error count: 0

=========================================================

Node: dhcp47-116.lab.eng.blr.redhat.com

Number of Scrubbed files: 2

Number of Skipped files: 0

Last completed scrub time: 2017-05-16 09:35:14

Duration of last scrub (D:M:H:M:S): 0:0:0:7

Error count: 0

=========================================================

Node: dhcp47-113.lab.eng.blr.redhat.com

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: 2017-05-16 08:35:24

Duration of last scrub (D:M:H:M:S): 0:0:0:7

Error count: 0

=========================================================

Node: dhcp47-115.lab.eng.blr.redhat.com

Number of Scrubbed files: 2

Number of Skipped files: 0

Last completed scrub time: 2017-05-16 09:35:11

Duration of last scrub (D:M:H:M:S): 0:0:0:7

Error count: 0

=========================================================

Node: dhcp47-117.lab.eng.blr.redhat.com

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: 2017-05-16 08:35:23

Duration of last scrub (D:M:H:M:S): 0:0:0:7

Error count: 0

=========================================================

[root at dhcp47-121 ~]# gluster v heal disp2 info
Brick 10.70.47.121:/bricks/brick8/disp2_0
/d1/d2/d3/d4/test2 
Status: Connected
Number of entries: 1

Brick 10.70.47.113:/bricks/brick8/disp2_1
Status: Transport endpoint is not connected
Number of entries: -

Brick 10.70.47.114:/bricks/brick8/disp2_2
/d1/d2/d3/d4/test2 
Status: Connected
Number of entries: 1

Brick 10.70.47.115:/bricks/brick8/disp2_3
/d1/d2/d3/d4/test2 
Status: Connected
Number of entries: 1

Brick 10.70.47.116:/bricks/brick8/disp2_4
/d1/d2/d3/d4/test2 
Status: Connected
Number of entries: 1

Brick 10.70.47.117:/bricks/brick8/disp2_5
Status: Transport endpoint is not connected
Number of entries: -

[root at dhcp47-121 ~]# 

[2017-05-16 08:54:10.160132] E [MSGID: 115070]
[server-rpc-fops.c:1474:server_open_cbk] 0-disp2-server: 4619: OPEN
/d1/d2/d3/d4/test2 (3673eecb-e5b5-4014-9bc6-a2fc007f08cb) ==> (Input/output
error) [Input/output error]
pending frames:
frame : type(0) op(29)
frame : type(0) op(11)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2017-05-16 08:55:01
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.8.4
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7f0e805201b2]
/lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7f0e80529bd4]
/lib64/libc.so.6(+0x35250)[0x7f0e7ec02250]
/usr/lib64/glusterfs/3.8.4/xlator/features/bitrot-stub.so(+0xadf4)[0x7f0e7174cdf4]
/usr/lib64/glusterfs/3.8.4/xlator/features/bitrot-stub.so(+0xde56)[0x7f0e7174fe56]
/usr/lib64/glusterfs/3.8.4/xlator/features/access-control.so(+0x5815)[0x7f0e71535815]
/usr/lib64/glusterfs/3.8.4/xlator/features/locks.so(+0x6dc8)[0x7f0e71312dc8]
/usr/lib64/glusterfs/3.8.4/xlator/features/worm.so(+0x7e59)[0x7f0e71106e59]
/usr/lib64/glusterfs/3.8.4/xlator/features/read-only.so(+0x4478)[0x7f0e70efb478]
/usr/lib64/glusterfs/3.8.4/xlator/features/leases.so(+0x50b4)[0x7f0e70ce70b4]
/usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so(+0xf143)[0x7f0e70ad7143]
/lib64/libglusterfs.so.0(default_open_resume+0x1c9)[0x7f0e805b1269]
/lib64/libglusterfs.so.0(call_resume+0x75)[0x7f0e80542b25]
/usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so(+0x4957)[0x7f0e708c1957]
/lib64/libpthread.so.0(+0x7dc5)[0x7f0e7f37fdc5]
/lib64/libc.so.6(clone+0x6d)[0x7f0e7ecc473d]

BT:

Program terminated with signal 11, Segmentation fault.
#0  list_add_tail (head=0x7f0e28001908, new=0x18) at
../../../../../libglusterfs/src/list.h:40
40        new->next = head;
Missing separate debuginfos, use: debuginfo-install
glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64
krb5-libs-1.14.1-27.el7_3.x86_64 libacl-2.2.51-12.el7.x86_64
libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-12.el7.x86_64
libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64
libselinux-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7_3.2.x86_64
openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64
sqlite-3.7.17-8.el7.x86_64 sssd-client-1.14.0-43.el7_3.14.x86_64
zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  list_add_tail (head=0x7f0e28001908, new=0x18) at
../../../../../libglusterfs/src/list.h:40
#1  br_stub_add_fd_to_inode (this=this at entry=0x7f0e6c012440,
fd=fd at entry=0x7f0e6c0a5050, ctx=ctx at entry=0x0) at bit-rot-stub.c:2398
#2  0x00007f0e7174fe56 in br_stub_open (frame=0x7f0e28000ca0,
this=0x7f0e6c012440, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0)
at bit-rot-stub.c:2352
#3  0x00007f0e71535815 in posix_acl_open (frame=0x7f0e280014b0,
this=0x7f0e6c013d70, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0)
at posix-acl.c:1129
#4  0x00007f0e71312dc8 in pl_open (frame=frame at entry=0x7f0e28000ac0,
this=this at entry=0x7f0e6c015320, loc=loc at entry=0x7f0e6c0ccf90,
flags=flags at entry=2, fd=fd at entry=0x7f0e6c0a5050, 
    xdata=xdata at entry=0x0) at posix.c:1698
#5  0x00007f0e71106e59 in worm_open (frame=0x7f0e28000ac0, this=<optimized
out>, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at worm.c:43
#6  0x00007f0e70efb478 in ro_open (frame=0x7f0e28001740, this=0x7f0e6c018130,
loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at
read-only-common.c:341
#7  0x00007f0e70ce70b4 in leases_open (frame=0x7f0e28001b50,
this=0x7f0e6c019880, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0)
at leases.c:75
#8  0x00007f0e70ad7143 in up_open (frame=0x7f0e28002250, this=0x7f0e6c01af20,
loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at upcall.c:75
#9  0x00007f0e805b1269 in default_open_resume (frame=0x7f0e6c002020,
this=0x7f0e6c01c690, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0)
at defaults.c:1726
#10 0x00007f0e80542b25 in call_resume (stub=0x7f0e6c0ccf40) at call-stub.c:2508
#11 0x00007f0e708c1957 in iot_worker (data=0x7f0e6c0550e0) at io-threads.c:220
#12 0x00007f0e7f37fdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f0e7ecc473d in clone () from /lib64/libc.so.6
(gdb)

--- Additional comment from Worker Ant on 2017-05-22 09:00:24 EDT ---

REVIEW: https://review.gluster.org/17357 (features/bitrot: Fix glusterfsd
crash) posted (#1) for review on master by Kotresh HR (khiremat at redhat.com)

--- Additional comment from Worker Ant on 2017-05-28 23:52:14 EDT ---

COMMIT: https://review.gluster.org/17357 committed in master by Atin Mukherjee
(amukherj at redhat.com) 
------
commit 6908e962f6293d38f0ee65c088247a66f2832e4a
Author: Kotresh HR <khiremat at redhat.com>
Date:   Mon May 22 08:47:07 2017 -0400

    features/bitrot: Fix glusterfsd crash

    With object versioning being optional, it can
    so happen the bitrot stub context is not always
    set. When it's not found, it's initialized. But
    was not being assigned to use in the local
    function. This was leading for brick crash.
    Fixed the same.

    Change-Id: I0dab6435cdfe16a8c7f6a31ffec1a370822597a8
    BUG: 1454317
    Signed-off-by: Kotresh HR <khiremat at redhat.com>
    Reviewed-on: https://review.gluster.org/17357
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Raghavendra Bhat <raghavendra at redhat.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1451280
[Bug 1451280] [Bitrot]: Brick process crash observed while trying to
recover a bad file in disperse volume
https://bugzilla.redhat.com/show_bug.cgi?id=1454317
[Bug 1454317] [Bitrot]: Brick process crash observed while trying to
recover a bad file in disperse volume
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
You are the Docs Contact for the bug.