[Gluster-users] heal failure after bricks go down

Wed Aug 3 13:33:12 UTC 2022

Sequence of events which ended up with 2 bricks down and a heal
failure. What should I do about the heal failure, and before or after
replacing the bad disk? First, gluster 10.2 info

Volume Name: glust-distr-rep
Type: Distributed-Replicate
Volume ID: fe0ea6f6-2d1b-4b5c-8af5-0c11ea546270
Status: Started
Snapshot Count: 0
Number of Bricks: 9 x 2 = 18
Transport-type: tcp
Bricks:
Brick1: md1cfsd01:/bricks/b0/br
Brick2: md1cfsd02:/bricks/b0/br
Brick3: md1cfsd03:/bricks/b0/br
Brick4: md1cfsd01:/bricks/b3/br
Brick5: md1cfsd02:/bricks/b3/br
Brick6: md1cfsd03:/bricks/b3/br
Brick7: md1cfsd01:/bricks/b1/br
Brick8: md1cfsd02:/bricks/b1/br
Brick9: md1cfsd03:/bricks/b1/br
Brick10: md1cfsd01:/bricks/b4/br
Brick11: md1cfsd02:/bricks/b4/br
Brick12: md1cfsd03:/bricks/b4/br
Brick13: md1cfsd01:/bricks/b2/br
Brick14: md1cfsd02:/bricks/b2/br
Brick15: md1cfsd03:/bricks/b2/br
Brick16: md1cfsd01:/bricks/b5/br
Brick17: md1cfsd02:/bricks/b5/br
Brick18: md1cfsd03:/bricks/b5/br
Options Reconfigured:
performance.md-cache-statfs: on
cluster.server-quorum-type: server
cluster.min-free-disk: 15
storage.batch-fsync-delay-usec: 0
user.smb: enable
features.cache-invalidation: on
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet

Fun started with a brick(d02:b5) crashing:

[2022-08-02 18:59:29.417147 +0000] W
[rpcsvc.c:1323:rpcsvc_callback_submit] 0-rpcsvc: transmission of
rpc-request failed
pending frames:
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
patchset: git://git.gluster.org/glusterfs.git
signal received: 7
time of crash:
2022-08-02 18:59:29 +0000
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 10.2
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x28a54)[0x7fefb20f7a54]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x700)[0x7fefb20fffc0]
/lib/x86_64-linux-gnu/libc.so.6(+0x3bd60)[0x7fefb1ecdd60]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(__gf_free+0x5a)[0x7fefb211c7aa]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_unref+0x9a)[0x7fefb209e4fa]
/usr/lib/x86_64-linux-gnu/glusterfs/10.2/xlator/protocol/server.so(+0xaf4b)[0x7fefac1fff4b]
/usr/lib/x86_64-linux-gnu/glusterfs/10.2/xlator/protocol/server.so(+0xb964)[0x7fefac200964]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x34)[0x7fefb20eb244]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x1ab)[0x7fefb217cf2b]
...

Then a few hours later a read error on a different brick(b2) on the same host:

[2022-08-02 22:04:17.808970 +0000] E [MSGID: 113040]
[posix-inode-fd-ops.c:1758:posix_readv] 0-glust-distr-rep-posix: read
failed on gfid=16b51498-966e-4546-b561-24b0062f4324,
fd=0x7ff9f00d6b08, offset=663314432 size=16384, buf=0x7ff9fc0f7000
[Input/output error]
[2022-08-02 22:04:17.809057 +0000] E [MSGID: 115068]
[server-rpc-fops_v2.c:1369:server4_readv_cbk]
0-glust-distr-rep-server: READ info [{frame=1334746}, {READV_fd_no=4},
{uuid_utoa=16b51498-966e-4546-b561-24b0062f4324},
{client=CTX_ID:6d7535af-769c-4223-aad0-79acffa836ed-GRAPH_ID:0-PID:1414-HOST:r4-16-PC_NAME:glust-distr-rep-client-13-RECON_NO:-1},
{error-xlator=glust-distr-rep-posix}, {errno=5}, {error=Input/output
error}]

This looks like a real hardware error:
[Tue Aug  2 18:03:48 2022] megaraid_sas 0000:03:00.0: 6293
(712778647s/0x0002/FATAL) - Unrecoverable medium error during recovery
on PD 04(e0x20/s4) at 1d267163
[Tue Aug  2 18:03:49 2022] sd 0:2:3:0: [sdd] tag#435 FAILED Result:
hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=3s
[Tue Aug  2 18:03:49 2022] sd 0:2:3:0: [sdd] tag#435 CDB: Read(10) 28
00 1d 26 70 78 00 01 00 00
[Tue Aug  2 18:03:49 2022] blk_update_request: I/O error, dev sdd,
sector 489058424 op 0x0:(READ) flags 0x80700 phys_seg 9 prio class 0

This morning noticing both b2 & b5 were offline, systemctl stopped and
started glusterd to restart the bricks.
All bricks are now up:
Status of volume: glust-distr-rep
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick md1cfsd01:/bricks/b0/br               55386     0          Y       2047
Brick md1cfsd02:/bricks/b0/br               59983     0          Y       3036416
Brick md1cfsd03:/bricks/b0/br               58028     0          Y       2014
Brick md1cfsd01:/bricks/b3/br               59454     0          Y       2041
Brick md1cfsd02:/bricks/b3/br               52352     0          Y       3036421
Brick md1cfsd03:/bricks/b3/br               56786     0          Y       2017
Brick md1cfsd01:/bricks/b1/br               59885     0          Y       2040
Brick md1cfsd02:/bricks/b1/br               55148     0          Y       3036434
Brick md1cfsd03:/bricks/b1/br               52422     0          Y       2068
Brick md1cfsd01:/bricks/b4/br               56378     0          Y       2099
Brick md1cfsd02:/bricks/b4/br               60152     0          Y       3036470
Brick md1cfsd03:/bricks/b4/br               50448     0          Y       2490448
Brick md1cfsd01:/bricks/b2/br               49455     0          Y       2097
Brick md1cfsd02:/bricks/b2/br               53717     0          Y       3036498
Brick md1cfsd03:/bricks/b2/br               51838     0          Y       2124
Brick md1cfsd01:/bricks/b5/br               51002     0          Y       2104
Brick md1cfsd02:/bricks/b5/br               57204     0          Y       3036523
Brick md1cfsd03:/bricks/b5/br               56817     0          Y       2123
Self-heal Daemon on localhost               N/A       N/A        Y       3036660
Self-heal Daemon on md1cfsd03               N/A       N/A        Y       2627
Self-heal Daemon on md1cfsd01               N/A       N/A        Y       2623

Then manually triggered a heal, which healed thousands of files but
now is stuck on the last 47 according to heal info summary.
glfsheal-glust-distr-rep.log has a bunch of entries like so:

[2022-08-03 13:08:41.169387 +0000] W [MSGID: 114031]
[client-rpc-fops_v2.c:2618:client4_0_lookup_cbk]
0-glust-distr-rep-client-16: remote operation failed.
[{path=<gfid:24977f2f-5fbe-44f2-91bd-605eda824aff>},
{gfid=24977f2f-5fbe-44f2-91bd-605eda824aff}, {errno=2}, {error=No such
file or directory}]