[Gluster-users] heal failure after bricks go down
Strahil Nikolov
hunter86_bg at yahoo.com
Thu Aug 4 18:31:48 UTC 2022
Hi,
have you checked the stayus of the gfids ?I usually use method 2 from https://docs.gluster.org/en/main/Troubleshooting/gfid-to-path/ to identify the file on the brick.Then you can use getfattr to identify the status of the files on the bricks.
As you have 3 hosts, you can always implement an arbiter for each brick and mitigate the risk for split brains.
Best Regards,Strahil Nikolov
On Wed, Aug 3, 2022 at 16:33, Eli V<eliventer at gmail.com> wrote: Sequence of events which ended up with 2 bricks down and a heal
failure. What should I do about the heal failure, and before or after
replacing the bad disk? First, gluster 10.2 info
Volume Name: glust-distr-rep
Type: Distributed-Replicate
Volume ID: fe0ea6f6-2d1b-4b5c-8af5-0c11ea546270
Status: Started
Snapshot Count: 0
Number of Bricks: 9 x 2 = 18
Transport-type: tcp
Bricks:
Brick1: md1cfsd01:/bricks/b0/br
Brick2: md1cfsd02:/bricks/b0/br
Brick3: md1cfsd03:/bricks/b0/br
Brick4: md1cfsd01:/bricks/b3/br
Brick5: md1cfsd02:/bricks/b3/br
Brick6: md1cfsd03:/bricks/b3/br
Brick7: md1cfsd01:/bricks/b1/br
Brick8: md1cfsd02:/bricks/b1/br
Brick9: md1cfsd03:/bricks/b1/br
Brick10: md1cfsd01:/bricks/b4/br
Brick11: md1cfsd02:/bricks/b4/br
Brick12: md1cfsd03:/bricks/b4/br
Brick13: md1cfsd01:/bricks/b2/br
Brick14: md1cfsd02:/bricks/b2/br
Brick15: md1cfsd03:/bricks/b2/br
Brick16: md1cfsd01:/bricks/b5/br
Brick17: md1cfsd02:/bricks/b5/br
Brick18: md1cfsd03:/bricks/b5/br
Options Reconfigured:
performance.md-cache-statfs: on
cluster.server-quorum-type: server
cluster.min-free-disk: 15
storage.batch-fsync-delay-usec: 0
user.smb: enable
features.cache-invalidation: on
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
Fun started with a brick(d02:b5) crashing:
[2022-08-02 18:59:29.417147 +0000] W
[rpcsvc.c:1323:rpcsvc_callback_submit] 0-rpcsvc: transmission of
rpc-request failed
pending frames:
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
patchset: git://git.gluster.org/glusterfs.git
signal received: 7
time of crash:
2022-08-02 18:59:29 +0000
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 10.2
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x28a54)[0x7fefb20f7a54]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x700)[0x7fefb20fffc0]
/lib/x86_64-linux-gnu/libc.so.6(+0x3bd60)[0x7fefb1ecdd60]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(__gf_free+0x5a)[0x7fefb211c7aa]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_unref+0x9a)[0x7fefb209e4fa]
/usr/lib/x86_64-linux-gnu/glusterfs/10.2/xlator/protocol/server.so(+0xaf4b)[0x7fefac1fff4b]
/usr/lib/x86_64-linux-gnu/glusterfs/10.2/xlator/protocol/server.so(+0xb964)[0x7fefac200964]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x34)[0x7fefb20eb244]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x1ab)[0x7fefb217cf2b]
...
Then a few hours later a read error on a different brick(b2) on the same host:
[2022-08-02 22:04:17.808970 +0000] E [MSGID: 113040]
[posix-inode-fd-ops.c:1758:posix_readv] 0-glust-distr-rep-posix: read
failed on gfid=16b51498-966e-4546-b561-24b0062f4324,
fd=0x7ff9f00d6b08, offset=663314432 size=16384, buf=0x7ff9fc0f7000
[Input/output error]
[2022-08-02 22:04:17.809057 +0000] E [MSGID: 115068]
[server-rpc-fops_v2.c:1369:server4_readv_cbk]
0-glust-distr-rep-server: READ info [{frame=1334746}, {READV_fd_no=4},
{uuid_utoa=16b51498-966e-4546-b561-24b0062f4324},
{client=CTX_ID:6d7535af-769c-4223-aad0-79acffa836ed-GRAPH_ID:0-PID:1414-HOST:r4-16-PC_NAME:glust-distr-rep-client-13-RECON_NO:-1},
{error-xlator=glust-distr-rep-posix}, {errno=5}, {error=Input/output
error}]
This looks like a real hardware error:
[Tue Aug 2 18:03:48 2022] megaraid_sas 0000:03:00.0: 6293
(712778647s/0x0002/FATAL) - Unrecoverable medium error during recovery
on PD 04(e0x20/s4) at 1d267163
[Tue Aug 2 18:03:49 2022] sd 0:2:3:0: [sdd] tag#435 FAILED Result:
hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=3s
[Tue Aug 2 18:03:49 2022] sd 0:2:3:0: [sdd] tag#435 CDB: Read(10) 28
00 1d 26 70 78 00 01 00 00
[Tue Aug 2 18:03:49 2022] blk_update_request: I/O error, dev sdd,
sector 489058424 op 0x0:(READ) flags 0x80700 phys_seg 9 prio class 0
This morning noticing both b2 & b5 were offline, systemctl stopped and
started glusterd to restart the bricks.
All bricks are now up:
Status of volume: glust-distr-rep
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick md1cfsd01:/bricks/b0/br 55386 0 Y 2047
Brick md1cfsd02:/bricks/b0/br 59983 0 Y 3036416
Brick md1cfsd03:/bricks/b0/br 58028 0 Y 2014
Brick md1cfsd01:/bricks/b3/br 59454 0 Y 2041
Brick md1cfsd02:/bricks/b3/br 52352 0 Y 3036421
Brick md1cfsd03:/bricks/b3/br 56786 0 Y 2017
Brick md1cfsd01:/bricks/b1/br 59885 0 Y 2040
Brick md1cfsd02:/bricks/b1/br 55148 0 Y 3036434
Brick md1cfsd03:/bricks/b1/br 52422 0 Y 2068
Brick md1cfsd01:/bricks/b4/br 56378 0 Y 2099
Brick md1cfsd02:/bricks/b4/br 60152 0 Y 3036470
Brick md1cfsd03:/bricks/b4/br 50448 0 Y 2490448
Brick md1cfsd01:/bricks/b2/br 49455 0 Y 2097
Brick md1cfsd02:/bricks/b2/br 53717 0 Y 3036498
Brick md1cfsd03:/bricks/b2/br 51838 0 Y 2124
Brick md1cfsd01:/bricks/b5/br 51002 0 Y 2104
Brick md1cfsd02:/bricks/b5/br 57204 0 Y 3036523
Brick md1cfsd03:/bricks/b5/br 56817 0 Y 2123
Self-heal Daemon on localhost N/A N/A Y 3036660
Self-heal Daemon on md1cfsd03 N/A N/A Y 2627
Self-heal Daemon on md1cfsd01 N/A N/A Y 2623
Then manually triggered a heal, which healed thousands of files but
now is stuck on the last 47 according to heal info summary.
glfsheal-glust-distr-rep.log has a bunch of entries like so:
[2022-08-03 13:08:41.169387 +0000] W [MSGID: 114031]
[client-rpc-fops_v2.c:2618:client4_0_lookup_cbk]
0-glust-distr-rep-client-16: remote operation failed.
[{path=<gfid:24977f2f-5fbe-44f2-91bd-605eda824aff>},
{gfid=24977f2f-5fbe-44f2-91bd-605eda824aff}, {errno=2}, {error=No such
file or directory}]
________
Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users at gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20220804/9638e6f4/attachment.html>
More information about the Gluster-users
mailing list