[Gluster-users] heal failure after bricks go down

Thu Aug 4 18:31:48 UTC 2022

Hi,
have you checked the stayus of the gfids ?I usually use method 2 from https://docs.gluster.org/en/main/Troubleshooting/gfid-to-path/ to identify the file on the brick.Then you can use getfattr to identify the status of the files on the bricks.
As you have 3 hosts, you can always implement an arbiter for each brick and mitigate the risk for split brains.
Best Regards,Strahil Nikolov 

  On Wed, Aug 3, 2022 at 16:33, Eli V<eliventer at gmail.com> wrote:   Sequence of events which ended up with 2 bricks down and a heal
failure. What should I do about the heal failure, and before or after
replacing the bad disk? First, gluster 10.2 info

Volume Name: glust-distr-rep
Type: Distributed-Replicate
Volume ID: fe0ea6f6-2d1b-4b5c-8af5-0c11ea546270
Status: Started
Snapshot Count: 0
Number of Bricks: 9 x 2 = 18
Transport-type: tcp
Bricks:
Brick1: md1cfsd01:/bricks/b0/br
Brick2: md1cfsd02:/bricks/b0/br
Brick3: md1cfsd03:/bricks/b0/br
Brick4: md1cfsd01:/bricks/b3/br
Brick5: md1cfsd02:/bricks/b3/br
Brick6: md1cfsd03:/bricks/b3/br
Brick7: md1cfsd01:/bricks/b1/br
Brick8: md1cfsd02:/bricks/b1/br
Brick9: md1cfsd03:/bricks/b1/br
Brick10: md1cfsd01:/bricks/b4/br
Brick11: md1cfsd02:/bricks/b4/br
Brick12: md1cfsd03:/bricks/b4/br
Brick13: md1cfsd01:/bricks/b2/br
Brick14: md1cfsd02:/bricks/b2/br
Brick15: md1cfsd03:/bricks/b2/br
Brick16: md1cfsd01:/bricks/b5/br
Brick17: md1cfsd02:/bricks/b5/br
Brick18: md1cfsd03:/bricks/b5/br
Options Reconfigured:
performance.md-cache-statfs: on
cluster.server-quorum-type: server
cluster.min-free-disk: 15
storage.batch-fsync-delay-usec: 0
user.smb: enable
features.cache-invalidation: on
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet

Fun started with a brick(d02:b5) crashing:

[2022-08-02 18:59:29.417147 +0000] W
[rpcsvc.c:1323:rpcsvc_callback_submit] 0-rpcsvc: transmission of
rpc-request failed
pending frames:
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
patchset: git://git.gluster.org/glusterfs.git
signal received: 7
time of crash:
2022-08-02 18:59:29 +0000
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 10.2
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x28a54)[0x7fefb20f7a54]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x700)[0x7fefb20fffc0]
/lib/x86_64-linux-gnu/libc.so.6(+0x3bd60)[0x7fefb1ecdd60]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(__gf_free+0x5a)[0x7fefb211c7aa]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_unref+0x9a)[0x7fefb209e4fa]
/usr/lib/x86_64-linux-gnu/glusterfs/10.2/xlator/protocol/server.so(+0xaf4b)[0x7fefac1fff4b]
/usr/lib/x86_64-linux-gnu/glusterfs/10.2/xlator/protocol/server.so(+0xb964)[0x7fefac200964]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(xlator_notify+0x34)[0x7fefb20eb244]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_notify+0x1ab)[0x7fefb217cf2b]
...

Then a few hours later a read error on a different brick(b2) on the same host:

[2022-08-02 22:04:17.808970 +0000] E [MSGID: 113040]
[posix-inode-fd-ops.c:1758:posix_readv] 0-glust-distr-rep-posix: read
failed on gfid=16b51498-966e-4546-b561-24b0062f4324,
fd=0x7ff9f00d6b08, offset=663314432 size=16384, buf=0x7ff9fc0f7000
[Input/output error]
[2022-08-02 22:04:17.809057 +0000] E [MSGID: 115068]
[server-rpc-fops_v2.c:1369:server4_readv_cbk]
0-glust-distr-rep-server: READ info [{frame=1334746}, {READV_fd_no=4},
{uuid_utoa=16b51498-966e-4546-b561-24b0062f4324},
{client=CTX_ID:6d7535af-769c-4223-aad0-79acffa836ed-GRAPH_ID:0-PID:1414-HOST:r4-16-PC_NAME:glust-distr-rep-client-13-RECON_NO:-1},
{error-xlator=glust-distr-rep-posix}, {errno=5}, {error=Input/output
error}]

This looks like a real hardware error:
[Tue Aug  2 18:03:48 2022] megaraid_sas 0000:03:00.0: 6293
(712778647s/0x0002/FATAL) - Unrecoverable medium error during recovery
on PD 04(e0x20/s4) at 1d267163
[Tue Aug  2 18:03:49 2022] sd 0:2:3:0: [sdd] tag#435 FAILED Result:
hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=3s
[Tue Aug  2 18:03:49 2022] sd 0:2:3:0: [sdd] tag#435 CDB: Read(10) 28
00 1d 26 70 78 00 01 00 00
[Tue Aug  2 18:03:49 2022] blk_update_request: I/O error, dev sdd,
sector 489058424 op 0x0:(READ) flags 0x80700 phys_seg 9 prio class 0

This morning noticing both b2 & b5 were offline, systemctl stopped and
started glusterd to restart the bricks.
All bricks are now up:
Status of volume: glust-distr-rep
Gluster process                            TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick md1cfsd01:/bricks/b0/br              55386    0          Y      2047
Brick md1cfsd02:/bricks/b0/br              59983    0          Y      3036416
Brick md1cfsd03:/bricks/b0/br              58028    0          Y      2014
Brick md1cfsd01:/bricks/b3/br              59454    0          Y      2041
Brick md1cfsd02:/bricks/b3/br              52352    0          Y      3036421
Brick md1cfsd03:/bricks/b3/br              56786    0          Y      2017
Brick md1cfsd01:/bricks/b1/br              59885    0          Y      2040
Brick md1cfsd02:/bricks/b1/br              55148    0          Y      3036434
Brick md1cfsd03:/bricks/b1/br              52422    0          Y      2068
Brick md1cfsd01:/bricks/b4/br              56378    0          Y      2099
Brick md1cfsd02:/bricks/b4/br              60152    0          Y      3036470
Brick md1cfsd03:/bricks/b4/br              50448    0          Y      2490448
Brick md1cfsd01:/bricks/b2/br              49455    0          Y      2097
Brick md1cfsd02:/bricks/b2/br              53717    0          Y      3036498
Brick md1cfsd03:/bricks/b2/br              51838    0          Y      2124
Brick md1cfsd01:/bricks/b5/br              51002    0          Y      2104
Brick md1cfsd02:/bricks/b5/br              57204    0          Y      3036523
Brick md1cfsd03:/bricks/b5/br              56817    0          Y      2123
Self-heal Daemon on localhost              N/A      N/A        Y      3036660
Self-heal Daemon on md1cfsd03              N/A      N/A        Y      2627
Self-heal Daemon on md1cfsd01              N/A      N/A        Y      2623

Then manually triggered a heal, which healed thousands of files but
now is stuck on the last 47 according to heal info summary.
glfsheal-glust-distr-rep.log has a bunch of entries like so:

[2022-08-03 13:08:41.169387 +0000] W [MSGID: 114031]
[client-rpc-fops_v2.c:2618:client4_0_lookup_cbk]
0-glust-distr-rep-client-16: remote operation failed.
[{path=<gfid:24977f2f-5fbe-44f2-91bd-605eda824aff>},
{gfid=24977f2f-5fbe-44f2-91bd-605eda824aff}, {errno=2}, {error=No such
file or directory}]
________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users at gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20220804/9638e6f4/attachment.html>