[Gluster-users] Gluster-users Digest, Vol 162, Issue 1

Thu Oct 7 11:25:40 UTC 2021

Hello Marcus,
Please provide below info for understanding the problem,
1. getxattr -d -m. -e hex fileName                                   -->
for any 2 of the files from all 3 nodes
     getxattr -d -m. -e hex parentDirectoryName              --> from all 3
nodes

This will confirm whether it is a GFID split brain case.

Thanks & Regards,
Chetan.

> Message: 3
> Date: Wed, 6 Oct 2021 10:58:24 +0200
> From: Marcus Peders?n <marcus.pedersen at slu.se>
> To: <gluster-users at gluster.org>
> Subject: [Gluster-users] Gluster heal problem
> Message-ID: <20211006085824.GA6893 at slu.se>
> Content-Type: text/plain; charset="utf-8"
>
> Hi all,
> I have a problem with heal, I have 995 files that fails with heal.
>
> Gluster version: 9.3
> OS: Debian Bullseye
>
> My setup is a replicate with an arbiter:
> Volume Name: gds-admin
> Type: Replicate
> Volume ID: f1f112f4-8cee-4c04-8ea5-c7d895c8c8d6
> Status: Started
> Snapshot Count: 8
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: urd-gds-001:/urd-gds/gds-admin
> Brick2: urd-gds-002:/urd-gds/gds-admin
> Brick3: urd-gds-000:/urd-gds/gds-admin (arbiter)
> Options Reconfigured:
> storage.build-pgfid: off
> performance.client-io-threads: off
> nfs.disable: on
> transport.address-family: inet
> storage.fips-mode-rchecksum: on
> features.barrier: disable
>
> Gluster volume status:
> Status of volume: gds-admin
> Gluster process                             TCP Port  RDMA Port  Online
> Pid
>
> ------------------------------------------------------------------------------
> Brick urd-gds-001:/urd-gds/gds-admin        49155     0          Y
>  6964
> Brick urd-gds-002:/urd-gds/gds-admin        49155     0          Y
>  4270
> Brick urd-gds-000:/urd-gds/gds-admin        49152     0          Y
>  1175
> Self-heal Daemon on localhost               N/A       N/A        Y
>  7031
> Self-heal Daemon on urd-gds-002             N/A       N/A        Y
>  4281
> Self-heal Daemon on urd-gds-000             N/A       N/A        Y
>  1230
>
> Task Status of Volume gds-admin
>
> ------------------------------------------------------------------------------
> There are no active volume tasks
>
>
> Gluster pool list:
> UUID                            Hostname        State
> 8823d0d9-5d02-4f47-86e9-        urd-gds-000     Connected
> 73139305-08f5-42c2-92b6-        urd-gds-002     Connected
> d612a705-8493-474e-9fdc-        localhost       Connected
>
>
>
>
> info summary says:
> Brick urd-gds-001:/urd-gds/gds-admin
> Status: Connected
> Total Number of entries: 995
> Number of entries in heal pending: 995
> Number of entries in split-brain: 0
> Number of entries possibly healing: 0
>
> Brick urd-gds-002:/urd-gds/gds-admin
> Status: Connected
> Total Number of entries: 0
> Number of entries in heal pending: 0
> Number of entries in split-brain: 0
> Number of entries possibly healing: 0
>
> Brick urd-gds-000:/urd-gds/gds-admin
> Status: Connected
> Total Number of entries: 995
> Number of entries in heal pending: 995
> Number of entries in split-brain: 0
> Number of entries possibly healing: 0
>
>
>
> Statistics says (on both node urd-gds-000 and urd-gds-001):
> Starting time of crawl: Tue Oct  5 14:25:08 2021
>
> Ending time of crawl: Tue Oct  5 14:25:25 2021
>
> Type of crawl: INDEX
> No. of entries healed: 0
> No. of entries in split-brain: 0
> No. of heal failed entries: 995
>
>
> To me it seems as if node urd-gds-002 has old version of files.
> I tried 2 files that had filenames and both urd-gds-000 and urd-gds-001
> has the same gfid for the file and the same timestamp for the file.
> Node urd-gds-002 has a different gfid and an older timestamp.
> The client could not access the file.
> I manually removed the file and gfid file from urd-gds-002 and these files
> were healed.
>
> I have a long list of files with just gfids (995).
> I tried to get the file path with (example):
> getfattr -n trusted.glusterfs.pathinfo -e text
> /mnt/gds-admin/.gfid/4e203eb1-795e-433a-9403-753ba56575fd
> getfattr: Removing leading '/' from absolute path names
> # file: mnt/gds-admin/.gfid/4e203eb1-795e-433a-9403-753ba56575fd
> trusted.glusterfs.pathinfo="(<REPLICATE:gds-admin-replicate-0>
> <POSIX(/urd-gds/gds-admin):urd-gds-000:/urd-gds/gds-admin/.glusterfs/30/70/3070276f-1096-44c8-b9e9-62625620aba3/04>
> <POSIX(/urd-gds/gds-admin):urd-gds-001:/urd-gds/gds-admin/.glusterfs/30/70/3070276f-1096-44c8-b9e9-62625620aba3/04>)"
>
> This tells me that the file exists on node urd-gds-000 and urd-gds-001.
>
> I have been looking through the glustershd.log and I see the similar error
> over and over again on urd-gds-000 and urd-gds-001:
> [2021-10-05 12:46:01.095509 +0000] I [MSGID: 108026]
> [afr-self-heal-entry.c:1052:afr_selfheal_entry_do] 0-gds-admin-replicate-0:
> performing entry selfheal on d0d8b20e-c9df-4b8b-ac2e-24697fdf9201
> [2021-10-05 12:46:01.802920 +0000] E [MSGID: 114031]
> [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-1: remote
> operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
> [2021-10-05 12:46:01.803538 +0000] E [MSGID: 114031]
> [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-2: remote
> operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
> [2021-10-05 12:46:01.803612 +0000] E [MSGID: 114031]
> [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-0: remote
> operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
> [2021-10-05 12:46:01.908395 +0000] I [MSGID: 108026]
> [afr-self-heal-entry.c:1052:afr_selfheal_entry_do] 0-gds-admin-replicate-0:
> performing entry selfheal on 0e309af2-2538-440a-8fd0-392620e83d05
> [2021-10-05 12:46:01.914909 +0000] E [MSGID: 114031]
> [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-0: remote
> operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
> [2021-10-05 12:46:01.915225 +0000] E [MSGID: 114031]
> [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-1: remote
> operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
> [2021-10-05 12:46:01.915230 +0000] E [MSGID: 114031]
> [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-2: remote
> operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
>
> On urd-gds-002 I get same error over and over again:
> [2021-10-05 12:34:38.013434 +0000] E [MSGID: 114031]
> [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-1: remote
> operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
> [2021-10-05 12:34:38.013576 +0000] E [MSGID: 114031]
> [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-0: remote
> operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
> [2021-10-05 12:34:38.013948 +0000] E [MSGID: 114031]
> [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-2: remote
> operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
> [2021-10-05 12:44:39.011771 +0000] E [MSGID: 114031]
> [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-1: remote
> operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
> [2021-10-05 12:44:39.011825 +0000] E [MSGID: 114031]
> [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-0: remote
> operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
> [2021-10-05 12:44:39.012306 +0000] E [MSGID: 114031]
> [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-2: remote
> operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
> [2021-10-05 12:54:40.017676 +0000] E [MSGID: 114031]
> [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-1: remote
> operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
> [2021-10-05 12:54:40.018240 +0000] E [MSGID: 114031]
> [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-2: remote
> operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
> [2021-10-05 12:54:40.021305 +0000] E [MSGID: 114031]
> [client-rpc-fops_v2.c:211:client4_0_mkdir_cbk] 0-gds-admin-client-0: remote
> operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
>
> It seems to gradualy become less and less entries, over night it has been
> reduced from 995 to 972.
>
> If I do an ls from the client side in some direcotories some names of the
> files shows up in info summary
> and then dissapears after a while.
>
> I would really appreciate some help on how to resolve this issue!
>
> Many thanks!
>
> Best regards
> Marcus
>
> ---
> N?r du skickar e-post till SLU s? inneb?r detta att SLU behandlar dina
> personuppgifter. F?r att l?sa mer om hur detta g?r till, klicka h?r <
> https://www.slu.se/om-slu/kontakta-slu/personuppgifter/>
> E-mailing SLU will result in SLU processing your personal data. For more
> information on how this is done, click here <
> https://www.slu.se/en/about-slu/contact-slu/personal-data/>
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
> ------------------------------
>
> End of Gluster-users Digest, Vol 162, Issue 1
> *********************************************
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20211007/6a92dac2/attachment.html>