[Gluster-users] Files not healing & missing their extended attributes - Help!
Gambit15
dougti+gluster at gmail.com
Thu Jul 5 00:39:50 UTC 2018
Hi Karthik,
Many thanks for the response!
On 4 July 2018 at 05:26, Karthik Subrahmanya <ksubrahm at redhat.com> wrote:
> Hi,
>
> From the logs you have pasted it looks like those files are in GFID
> split-brain.
> They should have the GFIDs assigned on both the data bricks but they will
> be different.
>
> Can you please paste the getfattr output of those files and their parent
> from all the bricks again?
>
The files don't have any attributes set, however I did manage to find their
corresponding entries in .glusterfs
==================================
[root at v0 .glusterfs]# getfattr -m . -d -e hex
/gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
getfattr: Removing leading '/' from absolute path names
# file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.engine-client-2=0x0000000000000000000024ea
trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
[root at v0 .glusterfs]# getfattr -m . -d -e hex
/gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/*
getfattr: Removing leading '/' from absolute path names
# file:
gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.lockspace
security.selinux=0x73797374656d5f753a6f626a6563745f723a6675736566735f743a733000
# file:
gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.metadata
security.selinux=0x73797374656d5f753a6f626a6563745f723a6675736566735f743a733000
[root at v0 .glusterfs]# ls -l
/gluster/engine/brick/.glusterfs/db/9a/db9afb92-d2bc-49ed-8e34-dcd437ba7be2/
total 0
lrwxrwxrwx. 2 vdsm kvm 132 Jun 30 14:55 hosted-engine.lockspace ->
/var/run/vdsm/storage/98495dbc-a29c-4893-b6a0-0aa70860d0c9/2502aff4-6c67-4643-b681-99f2c87e793d/03919182-6be2-4cbc-aea2-b9d68422a800
lrwxrwxrwx. 2 vdsm kvm 132 Jun 30 14:55 hosted-engine.metadata ->
/var/run/vdsm/storage/98495dbc-a29c-4893-b6a0-0aa70860d0c9/99510501-6bdc-485a-98e8-c2f82ff8d519/71fa7e6c-cdfb-4da8-9164-2404b518d0ee
==================================
Again, here are the relevant client log entries:
[2018-07-03 19:09:29.245089] W [MSGID: 108008]
[afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
0-engine-replicate-0: GFID mismatch for
<gfid:db9afb92-d2bc-49ed-8e34-dcd437ba7be2>/hosted-engine.metadata
5e95ba8c-2f12-49bf-be2d-b4baf210d366 on engine-client-1 and
b9cd7613-3b96-415d-a549-1dc788a4f94d on engine-client-0
[2018-07-03 19:09:29.245585] W [fuse-bridge.c:471:fuse_entry_cbk]
0-glusterfs-fuse: 10430040: LOOKUP()
/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.metadata => -1
(Input/output error)
[2018-07-03 19:09:30.619000] W [MSGID: 108008]
[afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
0-engine-replicate-0: GFID mismatch for
<gfid:db9afb92-d2bc-49ed-8e34-dcd437ba7be2>/hosted-engine.lockspace
8e86902a-c31c-4990-b0c5-0318807edb8f on engine-client-1 and
e5899a4c-dc5d-487e-84b0-9bbc73133c25 on engine-client-0
[2018-07-03 19:09:30.619360] W [fuse-bridge.c:471:fuse_entry_cbk]
0-glusterfs-fuse: 10430656: LOOKUP()
/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[root at v0 .glusterfs]# find . -type f | grep -E
"5e95ba8c-2f12-49bf-be2d-b4baf210d366|8e86902a-c31c-4990-b0c5-0318807edb8f|b9cd7613-3b96-415d-a549-1dc788a4f94d|e5899a4c-dc5d-487e-84b0-9bbc73133c25"
[root at v0 .glusterfs]#
==================================
> Which version of gluster you are using?
>
3.8.5
An upgrade is on the books, however I had to go back on my last attempt as
3.12 didn't work with 3.8 & I was unable to do a live rolling upgrade. Once
I've got this GFID mess sorted out, I'll give a full upgrade a go as I've
already had to failover this cluster's services to another cluster.
If you are using a version higher than or equal to 3.12 gfid split brains
> can be resolved using the methods (except method 4)
> explained in the "Resolution of split-brain using gluster CLI" section in
> [1].
> Also note that for gfid split-brain resolution using CLI you have to pass
> the name of the file as argument and not the GFID.
>
> If it is lower than 3.12 (Please consider upgrading them since they are
> EOL) you have to resolve it manually as explained in [2]
>
> [1] https://docs.gluster.org/en/latest/Troubleshooting/
> resolving-splitbrain/
> [2] https://docs.gluster.org/en/latest/Troubleshooting/
> resolving-splitbrain/#dir-split-brain
>
"The user needs to remove either file '1' on brick-a or the file '1' on
brick-b to resolve the split-brain. In addition, the corresponding
gfid-link file also needs to be removed."
Okay, so as you can see above, the files don't have a trusted.gfid
attribute, & on the brick I didn't find any files in .glusterfs with the
same name as the GFID's reported in the client log. I did however find the
symlinked files in a .glusterfs directory under the parent directory's GFID.
[root at v0 .glusterfs]# ls -l
/gluster/engine/brick/.glusterfs/db/9a/db9afb92-d2bc-49ed-8e34-dcd437ba7be2/
total 0
lrwxrwxrwx. 2 vdsm kvm 132 Jun 30 14:55 hosted-engine.lockspace ->
/var/run/vdsm/storage/98495dbc-a29c-4893-b6a0-0aa70860d0c9/2502aff4-6c67-4643-b681-99f2c87e793d/03919182-6be2-4cbc-aea2-b9d68422a800
lrwxrwxrwx. 2 vdsm kvm 132 Jun 30 14:55 hosted-engine.metadata ->
/var/run/vdsm/storage/98495dbc-a29c-4893-b6a0-0aa70860d0c9/99510501-6bdc-485a-98e8-c2f82ff8d519/71fa7e6c-cdfb-4da8-9164-2404b518d0ee
So if I delete those two symlinks & the files they point to, on one of the
two bricks, will that resolve the split brain? Is that correct?
> Thanks & Regards,
> Karthik
>
> On Wed, Jul 4, 2018 at 1:59 AM Gambit15 <dougti+gluster at gmail.com> wrote:
>
>> On 1 July 2018 at 22:37, Ashish Pandey <aspandey at redhat.com> wrote:
>>
>>>
>>> The only problem at the moment is that arbiter brick offline. You should
>>> only bother about completion of maintenance of arbiter brick ASAP.
>>> Bring this brick UP, start FULL heal or index heal and the volume will
>>> be in healthy state.
>>>
>>
>> Doesn't the arbiter only resolve split-brain situations? None of the
>> files that have been marked for healing are marked as in split-brain.
>>
>> The arbiter has now been brought back up, however the problem continues.
>>
>> I've found the following information in the client log:
>>
>> [2018-07-03 19:09:29.245089] W [MSGID: 108008]
>> [afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
>> 0-engine-replicate-0: GFID mismatch for <gfid:db9afb92-d2bc-49ed-8e34-
>> dcd437ba7be2>/hosted-engine.metadata 5e95ba8c-2f12-49bf-be2d-b4baf210d366
>> on engine-client-1 and b9cd7613-3b96-415d-a549-1dc788a4f94d on
>> engine-client-0
>> [2018-07-03 19:09:29.245585] W [fuse-bridge.c:471:fuse_entry_cbk]
>> 0-glusterfs-fuse: 10430040: LOOKUP() /98495dbc-a29c-4893-b6a0-
>> 0aa70860d0c9/ha_agent/hosted-engine.metadata => -1 (Input/output error)
>> [2018-07-03 19:09:30.619000] W [MSGID: 108008]
>> [afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
>> 0-engine-replicate-0: GFID mismatch for <gfid:db9afb92-d2bc-49ed-8e34-
>> dcd437ba7be2>/hosted-engine.lockspace 8e86902a-c31c-4990-b0c5-0318807edb8f
>> on engine-client-1 and e5899a4c-dc5d-487e-84b0-9bbc73133c25 on
>> engine-client-0
>> [2018-07-03 19:09:30.619360] W [fuse-bridge.c:471:fuse_entry_cbk]
>> 0-glusterfs-fuse: 10430656: LOOKUP() /98495dbc-a29c-4893-b6a0-
>> 0aa70860d0c9/ha_agent/hosted-engine.lockspace => -1 (Input/output error)
>>
>> As you can see from the logs I posted previously, neither of those two
>> files, on either of the two servers, have any of gluster's extended
>> attributes set.
>>
>> The arbiter doesn't have any record of the files in question, as they
>> were created after it went offline.
>>
>> How do I fix this? Is it possible to locate the correct gfids somewhere &
>> redefine them on the files manually?
>>
>> Cheers,
>> Doug
>>
>> ------------------------------
>>> *From: *"Gambit15" <dougti+gluster at gmail.com>
>>> *To: *"Ashish Pandey" <aspandey at redhat.com>
>>> *Cc: *"gluster-users" <gluster-users at gluster.org>
>>> *Sent: *Monday, July 2, 2018 1:45:01 AM
>>> *Subject: *Re: [Gluster-users] Files not healing & missing their
>>> extended attributes - Help!
>>>
>>>
>>> Hi Ashish,
>>>
>>> The output is below. It's a rep 2+1 volume. The arbiter is offline for
>>> maintenance at the moment, however quorum is met & no files are reported as
>>> in split-brain (it hosts VMs, so files aren't accessed concurrently).
>>>
>>> ======================
>>> [root at v0 glusterfs]# gluster volume info engine
>>>
>>> Volume Name: engine
>>> Type: Replicate
>>> Volume ID: 279737d3-3e5a-4ee9-8d4a-97edcca42427
>>> Status: Started
>>> Snapshot Count: 0
>>> Number of Bricks: 1 x (2 + 1) = 3
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: s0:/gluster/engine/brick
>>> Brick2: s1:/gluster/engine/brick
>>> Brick3: s2:/gluster/engine/arbiter (arbiter)
>>> Options Reconfigured:
>>> nfs.disable: on
>>> performance.readdir-ahead: on
>>> transport.address-family: inet
>>> performance.quick-read: off
>>> performance.read-ahead: off
>>> performance.io-cache: off
>>> performance.stat-prefetch: off
>>> cluster.eager-lock: enable
>>> network.remote-dio: enable
>>> cluster.quorum-type: auto
>>> cluster.server-quorum-type: server
>>> storage.owner-uid: 36
>>> storage.owner-gid: 36
>>> performance.low-prio-threads: 32
>>>
>>> ======================
>>>
>>> [root at v0 glusterfs]# gluster volume heal engine info
>>> Brick s0:/gluster/engine/brick
>>> /__DIRECT_IO_TEST__
>>> /98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
>>> /98495dbc-a29c-4893-b6a0-0aa70860d0c9
>>> <LIST TRUNCATED FOR BREVITY>
>>> Status: Connected
>>> Number of entries: 34
>>>
>>> Brick s1:/gluster/engine/brick
>>> <SAME AS ABOVE - TRUNCATED FOR BREVITY>
>>> Status: Connected
>>> Number of entries: 34
>>>
>>> Brick s2:/gluster/engine/arbiter
>>> Status: Ponto final de transporte não está conectado
>>> Number of entries: -
>>>
>>> ======================
>>> === PEER V0 ===
>>>
>>> [root at v0 glusterfs]# getfattr -m . -d -e hex /gluster/engine/brick/
>>> 98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
>>> getfattr: Removing leading '/' from absolute path names
>>> # file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/
>>> ha_agent
>>> security.selinux=0x73797374656d5f753a6f626a6563
>>> 745f723a756e6c6162656c65645f743a733000
>>> trusted.afr.dirty=0x000000000000000000000000
>>> trusted.afr.engine-client-2=0x0000000000000000000024e8
>>> trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
>>> trusted.glusterfs.dht=0x000000010000000000000000ffffffff
>>>
>>> [root at v0 glusterfs]# getfattr -m . -d -e hex /gluster/engine/brick/
>>> 98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/*
>>> getfattr: Removing leading '/' from absolute path names
>>> # file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/
>>> ha_agent/hosted-engine.lockspace
>>> security.selinux=0x73797374656d5f753a6f626a6563
>>> 745f723a6675736566735f743a733000
>>>
>>> # file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/
>>> ha_agent/hosted-engine.metadata
>>> security.selinux=0x73797374656d5f753a6f626a6563
>>> 745f723a6675736566735f743a733000
>>>
>>> === PEER V1 ===
>>>
>>> [root at v1 glusterfs]# getfattr -m . -d -e hex /gluster/engine/brick/
>>> 98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
>>> getfattr: Removing leading '/' from absolute path names
>>> # file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/
>>> ha_agent
>>> security.selinux=0x73797374656d5f753a6f626a6563
>>> 745f723a756e6c6162656c65645f743a733000
>>> trusted.afr.dirty=0x000000000000000000000000
>>> trusted.afr.engine-client-2=0x0000000000000000000024ec
>>> trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
>>> trusted.glusterfs.dht=0x000000010000000000000000ffffffff
>>>
>>> ======================
>>>
>>> cmd_history.log-20180701:
>>>
>>> [2018-07-01 03:11:38.461175] : volume heal engine full : SUCCESS
>>> [2018-07-01 03:11:51.151891] : volume heal data full : SUCCESS
>>>
>>> glustershd.log-20180701:
>>> <LOGS FROM 06/01 TRUNCATED>
>>> [2018-07-01 07:15:04.779122] I [MSGID: 100011] [glusterfsd.c:1396:reincarnate]
>>> 0-glusterfsd: Fetching the volume file from server...
>>>
>>> glustershd.log:
>>> [2018-07-01 07:15:04.779693] I [glusterfsd-mgmt.c:1596:mgmt_getspec_cbk]
>>> 0-glusterfs: No change in volfile, continuing
>>>
>>> That's the *only* message in glustershd.log today.
>>>
>>> ======================
>>>
>>> [root at v0 glusterfs]# gluster volume status engine
>>> Status of volume: engine
>>> Gluster process TCP Port RDMA Port
>>> Online Pid
>>> ------------------------------------------------------------
>>> ------------------
>>> Brick s0:/gluster/engine/brick 49154 0
>>> Y 2816
>>> Brick s1:/gluster/engine/brick 49154 0
>>> Y 3995
>>> Self-heal Daemon on localhost N/A N/A Y
>>> 2919
>>> Self-heal Daemon on s1 N/A N/A Y
>>> 4013
>>>
>>> Task Status of Volume engine
>>> ------------------------------------------------------------
>>> ------------------
>>> There are no active volume tasks
>>>
>>> ======================
>>>
>>> Okay, so actually only the directory ha_agent is listed for healing (not
>>> its contents), & that does have attributes set.
>>>
>>> Many thanks for the reply!
>>>
>>>
>>> On 1 July 2018 at 15:34, Ashish Pandey <aspandey at redhat.com> wrote:
>>>
>>>> You have not even talked about the volume type and configuration and
>>>> this issue would require lot of other information to fix it.
>>>>
>>>> 1 - What is the type of volume and config.
>>>> 2 - Provide the gluster v <volname> info out put
>>>> 3 - Heal info out put
>>>> 4 - getxattr of one of the file, which needs healing, from all the
>>>> bricks.
>>>> 5 - What lead to the healing of file?
>>>> 6 - gluster v <volname> status
>>>> 7 - glustershd.log out put just after you run full heal or index heal
>>>>
>>>> ----
>>>> Ashish
>>>>
>>>> ------------------------------
>>>> *From: *"Gambit15" <dougti+gluster at gmail.com>
>>>> *To: *"gluster-users" <gluster-users at gluster.org>
>>>> *Sent: *Sunday, July 1, 2018 11:50:16 PM
>>>> *Subject: *[Gluster-users] Files not healing & missing their
>>>> extended attributes - Help!
>>>>
>>>>
>>>> Hi Guys,
>>>> I had to restart our datacenter yesterday, but since doing so a number
>>>> of the files on my gluster share have been stuck, marked as healing. After
>>>> no signs of progress, I manually set off a full heal last night, but after
>>>> 24hrs, nothing's happened.
>>>>
>>>> The gluster logs all look normal, and there're no messages about failed
>>>> connections or heal processes kicking off.
>>>>
>>>> I checked the listed files' extended attributes on their bricks today,
>>>> and they only show the selinux attribute. There's none of the trusted.*
>>>> attributes I'd expect.
>>>> The healthy files on the bricks do have their extended attributes
>>>> though.
>>>>
>>>> I'm guessing that perhaps the files somehow lost their attributes, and
>>>> gluster is no longer able to work out what to do with them? It's not logged
>>>> any errors, warnings, or anything else out of the normal though, so I've no
>>>> idea what the problem is or how to resolve it.
>>>>
>>>> I've got 16 hours to get this sorted before the start of work, Monday.
>>>> Help!
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180704/dda29dd2/attachment.html>
More information about the Gluster-users
mailing list