[Gluster-users] Files not healing & missing their extended attributes - Help!

Wed Jul 4 23:50:50 UTC 2018

On 3 July 2018 at 23:37, Vlad Kopylov <vladkopy at gmail.com> wrote:

> might be too late but sort of simple always working solution for such
> cases is rebuilding .glusterfs
>
> kill it and query attr for all files again, it will recreate .glusterfs on
> all bricks
>
> something like mentioned here
> https://lists.gluster.org/pipermail/gluster-users/2018-January/033352.html
>

Is my problem with .glusterfs though? I'd be super cautious removing the
entire directory unless I'm sure that's the solution...

Cheers,

> On Tue, Jul 3, 2018 at 4:27 PM, Gambit15 <dougti+gluster at gmail.com> wrote:
>
>> On 1 July 2018 at 22:37, Ashish Pandey <aspandey at redhat.com> wrote:
>>
>>>
>>> The only problem at the moment is that arbiter brick offline. You should
>>> only bother about completion of maintenance of arbiter brick ASAP.
>>> Bring this brick UP, start FULL heal or index heal and the volume will
>>> be in healthy state.
>>>
>>
>> Doesn't the arbiter only resolve split-brain situations? None of the
>> files that have been marked for healing are marked as in split-brain.
>>
>> The arbiter has now been brought back up, however the problem continues.
>>
>> I've found the following information in the client log:
>>
>> [2018-07-03 19:09:29.245089] W [MSGID: 108008]
>> [afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
>> 0-engine-replicate-0: GFID mismatch for <gfid:db9afb92-d2bc-49ed-8e34-
>> dcd437ba7be2>/hosted-engine.metadata 5e95ba8c-2f12-49bf-be2d-b4baf210d366
>> on engine-client-1 and b9cd7613-3b96-415d-a549-1dc788a4f94d on
>> engine-client-0
>> [2018-07-03 19:09:29.245585] W [fuse-bridge.c:471:fuse_entry_cbk]
>> 0-glusterfs-fuse: 10430040: LOOKUP() /98495dbc-a29c-4893-b6a0-0aa70
>> 860d0c9/ha_agent/hosted-engine.metadata => -1 (Input/output error)
>> [2018-07-03 19:09:30.619000] W [MSGID: 108008]
>> [afr-self-heal-name.c:354:afr_selfheal_name_gfid_mismatch_check]
>> 0-engine-replicate-0: GFID mismatch for <gfid:db9afb92-d2bc-49ed-8e34-
>> dcd437ba7be2>/hosted-engine.lockspace 8e86902a-c31c-4990-b0c5-0318807edb8f
>> on engine-client-1 and e5899a4c-dc5d-487e-84b0-9bbc73133c25 on
>> engine-client-0
>> [2018-07-03 19:09:30.619360] W [fuse-bridge.c:471:fuse_entry_cbk]
>> 0-glusterfs-fuse: 10430656: LOOKUP() /98495dbc-a29c-4893-b6a0-0aa70
>> 860d0c9/ha_agent/hosted-engine.lockspace => -1 (Input/output error)
>>
>> As you can see from the logs I posted previously, neither of those two
>> files, on either of the two servers, have any of gluster's extended
>> attributes set.
>>
>> The arbiter doesn't have any record of the files in question, as they
>> were created after it went offline.
>>
>> How do I fix this? Is it possible to locate the correct gfids somewhere &
>> redefine them on the files manually?
>>
>> Cheers,
>>  Doug
>>
>> ------------------------------
>>> *From: *"Gambit15" <dougti+gluster at gmail.com>
>>> *To: *"Ashish Pandey" <aspandey at redhat.com>
>>> *Cc: *"gluster-users" <gluster-users at gluster.org>
>>> *Sent: *Monday, July 2, 2018 1:45:01 AM
>>> *Subject: *Re: [Gluster-users] Files not healing & missing their
>>> extended attributes - Help!
>>>
>>>
>>> Hi Ashish,
>>>
>>> The output is below. It's a rep 2+1 volume. The arbiter is offline for
>>> maintenance at the moment, however quorum is met & no files are reported as
>>> in split-brain (it hosts VMs, so files aren't accessed concurrently).
>>>
>>> ======================
>>> [root at v0 glusterfs]# gluster volume info engine
>>>
>>> Volume Name: engine
>>> Type: Replicate
>>> Volume ID: 279737d3-3e5a-4ee9-8d4a-97edcca42427
>>> Status: Started
>>> Snapshot Count: 0
>>> Number of Bricks: 1 x (2 + 1) = 3
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: s0:/gluster/engine/brick
>>> Brick2: s1:/gluster/engine/brick
>>> Brick3: s2:/gluster/engine/arbiter (arbiter)
>>> Options Reconfigured:
>>> nfs.disable: on
>>> performance.readdir-ahead: on
>>> transport.address-family: inet
>>> performance.quick-read: off
>>> performance.read-ahead: off
>>> performance.io-cache: off
>>> performance.stat-prefetch: off
>>> cluster.eager-lock: enable
>>> network.remote-dio: enable
>>> cluster.quorum-type: auto
>>> cluster.server-quorum-type: server
>>> storage.owner-uid: 36
>>> storage.owner-gid: 36
>>> performance.low-prio-threads: 32
>>>
>>> ======================
>>>
>>> [root at v0 glusterfs]# gluster volume heal engine info
>>> Brick s0:/gluster/engine/brick
>>> /__DIRECT_IO_TEST__
>>> /98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
>>> /98495dbc-a29c-4893-b6a0-0aa70860d0c9
>>> <LIST TRUNCATED FOR BREVITY>
>>> Status: Connected
>>> Number of entries: 34
>>>
>>> Brick s1:/gluster/engine/brick
>>> <SAME AS ABOVE - TRUNCATED FOR BREVITY>
>>> Status: Connected
>>> Number of entries: 34
>>>
>>> Brick s2:/gluster/engine/arbiter
>>> Status: Ponto final de transporte não está conectado
>>> Number of entries: -
>>>
>>> ======================
>>> === PEER V0 ===
>>>
>>> [root at v0 glusterfs]# getfattr -m . -d -e hex
>>> /gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
>>> getfattr: Removing leading '/' from absolute path names
>>> # file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha
>>> _agent
>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
>>> c6162656c65645f743a733000
>>> trusted.afr.dirty=0x000000000000000000000000
>>> trusted.afr.engine-client-2=0x0000000000000000000024e8
>>> trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
>>> trusted.glusterfs.dht=0x000000010000000000000000ffffffff
>>>
>>> [root at v0 glusterfs]# getfattr -m . -d -e hex
>>> /gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent/*
>>> getfattr: Removing leading '/' from absolute path names
>>> # file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha
>>> _agent/hosted-engine.lockspace
>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a66757
>>> 36566735f743a733000
>>>
>>> # file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha
>>> _agent/hosted-engine.metadata
>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6675736566735f743a733000
>>>
>>>
>>> === PEER V1 ===
>>>
>>> [root at v1 glusterfs]# getfattr -m . -d -e hex
>>> /gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha_agent
>>> getfattr: Removing leading '/' from absolute path names
>>> # file: gluster/engine/brick/98495dbc-a29c-4893-b6a0-0aa70860d0c9/ha
>>> _agent
>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
>>> c6162656c65645f743a733000
>>> trusted.afr.dirty=0x000000000000000000000000
>>> trusted.afr.engine-client-2=0x0000000000000000000024ec
>>> trusted.gfid=0xdb9afb92d2bc49ed8e34dcd437ba7be2
>>> trusted.glusterfs.dht=0x000000010000000000000000ffffffff
>>>
>>> ======================
>>>
>>> cmd_history.log-20180701:
>>>
>>> [2018-07-01 03:11:38.461175]  : volume heal engine full : SUCCESS
>>> [2018-07-01 03:11:51.151891]  : volume heal data full : SUCCESS
>>>
>>> glustershd.log-20180701:
>>> <LOGS FROM 06/01 TRUNCATED>
>>> [2018-07-01 07:15:04.779122] I [MSGID: 100011]
>>> [glusterfsd.c:1396:reincarnate] 0-glusterfsd: Fetching the volume file
>>> from server...
>>>
>>> glustershd.log:
>>> [2018-07-01 07:15:04.779693] I [glusterfsd-mgmt.c:1596:mgmt_getspec_cbk]
>>> 0-glusterfs: No change in volfile, continuing
>>>
>>> That's the *only* message in glustershd.log today.
>>>
>>> ======================
>>>
>>> [root at v0 glusterfs]# gluster volume status engine
>>> Status of volume: engine
>>> Gluster process                             TCP Port  RDMA Port
>>> Online  Pid
>>> ------------------------------------------------------------
>>> ------------------
>>> Brick s0:/gluster/engine/brick              49154     0
>>> Y       2816
>>> Brick s1:/gluster/engine/brick              49154     0
>>> Y       3995
>>> Self-heal Daemon on localhost               N/A       N/A        Y
>>> 2919
>>> Self-heal Daemon on s1                      N/A       N/A        Y
>>> 4013
>>>
>>> Task Status of Volume engine
>>> ------------------------------------------------------------
>>> ------------------
>>> There are no active volume tasks
>>>
>>> ======================
>>>
>>> Okay, so actually only the directory ha_agent is listed for healing (not
>>> its contents), & that does have attributes set.
>>>
>>> Many thanks for the reply!
>>>
>>>
>>> On 1 July 2018 at 15:34, Ashish Pandey <aspandey at redhat.com> wrote:
>>>
>>>> You have not even talked about the volume type and configuration and
>>>> this issue would require lot of other information to fix it.
>>>>
>>>> 1 - What is the type of volume and config.
>>>> 2 - Provide the gluster v <volname> info out put
>>>> 3 - Heal info out put
>>>> 4 - getxattr of one of the file, which needs healing, from all the
>>>> bricks.
>>>> 5 - What lead to the healing of file?
>>>> 6 - gluster v <volname> status
>>>> 7 - glustershd.log out put just after you run full heal or index heal
>>>>
>>>> ----
>>>> Ashish
>>>>
>>>> ------------------------------
>>>> *From: *"Gambit15" <dougti+gluster at gmail.com>
>>>> *To: *"gluster-users" <gluster-users at gluster.org>
>>>> *Sent: *Sunday, July 1, 2018 11:50:16 PM
>>>> *Subject: *[Gluster-users] Files not healing & missing their
>>>> extended        attributes - Help!
>>>>
>>>>
>>>> Hi Guys,
>>>>  I had to restart our datacenter yesterday, but since doing so a number
>>>> of the files on my gluster share have been stuck, marked as healing. After
>>>> no signs of progress, I manually set off a full heal last night, but after
>>>> 24hrs, nothing's happened.
>>>>
>>>> The gluster logs all look normal, and there're no messages about failed
>>>> connections or heal processes kicking off.
>>>>
>>>> I checked the listed files' extended attributes on their bricks today,
>>>> and they only show the selinux attribute. There's none of the trusted.*
>>>> attributes I'd expect.
>>>> The healthy files on the bricks do have their extended attributes
>>>> though.
>>>>
>>>> I'm guessing that perhaps the files somehow lost their attributes, and
>>>> gluster is no longer able to work out what to do with them? It's not logged
>>>> any errors, warnings, or anything else out of the normal though, so I've no
>>>> idea what the problem is or how to resolve it.
>>>>
>>>> I've got 16 hours to get this sorted before the start of work, Monday.
>>>> Help!
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180704/06d6486a/attachment.html>