[Gluster-users] [Stale file handle] in shard volume

Olaf Buitelaar olaf.buitelaar at gmail.com
Wed Jan 2 10:42:58 UTC 2019


Dear All,

The bash file i'm planning to run can be found here;
https://gist.github.com/olafbuitelaar/ff6fe9d4ab39696d9ad6ca689cc89986
It would be nice to receive some feedback from the community before i would
actually run the clean-up of all stale file handles.

Thanks Olaf

Op zo 30 dec. 2018 om 20:56 schreef Olaf Buitelaar <olaf.buitelaar at gmail.com
>:

> Dear All,
>
> till now a selected group of VM's still seem to produce new stale file's
> and getting paused due to this.
> I've not updated gluster recently, however i did change the op version
> from 31200 to 31202 about a week before this issue arose.
> Looking at the .shard directory, i've 100.000+ files sharing the same
> characteristics as a stale file. which are found till now,
> they all have the sticky bit set, e.g. file permissions; ---------T. are
> 0kb in size, and have the trusted.glusterfs.dht.linkto attribute.
> These files range from long a go (beginning of the year) till now. Which
> makes me suspect this was laying dormant for some time now..and somehow
> recently surfaced.
> Checking other sub-volumes they contain also 0kb files in the .shard
> directory, but don't have the sticky bit and the linkto attribute.
>
> Does anybody else experience this issue? Could this be a bug or an
> environmental issue?
>
> Also i wonder if there is any tool or gluster command to clean all stale
> file handles?
> Otherwise i'm planning to make a simple bash script, which iterates over
> the .shard dir, checks each file for the above mentioned criteria, and
> (re)moves the file and the corresponding .glusterfs file.
> If there are other criteria needed to identify a stale file handle, i
> would like to hear that.
> If this is a viable and safe operation to do of course.
>
> Thanks Olaf
>
>
>
> Op do 20 dec. 2018 om 13:43 schreef Olaf Buitelaar <
> olaf.buitelaar at gmail.com>:
>
>> Dear All,
>>
>> I figured it out, it appeared to be the exact same issue as described
>> here;
>> https://lists.gluster.org/pipermail/gluster-users/2018-March/033785.html
>> Another subvolume also had the shard file, only were all 0 bytes and had
>> the dht.linkto
>>
>> for reference;
>> [root at lease-04 ovirt-backbone-2]# getfattr -d -m . -e hex
>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>
>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>> trusted.gfid=0x298147e49f9748b2baf1c8fff897244d
>>
>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>
>> trusted.glusterfs.dht.linkto=0x6f766972742d6261636b626f6e652d322d7265706c69636174652d3100
>>
>> [root at lease-04 ovirt-backbone-2]# getfattr -d -m . -e hex
>> .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d
>> # file: .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d
>>
>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>> trusted.gfid=0x298147e49f9748b2baf1c8fff897244d
>>
>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>
>> trusted.glusterfs.dht.linkto=0x6f766972742d6261636b626f6e652d322d7265706c69636174652d3100
>>
>> [root at lease-04 ovirt-backbone-2]# stat
>> .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d
>>   File: ‘.glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d’
>>   Size: 0               Blocks: 0          IO Block: 4096   regular empty
>> file
>> Device: fd01h/64769d    Inode: 1918631406  Links: 2
>> Access: (1000/---------T)  Uid: (    0/    root)   Gid: (    0/    root)
>> Context: system_u:object_r:etc_runtime_t:s0
>> Access: 2018-12-17 21:43:36.405735296 +0000
>> Modify: 2018-12-17 21:43:36.405735296 +0000
>> Change: 2018-12-17 21:43:36.405735296 +0000
>>  Birth: -
>>
>> removing the shard file and glusterfs file from each node resolved the
>> issue.
>>
>> I also found this thread;
>> https://lists.gluster.org/pipermail/gluster-users/2018-December/035460.html
>> Maybe he suffers from the same issue.
>>
>> Best Olaf
>>
>>
>> Op wo 19 dec. 2018 om 21:56 schreef Olaf Buitelaar <
>> olaf.buitelaar at gmail.com>:
>>
>>> Dear All,
>>>
>>> It appears i've a stale file in one of the volumes, on 2 files. These
>>> files are qemu images (1 raw and 1 qcow2).
>>> I'll just focus on 1 file since the situation on the other seems the
>>> same.
>>>
>>> The VM get's paused more or less directly after being booted with error;
>>> [2018-12-18 14:05:05.275713] E [MSGID: 133010]
>>> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-backbone-2-shard:
>>> Lookup on shard 51500 failed. Base file gfid =
>>> f28cabcb-d169-41fc-a633-9bef4c4a8e40 [Stale file handle]
>>>
>>> investigating the shard;
>>>
>>> #on the arbiter node:
>>>
>>> [root at lease-05 ovirt-backbone-2]# getfattr -n glusterfs.gfid.string
>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425
>>> getfattr: Removing leading '/' from absolute path names
>>> # file:
>>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425
>>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40"
>>>
>>> [root at lease-05 ovirt-backbone-2]# getfattr -d -m . -e hex
>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>
>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>> trusted.afr.dirty=0x000000000000000000000000
>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0
>>>
>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>
>>> [root at lease-05 ovirt-backbone-2]# getfattr -d -m . -e hex
>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>
>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>> trusted.afr.dirty=0x000000000000000000000000
>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0
>>>
>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>
>>> [root at lease-05 ovirt-backbone-2]# stat
>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>   File: ‘.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0’
>>>   Size: 0               Blocks: 0          IO Block: 4096   regular
>>> empty file
>>> Device: fd01h/64769d    Inode: 537277306   Links: 2
>>> Access: (0660/-rw-rw----)  Uid: (    0/    root)   Gid: (    0/    root)
>>> Context: system_u:object_r:etc_runtime_t:s0
>>> Access: 2018-12-17 21:43:36.361984810 +0000
>>> Modify: 2018-12-17 21:43:36.361984810 +0000
>>> Change: 2018-12-18 20:55:29.908647417 +0000
>>>  Birth: -
>>>
>>> [root at lease-05 ovirt-backbone-2]# find . -inum 537277306
>>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>
>>> #on the data nodes:
>>>
>>> [root at lease-08 ~]# getfattr -n glusterfs.gfid.string
>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425
>>> getfattr: Removing leading '/' from absolute path names
>>> # file:
>>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425
>>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40"
>>>
>>> [root at lease-08 ovirt-backbone-2]# getfattr -d -m . -e hex
>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>
>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>> trusted.afr.dirty=0x000000000000000000000000
>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0
>>>
>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>
>>> [root at lease-08 ovirt-backbone-2]# getfattr -d -m . -e hex
>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>
>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>> trusted.afr.dirty=0x000000000000000000000000
>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0
>>>
>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>
>>> [root at lease-08 ovirt-backbone-2]# stat
>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>   File: ‘.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0’
>>>   Size: 2166784         Blocks: 4128       IO Block: 4096   regular file
>>> Device: fd03h/64771d    Inode: 12893624759  Links: 3
>>> Access: (0660/-rw-rw----)  Uid: (    0/    root)   Gid: (    0/    root)
>>> Context: system_u:object_r:etc_runtime_t:s0
>>> Access: 2018-12-18 18:52:38.070776585 +0000
>>> Modify: 2018-12-17 21:43:36.388054443 +0000
>>> Change: 2018-12-18 21:01:47.810506528 +0000
>>>  Birth: -
>>>
>>> [root at lease-08 ovirt-backbone-2]# find . -inum 12893624759
>>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>
>>> ========================
>>>
>>> [root at lease-11 ovirt-backbone-2]# getfattr -n glusterfs.gfid.string
>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425
>>> getfattr: Removing leading '/' from absolute path names
>>> # file:
>>> mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425
>>> glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40"
>>>
>>> [root at lease-11 ovirt-backbone-2]#  getfattr -d -m . -e hex
>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>
>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>> trusted.afr.dirty=0x000000000000000000000000
>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0
>>>
>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>
>>> [root at lease-11 ovirt-backbone-2]# getfattr -d -m . -e hex
>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>> # file: .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>
>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>> trusted.afr.dirty=0x000000000000000000000000
>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0
>>>
>>> trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>
>>> [root at lease-11 ovirt-backbone-2]# stat
>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>   File: ‘.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0’
>>>   Size: 2166784         Blocks: 4128       IO Block: 4096   regular file
>>> Device: fd03h/64771d    Inode: 12956094809  Links: 3
>>> Access: (0660/-rw-rw----)  Uid: (    0/    root)   Gid: (    0/    root)
>>> Context: system_u:object_r:etc_runtime_t:s0
>>> Access: 2018-12-18 20:11:53.595208449 +0000
>>> Modify: 2018-12-17 21:43:36.391580259 +0000
>>> Change: 2018-12-18 19:19:25.888055392 +0000
>>>  Birth: -
>>>
>>> [root at lease-11 ovirt-backbone-2]# find . -inum 12956094809
>>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>
>>> ================
>>>
>>> I don't really see any inconsistencies, except the dates on the stat.
>>> However this is only after i tried moving the file out of the volumes to
>>> force a heal, which does happen on the data nodes, but not on the arbiter
>>> node. Before that they were also the same.
>>> I've also compared the file
>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 on the 2 nodes and they
>>> are exactly the same.
>>>
>>> Things i've further tried;
>>> - gluster v heal ovirt-backbone-2 full => gluster v heal
>>> ovirt-backbone-2 info reports 0 entries on all nodes
>>>
>>> - stop each glusterd and glusterfsd, pause around 40sec and start them
>>> again on each node, 1 at a time, waiting for the heal to recover before
>>> moving to the next node
>>>
>>> - force a heal by stopping glusterd on a node and perform these steps;
>>> mkdir /mnt/ovirt-backbone-2/trigger
>>> rmdir /mnt/ovirt-backbone-2/trigger
>>> setfattr -n trusted.non-existent-key -v abc /mnt/ovirt-backbone-2/
>>> setfattr -x trusted.non-existent-key /mnt/ovirt-backbone-2/
>>> start glusterd
>>>
>>> - gluster volume rebalance ovirt-backbone-2 start => success
>>>
>>> Whats further interesting is that according the mount log, the volume is
>>> in split-brain;
>>> [2018-12-18 10:06:04.606870] E [MSGID: 108008]
>>> [afr-read-txn.c:90:afr_read_txn_refresh_done]
>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid
>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output
>>> error]
>>> [2018-12-18 10:06:04.606908] E [MSGID: 133014]
>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed:
>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error]
>>> [2018-12-18 10:06:04.606927] W [fuse-bridge.c:871:fuse_attr_cbk]
>>> 0-glusterfs-fuse: 428090: FSTAT()
>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error)
>>> [2018-12-18 10:06:05.107729] E [MSGID: 108008]
>>> [afr-read-txn.c:90:afr_read_txn_refresh_done]
>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid
>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output
>>> error]
>>> [2018-12-18 10:06:05.107770] E [MSGID: 133014]
>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed:
>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error]
>>> [2018-12-18 10:06:05.107791] W [fuse-bridge.c:871:fuse_attr_cbk]
>>> 0-glusterfs-fuse: 428091: FSTAT()
>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error)
>>> [2018-12-18 10:06:05.537244] I [MSGID: 108006]
>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no
>>> subvolumes up
>>> [2018-12-18 10:06:05.538523] E [MSGID: 108008]
>>> [afr-read-txn.c:90:afr_read_txn_refresh_done]
>>> 0-ovirt-backbone-2-replicate-2: Failing STAT on gfid
>>> 00000000-0000-0000-0000-000000000001: split-brain observed. [Input/output
>>> error]
>>> [2018-12-18 10:06:05.538685] I [MSGID: 108006]
>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no
>>> subvolumes up
>>> [2018-12-18 10:06:05.538794] I [MSGID: 108006]
>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no
>>> subvolumes up
>>> [2018-12-18 10:06:05.539342] I [MSGID: 109063]
>>> [dht-layout.c:716:dht_layout_normalize] 0-ovirt-backbone-2-dht: Found
>>> anomalies in /b1c2c949-aef4-4aec-999b-b179efeef732 (gfid =
>>> 8c8598ce-1a52-418e-a7b4-435fee34bae8). Holes=2 overlaps=0
>>> [2018-12-18 10:06:05.539372] W [MSGID: 109005]
>>> [dht-selfheal.c:2158:dht_selfheal_directory] 0-ovirt-backbone-2-dht:
>>> Directory selfheal failed: 2 subvolumes down.Not fixing. path =
>>> /b1c2c949-aef4-4aec-999b-b179efeef732, gfid =
>>> 8c8598ce-1a52-418e-a7b4-435fee34bae8
>>> [2018-12-18 10:06:05.539694] I [MSGID: 108006]
>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no
>>> subvolumes up
>>> [2018-12-18 10:06:05.540652] I [MSGID: 108006]
>>> [afr-common.c:5494:afr_local_init] 0-ovirt-backbone-2-replicate-1: no
>>> subvolumes up
>>> [2018-12-18 10:06:05.608612] E [MSGID: 108008]
>>> [afr-read-txn.c:90:afr_read_txn_refresh_done]
>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid
>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output
>>> error]
>>> [2018-12-18 10:06:05.608657] E [MSGID: 133014]
>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed:
>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error]
>>> [2018-12-18 10:06:05.608672] W [fuse-bridge.c:871:fuse_attr_cbk]
>>> 0-glusterfs-fuse: 428096: FSTAT()
>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error)
>>> [2018-12-18 10:06:06.109339] E [MSGID: 108008]
>>> [afr-read-txn.c:90:afr_read_txn_refresh_done]
>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid
>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain observed. [Input/output
>>> error]
>>> [2018-12-18 10:06:06.109378] E [MSGID: 133014]
>>> [shard.c:1248:shard_common_stat_cbk] 0-ovirt-backbone-2-shard: stat failed:
>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output error]
>>> [2018-12-18 10:06:06.109399] W [fuse-bridge.c:871:fuse_attr_cbk]
>>> 0-glusterfs-fuse: 428097: FSTAT()
>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids => -1 (Input/output error)
>>>
>>> #note i'm able to see ; /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids
>>> [root at lease-11 ovirt-backbone-2]# stat
>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids
>>>   File:
>>> ‘/mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids’
>>>   Size: 1048576         Blocks: 2048       IO Block: 131072 regular file
>>> Device: 41h/65d Inode: 10492258721813610344  Links: 1
>>> Access: (0660/-rw-rw----)  Uid: (   36/    vdsm)   Gid: (   36/     kvm)
>>> Context: system_u:object_r:fusefs_t:s0
>>> Access: 2018-12-19 20:07:39.917573869 +0000
>>> Modify: 2018-12-19 20:07:39.928573917 +0000
>>> Change: 2018-12-19 20:07:39.929573921 +0000
>>>  Birth: -
>>>
>>> however checking: gluster v heal ovirt-backbone-2 info split-brain
>>> reports no entries.
>>>
>>> I've also tried mounting the qemu image, and this works fine, i'm able
>>> to see all contents;
>>>  losetup /dev/loop0
>>> /mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425
>>>  kpartx -a /dev/loop0
>>>  vgscan
>>>  vgchange -ay slave-data
>>>  mkdir /mnt/slv01
>>>  mount /dev/mapper/slave--data-lvol0 /mnt/slv01/
>>>
>>> Possible causes for this issue;
>>> 1. the machine "lease-11" suffered from a faulty RAM module (ECC), which
>>> halted the machine and causes an invalid state. (this machine also hosts
>>> other volumes, with similar configurations, which report no issue)
>>> 2. after the RAM module was replaced, the VM using the backing qemu
>>> image, was restored from a backup (the backup was file based within the VM
>>> on a different directory). This is because some files were corrupted. The
>>> backup/recovery obviously causes extra IO, possible introducing race
>>> conditions? The machine did run for about 12h without issues, and in total
>>> for about 36h.
>>> 3. since only the client (maybe only gfapi?) reports errors, something
>>> is broken there?
>>>
>>> The volume info;
>>> root at lease-06 ~# gluster v info ovirt-backbone-2
>>>
>>> Volume Name: ovirt-backbone-2
>>> Type: Distributed-Replicate
>>> Volume ID: 85702d35-62c8-4c8c-930d-46f455a8af28
>>> Status: Started
>>> Snapshot Count: 0
>>> Number of Bricks: 3 x (2 + 1) = 9
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: 10.32.9.7:/data/gfs/bricks/brick1/ovirt-backbone-2
>>> Brick2: 10.32.9.3:/data/gfs/bricks/brick1/ovirt-backbone-2
>>> Brick3: 10.32.9.4:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter)
>>> Brick4: 10.32.9.8:/data0/gfs/bricks/brick1/ovirt-backbone-2
>>> Brick5: 10.32.9.21:/data0/gfs/bricks/brick1/ovirt-backbone-2
>>> Brick6: 10.32.9.5:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter)
>>> Brick7: 10.32.9.9:/data0/gfs/bricks/brick1/ovirt-backbone-2
>>> Brick8: 10.32.9.20:/data0/gfs/bricks/brick1/ovirt-backbone-2
>>> Brick9: 10.32.9.6:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter)
>>> Options Reconfigured:
>>> nfs.disable: on
>>> transport.address-family: inet
>>> performance.quick-read: off
>>> performance.read-ahead: off
>>> performance.io-cache: off
>>> performance.low-prio-threads: 32
>>> network.remote-dio: enable
>>> cluster.eager-lock: enable
>>> cluster.quorum-type: auto
>>> cluster.server-quorum-type: server
>>> cluster.data-self-heal-algorithm: full
>>> cluster.locking-scheme: granular
>>> cluster.shd-max-threads: 8
>>> cluster.shd-wait-qlength: 10000
>>> features.shard: on
>>> user.cifs: off
>>> storage.owner-uid: 36
>>> storage.owner-gid: 36
>>> features.shard-block-size: 64MB
>>> performance.write-behind-window-size: 512MB
>>> performance.cache-size: 384MB
>>> cluster.brick-multiplex: on
>>>
>>> The volume status;
>>> root at lease-06 ~# gluster v status ovirt-backbone-2
>>> Status of volume: ovirt-backbone-2
>>> Gluster process                             TCP Port  RDMA Port  Online
>>> Pid
>>>
>>> ------------------------------------------------------------------------------
>>> Brick 10.32.9.7:/data/gfs/bricks/brick1/ovi
>>> rt-backbone-2                               49152     0          Y
>>> 7727
>>> Brick 10.32.9.3:/data/gfs/bricks/brick1/ovi
>>> rt-backbone-2                               49152     0          Y
>>> 12620
>>> Brick 10.32.9.4:/data/gfs/bricks/bricka/ovi
>>> rt-backbone-2                               49152     0          Y
>>> 8794
>>> Brick 10.32.9.8:/data0/gfs/bricks/brick1/ov
>>> irt-backbone-2                              49161     0          Y
>>> 22333
>>> Brick 10.32.9.21:/data0/gfs/bricks/brick1/o
>>> virt-backbone-2                             49152     0          Y
>>> 15030
>>> Brick 10.32.9.5:/data/gfs/bricks/bricka/ovi
>>> rt-backbone-2                               49166     0          Y
>>> 24592
>>> Brick 10.32.9.9:/data0/gfs/bricks/brick1/ov
>>> irt-backbone-2                              49153     0          Y
>>> 20148
>>> Brick 10.32.9.20:/data0/gfs/bricks/brick1/o
>>> virt-backbone-2                             49154     0          Y
>>> 15413
>>> Brick 10.32.9.6:/data/gfs/bricks/bricka/ovi
>>> rt-backbone-2                               49152     0          Y
>>> 43120
>>> Self-heal Daemon on localhost               N/A       N/A        Y
>>> 44587
>>> Self-heal Daemon on 10.201.0.2              N/A       N/A        Y
>>> 8401
>>> Self-heal Daemon on 10.201.0.5              N/A       N/A        Y
>>> 11038
>>> Self-heal Daemon on 10.201.0.8              N/A       N/A        Y
>>> 9513
>>> Self-heal Daemon on 10.32.9.4               N/A       N/A        Y
>>> 23736
>>> Self-heal Daemon on 10.32.9.20              N/A       N/A        Y
>>> 2738
>>> Self-heal Daemon on 10.32.9.3               N/A       N/A        Y
>>> 25598
>>> Self-heal Daemon on 10.32.9.5               N/A       N/A        Y
>>> 511
>>> Self-heal Daemon on 10.32.9.9               N/A       N/A        Y
>>> 23357
>>> Self-heal Daemon on 10.32.9.8               N/A       N/A        Y
>>> 15225
>>> Self-heal Daemon on 10.32.9.7               N/A       N/A        Y
>>> 25781
>>> Self-heal Daemon on 10.32.9.21              N/A       N/A        Y
>>> 5034
>>>
>>> Task Status of Volume ovirt-backbone-2
>>>
>>> ------------------------------------------------------------------------------
>>> Task                 : Rebalance
>>> ID                   : 6dfbac43-0125-4568-9ac3-a2c453faaa3d
>>> Status               : completed
>>>
>>> gluster version is @3.12.15 and cluster.op-version=31202
>>>
>>> ========================
>>>
>>> It would be nice to know if it's possible to mark the files as not stale
>>> or if i should investigate other things?
>>> Or should we consider this volume lost?
>>> Also checking the code at;
>>> https://github.com/gluster/glusterfs/blob/master/xlators/features/shard/src/shard.c
>>> it seems the functions shifted quite some (line 1724 vs. 2243), so maybe
>>> it's fixed in a future version?
>>> Any thoughts are welcome.
>>>
>>> Thanks Olaf
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190102/58c5d238/attachment.html>


More information about the Gluster-users mailing list