[Gluster-users] Sharding problem - multiple shard copies with mismatching gfids

Raghavendra Gowdappa rgowdapp at redhat.com
Mon Mar 26 07:37:21 UTC 2018


Ian,

Do you've a reproducer for this bug? If not a specific one, a general
outline of what operations where done on the file will help.

regards,
Raghavendra

On Mon, Mar 26, 2018 at 12:55 PM, Raghavendra Gowdappa <rgowdapp at redhat.com>
wrote:

>
>
> On Mon, Mar 26, 2018 at 12:40 PM, Krutika Dhananjay <kdhananj at redhat.com>
> wrote:
>
>> The gfid mismatch here is between the shard and its "link-to" file, the
>> creation of which happens at a layer below that of shard translator on the
>> stack.
>>
>> Adding DHT devs to take a look.
>>
>
> Thanks Krutika. I assume shard doesn't do any dentry operations like
> rename, link, unlink on the path of file (not the gfid handle based path)
> internally while managing shards. Can you confirm? If it does these
> operations, what fops does it do?
>
> @Ian,
>
> I can suggest following way to fix the problem:
> * Since one of files listed is a DHT linkto file, I am assuming there is
> only one shard of the file. If not, please list out gfids of other shards
> and don't proceed with healing procedure.
> * If gfids of all shards happen to be same and only linkto has a different
> gfid, please proceed to step 3. Otherwise abort the healing procedure.
> * If cluster.lookup-optimize is set to true abort the healing procedure
> * Delete the linkto file - the file with permissions -------T and xattr
> trusted.dht.linkto and do a lookup on the file from mount point after
> turning off readdriplus [1].
>
> As to reasons on how we ended up in this situation, Can you explain me
> what is the I/O pattern on this file - like are there lots of entry
> operations like rename, link, unlink etc on the file? There have been known
> races in rename/lookup-heal-creating-linkto where linkto and data file
> have different gfids. [2] fixes some of these cases
>
> [1] http://lists.gluster.org/pipermail/gluster-users/2017-
> March/030148.html
> [2] https://review.gluster.org/#/c/19547/
>
> regards,
> Raghavendra
>
>>
>>
>>> -Krutika
>>
>> On Mon, Mar 26, 2018 at 1:09 AM, Ian Halliday <ihalliday at ndevix.com>
>> wrote:
>>
>>> Hello all,
>>>
>>> We are having a rather interesting problem with one of our VM storage
>>> systems. The GlusterFS client is throwing errors relating to GFID
>>> mismatches. We traced this down to multiple shards being present on the
>>> gluster nodes, with different gfids.
>>>
>>> Hypervisor gluster mount log:
>>>
>>> [2018-03-25 18:54:19.261733] E [MSGID: 133010]
>>> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-zone1-shard:
>>> Lookup on shard 7 failed. Base file gfid = 87137cac-49eb-492a-8f33-8e33470d8cb7
>>> [Stale file handle]
>>> The message "W [MSGID: 109009] [dht-common.c:2162:dht_lookup_linkfile_cbk]
>>> 0-ovirt-zone1-dht: /.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid
>>> different on data file on ovirt-zone1-replicate-3, gfid local =
>>> 00000000-0000-0000-0000-000000000000, gfid node =
>>> 57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56 " repeated 2 times between
>>> [2018-03-25 18:54:19.253748] and [2018-03-25 18:54:19.263576]
>>> [2018-03-25 18:54:19.264349] W [MSGID: 109009]
>>> [dht-common.c:1901:dht_lookup_everywhere_cbk] 0-ovirt-zone1-dht:
>>> /.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid differs on
>>> subvolume ovirt-zone1-replicate-3, gfid local =
>>> fdf0813b-718a-4616-a51b-6999ebba9ec3, gfid node =
>>> 57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56
>>>
>>>
>>> On the storage nodes, we found this:
>>>
>>> [root at n1 gluster]# find -name 87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>> ./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>> ./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>>
>>> [root at n1 gluster]# ls -lh ./brick2/brick/.shard/87137cac
>>> -49eb-492a-8f33-8e33470d8cb7.7
>>> ---------T. 2 root root 0 Mar 25 13:55 ./brick2/brick/.shard/87137cac
>>> -49eb-492a-8f33-8e33470d8cb7.7
>>> [root at n1 gluster]# ls -lh ./brick4/brick/.shard/87137cac
>>> -49eb-492a-8f33-8e33470d8cb7.7
>>> -rw-rw----. 2 root root 3.8G Mar 25 13:55 ./brick4/brick/.shard/87137cac
>>> -49eb-492a-8f33-8e33470d8cb7.7
>>>
>>> [root at n1 gluster]# getfattr -d -m . -e hex
>>> ./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>> # file: brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
>>> c6162656c65645f743a733000
>>> trusted.gfid=0xfdf0813b718a4616a51b6999ebba9ec3
>>> trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e653
>>> 12d7265706c69636174652d3300
>>>
>>> [root at n1 gluster]# getfattr -d -m . -e hex
>>> ./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>> # file: brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
>>> c6162656c65645f743a733000
>>> trusted.afr.dirty=0x000000000000000000000000
>>> trusted.bit-rot.version=0x020000000000000059914190000ce672
>>> trusted.gfid=0x57c6fcdf52bb4f7aaea402f0dc81ff56
>>>
>>>
>>> I'm wondering how they got created in the first place, and if anyone has
>>> any insight on how to fix it?
>>>
>>> Storage nodes:
>>> [root at n1 gluster]# gluster --version
>>> glusterfs 4.0.0
>>>
>>> [root at n1 gluster]# gluster volume info
>>>
>>> Volume Name: ovirt-350-zone1
>>> Type: Distributed-Replicate
>>> Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
>>> Status: Started
>>> Snapshot Count: 0
>>> Number of Bricks: 7 x (2 + 1) = 21
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: 10.0.6.100:/gluster/brick1/brick
>>> Brick2: 10.0.6.101:/gluster/brick1/brick
>>> Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
>>> Brick4: 10.0.6.100:/gluster/brick2/brick
>>> Brick5: 10.0.6.101:/gluster/brick2/brick
>>> Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
>>> Brick7: 10.0.6.100:/gluster/brick3/brick
>>> Brick8: 10.0.6.101:/gluster/brick3/brick
>>> Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
>>> Brick10: 10.0.6.100:/gluster/brick4/brick
>>> Brick11: 10.0.6.101:/gluster/brick4/brick
>>> Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
>>> Brick13: 10.0.6.100:/gluster/brick5/brick
>>> Brick14: 10.0.6.101:/gluster/brick5/brick
>>> Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
>>> Brick16: 10.0.6.100:/gluster/brick6/brick
>>> Brick17: 10.0.6.101:/gluster/brick6/brick
>>> Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
>>> Brick19: 10.0.6.100:/gluster/brick7/brick
>>> Brick20: 10.0.6.101:/gluster/brick7/brick
>>> Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
>>> Options Reconfigured:
>>> cluster.min-free-disk: 50GB
>>> performance.strict-write-ordering: off
>>> performance.strict-o-direct: off
>>> nfs.disable: off
>>> performance.readdir-ahead: on
>>> transport.address-family: inet
>>> performance.cache-size: 1GB
>>> features.shard: on
>>> features.shard-block-size: 5GB
>>> server.event-threads: 8
>>> server.outstanding-rpc-limit: 128
>>> storage.owner-uid: 36
>>> storage.owner-gid: 36
>>> performance.quick-read: off
>>> performance.read-ahead: off
>>> performance.io-cache: off
>>> performance.stat-prefetch: on
>>> cluster.eager-lock: enable
>>> network.remote-dio: enable
>>> cluster.quorum-type: auto
>>> cluster.server-quorum-type: server
>>> cluster.data-self-heal-algorithm: full
>>> performance.flush-behind: off
>>> performance.write-behind-window-size: 8MB
>>> client.event-threads: 8
>>> server.allow-insecure: on
>>>
>>>
>>> Client version:
>>> [root at kvm573 ~]# gluster --version
>>> glusterfs 3.12.5
>>>
>>>
>>> Thanks!
>>>
>>> - Ian
>>>
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180326/26ba11bd/attachment.html>


More information about the Gluster-users mailing list