[Gluster-users] Sharding problem - multiple shard copies with mismatching gfids

Mon Mar 26 08:39:52 UTC 2018

Raghavenda,

The issue typically appears during heavy write operations to the VM 
image. Its most noticeable during the filesystem creation process on a 
virtual machine image. I'll get some specific data while executing that 
process and will get back to you soon.

thanks

-- Ian

------ Original Message ------
From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
To: "Krutika Dhananjay" <kdhananj at redhat.com>
Cc: "Ian Halliday" <ihalliday at ndevix.com>; "gluster-user" 
<gluster-users at gluster.org>; "Nithya Balachandran" <nbalacha at redhat.com>
Sent: 3/26/2018 2:37:21 AM
Subject: Re: [Gluster-users] Sharding problem - multiple shard copies 
with mismatching gfids

>Ian,
>
>Do you've a reproducer for this bug? If not a specific one, a general 
>outline of what operations where done on the file will help.
>
>regards,
>Raghavendra
>
>On Mon, Mar 26, 2018 at 12:55 PM, Raghavendra Gowdappa 
><rgowdapp at redhat.com> wrote:
>>
>>
>>On Mon, Mar 26, 2018 at 12:40 PM, Krutika Dhananjay 
>><kdhananj at redhat.com> wrote:
>>>The gfid mismatch here is between the shard and its "link-to" file, 
>>>the creation of which happens at a layer below that of shard 
>>>translator on the stack.
>>>
>>>Adding DHT devs to take a look.
>>
>>Thanks Krutika. I assume shard doesn't do any dentry operations like 
>>rename, link, unlink on the path of file (not the gfid handle based 
>>path) internally while managing shards. Can you confirm? If it does 
>>these operations, what fops does it do?
>>
>>@Ian,
>>
>>I can suggest following way to fix the problem:
>>* Since one of files listed is a DHT linkto file, I am assuming there 
>>is only one shard of the file. If not, please list out gfids of other 
>>shards and don't proceed with healing procedure.
>>* If gfids of all shards happen to be same and only linkto has a 
>>different gfid, please proceed to step 3. Otherwise abort the healing 
>>procedure.
>>* If cluster.lookup-optimize is set to true abort the healing 
>>procedure
>>* Delete the linkto file - the file with permissions -------T and 
>>xattr trusted.dht.linkto and do a lookup on the file from mount point 
>>after turning off readdriplus [1].
>>
>>As to reasons on how we ended up in this situation, Can you explain me 
>>what is the I/O pattern on this file - like are there lots of entry 
>>operations like rename, link, unlink etc on the file? There have been 
>>known races in rename/lookup-heal-creating-linkto where linkto and 
>>data file have different gfids. [2] fixes some of these cases
>>
>>[1] 
>>http://lists.gluster.org/pipermail/gluster-users/2017-March/030148.html 
>><http://lists.gluster.org/pipermail/gluster-users/2017-March/030148.html>
>>[2] https://review.gluster.org/#/c/19547/ 
>><https://review.gluster.org/#/c/19547/>
>>
>>regards,
>>Raghavendra
>>>
>>>>
>>>-Krutika
>>>
>>>On Mon, Mar 26, 2018 at 1:09 AM, Ian Halliday <ihalliday at ndevix.com> 
>>>wrote:
>>>>Hello all,
>>>>
>>>>We are having a rather interesting problem with one of our VM 
>>>>storage systems. The GlusterFS client is throwing errors relating to 
>>>>GFID mismatches. We traced this down to multiple shards being 
>>>>present on the gluster nodes, with different gfids.
>>>>
>>>>Hypervisor gluster mount log:
>>>>
>>>>[2018-03-25 18:54:19.261733] E [MSGID: 133010] 
>>>>[shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-zone1-shard: 
>>>>Lookup on shard 7 failed. Base file gfid = 
>>>>87137cac-49eb-492a-8f33-8e33470d8cb7 [Stale file handle]
>>>>The message "W [MSGID: 109009] 
>>>>[dht-common.c:2162:dht_lookup_linkfile_cbk] 0-ovirt-zone1-dht: 
>>>>/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid different on 
>>>>data file on ovirt-zone1-replicate-3, gfid local = 
>>>>00000000-0000-0000-0000-000000000000, gfid node = 
>>>>57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56 " repeated 2 times between 
>>>>[2018-03-25 18:54:19.253748] and [2018-03-25 18:54:19.263576]
>>>>[2018-03-25 18:54:19.264349] W [MSGID: 109009] 
>>>>[dht-common.c:1901:dht_lookup_everywhere_cbk] 0-ovirt-zone1-dht: 
>>>>/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid differs on 
>>>>subvolume ovirt-zone1-replicate-3, gfid local = 
>>>>fdf0813b-718a-4616-a51b-6999ebba9ec3, gfid node = 
>>>>57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56
>>>>
>>>>
>>>>On the storage nodes, we found this:
>>>>
>>>>[root at n1 gluster]# find -name 87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>>>./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>>>./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>>>
>>>>[root at n1 gluster]# ls -lh 
>>>>./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>>>---------T. 2 root root 0 Mar 25 13:55 
>>>>./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>>>[root at n1 gluster]# ls -lh 
>>>>./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>>>-rw-rw----. 2 root root 3.8G Mar 25 13:55 
>>>>./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>>>
>>>>[root at n1 gluster]# getfattr -d -m . -e hex 
>>>>./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>>># file: brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>>>security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
>>>>trusted.gfid=0xfdf0813b718a4616a51b6999ebba9ec3
>>>>trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e65312d7265706c69636174652d3300
>>>>
>>>>[root at n1 gluster]# getfattr -d -m . -e hex 
>>>>./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>>># file: brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>>>security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
>>>>trusted.afr.dirty=0x000000000000000000000000
>>>>trusted.bit-rot.version=0x020000000000000059914190000ce672
>>>>trusted.gfid=0x57c6fcdf52bb4f7aaea402f0dc81ff56
>>>>
>>>>
>>>>I'm wondering how they got created in the first place, and if anyone 
>>>>has any insight on how to fix it?
>>>>
>>>>Storage nodes:
>>>>[root at n1 gluster]# gluster --version
>>>>glusterfs 4.0.0
>>>>
>>>>[root at n1 gluster]# gluster volume info
>>>>
>>>>Volume Name: ovirt-350-zone1
>>>>Type: Distributed-Replicate
>>>>Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
>>>>Status: Started
>>>>Snapshot Count: 0
>>>>Number of Bricks: 7 x (2 + 1) = 21
>>>>Transport-type: tcp
>>>>Bricks:
>>>>Brick1: 10.0.6.100:/gluster/brick1/brick
>>>>Brick2: 10.0.6.101:/gluster/brick1/brick
>>>>Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
>>>>Brick4: 10.0.6.100:/gluster/brick2/brick
>>>>Brick5: 10.0.6.101:/gluster/brick2/brick
>>>>Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
>>>>Brick7: 10.0.6.100:/gluster/brick3/brick
>>>>Brick8: 10.0.6.101:/gluster/brick3/brick
>>>>Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
>>>>Brick10: 10.0.6.100:/gluster/brick4/brick
>>>>Brick11: 10.0.6.101:/gluster/brick4/brick
>>>>Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
>>>>Brick13: 10.0.6.100:/gluster/brick5/brick
>>>>Brick14: 10.0.6.101:/gluster/brick5/brick
>>>>Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
>>>>Brick16: 10.0.6.100:/gluster/brick6/brick
>>>>Brick17: 10.0.6.101:/gluster/brick6/brick
>>>>Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
>>>>Brick19: 10.0.6.100:/gluster/brick7/brick
>>>>Brick20: 10.0.6.101:/gluster/brick7/brick
>>>>Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
>>>>Options Reconfigured:
>>>>cluster.min-free-disk: 50GB
>>>>performance.strict-write-ordering: off
>>>>performance.strict-o-direct: off
>>>>nfs.disable: off
>>>>performance.readdir-ahead: on
>>>>transport.address-family: inet
>>>>performance.cache-size: 1GB
>>>>features.shard: on
>>>>features.shard-block-size: 5GB
>>>>server.event-threads: 8
>>>>server.outstanding-rpc-limit: 128
>>>>storage.owner-uid: 36
>>>>storage.owner-gid: 36
>>>>performance.quick-read: off
>>>>performance.read-ahead: off
>>>>performance.io-cache: off
>>>>performance.stat-prefetch: on
>>>>cluster.eager-lock: enable
>>>>network.remote-dio: enable
>>>>cluster.quorum-type: auto
>>>>cluster.server-quorum-type: server
>>>>cluster.data-self-heal-algorithm: full
>>>>performance.flush-behind: off
>>>>performance.write-behind-window-size: 8MB
>>>>client.event-threads: 8
>>>>server.allow-insecure: on
>>>>
>>>>
>>>>Client version:
>>>>[root at kvm573 ~]# gluster --version
>>>>glusterfs 3.12.5
>>>>
>>>>
>>>>Thanks!
>>>>
>>>>- Ian
>>>>
>>>>
>>>>_______________________________________________
>>>>Gluster-users mailing list
>>>>Gluster-users at gluster.org
>>>>http://lists.gluster.org/mailman/listinfo/gluster-users 
>>>><http://lists.gluster.org/mailman/listinfo/gluster-users>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180326/f5a01276/attachment.html>