[Gluster-users] Sharding problem - multiple shard copies with mismatching gfids

Ian Halliday ihalliday at ndevix.com
Mon Mar 26 08:39:52 UTC 2018


The issue typically appears during heavy write operations to the VM 
image. Its most noticeable during the filesystem creation process on a 
virtual machine image. I'll get some specific data while executing that 
process and will get back to you soon.


>Do you've a reproducer for this bug? If not a specific one, a general 
>outline of what operations where done on the file will help.
>On Mon, Mar 26, 2018 at 12:55 PM, Raghavendra Gowdappa 
><rgowdapp at redhat.com> wrote:
>>On Mon, Mar 26, 2018 at 12:40 PM, Krutika Dhananjay 
>><kdhananj at redhat.com> wrote:
>>>The gfid mismatch here is between the shard and its "link-to" file, 
>>>the creation of which happens at a layer below that of shard 
>>>translator on the stack.
>>>Adding DHT devs to take a look.
>>Thanks Krutika. I assume shard doesn't do any dentry operations like 
>>rename, link, unlink on the path of file (not the gfid handle based 
>>path) internally while managing shards. Can you confirm? If it does 
>>these operations, what fops does it do?
>>I can suggest following way to fix the problem:
>>* Since one of files listed is a DHT linkto file, I am assuming there 
>>is only one shard of the file. If not, please list out gfids of other 
>>shards and don't proceed with healing procedure.
>>* If gfids of all shards happen to be same and only linkto has a 
>>different gfid, please proceed to step 3. Otherwise abort the healing 
>>* If cluster.lookup-optimize is set to true abort the healing 
>>* Delete the linkto file - the file with permissions -------T and 
>>xattr trusted.dht.linkto and do a lookup on the file from mount point 
>>after turning off readdriplus [1].
>>As to reasons on how we ended up in this situation, Can you explain me 
>>what is the I/O pattern on this file - like are there lots of entry 
>>operations like rename, link, unlink etc on the file? There have been 
>>known races in rename/lookup-heal-creating-linkto where linkto and 
>>data file have different gfids. [2] fixes some of these cases
>>[2] https://review.gluster.org/#/c/19547/ 
>>>On Mon, Mar 26, 2018 at 1:09 AM, Ian Halliday <ihalliday at ndevix.com> 
>>>>Hello all,
>>>>We are having a rather interesting problem with one of our VM 
>>>>storage systems. The GlusterFS client is throwing errors relating to 
>>>>GFID mismatches. We traced this down to multiple shards being 
>>>>present on the gluster nodes, with different gfids.
>>>>Hypervisor gluster mount log:
>>>>[2018-03-25 18:54:19.261733] E [MSGID: 133010] 
>>>>[shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-zone1-shard: 
>>>>Lookup on shard 7 failed. Base file gfid = 
>>>>87137cac-49eb-492a-8f33-8e33470d8cb7 [Stale file handle]
>>>>The message "W [MSGID: 109009] 
>>>>[dht-common.c:2162:dht_lookup_linkfile_cbk] 0-ovirt-zone1-dht: 
>>>>/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid different on 
>>>>data file on ovirt-zone1-replicate-3, gfid local = 
>>>>00000000-0000-0000-0000-000000000000, gfid node = 
>>>>57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56 " repeated 2 times between 
>>>>[2018-03-25 18:54:19.253748] and [2018-03-25 18:54:19.263576]
>>>>[2018-03-25 18:54:19.264349] W [MSGID: 109009] 
>>>>[dht-common.c:1901:dht_lookup_everywhere_cbk] 0-ovirt-zone1-dht: 
>>>>/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid differs on 
>>>>subvolume ovirt-zone1-replicate-3, gfid local = 
>>>>fdf0813b-718a-4616-a51b-6999ebba9ec3, gfid node = 
>>>>On the storage nodes, we found this:
>>>>[root at n1 gluster]# find -name 87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>>>[root at n1 gluster]# ls -lh 
>>>>---------T. 2 root root 0 Mar 25 13:55 
>>>>[root at n1 gluster]# ls -lh 
>>>>-rw-rw----. 2 root root 3.8G Mar 25 13:55 
>>>>[root at n1 gluster]# getfattr -d -m . -e hex 
>>>># file: brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>>>[root at n1 gluster]# getfattr -d -m . -e hex 
>>>># file: brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>>>I'm wondering how they got created in the first place, and if anyone 
>>>>has any insight on how to fix it?
>>>>Storage nodes:
>>>>[root at n1 gluster]# gluster --version
>>>>glusterfs 4.0.0
>>>>[root at n1 gluster]# gluster volume info
>>>>Volume Name: ovirt-350-zone1
>>>>Type: Distributed-Replicate
>>>>Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
>>>>Status: Started
>>>>Snapshot Count: 0
>>>>Number of Bricks: 7 x (2 + 1) = 21
>>>>Transport-type: tcp
>>>>Brick3: (arbiter)
>>>>Brick6: (arbiter)
>>>>Brick9: (arbiter)
>>>>Brick12: (arbiter)
>>>>Brick15: (arbiter)
>>>>Brick18: (arbiter)
>>>>Brick21: (arbiter)
>>>>Options Reconfigured:
>>>>cluster.min-free-disk: 50GB
>>>>performance.strict-write-ordering: off
>>>>performance.strict-o-direct: off
>>>>nfs.disable: off
>>>>performance.readdir-ahead: on
>>>>transport.address-family: inet
>>>>performance.cache-size: 1GB
>>>>features.shard: on
>>>>features.shard-block-size: 5GB
>>>>server.event-threads: 8
>>>>server.outstanding-rpc-limit: 128
>>>>storage.owner-uid: 36
>>>>storage.owner-gid: 36
>>>>performance.quick-read: off
>>>>performance.read-ahead: off
>>>>performance.io-cache: off
>>>>performance.stat-prefetch: on
>>>>cluster.eager-lock: enable
>>>>network.remote-dio: enable
>>>>cluster.quorum-type: auto
>>>>cluster.server-quorum-type: server
>>>>cluster.data-self-heal-algorithm: full
>>>>performance.flush-behind: off
>>>>performance.write-behind-window-size: 8MB
>>>>client.event-threads: 8
>>>>server.allow-insecure: on
>>>>Client version:
>>>>[root at kvm573 ~]# gluster --version
>>>>glusterfs 3.12.5
>>>>- Ian
