[Gluster-devel] Possible race condition bug with tiered volume

Tue Oct 18 23:09:29 UTC 2016

Dang. I always think I get all the detail and inevitably leave out
something important. :-/

I'm mobile and don't have the exact version in front of me, but this is
recent if not latest RHGS on RHEL 7.2.

On Oct 18, 2016 7:04 PM, "Dan Lambright" <dlambrig at redhat.com> wrote:

> Dustin,
>
> What level code ? I often run smallfile on upstream code with tiered
> volumes and have not seen this.
>
> Sure, one of us will get back to you.
>
> Unfortunately, gluster has a lot of protocol overhead (LOOKUPs), and they
> overwhelm the boost in transfer speeds you get for small files. A
> presentation at the Berlin gluster summit evaluated this.  The expectation
> is md-cache will go a long way towards helping that, before too long.
>
> Dan
>
>
>
> ----- Original Message -----
> > From: "Dustin Black" <dblack at redhat.com>
> > To: gluster-devel at gluster.org
> > Cc: "Annette Clewett" <aclewett at redhat.com>
> > Sent: Tuesday, October 18, 2016 4:30:04 PM
> > Subject: [Gluster-devel] Possible race condition bug with tiered volume
> >
> > I have a 3x2 hot tier on NVMe drives with a 3x2 cold tier on RAID6
> drives.
> >
> > # gluster vol info 1nvme-distrep3x2
> > Volume Name: 1nvme-distrep3x2
> > Type: Tier
> > Volume ID: 21e3fc14-c35c-40c5-8e46-c258c1302607
> > Status: Started
> > Number of Bricks: 12
> > Transport-type: tcp
> > Hot Tier :
> > Hot Tier Type : Distributed-Replicate
> > Number of Bricks: 3 x 2 = 6
> > Brick1: n5:/rhgs/hotbricks/1nvme-distrep3x2-hot
> > Brick2: n4:/rhgs/hotbricks/1nvme-distrep3x2-hot
> > Brick3: n3:/rhgs/hotbricks/1nvme-distrep3x2-hot
> > Brick4: n2:/rhgs/hotbricks/1nvme-distrep3x2-hot
> > Brick5: n1:/rhgs/hotbricks/1nvme-distrep3x2-hot
> > Brick6: n0:/rhgs/hotbricks/1nvme-distrep3x2-hot
> > Cold Tier:
> > Cold Tier Type : Distributed-Replicate
> > Number of Bricks: 3 x 2 = 6
> > Brick7: n0:/rhgs/coldbricks/1nvme-distrep3x2
> > Brick8: n1:/rhgs/coldbricks/1nvme-distrep3x2
> > Brick9: n2:/rhgs/coldbricks/1nvme-distrep3x2
> > Brick10: n3:/rhgs/coldbricks/1nvme-distrep3x2
> > Brick11: n4:/rhgs/coldbricks/1nvme-distrep3x2
> > Brick12: n5:/rhgs/coldbricks/1nvme-distrep3x2
> > Options Reconfigured:
> > cluster.tier-mode: cache
> > features.ctr-enabled: on
> > performance.readdir-ahead: on
> >
> >
> > I am attempting to run the 'smallfile' benchmark tool on this volume. The
> > 'smallfile' tool creates a starting gate directory and files in a shared
> > filesystem location. The first run (write) works as expected.
> >
> > # smallfile_cli.py --threads 12 --file-size 4096 --files 300 --top
> > /rhgs/client/1nvme-distrep3x2 --host-set
> > c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11 --prefix test1 --stonewall Y
> > --network-sync-dir /rhgs/client/1nvme-distrep3x2/smf1 --operation create
> >
> > For the second run (read), I believe that smallfile attempts first to 'rm
> > -rf' the "network-sync-dir" path, which fails with ENOTEMPTY, causing the
> > run to fail
> >
> > # smallfile_cli.py --threads 12 --file-size 4096 --files 300 --top
> > /rhgs/client/1nvme-distrep3x2 --host-set
> > c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11 --prefix test1 --stonewall Y
> > --network-sync-dir /rhgs/client/1nvme-distrep3x2/smf1 --operation create
> > ...
> > Traceback (most recent call last):
> > File "/root/bin/smallfile_cli.py", line 280, in <module>
> > run_workload()
> > File "/root/bin/smallfile_cli.py", line 270, in run_workload
> > return run_multi_host_workload(params)
> > File "/root/bin/smallfile_cli.py", line 62, in run_multi_host_workload
> > sync_files.create_top_dirs(master_invoke, True)
> > File "/root/bin/sync_files.py", line 27, in create_top_dirs
> > shutil.rmtree(master_invoke.network_dir)
> > File "/usr/lib64/python2.7/shutil.py", line 256, in rmtree
> > onerror(os.rmdir, path, sys.exc_info())
> > File "/usr/lib64/python2.7/shutil.py", line 254, in rmtree
> > os.rmdir(path)
> > OSError: [Errno 39] Directory not empty: '/rhgs/client/1nvme-
> distrep3x2/smf1'
> >
> >
> > From the client perspective, the directory is clearly empty.
> >
> > # ls -a /rhgs/client/1nvme-distrep3x2/smf1/
> > . ..
> >
> >
> > And a quick search on the bricks shows that the hot tier on the last
> replica
> > pair is the offender.
> >
> > # for i in {0..5}; do ssh n$i "hostname; ls
> > /rhgs/coldbricks/1nvme-distrep3x2/smf1 | wc -l; ls
> > /rhgs/hotbricks/1nvme-distrep3x2-hot/smf1 | wc -l"; donerhosd0
> > 0
> > 0
> > rhosd1
> > 0
> > 0
> > rhosd2
> > 0
> > 0
> > rhosd3
> > 0
> > 0
> > rhosd4
> > 0
> > 1
> > rhosd5
> > 0
> > 1
> >
> >
> > (For the record, multiple runs of this reproducer show that it is
> > consistently the hot tier that is to blame, but it is not always the same
> > replica pair.)
> >
> >
> > Can someone try recreating this scenario to see if the problem is
> consistent?
> > Please reach out if you need me to provide any further details.
> >
> >
> > Dustin Black, RHCA
> > Senior Architect, Software-Defined Storage
> > Red Hat, Inc.
> > (o) +1.212.510.4138 (m) +1.215.821.7423
> > dustin at redhat.com
> >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20161018/41db8e05/attachment-0001.html>