[Gluster-devel] Possible race condition bug with tiered volume

Thu Oct 20 20:19:59 UTC 2016

Dustin,

Your python code looks fine to me... I've been in Ceph C++ weeds lately, I kinda miss python ;)

If I run back-to-back smallfile operation "create", then on the second smallfile run, I consistently see:  

0.00% of requested files processed, minimum is  70.00
at least one thread encountered error, test may be incomplete

Is this what you get? We can follow up off the mailing list.

Dan

glusterfs 3.7.15 built on Oct 20 2016, with two clients running small file against a tiered volume (using ram disk as hot tier, cold disks JBOD, copied below) on Fedora 23.

./smallfile_cli.py  --top /mnt/p66p67 --host-set gprfc066,gprfc067 --threads 8 --files 5000 --file-size 64 --record-size 64 --fsync N --operation read

volume - 

Status: Started
Number of Bricks: 28
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick1: gprfs020:/home/ram 
Brick2: gprfs019:/home/ram 
Brick3: gprfs018:/home/ram 
Brick4: gprfs017:/home/ram 
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 2 x (8 + 4) = 24
Brick5: gprfs017:/t0
Brick6: gprfs018:/t0
Brick7: gprfs019:/t0
Brick8: gprfs020:/t0
Brick9: gprfs017:/t1
Brick10: gprfs018:/t1
Brick11: gprfs019:/t1
Brick12: gprfs020:/t1
Brick13: gprfs017:/t2
Brick14: gprfs018:/t2
Brick15: gprfs019:/t2
Brick16: gprfs020:/t2
Brick17: gprfs017:/t3
Brick18: gprfs018:/t3
Brick19: gprfs019:/t3
Brick20: gprfs020:/t3
Brick21: gprfs017:/t4
Brick22: gprfs018:/t4
Brick23: gprfs019:/t4
Brick24: gprfs020:/t4
Brick25: gprfs017:/t5
Brick26: gprfs018:/t5
Brick27: gprfs019:/t5
Brick28: gprfs020:/t5
Options Reconfigured:
cluster.tier-mode: cache   
features.ctr-enabled: on   
performance.readdir-ahead: on

----- Original Message -----
> From: "Dustin Black" <dblack at redhat.com>
> To: "Dan Lambright" <dlambrig at redhat.com>
> Cc: "Milind Changire" <mchangir at redhat.com>, "Annette Clewett" <aclewett at redhat.com>, gluster-devel at gluster.org
> Sent: Wednesday, October 19, 2016 3:23:04 PM
> Subject: Re: [Gluster-devel] Possible race condition bug with tiered volume
> 
> # gluster --version
> glusterfs 3.7.9 built on Jun 10 2016 06:32:42
> 
> 
> Try not to make fun of my python, but I was able to make a small
> modification to the to the sync_files.py script from smallfile and at least
> enable my team to move on with testing. It's terribly hacky and ugly, but
> works around the problem, which I am pretty convinced is a Gluster bug at
> this point.
> 
> 
> # diff bin/sync_files.py.orig bin/sync_files.py
> 6a7,8
> > import errno
> > import binascii
> 27c29,40
> <         shutil.rmtree(master_invoke.network_dir)
> ---
> >         try:
> >             shutil.rmtree(master_invoke.network_dir)
> >         except OSError as e:
> >             err = e.errno
> >             if err != errno.EEXIST:
> >                 # workaround for possible bug in Gluster
> >                 if err != errno.ENOTEMPTY:
> >                     raise e
> >                 else:
> >                     print('saw ENOTEMPTY on stonewall, moving shared
> directory')
> >                     ext = str(binascii.b2a_hex(os.urandom(15)))
> >                     shutil.move(master_invoke.network_dir,
> master_invoke.network_dir + ext)
> 
> 
> Dustin Black, RHCA
> Senior Architect, Software-Defined Storage
> Red Hat, Inc.
> (o) +1.212.510.4138  (m) +1.215.821.7423
> dustin at redhat.com
> 
> 
> On Tue, Oct 18, 2016 at 7:09 PM, Dustin Black <dblack at redhat.com> wrote:
> 
> > Dang. I always think I get all the detail and inevitably leave out
> > something important. :-/
> >
> > I'm mobile and don't have the exact version in front of me, but this is
> > recent if not latest RHGS on RHEL 7.2.
> >
> >
> > On Oct 18, 2016 7:04 PM, "Dan Lambright" <dlambrig at redhat.com> wrote:
> >
> >> Dustin,
> >>
> >> What level code ? I often run smallfile on upstream code with tiered
> >> volumes and have not seen this.
> >>
> >> Sure, one of us will get back to you.
> >>
> >> Unfortunately, gluster has a lot of protocol overhead (LOOKUPs), and they
> >> overwhelm the boost in transfer speeds you get for small files. A
> >> presentation at the Berlin gluster summit evaluated this.  The expectation
> >> is md-cache will go a long way towards helping that, before too long.
> >>
> >> Dan
> >>
> >>
> >>
> >> ----- Original Message -----
> >> > From: "Dustin Black" <dblack at redhat.com>
> >> > To: gluster-devel at gluster.org
> >> > Cc: "Annette Clewett" <aclewett at redhat.com>
> >> > Sent: Tuesday, October 18, 2016 4:30:04 PM
> >> > Subject: [Gluster-devel] Possible race condition bug with tiered volume
> >> >
> >> > I have a 3x2 hot tier on NVMe drives with a 3x2 cold tier on RAID6
> >> drives.
> >> >
> >> > # gluster vol info 1nvme-distrep3x2
> >> > Volume Name: 1nvme-distrep3x2
> >> > Type: Tier
> >> > Volume ID: 21e3fc14-c35c-40c5-8e46-c258c1302607
> >> > Status: Started
> >> > Number of Bricks: 12
> >> > Transport-type: tcp
> >> > Hot Tier :
> >> > Hot Tier Type : Distributed-Replicate
> >> > Number of Bricks: 3 x 2 = 6
> >> > Brick1: n5:/rhgs/hotbricks/1nvme-distrep3x2-hot
> >> > Brick2: n4:/rhgs/hotbricks/1nvme-distrep3x2-hot
> >> > Brick3: n3:/rhgs/hotbricks/1nvme-distrep3x2-hot
> >> > Brick4: n2:/rhgs/hotbricks/1nvme-distrep3x2-hot
> >> > Brick5: n1:/rhgs/hotbricks/1nvme-distrep3x2-hot
> >> > Brick6: n0:/rhgs/hotbricks/1nvme-distrep3x2-hot
> >> > Cold Tier:
> >> > Cold Tier Type : Distributed-Replicate
> >> > Number of Bricks: 3 x 2 = 6
> >> > Brick7: n0:/rhgs/coldbricks/1nvme-distrep3x2
> >> > Brick8: n1:/rhgs/coldbricks/1nvme-distrep3x2
> >> > Brick9: n2:/rhgs/coldbricks/1nvme-distrep3x2
> >> > Brick10: n3:/rhgs/coldbricks/1nvme-distrep3x2
> >> > Brick11: n4:/rhgs/coldbricks/1nvme-distrep3x2
> >> > Brick12: n5:/rhgs/coldbricks/1nvme-distrep3x2
> >> > Options Reconfigured:
> >> > cluster.tier-mode: cache
> >> > features.ctr-enabled: on
> >> > performance.readdir-ahead: on
> >> >
> >> >
> >> > I am attempting to run the 'smallfile' benchmark tool on this volume.
> >> The
> >> > 'smallfile' tool creates a starting gate directory and files in a shared
> >> > filesystem location. The first run (write) works as expected.
> >> >
> >> > # smallfile_cli.py --threads 12 --file-size 4096 --files 300 --top
> >> > /rhgs/client/1nvme-distrep3x2 --host-set
> >> > c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11 --prefix test1 --stonewall Y
> >> > --network-sync-dir /rhgs/client/1nvme-distrep3x2/smf1 --operation
> >> create
> >> >
> >> > For the second run (read), I believe that smallfile attempts first to
> >> 'rm
> >> > -rf' the "network-sync-dir" path, which fails with ENOTEMPTY, causing
> >> the
> >> > run to fail
> >> >
> >> > # smallfile_cli.py --threads 12 --file-size 4096 --files 300 --top
> >> > /rhgs/client/1nvme-distrep3x2 --host-set
> >> > c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11 --prefix test1 --stonewall Y
> >> > --network-sync-dir /rhgs/client/1nvme-distrep3x2/smf1 --operation
> >> create
> >> > ...
> >> > Traceback (most recent call last):
> >> > File "/root/bin/smallfile_cli.py", line 280, in <module>
> >> > run_workload()
> >> > File "/root/bin/smallfile_cli.py", line 270, in run_workload
> >> > return run_multi_host_workload(params)
> >> > File "/root/bin/smallfile_cli.py", line 62, in run_multi_host_workload
> >> > sync_files.create_top_dirs(master_invoke, True)
> >> > File "/root/bin/sync_files.py", line 27, in create_top_dirs
> >> > shutil.rmtree(master_invoke.network_dir)
> >> > File "/usr/lib64/python2.7/shutil.py", line 256, in rmtree
> >> > onerror(os.rmdir, path, sys.exc_info())
> >> > File "/usr/lib64/python2.7/shutil.py", line 254, in rmtree
> >> > os.rmdir(path)
> >> > OSError: [Errno 39] Directory not empty: '/rhgs/client/1nvme-distrep3x2
> >> /smf1'
> >> >
> >> >
> >> > From the client perspective, the directory is clearly empty.
> >> >
> >> > # ls -a /rhgs/client/1nvme-distrep3x2/smf1/
> >> > . ..
> >> >
> >> >
> >> > And a quick search on the bricks shows that the hot tier on the last
> >> replica
> >> > pair is the offender.
> >> >
> >> > # for i in {0..5}; do ssh n$i "hostname; ls
> >> > /rhgs/coldbricks/1nvme-distrep3x2/smf1 | wc -l; ls
> >> > /rhgs/hotbricks/1nvme-distrep3x2-hot/smf1 | wc -l"; donerhosd0
> >> > 0
> >> > 0
> >> > rhosd1
> >> > 0
> >> > 0
> >> > rhosd2
> >> > 0
> >> > 0
> >> > rhosd3
> >> > 0
> >> > 0
> >> > rhosd4
> >> > 0
> >> > 1
> >> > rhosd5
> >> > 0
> >> > 1
> >> >
> >> >
> >> > (For the record, multiple runs of this reproducer show that it is
> >> > consistently the hot tier that is to blame, but it is not always the
> >> same
> >> > replica pair.)
> >> >
> >> >
> >> > Can someone try recreating this scenario to see if the problem is
> >> consistent?
> >> > Please reach out if you need me to provide any further details.
> >> >
> >> >
> >> > Dustin Black, RHCA
> >> > Senior Architect, Software-Defined Storage
> >> > Red Hat, Inc.
> >> > (o) +1.212.510.4138 (m) +1.215.821.7423
> >> > dustin at redhat.com
> >> >
> >> > _______________________________________________
> >> > Gluster-devel mailing list
> >> > Gluster-devel at gluster.org
> >> > http://www.gluster.org/mailman/listinfo/gluster-devel
> >>
> >
>