[Bugs] [Bug 1379528] New: Poor smallfile read performance on Arbiter volume compared to Replica 3 volume

Tue Sep 27 04:31:46 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1379528

            Bug ID: 1379528
           Summary: Poor smallfile read performance on Arbiter volume
                    compared to Replica 3 volume
           Product: GlusterFS
           Version: 3.9
         Component: arbiter
          Assignee: bugs at gluster.org
          Reporter: ravishankar at redhat.com
                CC: bugs at gluster.org, mpillai at redhat.com,
                    pkarampu at redhat.com, psuriset at redhat.com,
                    ravishankar at redhat.com, rcyriac at redhat.com,
                    rsussman at redhat.com, shberry at redhat.com
        Depends On: 1377193, 1378684, 1378867

+++ This bug was initially created as a clone of Bug #1378684 +++

+++ This bug was initially created as a clone of Bug #1377193 +++

Description of problem:

Expectation was smallfile read performance on Arbiter volume would match
replica 3 smallfile read performance.
Observation is Arbiter volume read performance is 30% of replica 3 read
performance.

Version-Release number of selected component (if applicable):

glusterfs-cli-3.8.2-1.el7.x86_64
glusterfs-3.8.2-1.el7.x86_64
glusterfs-api-3.8.2-1.el7.x86_64
glusterfs-libs-3.8.2-1.el7.x86_64
glusterfs-fuse-3.8.2-1.el7.x86_64
glusterfs-client-xlators-3.8.2-1.el7.x86_64
glusterfs-server-3.8.2-1.el7.x86_64

How reproducible:

Every time.

gluster v info (Replica 3 volume)

Volume Name: rep3
Type: Distributed-Replicate
Volume ID: e7a5d84d-31da-40a8-85d0-2b94b95c3b28
Status: Started
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: 172.17.40.13:/bricks/b/g
Brick2: 172.17.40.14:/bricks/b/g
Brick3: 172.17.40.15:/bricks/b/g
Brick4: 172.17.40.16:/bricks/b/g
Brick5: 172.17.40.22:/bricks/b/g
Brick6: 172.17.40.24:/bricks/b/g
Options Reconfigured:
server.event-threads: 4
client.event-threads: 4
cluster.lookup-optimize: on
performance.readdir-ahead: on

gluster v info (Arbiter Volume)

Volume Name: arb
Type: Distributed-Replicate
Volume ID: e7a5d84d-31da-40a8-85d0-2b94b95c3b28
Status: Started
Number of Bricks: 3 x (2 + 1) = 9
Transport-type: tcp
Brick1: 172.17.40.13:/bricks/b01/g
Brick2: 172.17.40.14:/bricks/b01/g
Brick3: 172.17.40.15:/bricks/b02/g (arbiter)
Brick4: 172.17.40.15:/bricks/b01/g
Brick5: 172.17.40.16:/bricks/b01/g
Brick6: 172.17.40.22:/bricks/b02/g (arbiter)
Brick7: 172.17.40.22:/bricks/b01/g
Brick8: 172.17.40.24:/bricks/b01/g
Brick9: 172.17.40.13:/bricks/b02/g (arbiter)
Options Reconfigured:
server.event-threads: 4
client.event-threads: 4
cluster.lookup-optimize: on
performance.readdir-ahead: on

Steps to Reproduce:

For both Replica 3 volume and Arbiter Volume, do the following

1. Creation of files. Drop cache on server and client side. Create smallfile
files using command /root/smallfile/smallfile_cli.py --top /mnt/glusterfs
--host-set clientfile --threads 4 --file-size 256 --files 6554 --record-size 32
--fsync Y --operation create

2. Reading of files. Again drop cache on server and client side. Read
smallfiles using command /root/smallfile/smallfile_cli.py --top /mnt/glusterfs
--host-set clientfile --threads 4 --file-size 256 --files 6554  --record-size
32 --operation read

3. Compare the read performance for both replica 3 and Arbiter volume 

Actual results:

Arbiter read performance is 30% of replica 3 read performance for smallfile
workload.

Expected results:

Smallfile read performance of Arbiter volume and Replica 3 volume should
ideally be same.

--Shekhar

--- Additional comment from Ravishankar N on 2016-09-19 03:31:50 EDT ---

Note to self: workload used:https://github.com/bengland2/smallfile

--- Additional comment from Shekhar Berry on 2016-09-19 04:07:56 EDT ---

Smallfile Performance numbers:

Create Performance for 256KiB file size
---------------------------------------

Replica 2 Volume : 407 files/sec/server
Arbiter Volume   : 317 files/sec/server
Replica 3 Volume : 306 files/sec/server

Read Performance for 256KiB file size
-------------------------------------

Replica 2 Volume : 380 files/sec/server
Arbiter Volume   : 132 files/sec/server
Replica 3 Volume : 329 files/sec/server

--Shekhar

--- Additional comment from Ravishankar N on 2016-09-22 05:55:55 EDT ---

I was able to get similar results on my testing where the 'files/sec' was
almost half for a 1x (2+1) setup when compared to a 1x3 setup for 256KB write
size. A summary of the cumulative brick profile info on one such run is given
below for some FOPS:

Replica 3 vol
-------------- 
             No of calls:        
    Brick1    Brick2    Brick3
Lookup    28,544    28,545    28,552
Read    17,695    17,507    17,228
FSTAT    17,714    17,535    17,247
Inodelk    8    8    8

Arbiter vol
-----------
    No. of calls:
    Brick1    Brick2    Arbiter brick
Lookup    56,241    56,246    56,245
Read    34,920    17,508    -
FSTAT    34,995    17,533    -
Inodelk    52,442    52,442    52,442

I see that the sum total of the reads on all bricks is similar for both replica
and arbiter setups. In arbiter vol, zero reads are served from arbiter brick
and so the read load is spread between 1st 2 bricks. Likewise for Fstat.

But the problem seems to be in the number of lookups. For arbiter volume, the
number seems to be double than replica-3. I'm guessing this is what is slowing
things down. I also see a lot of Inodelks for the arbiter volume, which is
unexpected because the I/O was a read operation. I need to figure out why these
2 things are happening.

--- Additional comment from Worker Ant on 2016-09-23 01:37:13 EDT ---

REVIEW: http://review.gluster.org/15548 (afr: Ignore GF_CONTENT_KEY in metadata
heal check) posted (#1) for review on master by Ravishankar N
(ravishankar at redhat.com)

--- Additional comment from Worker Ant on 2016-09-25 08:04:33 EDT ---

REVIEW: http://review.gluster.org/15548 (afr: Ignore GF_CONTENT_KEY in metadata
heal check) posted (#2) for review on master by Ravishankar N
(ravishankar at redhat.com)

--- Additional comment from Worker Ant on 2016-09-26 01:04:19 EDT ---

REVIEW: http://review.gluster.org/15548 (afr: Ignore gluster internal (virtual)
xattrs in metadata heal check) posted (#3) for review on master by Ravishankar
N (ravishankar at redhat.com)

--- Additional comment from Worker Ant on 2016-09-26 01:12:04 EDT ---

REVIEW: http://review.gluster.org/15548 (afr: Ignore gluster internal (virtual)
xattrs in metadata heal check) posted (#4) for review on master by Ravishankar
N (ravishankar at redhat.com)

--- Additional comment from Worker Ant on 2016-09-26 09:25:45 EDT ---

COMMIT: http://review.gluster.org/15548 committed in master by Pranith Kumar
Karampuri (pkarampu at redhat.com) 
------
commit 5afc6aba906a21aee19c2f1baaa7d9eb762ae0ac
Author: Ravishankar N <ravishankar at redhat.com>
Date:   Fri Sep 23 10:47:03 2016 +0530

    afr: Ignore gluster internal (virtual) xattrs in metadata heal check

    Problem:
    In arbiter configuration, posix-xlator in the arbiter brick always sets
    the GF_CONTENT_KEY in the response dict with a value 0. If the file size on
    the data bricks is more than quick-read's max-file-size (64kb default),
    those bricks don't set the key. Because of this difference in the no. of
dict
    elements, afr triggers metadata heal in lookup code path, in turn
    leading to extra lookups+inodelks.

    Fix:
    Changed afr dict comparison logic to ignore all virtual xattrs and the
    on-disk ones that we should not be healing.

    Also removed is_virtual_xattr() function. The original callers to this
    function (upcall) don't seem to need it anymore.

    Change-Id: I05730bdd39d8fb0b9a49a5fc9c0bb01f0d3bb308
    BUG: 1378684
    Signed-off-by: Ravishankar N <ravishankar at redhat.com>
    Reviewed-on: http://review.gluster.org/15548
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu at redhat.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1377193
[Bug 1377193] Poor smallfile read performance on Arbiter volume compared to
Replica 3 volume
https://bugzilla.redhat.com/show_bug.cgi?id=1378684
[Bug 1378684] Poor smallfile read performance on Arbiter volume compared to
Replica 3 volume
https://bugzilla.redhat.com/show_bug.cgi?id=1378867
[Bug 1378867] Poor smallfile read performance on Arbiter volume compared to
Replica 3 volume
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.