[Bugs] [Bug 1220064] New: Gluster small-file creates do not scale with brick count

Sat May 9 18:14:13 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1220064

            Bug ID: 1220064
           Summary: Gluster small-file creates do not scale with brick
                    count
           Product: GlusterFS
           Version: 3.7.0
         Component: distribute
          Severity: high
          Priority: urgent
          Assignee: bugs at gluster.org
          Reporter: srangana at redhat.com
                CC: bengland at redhat.com, bugs at gluster.org,
                    gluster-bugs at redhat.com, ira at redhat.com,
                    nsathyan at redhat.com, perfbz at redhat.com,
                    srangana at redhat.com, storage-qa-internal at redhat.com,
                    vagarwal at redhat.com
        Depends On: 1156637, 1219637
            Blocks: 1202842

+++ This bug was initially created as a clone of Bug #1219637 +++

+++ This bug was initially created as a clone of Bug #1156637 +++

Description of problem:

Gluster small-file creates have negative scalability with brick count.  This
prevents Gluster from getting reasonable small-file create performance with 

a) JBOD (just a bunch of disks) configurations and
b) high server counts

How reproducible:

Every time.

Steps to Reproduce:
1.  Create Gluster volume with 2, 4, 8, 16, 32 .. bricks (easy to do with JBOD)
2.  run smallfile or similar workload on all clients (glusterfs for example)
3.  measure throughput per brick

Note: Dan Lambright and I were able to take this testing out to 84 bricks using
virtual machines with a single disk drive as a brick for each VM, and 3 GB + 2
CPU cores/VM.  We made sure replicas were on different physical machines.  See
article below for details.  

Actual results:

At some point throughput levels off and starts to decline as brick count is
increased.  However, with Gluster volume parameter cluster.lookup-unhashed off
instead of default value of on, throughput continues to increase, though
perhaps not linearly.

A dangerous workaround is "gluster v set your-volume cluster.lookup-unhashed
off", but if you do this you may lose data.

Expected results:

Throughput should scale linearly with brick count, assuming number of
bricks/server is small.

Additional info:

https://mojo.redhat.com/people/bengland/blog/2014/04/30/gluster-scalability-test-results-using-virtual-machine-servers

for Red-Hat-external folks, available upon request.

Gluster volume profile output shows that without this tuning, LOOKUP FOP starts
to dominate calls and eventually %latency as well.  For example, with just 2
servers and 6 RAID6 bricks/server in a 1-replica volume, we get something like
this:

Interval 2 Stats:
   Block Size:              65536b+ 
 No. of Reads:                    0 
No. of Writes:                 4876 
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      0.00       0.00 us       0.00 us       0.00 us           4881      FORGET
      0.00       0.00 us       0.00 us       0.00 us           4876     RELEASE
      0.08      46.11 us      18.00 us     208.00 us            160      STATFS
      0.44      37.75 us      14.00 us     536.00 us           1081        STAT
      1.54      29.12 us       6.00 us    1070.00 us           4876       FLUSH
      8.44     160.01 us      80.00 us     935.00 us           4876       WRITE
     14.74     279.62 us     126.00 us    2729.00 us           4877      CREATE
     74.76     100.29 us      33.00 us    2708.00 us          68948      LOOKUP

    Duration: 10 seconds
   Data Read: 0 bytes
Data Written: 319553536 bytes

The number of LOOKUP FOPs is approximately 14 times the number of CREATE FOPs,
which makes sense because there are 12 DHT subvolumes and it checks all of them
for existence of a file with that name before it does a CREATE.   However, this
shouldn't be necessary if DHT layout hasn't changed since volume creation or
last rebalance.  

Jeff Darcy has written a patch at https://review.gluster.org/#/c/7702/ to try
to make "cluster.lookup-unhashed=auto" be a safe default where we don't have to
do exhaustive per-file LOOKUPs on every brick, unless the layout changes, and
in that circumstance we can get back to a good state by doing a rebalance (did
I capture behavior?)

--- Additional comment from Anand Avati on 2015-05-07 16:46:18 EDT ---

REVIEW: http://review.gluster.org/7702 (dht: make lookup-unhashed=auto do
something actually useful) posted (#8) for review on master by Shyamsundar
Ranganathan (srangana at redhat.com)

--- Additional comment from Anand Avati on 2015-05-08 15:16:32 EDT ---

REVIEW: http://review.gluster.org/7702 (dht: make lookup-unhashed=auto do
something actually useful) posted (#9) for review on master by Shyamsundar
Ranganathan (srangana at redhat.com)

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1156637
[Bug 1156637] Gluster small-file creates do not scale with brick count
https://bugzilla.redhat.com/show_bug.cgi?id=1202842
[Bug 1202842] [TRACKER] RHGS 3.1 Tracker BZ
https://bugzilla.redhat.com/show_bug.cgi?id=1219637
[Bug 1219637] Gluster small-file creates do not scale with brick count
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.