[Bugs] [Bug 1349953] New: thread CPU saturation limiting throughput on write workloads

Fri Jun 24 15:59:04 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1349953

            Bug ID: 1349953
           Summary: thread CPU saturation limiting throughput on write
                    workloads
           Product: GlusterFS
           Version: 3.8.0
         Component: fuse
          Assignee: bugs at gluster.org
          Reporter: mpillai at redhat.com
                CC: bugs at gluster.org

Description of problem:

On a distributed iozone benchmark test involving sequential writes to
large-files, we are seeing poor write throughput when there are multiple
threads per client. Stats on the clients show a glusterfs thread at 100% CPU
utilization. Overall CPU utilization on the clients is low.

Version-Release number of selected component (if applicable):
glusterfs*-3.8.0-1.el7.x86_64 (on both clients and servers)
RHEL 7.1 (clients)
RHEL 7.2 (servers)

How reproducible:
consistently

Steps to Reproduce:
The h/w setup involves 6 servers and 6 clients, with 10gE network. Each server
has 12 hard disks for a total of 72 drives. A single 12x(4+2) EC volume is
created and fuse mounted on the 6 clients. Iozone is run in distributed mode
from the clients, as below (in this case, with 4 threads per client):
iozone -+m ${IOZONE_CONF} -i 0 -w -+n -c -C -e -s 20g -r 64k -t 24

For comparison, results were also obtained with a 3x2 dist-rep volume. In this
case, the disks on each server are aggregated into a 12-disk RAID-6 device on
which the gluster brick is created.

Actual results:
Throughput for 12x(4+2) dist-disperse volume with each brick on a single disk:
     throughput for 24 initial writers  =  738076.08 kB/sec

Throughput for 3x2 dist-replicated volume with bricks on 12-disk RAID-6:
     throughput for 24 initial writers  = 1817252.84 kB/sec

Expected results:

1. EC should exceed replica-2 performance on this workload:

EC needs to write out fewer bytes compared to replica-2. EC needs to write out
1.5x the number of bytes written by application. Replica-2 on the other hand
needs to write out 1.2 (RAID-6) * 2 (replica-2) = 2.4x number of bytes actually
written.

For this large-file workload, EC should be capable of achieving higher
throughput than replica-2, but it is not. For some other write-intensive
large-file benchmarks, we have seen EC exceed replica-2+RAID-6 by a significant
margin. So need to see why in this case that is not happening. 

2. write throughput for both EC and replica-2 are much less than what the h/w
setup is capable of.

Additional info:

Output of "top -bH -d 10" on the clients shows output like below:

<body>
top - 09:12:10 up 191 days,  7:52,  0 users,  load average: 0.56, 0.26, 0.51
Threads: 289 total,   1 running, 288 sleeping,   0 stopped,   0 zombie
%Cpu(s): 10.9 us,  5.3 sy,  0.0 ni, 83.5 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 65728904 total, 58975920 free,   929824 used,  5823160 buff/cache
KiB Swap: 32964604 total, 32964148 free,      456 used. 64008760 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
21160 root      20   0  892360 233600   3492 S 99.8  0.4   0:33.07 glusterfs
21155 root      20   0  892360 233600   3492 S 22.7  0.4   0:07.88 glusterfs
21156 root      20   0  892360 233600   3492 S 22.5  0.4   0:07.98 glusterfs
21154 root      20   0  892360 233600   3492 S 22.2  0.4   0:08.29 glusterfs
21157 root      20   0  892360 233600   3492 S 21.8  0.4   0:08.02 glusterfs
21167 root      20   0   53752  19484    816 S  2.9  0.0   0:00.96 iozone
21188 root      20   0   53752  18528    816 S  2.8  0.0   0:00.95 iozone
21202 root      20   0   53752  19484    816 S  2.6  0.0   0:00.84 iozone
[...]
</body>

One of the glusterfs threads is at almost 100% CPU utilization for the duration
of the test. This is seen with both EC and replica-2, but the results seem to
indicate a higher hit for EC performance.

Volume options that have been changed (for all runs):
cluster.lookup-optimize: on
server.event-threads: 4
client.event-threads: 4

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.