[Bugs] [Bug 1361678] New: thread CPU saturation limiting throughput on write workloads

Fri Jul 29 17:40:42 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1361678

            Bug ID: 1361678
           Summary: thread CPU saturation limiting throughput on write
                    workloads
           Product: GlusterFS
           Version: mainline
         Component: fuse
          Keywords: Performance
          Assignee: bugs at gluster.org
          Reporter: pkarampu at redhat.com
                CC: bugs at gluster.org, dblack at redhat.com,
                    mpillai at redhat.com, pkarampu at redhat.com,
                    rtalur at redhat.com, xhernandez at datalab.es
        Depends On: 1349953

+++ This bug was initially created as a clone of Bug #1349953 +++

Description of problem:

On a distributed iozone benchmark test involving sequential writes to
large-files, we are seeing poor write throughput when there are multiple
threads per client. Stats on the clients show a glusterfs thread at 100% CPU
utilization. Overall CPU utilization on the clients is low.

Version-Release number of selected component (if applicable):
glusterfs*-3.8.0-1.el7.x86_64 (on both clients and servers)
RHEL 7.1 (clients)
RHEL 7.2 (servers)

How reproducible:
consistently

Steps to Reproduce:
The h/w setup involves 6 servers and 6 clients, with 10gE network. Each server
has 12 hard disks for a total of 72 drives. A single 12x(4+2) EC volume is
created and fuse mounted on the 6 clients. Iozone is run in distributed mode
from the clients, as below (in this case, with 4 threads per client):
iozone -+m ${IOZONE_CONF} -i 0 -w -+n -c -C -e -s 20g -r 64k -t 24

For comparison, results were also obtained with a 3x2 dist-rep volume. In this
case, the disks on each server are aggregated into a 12-disk RAID-6 device on
which the gluster brick is created.

Actual results:
Throughput for 12x(4+2) dist-disperse volume with each brick on a single disk:
     throughput for 24 initial writers  =  738076.08 kB/sec

Throughput for 3x2 dist-replicated volume with bricks on 12-disk RAID-6:
     throughput for 24 initial writers  = 1817252.84 kB/sec

Expected results:

1. EC should exceed replica-2 performance on this workload:

EC needs to write out fewer bytes compared to replica-2. EC needs to write out
1.5x the number of bytes written by application. Replica-2 on the other hand
needs to write out 1.2 (RAID-6) * 2 (replica-2) = 2.4x number of bytes actually
written.

For this large-file workload, EC should be capable of achieving higher
throughput than replica-2, but it is not. For some other write-intensive
large-file benchmarks, we have seen EC exceed replica-2+RAID-6 by a significant
margin. So need to see why in this case that is not happening. 

2. write throughput for both EC and replica-2 are much less than what the h/w
setup is capable of.

Additional info:

Output of "top -bH -d 10" on the clients shows output like below:

<body>
top - 09:12:10 up 191 days,  7:52,  0 users,  load average: 0.56, 0.26, 0.51
Threads: 289 total,   1 running, 288 sleeping,   0 stopped,   0 zombie
%Cpu(s): 10.9 us,  5.3 sy,  0.0 ni, 83.5 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 65728904 total, 58975920 free,   929824 used,  5823160 buff/cache
KiB Swap: 32964604 total, 32964148 free,      456 used. 64008760 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
21160 root      20   0  892360 233600   3492 S 99.8  0.4   0:33.07 glusterfs
21155 root      20   0  892360 233600   3492 S 22.7  0.4   0:07.88 glusterfs
21156 root      20   0  892360 233600   3492 S 22.5  0.4   0:07.98 glusterfs
21154 root      20   0  892360 233600   3492 S 22.2  0.4   0:08.29 glusterfs
21157 root      20   0  892360 233600   3492 S 21.8  0.4   0:08.02 glusterfs
21167 root      20   0   53752  19484    816 S  2.9  0.0   0:00.96 iozone
21188 root      20   0   53752  18528    816 S  2.8  0.0   0:00.95 iozone
21202 root      20   0   53752  19484    816 S  2.6  0.0   0:00.84 iozone
[...]
</body>

One of the glusterfs threads is at almost 100% CPU utilization for the duration
of the test. This is seen with both EC and replica-2, but the results seem to
indicate a higher hit for EC performance.

Volume options that have been changed (for all runs):
cluster.lookup-optimize: on
server.event-threads: 4
client.event-threads: 4

--- Additional comment from Manoj Pillai on 2016-06-24 14:15:43 EDT ---

Simpler reproduction with a single client:

Created the 12x(4+2) volume and fuse mounted on the client.

Run iozone single client write test with 8 threads/files:
iozone -i 0 -w -+n -c -C -e -s 10g -r 64k -t 8 -F /mnt/glustervol/f{1..8}.ioz

Output:
throughput for  8 initial writers  =  179084.75 kB/sec

Output from: top -bH -d 10:

top - 13:55:21 up 191 days, 12:36,  6 users,  load average: 0.92, 0.32, 0.55
Threads: 292 total,   3 running, 289 sleeping,   0 stopped,   0 zombie
%Cpu(s): 10.3 us,  4.9 sy,  0.0 ni, 84.7 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 65728904 total, 54453352 free,   861572 used, 10413980 buff/cache
KiB Swap: 32964604 total, 32963420 free,     1184 used. 64082548 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 8839 root      20   0  892360 239612   3572 R 99.8  0.4   0:53.34 glusterfs
 8836 root      20   0  892360 239612   3572 S 19.6  0.4   0:10.37 glusterfs
 8835 root      20   0  892360 239612   3572 S 19.0  0.4   0:10.18 glusterfs
 8834 root      20   0  892360 239612   3572 S 18.8  0.4   0:10.17 glusterfs
 8833 root      20   0  892360 239612   3572 R 18.6  0.4   0:10.49 glusterfs
 8853 root      20   0   47432  18796    168 S  2.2  0.0   0:01.25 iozone
 8855 root      20   0   47432  18800    168 S  2.2  0.0   0:00.99 iozone
[...]

--- Additional comment from Pranith Kumar K on 2016-06-27 02:29:07 EDT ---

Thanks for this bug Manoj, I think, we need to test these cases with "gluster
volume set <volname> client.io-threads on" so that fuse-thread won't do the
encoding in EC volume.

--- Additional comment from Manoj Pillai on 2016-06-27 11:08:24 EDT ---

(In reply to Manoj Pillai from comment #1)
> Simpler reproduction with a single client:
> 
> Created the 12x(4+2) volume and fuse mounted on the client.
> 
> Run iozone single client write test with 8 threads/files:
> iozone -i 0 -w -+n -c -C -e -s 10g -r 64k -t 8 -F /mnt/glustervol/f{1..8}.ioz
> 
> Output:
> throughput for  8 initial writers  =  179084.75 kB/sec
> 

Same test with additional option: performance.client-io-threads on:
 throughput for  8 initial writers  =  773070.43 kB/sec

That's more like it!

--- Additional comment from Manoj Pillai on 2016-06-27 13:34:34 EDT ---

I'll look at some more results, including from replica-2 runs, before closing
this  bug.

--- Additional comment from Manoj Pillai on 2016-06-28 23:32:53 EDT ---

Results for distributed iozone test, with 6 clients. Command line:
iozone -+m ${IOZONE_CONF} -i 0 -w -+n -c -C -e -s 10g -r 64k -t 48

Volume same as in comment #0.
6 servers, 12x(4+2).
Volume options set:
cluster.lookup-optimize: on
server.event-threads: 4
client.event-threads: 4

iozone output:
Children see throughput for 48 initial writers  = 1092648.42 kB/sec

Now, adding option: performance.client-io-threads on:
iozone output:
Children see throughput for 48 initial writers  = 3746335.84 kB/sec

So, enabling client-io-threads gave us ~3.5x bump in throughput here. 

top output (on a client) without client-io-threads is given in comment #0.
Here's what it looks like after.

<body>
top - 12:07:08 up 11:37,  5 users,  load average: 1.66, 0.40, 0.17
Threads: 314 total,   7 running, 307 sleeping,   0 stopped,   0 zombie
%Cpu(s): 27.1 us, 15.9 sy,  0.0 ni, 54.5 id,  0.0 wa,  0.0 hi,  2.4 si,  0.0 st
KiB Mem : 65728904 total, 45210228 free,   973552 used, 19545124 buff/cache
KiB Swap: 32964604 total, 32964604 free,        0 used. 64288252 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
22074 root      20   0 1168900 239896   3708 R 88.0  0.4   0:21.89 glusterfs
22078 root      20   0 1168900 239896   3708 R 88.0  0.4   0:21.68 glusterfs
22076 root      20   0 1168900 239896   3708 S 87.9  0.4   0:21.62 glusterfs
22077 root      20   0 1168900 239896   3708 R 87.9  0.4   0:21.68 glusterfs
22081 root      20   0 1168900 239896   3708 R 75.5  0.4   0:17.93 glusterfs
22075 root      20   0 1168900 239896   3708 S 19.1  0.4   0:04.56 glusterfs
22273 root      20   0 1168900 239896   3708 R 18.9  0.4   0:04.48 glusterfs
22272 root      20   0 1168900 239896   3708 R 18.8  0.4   0:04.56 glusterfs
22274 root      20   0 1168900 239896   3708 S 18.7  0.4   0:04.48 glusterfs
22199 root      20   0   53752  19484    816 S  6.2  0.0   0:01.63 iozone
22186 root      20   0   53752  18536    820 S  6.1  0.0   0:01.52 iozone
[...]
</body>

The single hot thread is gone. Overall CPU utilization has gone up quite a bit
(idle is down to ~54%), and that may be a concern for some deployments.

Network stats (below is from a client) for the run with client-io-threads
enabled show that the 10gE link is now being pushed to near its limit, which is
what we want:
12:07:07 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s  
txcmp/s  rxmcst/s
12:07:17 PM      p3p1  76644.80 150502.30  15037.46 1197410.19      0.00     
0.00      0.00
12:07:17 PM      p3p2      0.00      0.00      0.00      0.00      0.00     
0.00      0.00
12:07:17 PM        lo      0.00      0.00      0.00      0.00      0.00     
0.00      0.00
12:07:17 PM       em3      0.00      0.00      0.00      0.00      0.00     
0.00      0.00
12:07:17 PM       em1      6.80      0.00      0.45      0.00      0.00     
0.00      3.00
12:07:17 PM       em2      0.00      0.00      0.00      0.00      0.00     
0.00      0.00
12:07:17 PM       em4      0.00      0.00      0.00      0.00      0.00     
0.00      0.00

--- Additional comment from Pranith Kumar K on 2016-06-28 23:45:04 EDT ---

(In reply to Manoj Pillai from comment #5)
> Results for distributed iozone test, with 6 clients. Command line:
> iozone -+m ${IOZONE_CONF} -i 0 -w -+n -c -C -e -s 10g -r 64k -t 48
> 
> Volume same as in comment #0.
> 6 servers, 12x(4+2).
> Volume options set:
> cluster.lookup-optimize: on
> server.event-threads: 4
> client.event-threads: 4
> 
> iozone output:
> Children see throughput for 48 initial writers  = 1092648.42 kB/sec
> 
> Now, adding option: performance.client-io-threads on:
> iozone output:
> Children see throughput for 48 initial writers  = 3746335.84 kB/sec
> 
> So, enabling client-io-threads gave us ~3.5x bump in throughput here. 
> 
> top output (on a client) without client-io-threads is given in comment #0.
> Here's what it looks like after.
> 
> <body>
> top - 12:07:08 up 11:37,  5 users,  load average: 1.66, 0.40, 0.17
> Threads: 314 total,   7 running, 307 sleeping,   0 stopped,   0 zombie
> %Cpu(s): 27.1 us, 15.9 sy,  0.0 ni, 54.5 id,  0.0 wa,  0.0 hi,  2.4 si,  0.0
> st
> KiB Mem : 65728904 total, 45210228 free,   973552 used, 19545124 buff/cache
> KiB Swap: 32964604 total, 32964604 free,        0 used. 64288252 avail Mem
> 
>   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
> 22074 root      20   0 1168900 239896   3708 R 88.0  0.4   0:21.89 glusterfs
> 22078 root      20   0 1168900 239896   3708 R 88.0  0.4   0:21.68 glusterfs
> 22076 root      20   0 1168900 239896   3708 S 87.9  0.4   0:21.62 glusterfs
> 22077 root      20   0 1168900 239896   3708 R 87.9  0.4   0:21.68 glusterfs
> 22081 root      20   0 1168900 239896   3708 R 75.5  0.4   0:17.93 glusterfs
> 22075 root      20   0 1168900 239896   3708 S 19.1  0.4   0:04.56 glusterfs
> 22273 root      20   0 1168900 239896   3708 R 18.9  0.4   0:04.48 glusterfs
> 22272 root      20   0 1168900 239896   3708 R 18.8  0.4   0:04.56 glusterfs
> 22274 root      20   0 1168900 239896   3708 S 18.7  0.4   0:04.48 glusterfs
> 22199 root      20   0   53752  19484    816 S  6.2  0.0   0:01.63 iozone
> 22186 root      20   0   53752  18536    820 S  6.1  0.0   0:01.52 iozone
> [...]
> </body>
> 
> The single hot thread is gone. Overall CPU utilization has gone up quite a
> bit (idle is down to ~54%), and that may be a concern for some deployments.

CPU utilization will go up because more threads are doing encoding now. Do you
suggest for us to have a way to throttle that? I also added Xavi CCed Xavi to
the bug to get his inputs.

> 
> Network stats (below is from a client) for the run with client-io-threads
> enabled show that the 10gE link is now being pushed to near its limit, which
> is what we want:
> 12:07:07 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s  
> txcmp/s  rxmcst/s
> 12:07:17 PM      p3p1  76644.80 150502.30  15037.46 1197410.19      0.00    
> 0.00      0.00
> 12:07:17 PM      p3p2      0.00      0.00      0.00      0.00      0.00     
> 0.00      0.00
> 12:07:17 PM        lo      0.00      0.00      0.00      0.00      0.00     
> 0.00      0.00
> 12:07:17 PM       em3      0.00      0.00      0.00      0.00      0.00     
> 0.00      0.00
> 12:07:17 PM       em1      6.80      0.00      0.45      0.00      0.00     
> 0.00      3.00
> 12:07:17 PM       em2      0.00      0.00      0.00      0.00      0.00     
> 0.00      0.00
> 12:07:17 PM       em4      0.00      0.00      0.00      0.00      0.00     
> 0.00      0.00

--- Additional comment from Manoj Pillai on 2016-06-29 02:23:15 EDT ---

(In reply to Pranith Kumar K from comment #6)

> > 
> > The single hot thread is gone. Overall CPU utilization has gone up quite a
> > bit (idle is down to ~54%), and that may be a concern for some deployments.
> 
> CPU utilization will go up because more threads are doing encoding now. Do
> you suggest for us to have a way to throttle that? I also added Xavi CCed
> Xavi to the bug to get his inputs.
> 

Right, CPU utilization is expected to go up as you scale to higher throughput.
I'm just saying that client-side CPU utilization can be a concern depending on
how CPU-hungry the applications are -- client-side CPU is primarily meant for
them.

Do we need an ability to cap CPU utilization from within gluster (rather than
using something like cgroups)? Take a look at a run with client event-threads
set to 2, instead of 4 like in the runs above:

iozone output:
Children see throughput for 48 initial writers  = 2778952.81 kB/sec
[lower throughput compared to 3.7 GB/s with client event-threads=4]

top output:
<body>
top - 00:26:56 up 1 day, 23:57,  5 users,  load average: 1.38, 0.45, 0.38
Threads: 312 total,   3 running, 309 sleeping,   0 stopped,   0 zombie
%Cpu(s): 15.8 us,  9.8 sy,  0.0 ni, 73.6 id,  0.0 wa,  0.0 hi,  0.8 si,  0.0 st
KiB Mem : 65728904 total, 40787784 free,   985512 used, 23955608 buff/cache
KiB Swap: 32964604 total, 32964604 free,        0 used. 64218968 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 6003 root      20   0  954872 240892   3708 R 99.9  0.4   0:47.30 glusterfs
 6005 root      20   0  954872 240892   3708 R 99.9  0.4   0:47.21 glusterfs
 6008 root      20   0  954872 240892   3708 S 48.5  0.4   0:22.06 glusterfs
 6199 root      20   0  954872 240892   3708 S 13.5  0.4   0:06.77 glusterfs
 6200 root      20   0  954872 240892   3708 S 13.5  0.4   0:06.41 glusterfs
 6004 root      20   0  954872 240892   3708 S 13.4  0.4   0:07.30 glusterfs
 6126 root      20   0   53752  19488    816 S  5.8  0.0   0:02.61 iozone
[...]
</body>
[lower overall CPU utilization as well, compared to event-threads=4]

The 2 top threads by CPU utilization (99.9% each) seem to be the epoll threads,
based on pstack output. The third is the fuse thread.

In comment #0, IIRC, the epoll threads were not doing most of the encoding.
With client-io-threads on, the epoll threads seem to be doing most of the
encoding and using most of the CPU. Is that true and expected? And by varying
the number of client event-threads, I seem to be able to limit CPU utilization
at the expense of throughput.

--- Additional comment from Pranith Kumar K on 2016-06-29 02:35:04 EDT ---

(In reply to Manoj Pillai from comment #7)
> (In reply to Pranith Kumar K from comment #6)
> 
> > > 
> > > The single hot thread is gone. Overall CPU utilization has gone up quite a
> > > bit (idle is down to ~54%), and that may be a concern for some deployments.
> > 
> > CPU utilization will go up because more threads are doing encoding now. Do
> > you suggest for us to have a way to throttle that? I also added Xavi CCed
> > Xavi to the bug to get his inputs.
> > 
> 
> Right, CPU utilization is expected to go up as you scale to higher
> throughput. I'm just saying that client-side CPU utilization can be a
> concern depending on how CPU-hungry the applications are -- client-side CPU
> is primarily meant for them.
> 
> Do we need an ability to cap CPU utilization from within gluster (rather
> than using something like cgroups)? Take a look at a run with client
> event-threads set to 2, instead of 4 like in the runs above:
> 
> iozone output:
> Children see throughput for 48 initial writers  = 2778952.81 kB/sec
> [lower throughput compared to 3.7 GB/s with client event-threads=4]
> 
> top output:
> <body>
> top - 00:26:56 up 1 day, 23:57,  5 users,  load average: 1.38, 0.45, 0.38
> Threads: 312 total,   3 running, 309 sleeping,   0 stopped,   0 zombie
> %Cpu(s): 15.8 us,  9.8 sy,  0.0 ni, 73.6 id,  0.0 wa,  0.0 hi,  0.8 si,  0.0
> st
> KiB Mem : 65728904 total, 40787784 free,   985512 used, 23955608 buff/cache
> KiB Swap: 32964604 total, 32964604 free,        0 used. 64218968 avail Mem
> 
>   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
>  6003 root      20   0  954872 240892   3708 R 99.9  0.4   0:47.30 glusterfs
>  6005 root      20   0  954872 240892   3708 R 99.9  0.4   0:47.21 glusterfs
>  6008 root      20   0  954872 240892   3708 S 48.5  0.4   0:22.06 glusterfs
>  6199 root      20   0  954872 240892   3708 S 13.5  0.4   0:06.77 glusterfs
>  6200 root      20   0  954872 240892   3708 S 13.5  0.4   0:06.41 glusterfs
>  6004 root      20   0  954872 240892   3708 S 13.4  0.4   0:07.30 glusterfs
>  6126 root      20   0   53752  19488    816 S  5.8  0.0   0:02.61 iozone
> [...]
> </body>
> [lower overall CPU utilization as well, compared to event-threads=4]
> 
> The 2 top threads by CPU utilization (99.9% each) seem to be the epoll
> threads, based on pstack output. The third is the fuse thread.
> 
> In comment #0, IIRC, the epoll threads were not doing most of the encoding.
> With client-io-threads on, the epoll threads seem to be doing most of the
> encoding and using most of the CPU. Is that true and expected? And by
> varying the number of client event-threads, I seem to be able to limit CPU
> utilization at the expense of throughput.

There is a way to find what thread is doing what. I believe Raghavendra Talur
showed that to me once. Let me CC him.

It is not clear why epoll-thread would do most of the encoding if all we are
doing is writes. I would guess for them to be io-threads. First write on the
file which has to acquire lock will be encoded by epoll-thread after which the
encoding burden should shift to io-threads.

--- Additional comment from Xavier Hernandez on 2016-06-29 02:54:11 EDT ---

I'm not sure what is exactly happening here. Enabling client-io-threads should
not move the encoding load to the epoll threads for most of the requests.
However the decoding load should be done mainly by the epoll threads.

In your test you are doing writes, right ? in this case the epoll threads
should only be handling the answers, with no encoding/decoding.

2.78 GB/s in records of 64k means a total of 43.421 requests per second. A 4+2
EC volume multiplies this by 6, for a total of 260.526 requests of 16k per
second. On average, each client is receiving 43.421 answers per second
(assuming a balanced load). Could it be that the communications layer could not
handle so many answers ?

Having 4 epoll threads means each thread handles 10.855 requests/s, while 2
threads need to handle 21.711 requests/s. Not sure if this is too many or not.

--- Additional comment from Raghavendra Talur on 2016-06-29 03:39:16 EDT ---

Performing a pstack on the PID should show enough backtrace for us to determine
what is the thread causing CPU overload. Can we have the pstack output please?

--- Additional comment from Manoj Pillai on 2016-06-29 09:33:28 EDT ---

Back to single client runs for simplicity.

iozone command:
./iozone -i 0 -w -+n -c -C -e -s 10g -r 64k -t 8 -F /mnt/glustervol/f{1..8}.ioz

gluster vol info ouput:
[...]
Options Reconfigured:
cluster.lookup-optimize: on
server.event-threads: 4
client.event-threads: 4
performance.client-io-threads: on
transport.address-family: inet
performance.readdir-ahead: on

top output:
  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
12160 root      20   0 1301004 249764   3868 R 90.4  0.4   1:57.51 glusterfs
12162 root      20   0 1301004 249764   3868 R 90.4  0.4   1:58.38 glusterfs
12158 root      20   0 1301004 249764   3868 R 90.2  0.4   1:58.66 glusterfs
12161 root      20   0 1301004 249764   3868 R 90.2  0.4   1:58.13 glusterfs
12165 root      20   0 1301004 249764   3868 S 72.9  0.4   1:36.63 glusterfs
12178 root      20   0 1301004 249764   3868 S 15.3  0.4   0:19.57 glusterfs
12159 root      20   0 1301004 249764   3868 S 14.3  0.4   0:19.14 glusterfs
12177 root      20   0 1301004 249764   3868 S 14.3  0.4   0:19.39 glusterfs
12175 root      20   0 1301004 249764   3868 S 14.1  0.4   0:19.01 glusterfs
12179 root      20   0 1301004 249764   3868 S 14.1  0.4   0:19.43 glusterfs
12246 root      20   0   47432  18796    168 S  6.8  0.0   0:01.55 iozone

pstack 12160:
Thread 1 (process 12160):
#0  0x00007f675ac9ed89 in gf8_muladd_04 () from
/usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so
#1  0x00007f675acb4d77 in ec_method_encode () from
/usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so
#2  0x00007f675ac9954d in ec_wind_writev () from
/usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so
#3  0x00007f675ac7dc78 in ec_dispatch_mask () from
/usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so
#4  0x00007f675ac9bbde in ec_manager_writev () from
/usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so
#5  0x00007f675ac7d50b in __ec_manager () from
/usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so
#6  0x00007f675ac7d6e8 in ec_resume () from
/usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so
#7  0x00007f675ac7f6df in ec_lock_resume_shared () from
/usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so
#8  0x00007f675ac80b83 in ec_lock_reuse () from
/usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so
#9  0x00007f675ac9bd08 in ec_manager_writev () from
/usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so
#10 0x00007f675ac7d50b in __ec_manager () from
/usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so
#11 0x00007f675ac7d6e8 in ec_resume () from
/usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so
#12 0x00007f675ac7d80f in ec_complete () from
/usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so
#13 0x00007f675ac9917c in ec_inode_write_cbk () from
/usr/lib64/glusterfs/3.8.0/xlator/cluster/disperse.so
#14 0x00007f675aeef442 in client3_3_writev_cbk () from
/usr/lib64/glusterfs/3.8.0/xlator/protocol/client.so
#15 0x00007f6768b26b30 in rpc_clnt_handle_reply () from /lib64/libgfrpc.so.0
#16 0x00007f6768b26def in rpc_clnt_notify () from /lib64/libgfrpc.so.0
#17 0x00007f6768b22923 in rpc_transport_notify () from /lib64/libgfrpc.so.0
#18 0x00007f675d3ec9d4 in socket_event_poll_in () from
/usr/lib64/glusterfs/3.8.0/rpc-transport/socket.so
#19 0x00007f675d3ef614 in socket_event_handler () from
/usr/lib64/glusterfs/3.8.0/rpc-transport/socket.so
#20 0x00007f6768db4590 in event_dispatch_epoll_worker () from
/lib64/libglusterfs.so.0
#21 0x00007f6767bbddf5 in start_thread () from /lib64/libpthread.so.0
#22 0x00007f67675041ad in clone () from /lib64/libc.so.6

--- Additional comment from Pranith Kumar K on 2016-06-29 10:03:47 EDT ---

Thanks for this stack trace Manoj. I think I understand what is happening now.
What happens is, while a write is in progress if more writes come, they will be
put in a queue which will be picked up as soon as the previous write completes.
So because epoll-thread resumes the write, it is doing encoding. So yes, you
are correct, in this case it is happening in epoll-thread. Do you think this is
bad. Is it becoming a bottle-neck?

Pranith

--- Additional comment from Manoj Pillai on 2016-06-29 13:20:16 EDT ---

(In reply to Pranith Kumar K from comment #12)

> So yes, you are correct, in this case it is happening in epoll-thread. Do
> you think this is bad. Is it becoming a bottle-neck?
> 

Nothing bad I think, just need to make sure that client.event-threads is also
tuned to a good number ( >=4 ), in addition to having client-io-threads on. We
are seeing that both are important for good EC write performance.

--- Additional comment from Xavier Hernandez on 2016-06-30 02:33:44 EDT ---

Thanks Manoj for the testing. I was wrong. As Pranith says, writes need to be
serialized in EC causing the queued writes to be resumed inside the epoll
threads.

In this case I'm not sure why enabling client-io-threads improves performance
so much. Maybe it allows fuse to process requests faster, causing the
saturation of epoll-threads later...

In any case I think it would be interesting to move the execution of all these
operations to other threads. Maybe using a thread pool or synctasks.

Another parameter that I tweak that gives a small extra boost in my tests is
rpc.outstanding-rpc-limit. Default is 64. Using 128, 256 or 512 gives some more
performance (probably at the cost of more CPU usage)

--- Additional comment from Pranith Kumar K on 2016-06-30 09:29:37 EDT ---

(In reply to Xavier Hernandez from comment #14)
> Thanks Manoj for the testing. I was wrong. As Pranith says, writes need to
> be serialized in EC causing the queued writes to be resumed inside the epoll
> threads.
> 
> In this case I'm not sure why enabling client-io-threads improves
> performance so much. Maybe it allows fuse to process requests faster,
> causing the saturation of epoll-threads later...
> 
> In any case I think it would be interesting to move the execution of all
> these operations to other threads. Maybe using a thread pool or synctasks.
> 
> Another parameter that I tweak that gives a small extra boost in my tests is
> rpc.outstanding-rpc-limit. Default is 64. Using 128, 256 or 512 gives some
> more performance (probably at the cost of more CPU usage)
Xavi,
     In the latest releases(Even in 3.7.x) the outstanding-rpc-limit is
infinity i.e. value '0' for servers. It is enabled only for NFS servers.
These are the patches:
http://review.gluster.org/10267
http://review.gluster.org/10268

So I would say don't enable anything?

Pranith

--- Additional comment from Xavier Hernandez on 2016-07-01 02:21:22 EDT ---

I didn't know about this change...

Maybe I did the tests with 3.6.x. I'm not sure. In any case I agree that it
will be useless with latest versions.

--- Additional comment from Manoj Pillai on 2016-07-08 04:03:41 EDT ---

Results with 3x2 volume (6 servers) with bricks on 12-disk RAID-6:

Benchmark: distributed iozone (write test) on 6 clients:
iozone -+m ${IOZONE_CONF} -i 0 -w -+n -c -C -e -s 10g -r 64k -t 48

With standard tuning options:
cluster.lookup-optimize: on
server.event-threads: 4
client.event-threads: 4

iozone output:
throughput for 48 initial writers  = 2052985.17 kB/sec

iozone output after adding option performance.client-io-threads on:
throughput for 48 initial writers  = 3180628.89 kB/sec

So we see a big bump in the case of replica-2 as well with client-io-threads.
So we really want this option on irrespective of volume type.

Without client-io-threads, we see the following in top output on the clients:
<quote>
top - 02:49:26 up 21:08,  0 users,  load average: 0.24, 0.47, 0.47
Threads: 296 total,   4 running, 292 sleeping,   0 stopped,   0 zombie
%Cpu(s):  7.5 us,  9.2 sy,  0.0 ni, 82.6 id,  0.0 wa,  0.0 hi,  0.8 si,  0.0 st
KiB Mem : 65728904 total, 59729480 free,   758744 used,  5240680 buff/cache
KiB Swap: 32964604 total, 32964604 free,        0 used. 64684248 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 5156 root      20   0  736788  62396   3512 R 99.7  0.1   0:10.38 glusterfs
 5152 root      20   0  736788  62396   3512 R 20.3  0.1   0:02.10 glusterfs
 5150 root      20   0  736788  62396   3512 S 20.2  0.1   0:02.16 glusterfs
 5153 root      20   0  736788  62396   3512 S 19.9  0.1   0:02.08 glusterfs
 5151 root      20   0  736788  62396   3512 S 19.4  0.1   0:02.03 glusterfs
 5198 root      20   0   53752  19484    816 S  4.4  0.0   0:00.45 iozone
 5184 root      20   0   53752  19488    816 S  4.3  0.0   0:00.44 iozone
[...]
</quote>

pstat shows that the hot thread is the fuse thread:
<quote>
[...]
37 0x00007eff9f18c3c0 in meta_writev () from
/usr/lib64/glusterfs/3.8.0/xlator/meta.so
#38 0x00007effa90aea55 in fuse_write_resume () from
/usr/lib64/glusterfs/3.8.0/xlator/mount/fuse.so
#39 0x00007effa90a55d5 in fuse_resolve_all () from
/usr/lib64/glusterfs/3.8.0/xlator/mount/fuse.so
#40 0x00007effa90a52e8 in fuse_resolve () from
/usr/lib64/glusterfs/3.8.0/xlator/mount/fuse.so
#41 0x00007effa90a561e in fuse_resolve_all () from
/usr/lib64/glusterfs/3.8.0/xlator/mount/fuse.so
#42 0x00007effa90a4903 in fuse_resolve_continue () from
/usr/lib64/glusterfs/3.8.0/xlator/mount/fuse.so
#43 0x00007effa90a5266 in fuse_resolve () from
/usr/lib64/glusterfs/3.8.0/xlator/mount/fuse.so
#44 0x00007effa90a55fe in fuse_resolve_all () from
/usr/lib64/glusterfs/3.8.0/xlator/mount/fuse.so
#45 0x00007effa90a5640 in fuse_resolve_and_resume () from
/usr/lib64/glusterfs/3.8.0/xlator/mount/fuse.so
#46 0x00007effa90bb7be in fuse_thread_proc () from
/usr/lib64/glusterfs/3.8.0/xlator/mount/fuse.so
#47 0x00007effb1659df5 in start_thread () from /lib64/libpthread.so.0
#48 0x00007effb0fa01ad in clone () from /lib64/libc.so.6
</quote>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1349953
[Bug 1349953] thread CPU saturation limiting throughput on write workloads
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.