[Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench
xhernandez at redhat.com
Fri Dec 21 08:18:24 UTC 2018
I've done some tracing of the latency that network layer introduces in
gluster. I've made the analysis as part of the pgbench performance issue
(in particulat the initialization and scaling phase), so I decided to look
at READV for this particular workload, but I think the results can be
extrapolated to other operations that also have small latency (cached data
from FS for example).
Note that measuring latencies introduces some latency. It consists in a
call to clock_get_time() for each probe point, so the real latency will be
a bit lower, but still proportional to these numbers.
Raw results (times in microseconds):
Min Avg Max From To
1.205 3.554 50.371 client_readv ->
0.828 2.472 44.811 client_submit_request ->
0.792 5.347 267.897 rpc_clnt_submit ->
4.337 11.229 242.354 socket_submit_outgoing_msg -> msg sent
25.447 99.988 49648.239 msg sent ->
__socket_proto_state_machine **** Time taken by kernel + network + brick
9.258 21.838 544.534 __socket_proto_state_machine ->
0.437 1.176 92.899 rpc_clnt_handle_reply ->
0.359 0.750 31.189 client4_0_readv_cbk -> UNWIND
Min Avg Max From To
5.145 11.891 326.291 __socket_proto_state_machine ->
1.943 4.036 160.085 rpcsvc_handle_rpc_call -> req queued
0.361 8.989 33658.985 req queued ->
0.125 0.591 47.348 rpcsvc_request_handler ->
2.064 5.373 643.653 server4_0_readv ->
14.610 33.766 641.871 server4_readv_resume ->
0.144 0.349 23.488 server4_readv_cbk ->
1.018 2.302 39.741 server_submit_reply ->
0.631 1.608 42.477 rpcsvc_submit_generic ->
6.013 22.009 48756.083 socket_submit_outgoing_msg -> msg sent
The total number of read requests is ~850000.
If we look only at the averages, we can see that the latency of the READV
on posix is very low, ~33 us (the brick is a NVME, so this is expected).
The total latency on the brick side is 90.914 us (sum of all avg
latencies). This seems consistent with the latency the client is seeing
between the time the message is sent and the time the answer arrives.
There's a 9 us gap that probably can be attributed to kernel processing
(I'm using loopback device, so no real network latencies here).
This means that operations suffer from a ~70 us delay. This is not
important when the operation takes in the order of milliseconds, like reads
and writes on spinning disks, but for very fast operations (bricks on NVME
and most of the metadata operations when the information is already cached
on the brick) this means a factor of x3 times slower. So when we have huge
amounts of operations with a very small latency, the overhead is huge.
If we also add the latencies on the client side, we are talking of about
146 us per read request, while the brick is able to serve them in 33 us.
This is 4.4 times slower.
It's interesting to see the amount of latency that sending a message
introduces: In the client side we have 11 us (it's a very small request
containing only some values, like offset and size). On the brick it also
takes 11 us to read it, and 22 us to send the answer. It's higher probably
because it also sends the 8KB block read from disk. On the client side we
can see that it also takes ~22 us to read the answer. It could be
attributed to system calls needed to send the data through the socket. In
this case we could try to send/recv all data with a single call. Anyway,
the minimum latency is smaller which makes me think that not all latency is
caused by system calls. We can also see that times on brick are very
unstable: the time needed to process requests queued to another thread
seems quite high sometimes. It's also high the maximum time that it could
take to send the answer.
For the particular pgbench case, the number of requests is huge. With a
scaling factor of just 100, it sends hundreds of thousands of requests. As
an example, this run sent:
- 520154 FSTAT (24.25 us avg latency on brick)
- 847900 READV (31.30 us avg latency on brick)
- 708850 WRITEV (64.39 us avg latency on brick)
Considering a total overhead of 113 us per request, we have 235 seconds of
overhead only on the network layer. Considering that the whole test took
476 seconds, this represents ~50% of the total time.
- We should never have a volume with caching-related xlators disabled.
The price we pay for it is too high. We need to make them work consistently
and aggressively to avoid as many requests as we can.
- We need to analyze client/server xlators deeper to see if we can avoid
some delays. However optimizing something that is already at the
microsecond level can be very hard.
- We need to determine what causes the fluctuations in brick side and
- This scenario is very similar to a smallfile/metadata workload, so
this is probably one important cause of its bad performance.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Gluster-devel