[Gluster-users] Poor performance on a server-class system vs. desktop

Fri Nov 27 08:40:49 UTC 2020

Top posting as my observations are general and doesn't speak anything
specific to the problem at hand, and what are our ideas to improve it.

Thanks Dmitry for a good thread :-)

I will try to break this into a long answer, but will give short answer for
question.

Does a single thread user app take a huge benefit from larger RAM/CPU ? -
*NO. *
So, how is distributed storage performance measured? - By running as many
threads (and different client mounts) as possible to saturate the n/w on
servers.

Let's get to longer look into performance:

First of all, when we talk performance of the local storage Vs network
storage Vs distributed storage multiple things needs to be considered:

Local Storage (lets say NVMe/SSD):  User App -> Kernel (ie, a syscall) ->
Access harddrive. (This is one way, the call returns in the same path).
Network storage (Say NFS): User App -> kernel (nfs client through syscall)
-> network call -> Server process (nfsd) -> kernel (syscall on the storage
machine) -> Access harddrive (Reverse path also needs to be traversed to
complete the call).
Distributed Storage (Say GlusterFS): User App -> Kernel (syscall to fuse)
-> glusterfs client (callback from fuse) -> network call -> glusterfsd ->
kernel (syscall) -> access to harddrive (reverse path for completing the
call).

Historically, Disk and Network were the slowest part here, so the 'kernel'
part was almost non-existent as a bottleneck. Gluster did well with
aggregation, and a linear performance improvement as long as this was true.
Ie, your network and disk were a significant % bottleneck of your storage
stack. The linear scale-out is true even today with NVMe and faster
networks, but the % difference from that of individual local storage
performance to glusterfs performance has increased mainly because of the
more layers it traverses now. What we are observing now with 100Gbps
network and NVMe drives is, most of the bottlenecks seen in network layer
and disk are going away, and the bottleneck is visible in the way we do
certain operations inside of glusterfs performance. Of late, we are
noticing the bottlenecks are in number of system calls we do as part of a
single call user does. For example, if you enable all the features of
gluster, a single open call would translate into 10s of calls on the disk
(stat()/getxattr(){s}/open().  This results in some delay. Also with a
process which utilizes many CPU cores, there is a penalty when
synchronization happens (and being distributed, multi threaded, multi
client architecture, glusterfs uses multiple locks).

We are working towards a unified caching translator, which would reduce
access to disk, which means we reduce many systemcalls made to disk. Also
we are aware network layer is a bottleneck (with XDR formating and the way
we process RPC packages). But taking up network layer optimizations (and
also use RDMA effectively) is a larger task.  We are looking for volunteers
to pick up this network enhancement task which would benefit a lot.

Now, coming back to the subject, more the CPUs, same test is showing lesser
performance gain because your locks would be taking more % bottleneck than
in your Laptop.  Can you try running the same test with restricting the
number of Cores the glusterfsd uses to 4 and retry the test?

Regards,
Amar

On Fri, Nov 27, 2020 at 11:23 AM Dmitry Antipov <dmantipov at yandex.ru> wrote:

> On 11/26/20 8:14 PM, Gionatan Danti wrote:
>
> > So I think you simply are CPU limited. I remember doing some tests with
> loopback RAM disks and finding that Gluster used 100% CPU (ie: full load on
> an entire core) when doing 4K random writes. Side
> > note: using synchronized (ie: fsync) 4k writes, I only get ~600 IOPs
> even when running both bricks on the same machine and backing them with RAM
> disks (in other words, with no network or disk
> > bottleneck).
>
> Thanks, it seems you're right. Running local replica 3 volume on 3x1Gb
> ramdisks, I'm seeing:
>
> top - 08:44:35 up 1 day, 11:51,  1 user,  load average: 2.34, 1.94, 1.00
> Tasks: 237 total,   2 running, 235 sleeping,   0 stopped,   0 zombie
> %Cpu(s): 38.7 us, 29.4 sy,  0.0 ni, 23.6 id,  0.0 wa,  0.4 hi,  7.9 si,
> 0.0 st
> MiB Mem :  15889.8 total,   1085.7 free,   1986.3 used,  12817.8 buff/cache
> MiB Swap:      0.0 total,      0.0 free,      0.0 used.  12307.3 avail Mem
>
>    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+
> COMMAND
> 63651 root      20   0  664124  41676   9600 R 166.7   0.3   0:24.20 fio
> 63282 root      20   0 1235336  21484   8768 S 120.4   0.1   2:43.73
> glusterfsd
> 63298 root      20   0 1235368  20512   8856 S 120.0   0.1   2:42.43
> glusterfsd
> 63314 root      20   0 1236392  21396   8684 S 119.8   0.1   2:41.94
> glusterfsd
>
> So, 32-core server-class system with a lot of RAM can't perform much
> faster for an
> individual I/O client - it just scales better if there are a lot of
> clients, right?
>
> Dmitry
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>

-- 
--
https://kadalu.io
Container Storage made easy!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20201127/06241a8c/attachment.html>