<div dir="ltr"><div dir="ltr">Hi Dmitry,</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Nov 26, 2020 at 10:44 AM Dmitry Antipov &lt;<a href="mailto:dmantipov@yandex.ru">dmantipov@yandex.ru</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">BTW, did someone try to profile the brick process? I do, and got this<br>
for the default replica 3 volume (&#39;perf record -F 2500 -g -p [PID]&#39;):<br>
<br>
+    3.29%     0.02%  glfs_epoll001    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    3.17%     0.01%  glfs_epoll001    [kernel.kallsyms]      [k] do_syscall_64<br>
+    3.17%     0.02%  glfs_epoll000    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    3.06%     0.02%  glfs_epoll000    [kernel.kallsyms]      [k] do_syscall_64<br>
+    2.75%     0.01%  glfs_iotwr00f    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    2.74%     0.01%  glfs_iotwr00b    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    2.74%     0.01%  glfs_iotwr001    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    2.73%     0.00%  glfs_iotwr003    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    2.72%     0.00%  glfs_iotwr000    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    2.72%     0.01%  glfs_iotwr00c    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    2.70%     0.01%  glfs_iotwr003    [kernel.kallsyms]      [k] do_syscall_64<br>
+    2.69%     0.00%  glfs_iotwr001    [kernel.kallsyms]      [k] do_syscall_64<br>
+    2.69%     0.01%  glfs_iotwr008    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    2.68%     0.00%  glfs_iotwr00b    [kernel.kallsyms]      [k] do_syscall_64<br>
+    2.68%     0.00%  glfs_iotwr00c    [kernel.kallsyms]      [k] do_syscall_64<br>
+    2.68%     0.00%  glfs_iotwr00f    [kernel.kallsyms]      [k] do_syscall_64<br>
+    2.68%     0.01%  glfs_iotwr000    [kernel.kallsyms]      [k] do_syscall_64<br>
+    2.67%     0.00%  glfs_iotwr00a    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    2.65%     0.00%  glfs_iotwr008    [kernel.kallsyms]      [k] do_syscall_64<br>
+    2.64%     0.00%  glfs_iotwr00e    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    2.64%     0.01%  glfs_iotwr00d    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    2.63%     0.01%  glfs_iotwr00a    [kernel.kallsyms]      [k] do_syscall_64<br>
+    2.63%     0.01%  glfs_iotwr007    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    2.63%     0.00%  glfs_iotwr005    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    2.63%     0.01%  glfs_iotwr006    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    2.63%     0.00%  glfs_iotwr009    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    2.61%     0.01%  glfs_iotwr004    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    2.61%     0.01%  glfs_iotwr00e    [kernel.kallsyms]      [k] do_syscall_64<br>
+    2.60%     0.00%  glfs_iotwr006    [kernel.kallsyms]      [k] do_syscall_64<br>
+    2.59%     0.00%  glfs_iotwr005    [kernel.kallsyms]      [k] do_syscall_64<br>
+    2.59%     0.00%  glfs_iotwr00d    [kernel.kallsyms]      [k] do_syscall_64<br>
+    2.58%     0.00%  glfs_iotwr002    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    2.58%     0.01%  glfs_iotwr007    [kernel.kallsyms]      [k] do_syscall_64<br>
+    2.58%     0.00%  glfs_iotwr004    [kernel.kallsyms]      [k] do_syscall_64<br>
+    2.57%     0.00%  glfs_iotwr009    [kernel.kallsyms]      [k] do_syscall_64<br>
+    2.54%     0.00%  glfs_iotwr002    [kernel.kallsyms]      [k] do_syscall_64<br>
+    1.65%     0.00%  glfs_epoll000    [unknown]              [k] 0x0000000000000001<br>
+    1.65%     0.00%  glfs_epoll001    [unknown]              [k] 0x0000000000000001<br>
+    1.48%     0.01%  glfs_rpcrqhnd    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    1.44%     0.08%  glfs_rpcrqhnd    <a href="http://libpthread-2.32.so" rel="noreferrer" target="_blank">libpthread-2.32.so</a>     [.] pthread_cond_wait@@GLIBC_2.3.2<br>
+    1.40%     0.01%  glfs_rpcrqhnd    [kernel.kallsyms]      [k] do_syscall_64<br>
+    1.36%     0.01%  glfs_rpcrqhnd    [kernel.kallsyms]      [k] __x64_sys_futex<br>
+    1.35%     0.03%  glfs_rpcrqhnd    [kernel.kallsyms]      [k] do_futex<br>
+    1.34%     0.01%  glfs_iotwr00a    <a href="http://libpthread-2.32.so" rel="noreferrer" target="_blank">libpthread-2.32.so</a>     [.] __libc_pwrite64<br>
+    1.32%     0.00%  glfs_iotwr00a    [kernel.kallsyms]      [k] __x64_sys_pwrite64<br>
+    1.32%     0.00%  glfs_iotwr001    <a href="http://libpthread-2.32.so" rel="noreferrer" target="_blank">libpthread-2.32.so</a>     [.] __libc_pwrite64<br>
+    1.31%     0.01%  glfs_iotwr002    <a href="http://libpthread-2.32.so" rel="noreferrer" target="_blank">libpthread-2.32.so</a>     [.] __libc_pwrite64<br>
+    1.31%     0.00%  glfs_iotwr00b    <a href="http://libpthread-2.32.so" rel="noreferrer" target="_blank">libpthread-2.32.so</a>     [.] __libc_pwrite64<br>
+    1.31%     0.01%  glfs_iotwr00a    [kernel.kallsyms]      [k] vfs_write<br>
+    1.30%     0.00%  glfs_iotwr001    [kernel.kallsyms]      [k] __x64_sys_pwrite64<br>
+    1.30%     0.00%  glfs_iotwr008    <a href="http://libpthread-2.32.so" rel="noreferrer" target="_blank">libpthread-2.32.so</a>     [.] __libc_pwrite64<br>
+    1.30%     0.00%  glfs_iotwr00a    [kernel.kallsyms]      [k] new_sync_write<br>
+    1.30%     0.00%  glfs_iotwr00c    <a href="http://libpthread-2.32.so" rel="noreferrer" target="_blank">libpthread-2.32.so</a>     [.] __libc_pwrite64<br>
+    1.29%     0.00%  glfs_iotwr00a    [kernel.kallsyms]      [k] xfs_file_write_iter<br>
+    1.29%     0.01%  glfs_iotwr00a    [kernel.kallsyms]      [k] xfs_file_dio_aio_write<br>
<br>
And on replica 3 with storage.linux-aio enabled:<br>
<br>
+   11.76%     0.05%  glfs_posixaio    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+   11.42%     0.01%  glfs_posixaio    [kernel.kallsyms]      [k] do_syscall_64<br>
+    8.81%     0.00%  glfs_posixaio    [unknown]              [k] 0x00000000baadf00d<br>
+    8.81%     0.00%  glfs_posixaio    [unknown]              [k] 0x0000000000000004<br>
+    8.74%     0.06%  glfs_posixaio    <a href="http://libc-2.32.so" rel="noreferrer" target="_blank">libc-2.32.so</a>           [.] __GI___writev<br>
+    8.33%     0.02%  glfs_posixaio    [kernel.kallsyms]      [k] do_writev<br>
+    8.23%     0.03%  glfs_posixaio    [kernel.kallsyms]      [k] vfs_writev<br>
+    8.12%     0.05%  glfs_posixaio    [kernel.kallsyms]      [k] do_iter_write<br>
+    8.02%     0.05%  glfs_posixaio    [kernel.kallsyms]      [k] do_iter_readv_writev<br>
+    7.96%     0.04%  glfs_posixaio    [kernel.kallsyms]      [k] sock_write_iter<br>
+    7.92%     0.01%  glfs_posixaio    [kernel.kallsyms]      [k] sock_sendmsg<br>
+    7.86%     0.01%  glfs_posixaio    [kernel.kallsyms]      [k] tcp_sendmsg<br>
+    7.28%     0.15%  glfs_posixaio    [kernel.kallsyms]      [k] tcp_sendmsg_locked<br>
+    6.49%     0.01%  glfs_posixaio    [kernel.kallsyms]      [k] __tcp_push_pending_frames<br>
+    6.48%     0.10%  glfs_posixaio    [kernel.kallsyms]      [k] tcp_write_xmit<br>
+    6.31%     0.02%  glfs_posixaio    [unknown]              [k] 0000000000000000<br>
+    6.05%     0.13%  glfs_posixaio    [kernel.kallsyms]      [k] __tcp_transmit_skb<br>
+    5.71%     0.06%  glfs_posixaio    [kernel.kallsyms]      [k] __ip_queue_xmit<br>
+    4.15%     0.03%  glfs_rpcrqhnd    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    4.07%     0.08%  glfs_posixaio    [kernel.kallsyms]      [k] ip_finish_output2<br>
+    3.75%     0.02%  glfs_posixaio    [kernel.kallsyms]      [k] asm_call_sysvec_on_stack<br>
+    3.75%     0.01%  glfs_rpcrqhnd    [kernel.kallsyms]      [k] do_syscall_64<br>
+    3.70%     0.03%  glfs_rpcrqhnd    [kernel.kallsyms]      [k] __x64_sys_futex<br>
+    3.68%     0.06%  glfs_posixaio    [kernel.kallsyms]      [k] __local_bh_enable_ip<br>
+    3.67%     0.07%  glfs_rpcrqhnd    [kernel.kallsyms]      [k] do_futex<br>
+    3.62%     0.05%  glfs_posixaio    [kernel.kallsyms]      [k] do_softirq<br>
+    3.61%     0.01%  glfs_posixaio    [kernel.kallsyms]      [k] do_softirq_own_stack<br>
+    3.59%     0.06%  glfs_posixaio    [kernel.kallsyms]      [k] __softirqentry_text_start<br>
+    3.44%     0.06%  glfs_posixaio    [kernel.kallsyms]      [k] net_rx_action<br>
+    3.34%     0.04%  glfs_posixaio    [kernel.kallsyms]      [k] process_backlog<br>
+    3.28%     0.02%  glfs_posixaio    [kernel.kallsyms]      [k] __netif_receive_skb_one_core<br>
+    3.08%     0.02%  glfs_epoll000    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    3.02%     0.03%  glfs_epoll001    [kernel.kallsyms]      [k] entry_SYSCALL_64_after_hwframe<br>
+    2.97%     0.01%  glfs_epoll000    [kernel.kallsyms]      [k] do_syscall_64<br>
+    2.89%     0.01%  glfs_epoll001    [kernel.kallsyms]      [k] do_syscall_64<br>
+    2.73%     0.08%  glfs_posixaio    [kernel.kallsyms]      [k] nf_hook_slow<br>
+    2.25%     0.04%  glfs_posixaio    <a href="http://libc-2.32.so" rel="noreferrer" target="_blank">libc-2.32.so</a>           [.] fgetxattr<br>
+    2.16%     0.14%  glfs_rpcrqhnd    [kernel.kallsyms]      [k] futex_wake<br>
<br>
According to these tables, the brick process is just a thin wrapper for the system calls<br>
and kernel network subsystem behind them.</blockquote><div><br></div><div>Mostly. However there&#39;s one issue that doesn&#39;t seem so obvious in the perf capture but we have identified it in other setups: when the system calls are processed very fast (as it should be the case when NVMe is used), the io-threads&#39; thread pool will be constantly processing the request queue. This queue is currently synchronized with a mutex. The small latency per request makes the contention on the mutex quite high. This means that the thread pool tends to be serialized by the lock, which kills most of the parallelism and also causes a lot of additional system calls (increased CPU utilization and higher latencies).</div><div><br></div><div>For now the only way I know to try to minimize this effect is to reduce the number of threads in the io-threads pool. It&#39;s hard to tell what would be a good number. It depends on many things. But you can run some tests with different values to try to find the best one (after changing the number of threads, it&#39;s better to restart the volume).</div><div><br></div><div>Reducing the number of threads reduces the CPU power that gluster can use, but also reduces the contention, so it&#39;s expected (though not guaranteed) that at some point, even with fewer threads the performance could be a bit better.</div><div><br></div><div>Regards,<br></div><div><br></div><div>Xavi</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> </blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
To whom it may be interesting, the following replica 3 volume options:<br>
<br>
performance.io-cache-pass-through: on<br>
performance.iot-pass-through: on<br>
performance.md-cache-pass-through: on<br>
performance.nl-cache-pass-through: on<br>
performance.open-behind-pass-through: on<br>
performance.read-ahead-pass-through: on<br>
performance.readdir-ahead-pass-through: on<br>
performance.strict-o-direct: on<br>
features.ctime: off<br>
features.selinux: off<br>
performance.write-behind: off<br>
performance.open-behind: off<br>
performance.quick-read: off<br>
storage.linux-aio: on<br>
storage.fips-mode-rchecksum: off<br>
<br>
are likely to improve the I/O performance of GFAPI clients (fio with gfapi and gfapi_async<br>
engines, qemu -drive file=gluster://XXX, etc.) by ~20%. But beware of killing I/O performance<br>
of FUSE clients.<br>
<br>
Dmitry<br>
________<br>
<br>
<br>
<br>
Community Meeting Calendar:<br>
<br>
Schedule -<br>
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC<br>
Bridge: <a href="https://meet.google.com/cpu-eiue-hvk" rel="noreferrer" target="_blank">https://meet.google.com/cpu-eiue-hvk</a><br>
Gluster-users mailing list<br>
<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>
<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>
<br>
</blockquote></div></div>