<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
A jumbo ethernet frame can be 9000 bytes. The ethernet frame header
is at least 38 bytes, and the minimum TCP/IP header size is 40 bytes
or 0.78% of the jumbo frame combined. Gluster's RPC also adds a few
bytes (not sure how many and don't have time to test at the moment
but for the sake of argument we'll just say 20 bytes) but, all
together, it's about 99% efficient. If you write 20 bytes to a file
(for an extreme example) then you'll have your 20 bytes+RPC
header+TCP/IP header+ethernet header; 118 bytes in headers for 20
bytes of data. That header being 90% of the frame means that your
packet is only 10% efficient. That's per replica so if you have a
replica 3 that's three individual frames with 118 bytes of headers
each to write the same 20 bytes of data. Those go out to the three
servers and wait for their response. So you have a network round
trip + a tiny bit of latency for stacking the three frames in the
kernel + disk write latency. That's a lot of overhead and cannot
ever be as fast as writing to a local disk for any networked
storage.<br>
<br>
The question, however, is does it need to be? Do you care if a
single thread is slower in a clustered environment than it would be
on a local raid stack? In good clustered engineering your workload
will be handled by multiple threads over a cluster of workers.
Overall, you have more threads than you could have on a single
machine. This allows servicing a greater overall workload than you
could without a cluster. I refer to that as comparing apples to
orchards (<a moz-do-not-send="true"
href="https://joejulian.name/post/dont-get-stuck-micro-engineering-for-scale/">1</a>).<br>
<br>
<div class="moz-cite-prefix">On 04/13/18 10:58, Anastasia Belyaeva
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAL_rV+HYA1tWBPuSm8zc3x_rJjPeSgjpHCeRLPHNmTn5u9cqew@mail.gmail.com">
<div dir="ltr">Thanks a lot for your reply!
<div><br>
</div>
<div>You guessed it right though - mailing lists, various
blogs, documentation, videos and even source code at this
point. Changing some off the options does make performance
slightly better, but nothing particularly groundbreaking.<br>
</div>
<div><br>
</div>
<div>So, if I understand you correctly, no one has yet managed
to get acceptable performance (relative to underlying hardware
capabilities) with smaller block sizes? Is there an
explanation for this?</div>
<div><br>
</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">2018-04-13 1:57 GMT+03:00 Vlad Kopylov
<span dir="ltr"><<a href="mailto:vladkopy@gmail.com"
target="_blank" moz-do-not-send="true">vladkopy@gmail.com</a>></span>:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>Guess you went through user lists and tried
something like this already <a
href="http://lists.gluster.org/pipermail/gluster-users/2018-April/033811.html"
target="_blank" moz-do-not-send="true">http://lists.gluster.org/<wbr>pipermail/gluster-users/2018-<wbr>April/033811.html</a><br>
</div>
I have a same exact setup and below is as far as it went
after months of trail and error.<br>
</div>
We all have somewhat same setup and same issue with this -
you can find same post as yours on the daily basis.<br>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Wed, Apr 11, 2018 at 3:03 PM,
Anastasia Belyaeva <span dir="ltr"><<a
href="mailto:anastasia.blv@gmail.com"
target="_blank" moz-do-not-send="true">anastasia.blv@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">Hello everybody!
<div><br>
</div>
<div>I have 3 gluster servers (<b>gluster 3.12.6,
Centos 7.2</b>; those are actually virtual
machines located on 3 separate physical
XenServer7.1 servers) </div>
<div><br>
</div>
<div>They are all connected via infiniband network.
Iperf3 shows around <b>23 Gbit/s network
bandwidth </b>between each 2 of them.</div>
<div><br>
</div>
<div>Each server has 3 HDD put into a <b>stripe*3
thin pool (LVM2) </b>with logical volume
created on top of it, formatted with <b>xfs</b>.
Gluster top reports the following throughput:</div>
<div><br>
</div>
<div>
<blockquote class="gmail_quote" style="margin:0px
0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">root@fsnode2
~ $ gluster volume top r3vol write-perf bs 4096
count 524288 list-cnt 0<br>
Brick: fsnode2.ibnet:/data/glusterfs/<wbr>r3vol/brick1/brick<br>
Throughput <b>631.82 MBps </b>time 3.3989 secs<br>
Brick: fsnode6.ibnet:/data/glusterfs/<wbr>r3vol/brick1/brick<br>
Throughput <b>566.96 MBps </b>time 3.7877 secs<br>
Brick: fsnode4.ibnet:/data/glusterfs/<wbr>r3vol/brick1/brick<br>
Throughput <b>546.65 MBps </b>time 3.9285 secs</blockquote>
</div>
<div><br>
</div>
<div>
<blockquote class="gmail_quote" style="margin:0px
0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">root@fsnode2
~ $ gluster volume top r2vol write-perf bs 4096
count 524288 list-cnt 0<br>
Brick: fsnode2.ibnet:/data/glusterfs/<wbr>r2vol/brick1/brick<br>
Throughput <b>539.60 MBps </b>time 3.9798 secs<br>
Brick: fsnode4.ibnet:/data/glusterfs/<wbr>r2vol/brick1/brick<br>
Throughput <b>580.07 MBps </b>time 3.7021 secs</blockquote>
</div>
<div><br>
</div>
<div>And two <b>pure replicated ('replica 2' and
'replica 3')</b> volumes. *The 'replica 2'
volume is for testing purpose only.</div>
<div>
<blockquote class="gmail_quote" style="margin:0px
0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Volume
Name: r2vol<br>
Type: Replicate<br>
Volume ID: 4748d0c0-6bef-40d5-b1ec-d30e10<wbr>cfddd9<br>
Status: Started<br>
Snapshot Count: 0<br>
Number of Bricks: 1 x 2 = 2<br>
Transport-type: tcp<br>
Bricks:<br>
Brick1: fsnode2.ibnet:/data/glusterfs/<wbr>r2vol/brick1/brick<br>
Brick2: fsnode4.ibnet:/data/glusterfs/<wbr>r2vol/brick1/brick<br>
Options Reconfigured:<br>
nfs.disable: on<br>
</blockquote>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px
0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Volume
Name: r3vol<br>
Type: Replicate<br>
Volume ID: b0f64c28-57e1-4b9d-946b-26ed6b<wbr>499f29<br>
Status: Started<br>
Snapshot Count: 0<br>
Number of Bricks: 1 x 3 = 3<br>
Transport-type: tcp<br>
Bricks:<br>
Brick1: fsnode2.ibnet:/data/glusterfs/<wbr>r3vol/brick1/brick<br>
Brick2: fsnode4.ibnet:/data/glusterfs/<wbr>r3vol/brick1/brick<br>
Brick3: fsnode6.ibnet:/data/glusterfs/<wbr>r3vol/brick1/brick<br>
Options Reconfigured:<br>
nfs.disable: on</blockquote>
</div>
<div><br>
</div>
<div><br>
</div>
<div><b>Client </b>is also gluster 3.12.6, Centos
7.3 virtual machine, <b>FUSE mount</b> </div>
<div>
<blockquote class="gmail_quote" style="margin:0px
0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">root@centos7u3-nogdesktop2
~ $ mount |grep gluster<br>
gluster-host.ibnet:/r2vol on /mnt/gluster/r2
type fuse.glusterfs
(rw,relatime,user_id=0,group_i<wbr>d=0,default_permissions,allow_<wbr>other,max_read=131072)<br>
gluster-host.ibnet:/r3vol on /mnt/gluster/r3
type fuse.glusterfs
(rw,relatime,user_id=0,group_i<wbr>d=0,default_permissions,allow_<wbr>other,max_read=131072)</blockquote>
</div>
<div><br>
</div>
<div><br>
</div>
<div><b>The problem </b>is that there is a
significant performance loss with smaller block
sizes. For example: </div>
<div><br>
</div>
<div><u>4K block size</u></div>
<div>[replica 3 volume]</div>
<div>
<div>root@centos7u3-nogdesktop2 ~ $ dd
if=/dev/zero of=/mnt/gluster/r3/file$RANDOM
bs=4096 count=262144</div>
<div>262144+0 records in</div>
<div>262144+0 records out</div>
<div>1073741824 bytes (1.1 GB) copied, 11.2207 s,
<b>95.7 MB/s</b></div>
</div>
<div><br>
</div>
<div>[replica 2 volume]<br>
</div>
<div>
<div>root@centos7u3-nogdesktop2 ~ $ dd
if=/dev/zero of=/mnt/gluster/r2/file$RANDOM
bs=4096 count=262144</div>
<div>262144+0 records in</div>
<div>262144+0 records out</div>
<div>1073741824 bytes (1.1 GB) copied, 12.0149 s,
<b>89.4 MB/s</b></div>
</div>
<div><b><br>
</b></div>
<div><u>512K block size</u><b><br>
</b></div>
<div>[replica 3 volume]<u><br>
</u></div>
<div>
<div>root@centos7u3-nogdesktop2 ~ $ dd
if=/dev/zero of=/mnt/gluster/r3/file$RANDOM
bs=512K count=2048</div>
<div>2048+0 records in</div>
<div>2048+0 records out</div>
<div>1073741824 bytes (1.1 GB) copied, 5.27207 s,
<b>204 MB/s</b></div>
</div>
<div><br>
</div>
<div>[replica 2 volume]<br>
</div>
<div>
<div>root@centos7u3-nogdesktop2 ~ $ dd
if=/dev/zero of=/mnt/gluster/r2/file$RANDOM
bs=512K count=2048</div>
<div>2048+0 records in</div>
<div>2048+0 records out</div>
<div>1073741824 bytes (1.1 GB) copied, 4.22321 s,
<b>254 MB/s</b></div>
</div>
<div><b><br>
</b></div>
<div>With bigger block size It's still not where I
expect it to be, but at least it starts to make
some sense.</div>
<div><br>
</div>
<div>I've been trying to solve this for a very long
time with no luck. </div>
<div>I've already tried both kernel tuning
(different 'tuned' profiles and the ones
recommended in the "Linux Kernel Tuning" section)
and tweaking gluster volume options, including
write-behind/flush-behind/writ<wbr>e-behind-window-size.</div>
<div>The latter, to my surprise, didn't make any
difference. 'Cause at first I thought it was the
buffering issue but it turns out it does buffer
writes, just not very efficient (well at least
what it looks like in the <b>gluster profile
output</b>)</div>
<div><br>
</div>
<div>
<blockquote class="gmail_quote" style="margin:0px
0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">root@fsnode2
~ $ gluster volume profile r3vol info clear<br>
...<br>
Cleared stats.</blockquote>
<div><br>
</div>
<div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">root@centos7u3-nogdesktop2
~ $ dd if=/dev/zero
of=/mnt/gluster/r3/file$RANDOM bs=4096
count=262144<br>
262144+0 records in<br>
262144+0 records out<br>
1073741824 bytes (1.1 GB) copied, 10.9743 s,
97.8 MB/s</blockquote>
</div>
<div> </div>
<div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">root@fsnode2
~ $ gluster volume profile r3vol info<br>
Brick: fsnode2.ibnet:/data/glusterfs/<wbr>r3vol/brick1/brick<br>
------------------------------<wbr>-------------------------<br>
Cumulative Stats:<br>
Block Size: 4096b+
8192b+ 16384b+<br>
No. of Reads: 0
0 0<br>
No. of Writes: 1576
4173 19605<br>
Block Size: 32768b+
65536b+ 131072b+<br>
No. of Reads: 0
0 0<br>
No. of Writes: 7777
1847 657<br>
%-latency Avg-latency Min-Latency
Max-Latency No. of calls Fop<br>
--------- ----------- -----------
----------- ------------ ----<br>
0.00 0.00 us 0.00 us
0.00 us 1 RELEASE<br>
0.00 18.00 us 18.00 us
18.00 us 1 STATFS<br>
0.00 20.50 us 11.00 us
30.00 us 2 FLUSH<br>
0.00 22.50 us 17.00 us
28.00 us 2 FINODELK<br>
0.01 76.50 us 65.00 us
88.00 us 2 FXATTROP<br>
0.01 177.00 us 177.00 us
177.00 us 1 CREATE<br>
0.02 56.14 us 23.00 us
128.00 us 7 LOOKUP<br>
0.02 259.00 us 20.00 us
498.00 us 2 ENTRYLK<br>
99.94 59.23 us 17.00 us
10914.00 us 35635 WRITE<br>
Duration: 38 seconds<br>
Data Read: 0 bytes<br>
Data Written: 1073741824 bytes<br>
Interval 0 Stats:<br>
Block Size: 4096b+
8192b+ 16384b+<br>
No. of Reads: 0
0 0<br>
No. of Writes: 1576
4173 19605<br>
Block Size: 32768b+
65536b+ 131072b+<br>
No. of Reads: 0
0 0<br>
No. of Writes: 7777
1847 657<br>
%-latency Avg-latency Min-Latency
Max-Latency No. of calls Fop<br>
--------- ----------- -----------
----------- ------------ ----<br>
0.00 0.00 us 0.00 us
0.00 us 1 RELEASE<br>
0.00 18.00 us 18.00 us
18.00 us 1 STATFS<br>
0.00 20.50 us 11.00 us
30.00 us 2 FLUSH<br>
0.00 22.50 us 17.00 us
28.00 us 2 FINODELK<br>
0.01 76.50 us 65.00 us
88.00 us 2 FXATTROP<br>
0.01 177.00 us 177.00 us
177.00 us 1 CREATE<br>
0.02 56.14 us 23.00 us
128.00 us 7 LOOKUP<br>
0.02 259.00 us 20.00 us
498.00 us 2 ENTRYLK<br>
99.94 59.23 us 17.00 us
10914.00 us 35635 WRITE<br>
Duration: 38 seconds<br>
Data Read: 0 bytes<br>
Data Written: 1073741824 bytes<br>
Brick: fsnode6.ibnet:/data/glusterfs/<wbr>r3vol/brick1/brick<br>
------------------------------<wbr>-------------------------<br>
Cumulative Stats:<br>
Block Size: 4096b+
8192b+ 16384b+<br>
No. of Reads: 0
0 0<br>
No. of Writes: 1576
4173 19605<br>
Block Size: 32768b+
65536b+ 131072b+<br>
No. of Reads: 0
0 0<br>
No. of Writes: 7777
1847 657<br>
%-latency Avg-latency Min-Latency
Max-Latency No. of calls Fop<br>
--------- ----------- -----------
----------- ------------ ----<br>
0.00 0.00 us 0.00 us
0.00 us 1 RELEASE<br>
0.00 33.00 us 33.00 us
33.00 us 1 STATFS<br>
0.00 22.50 us 13.00 us
32.00 us 2 ENTRYLK<br>
0.00 32.00 us 26.00 us
38.00 us 2 FLUSH<br>
0.01 47.50 us 16.00 us
79.00 us 2 FINODELK<br>
0.01 157.00 us 157.00 us
157.00 us 1 CREATE<br>
0.01 92.00 us 70.00 us
114.00 us 2 FXATTROP<br>
0.03 72.57 us 39.00 us
121.00 us 7 LOOKUP<br>
99.94 47.97 us 15.00 us
1598.00 us 35635 WRITE<br>
Duration: 38 seconds<br>
Data Read: 0 bytes<br>
Data Written: 1073741824 bytes<br>
Interval 0 Stats:<br>
Block Size: 4096b+
8192b+ 16384b+<br>
No. of Reads: 0
0 0<br>
No. of Writes: 1576
4173 19605<br>
Block Size: 32768b+
65536b+ 131072b+<br>
No. of Reads: 0
0 0<br>
No. of Writes: 7777
1847 657<br>
%-latency Avg-latency Min-Latency
Max-Latency No. of calls Fop<br>
--------- ----------- -----------
----------- ------------ ----<br>
0.00 0.00 us 0.00 us
0.00 us 1 RELEASE<br>
0.00 33.00 us 33.00 us
33.00 us 1 STATFS<br>
0.00 22.50 us 13.00 us
32.00 us 2 ENTRYLK<br>
0.00 32.00 us 26.00 us
38.00 us 2 FLUSH<br>
0.01 47.50 us 16.00 us
79.00 us 2 FINODELK<br>
0.01 157.00 us 157.00 us
157.00 us 1 CREATE<br>
0.01 92.00 us 70.00 us
114.00 us 2 FXATTROP<br>
0.03 72.57 us 39.00 us
121.00 us 7 LOOKUP<br>
99.94 47.97 us 15.00 us
1598.00 us 35635 WRITE<br>
Duration: 38 seconds<br>
Data Read: 0 bytes<br>
Data Written: 1073741824 bytes<br>
Brick: fsnode4.ibnet:/data/glusterfs/<wbr>r3vol/brick1/brick<br>
------------------------------<wbr>-------------------------<br>
Cumulative Stats:<br>
Block Size: 4096b+
8192b+ 16384b+<br>
No. of Reads: 0
0 0<br>
No. of Writes: 1576
4173 19605<br>
Block Size: 32768b+
65536b+ 131072b+<br>
No. of Reads: 0
0 0<br>
No. of Writes: 7777
1847 657<br>
%-latency Avg-latency Min-Latency
Max-Latency No. of calls Fop<br>
--------- ----------- -----------
----------- ------------ ----<br>
0.00 0.00 us 0.00 us
0.00 us 1 RELEASE<br>
0.00 58.00 us 58.00 us
58.00 us 1 STATFS<br>
0.00 38.00 us 38.00 us
38.00 us 2 ENTRYLK<br>
0.01 59.00 us 32.00 us
86.00 us 2 FLUSH<br>
0.01 81.00 us 33.00 us
129.00 us 2 FINODELK<br>
0.01 91.50 us 73.00 us
110.00 us 2 FXATTROP<br>
0.01 239.00 us 239.00 us
239.00 us 1 CREATE<br>
0.04 103.14 us 63.00 us
210.00 us 7 LOOKUP<br>
99.92 52.99 us 16.00 us
11289.00 us 35635 WRITE<br>
Duration: 38 seconds<br>
Data Read: 0 bytes<br>
Data Written: 1073741824 bytes<br>
Interval 0 Stats:<br>
Block Size: 4096b+
8192b+ 16384b+<br>
No. of Reads: 0
0 0<br>
No. of Writes: 1576
4173 19605<br>
Block Size: 32768b+
65536b+ 131072b+<br>
No. of Reads: 0
0 0<br>
No. of Writes: 7777
1847 657<br>
%-latency Avg-latency Min-Latency
Max-Latency No. of calls Fop<br>
--------- ----------- -----------
----------- ------------ ----<br>
0.00 0.00 us 0.00 us
0.00 us 1 RELEASE<br>
0.00 58.00 us 58.00 us
58.00 us 1 STATFS<br>
0.00 38.00 us 38.00 us
38.00 us 2 ENTRYLK<br>
0.01 59.00 us 32.00 us
86.00 us 2 FLUSH<br>
0.01 81.00 us 33.00 us
129.00 us 2 FINODELK<br>
0.01 91.50 us 73.00 us
110.00 us 2 FXATTROP<br>
0.01 239.00 us 239.00 us
239.00 us 1 CREATE<br>
0.04 103.14 us 63.00 us
210.00 us 7 LOOKUP<br>
99.92 52.99 us 16.00 us
11289.00 us 35635 WRITE<br>
Duration: 38 seconds<br>
Data Read: 0 bytes<br>
Data Written: 1073741824 bytes</blockquote>
</div>
</div>
<div><br>
</div>
<div><br>
</div>
<div>At this point I'm officially run out of idea
where to look next. So any help, suggestions or
pointers are highly appreciated! </div>
<span class="m_6873069489282419939HOEnZb"><font
color="#888888">
<div><br>
</div>
<div>
<div>-- <br>
<div
class="m_6873069489282419939m_-168076092818363674gmail_signature">
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div style="font-size:12.8px">Best
regards,</div>
<div style="font-size:12.8px">Anastasia
Belyaeva</div>
</div>
<div><br>
</div>
<div><br>
</div>
<div>
<div style="font-size:12.8px"><br>
</div>
</div>
<div><br>
</div>
<div><br>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</font></span></div>
<br>
______________________________<wbr>_________________<br>
Gluster-users mailing list<br>
<a href="mailto:Gluster-users@gluster.org"
target="_blank" moz-do-not-send="true">Gluster-users@gluster.org</a><br>
<a
href="http://lists.gluster.org/mailman/listinfo/gluster-users"
rel="noreferrer" target="_blank"
moz-do-not-send="true">http://lists.gluster.org/mailm<wbr>an/listinfo/gluster-users</a><br>
</blockquote>
</div>
<br>
</div>
</blockquote>
</div>
<br>
<br clear="all">
<div><br>
</div>
-- <br>
<div class="gmail_signature" data-smartmail="gmail_signature">
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div style="font-size:12.8px">Best regards,</div>
<div style="font-size:12.8px">Anastasia Belyaeva</div>
</div>
<div><span style="font-size:12.8px"><br>
</span></div>
<div><span style="font-size:12.8px">С уважением,</span><br>
</div>
<div>Анастасия Беляева<br>
</div>
<div><br>
</div>
<div>
<div style="font-size:12.8px"><br>
</div>
</div>
<div><br>
</div>
<div><br>
</div>
</div>
</div>
</div>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Gluster-users mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a>
<a class="moz-txt-link-freetext" href="http://lists.gluster.org/mailman/listinfo/gluster-users">http://lists.gluster.org/mailman/listinfo/gluster-users</a></pre>
</blockquote>
<br>
</body>
</html>