[Gluster-users] Unreasonably poor performance of replicated volumes

Joe Julian joe at julianfamily.org
Sat Apr 14 16:19:04 UTC 2018


A jumbo ethernet frame can be 9000 bytes. The ethernet frame header is 
at least 38 bytes, and the minimum TCP/IP header size is 40 bytes or 
0.78% of the jumbo frame combined. Gluster's RPC also adds a few bytes 
(not sure how many and don't have time to test at the moment but for the 
sake of argument we'll just say 20 bytes) but, all together, it's about 
99% efficient. If you write 20 bytes to a file (for an extreme example) 
then you'll have your 20 bytes+RPC header+TCP/IP header+ethernet header; 
118 bytes in headers for 20 bytes of data. That header being 90% of the 
frame means that your packet is only 10% efficient. That's per replica 
so if you have a replica 3 that's three individual frames with 118 bytes 
of headers each to write the same 20 bytes of data. Those go out to the 
three servers and wait for their response. So you have a network round 
trip + a tiny bit of latency for stacking the three frames in the kernel 
+ disk write latency. That's a lot of overhead and cannot ever be as 
fast as writing to a local disk for any networked storage.

The question, however, is does it need to be? Do you care if a single 
thread is slower in a clustered environment than it would be on a local 
raid stack? In good clustered engineering your workload will be handled 
by multiple threads over a cluster of workers. Overall, you have more 
threads than you could have on a single machine. This allows servicing a 
greater overall workload than you could without a cluster. I refer to 
that as comparing apples to orchards (1 
<https://joejulian.name/post/dont-get-stuck-micro-engineering-for-scale/>).

On 04/13/18 10:58, Anastasia Belyaeva wrote:
> Thanks a lot for your reply!
>
> You guessed it right though  - mailing lists, various blogs, 
> documentation, videos and even source code at this point. Changing 
> some off the options does make performance slightly better, but 
> nothing particularly groundbreaking.
>
> So, if I understand you correctly, no one has yet managed to get 
> acceptable performance (relative to underlying hardware capabilities) 
> with smaller block sizes? Is there an explanation for this?
>
>
> 2018-04-13 1:57 GMT+03:00 Vlad Kopylov <vladkopy at gmail.com 
> <mailto:vladkopy at gmail.com>>:
>
>     Guess you went through user lists and tried something like this
>     already
>     http://lists.gluster.org/pipermail/gluster-users/2018-April/033811.html
>     <http://lists.gluster.org/pipermail/gluster-users/2018-April/033811.html>
>     I have a same exact setup and below is as far as it went after
>     months of trail and error.
>     We all have somewhat same setup and same issue with this - you can
>     find same post as yours on the daily basis.
>
>     On Wed, Apr 11, 2018 at 3:03 PM, Anastasia Belyaeva
>     <anastasia.blv at gmail.com <mailto:anastasia.blv at gmail.com>> wrote:
>
>         Hello everybody!
>
>         I have 3 gluster servers (*gluster 3.12.6, Centos 7.2*; those
>         are actually virtual machines located on 3 separate physical
>         XenServer7.1 servers)
>
>         They are all connected via infiniband network. Iperf3 shows
>         around *23 Gbit/s network bandwidth *between each 2 of them.
>
>         Each server has 3 HDD put into a *stripe*3 thin pool (LVM2)
>         *with logical volume created on top of it, formatted with
>         *xfs*. Gluster top reports the following throughput:
>
>             root at fsnode2 ~ $ gluster volume top r3vol write-perf bs
>             4096 count 524288 list-cnt 0
>             Brick: fsnode2.ibnet:/data/glusterfs/r3vol/brick1/brick
>             Throughput *631.82 MBps *time 3.3989 secs
>             Brick: fsnode6.ibnet:/data/glusterfs/r3vol/brick1/brick
>             Throughput *566.96 MBps *time 3.7877 secs
>             Brick: fsnode4.ibnet:/data/glusterfs/r3vol/brick1/brick
>             Throughput *546.65 MBps *time 3.9285 secs
>
>
>             root at fsnode2 ~ $ gluster volume top r2vol write-perf bs
>             4096 count 524288 list-cnt 0
>             Brick: fsnode2.ibnet:/data/glusterfs/r2vol/brick1/brick
>             Throughput *539.60 MBps *time 3.9798 secs
>             Brick: fsnode4.ibnet:/data/glusterfs/r2vol/brick1/brick
>             Throughput *580.07 MBps *time 3.7021 secs
>
>
>         And two *pure replicated ('replica 2' and 'replica 3')*
>         volumes. *The 'replica 2' volume is for testing purpose only.
>
>             Volume Name: r2vol
>             Type: Replicate
>             Volume ID: 4748d0c0-6bef-40d5-b1ec-d30e10cfddd9
>             Status: Started
>             Snapshot Count: 0
>             Number of Bricks: 1 x 2 = 2
>             Transport-type: tcp
>             Bricks:
>             Brick1: fsnode2.ibnet:/data/glusterfs/r2vol/brick1/brick
>             Brick2: fsnode4.ibnet:/data/glusterfs/r2vol/brick1/brick
>             Options Reconfigured:
>             nfs.disable: on
>
>             Volume Name: r3vol
>             Type: Replicate
>             Volume ID: b0f64c28-57e1-4b9d-946b-26ed6b499f29
>             Status: Started
>             Snapshot Count: 0
>             Number of Bricks: 1 x 3 = 3
>             Transport-type: tcp
>             Bricks:
>             Brick1: fsnode2.ibnet:/data/glusterfs/r3vol/brick1/brick
>             Brick2: fsnode4.ibnet:/data/glusterfs/r3vol/brick1/brick
>             Brick3: fsnode6.ibnet:/data/glusterfs/r3vol/brick1/brick
>             Options Reconfigured:
>             nfs.disable: on
>
>
>
>         *Client *is also gluster 3.12.6, Centos 7.3 virtual machine,
>         *FUSE mount*
>
>             root at centos7u3-nogdesktop2 ~ $ mount |grep gluster
>             gluster-host.ibnet:/r2vol on /mnt/gluster/r2 type
>             fuse.glusterfs
>             (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
>             gluster-host.ibnet:/r3vol on /mnt/gluster/r3 type
>             fuse.glusterfs
>             (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
>
>
>
>         *The problem *is that there is a significant performance loss
>         with smaller block sizes. For example:
>
>         _4K block size_
>         [replica 3 volume]
>         root at centos7u3-nogdesktop2 ~ $ dd if=/dev/zero
>         of=/mnt/gluster/r3/file$RANDOM bs=4096 count=262144
>         262144+0 records in
>         262144+0 records out
>         1073741824 bytes (1.1 GB) copied, 11.2207 s, *95.7 MB/s*
>
>         [replica 2 volume]
>         root at centos7u3-nogdesktop2 ~ $ dd if=/dev/zero
>         of=/mnt/gluster/r2/file$RANDOM bs=4096 count=262144
>         262144+0 records in
>         262144+0 records out
>         1073741824 bytes (1.1 GB) copied, 12.0149 s, *89.4 MB/s*
>         *
>         *
>         _512K block size_*
>         *
>         [replica 3 volume]_
>         _
>         root at centos7u3-nogdesktop2 ~ $ dd if=/dev/zero
>         of=/mnt/gluster/r3/file$RANDOM bs=512K count=2048
>         2048+0 records in
>         2048+0 records out
>         1073741824 bytes (1.1 GB) copied, 5.27207 s, *204 MB/s*
>
>         [replica 2 volume]
>         root at centos7u3-nogdesktop2 ~ $ dd if=/dev/zero
>         of=/mnt/gluster/r2/file$RANDOM bs=512K count=2048
>         2048+0 records in
>         2048+0 records out
>         1073741824 bytes (1.1 GB) copied, 4.22321 s, *254 MB/s*
>         *
>         *
>         With bigger block size It's still not where I expect it to be,
>         but at least it starts to make some sense.
>
>         I've been trying to solve this for a very long time with no luck.
>         I've already tried both kernel tuning (different 'tuned'
>         profiles and the ones recommended in the "Linux Kernel Tuning"
>         section) and tweaking gluster volume options, including
>         write-behind/flush-behind/write-behind-window-size.
>         The latter, to my surprise, didn't make any difference. 'Cause
>         at first I thought it was the buffering issue but it turns out
>         it does buffer writes, just not very efficient (well at least
>         what it looks like in the *gluster profile output*)
>
>             root at fsnode2 ~ $ gluster volume profile r3vol info clear
>             ...
>             Cleared stats.
>
>
>             root at centos7u3-nogdesktop2 ~ $ dd if=/dev/zero
>             of=/mnt/gluster/r3/file$RANDOM bs=4096 count=262144
>             262144+0 records in
>             262144+0 records out
>             1073741824 bytes (1.1 GB) copied, 10.9743 s, 97.8 MB/s
>
>             root at fsnode2 ~ $ gluster volume profile r3vol info
>             Brick: fsnode2.ibnet:/data/glusterfs/r3vol/brick1/brick
>             -------------------------------------------------------
>             Cumulative Stats:
>                Block Size:               4096b+      8192b+          
>                 16384b+
>              No. of Reads:                    0           0          
>                       0
>             No. of Writes:                 1576        4173          
>                   19605
>                Block Size:              32768b+     65536b+          
>                131072b+
>              No. of Reads:                    0           0          
>                       0
>             No. of Writes:                 7777        1847          
>                     657
>              %-latency   Avg-latency   Min-Latency Max-Latency   No.
>             of calls         Fop
>              ---------   -----------   ----------- -----------  
>             ------------        ----
>                   0.00       0.00 us       0.00 us 0.00 us            
>              1     RELEASE
>                   0.00      18.00 us      18.00 us  18.00 us          
>                1      STATFS
>                   0.00      20.50 us      11.00 us  30.00 us          
>                2       FLUSH
>                   0.00      22.50 us      17.00 us  28.00 us          
>                2    FINODELK
>                   0.01      76.50 us      65.00 us  88.00 us          
>                2    FXATTROP
>                   0.01     177.00 us     177.00 us 177.00 us          
>                1      CREATE
>                   0.02      56.14 us      23.00 us 128.00 us          
>                7      LOOKUP
>                   0.02     259.00 us      20.00 us 498.00 us          
>                2     ENTRYLK
>                  99.94      59.23 us      17.00 us 10914.00 us        
>              35635       WRITE
>                 Duration: 38 seconds
>                Data Read: 0 bytes
>             Data Written: 1073741824 bytes
>             Interval 0 Stats:
>                Block Size:               4096b+      8192b+          
>                 16384b+
>              No. of Reads:                    0           0          
>                       0
>             No. of Writes:                 1576        4173          
>                   19605
>                Block Size:              32768b+     65536b+          
>                131072b+
>              No. of Reads:                    0           0          
>                       0
>             No. of Writes:                 7777        1847          
>                     657
>              %-latency   Avg-latency   Min-Latency Max-Latency   No.
>             of calls         Fop
>              ---------   -----------   ----------- -----------  
>             ------------        ----
>                   0.00       0.00 us       0.00 us 0.00 us            
>              1     RELEASE
>                   0.00      18.00 us      18.00 us  18.00 us          
>                1      STATFS
>                   0.00      20.50 us      11.00 us  30.00 us          
>                2       FLUSH
>                   0.00      22.50 us      17.00 us  28.00 us          
>                2    FINODELK
>                   0.01      76.50 us      65.00 us  88.00 us          
>                2    FXATTROP
>                   0.01     177.00 us     177.00 us 177.00 us          
>                1      CREATE
>                   0.02      56.14 us      23.00 us 128.00 us          
>                7      LOOKUP
>                   0.02     259.00 us      20.00 us 498.00 us          
>                2     ENTRYLK
>                  99.94      59.23 us      17.00 us 10914.00 us        
>              35635       WRITE
>                 Duration: 38 seconds
>                Data Read: 0 bytes
>             Data Written: 1073741824 bytes
>             Brick: fsnode6.ibnet:/data/glusterfs/r3vol/brick1/brick
>             -------------------------------------------------------
>             Cumulative Stats:
>                Block Size:               4096b+      8192b+          
>                 16384b+
>              No. of Reads:                    0           0          
>                       0
>             No. of Writes:                 1576        4173          
>                   19605
>                Block Size:              32768b+     65536b+          
>                131072b+
>              No. of Reads:                    0           0          
>                       0
>             No. of Writes:                 7777        1847          
>                     657
>              %-latency   Avg-latency   Min-Latency Max-Latency   No.
>             of calls         Fop
>              ---------   -----------   ----------- -----------  
>             ------------        ----
>                   0.00       0.00 us       0.00 us 0.00 us            
>              1     RELEASE
>                   0.00      33.00 us      33.00 us  33.00 us          
>                1      STATFS
>                   0.00      22.50 us      13.00 us  32.00 us          
>                2     ENTRYLK
>                   0.00      32.00 us      26.00 us  38.00 us          
>                2       FLUSH
>                   0.01      47.50 us      16.00 us  79.00 us          
>                2    FINODELK
>                   0.01     157.00 us     157.00 us 157.00 us          
>                1      CREATE
>                   0.01      92.00 us      70.00 us 114.00 us          
>                2    FXATTROP
>                   0.03      72.57 us      39.00 us 121.00 us          
>                7      LOOKUP
>                  99.94      47.97 us      15.00 us  1598.00 us        
>              35635       WRITE
>                 Duration: 38 seconds
>                Data Read: 0 bytes
>             Data Written: 1073741824 bytes
>             Interval 0 Stats:
>                Block Size:               4096b+      8192b+          
>                 16384b+
>              No. of Reads:                    0           0          
>                       0
>             No. of Writes:                 1576        4173          
>                   19605
>                Block Size:              32768b+     65536b+          
>                131072b+
>              No. of Reads:                    0           0          
>                       0
>             No. of Writes:                 7777        1847          
>                     657
>              %-latency   Avg-latency   Min-Latency Max-Latency   No.
>             of calls         Fop
>              ---------   -----------   ----------- -----------  
>             ------------        ----
>                   0.00       0.00 us       0.00 us 0.00 us            
>              1     RELEASE
>                   0.00      33.00 us      33.00 us  33.00 us          
>                1      STATFS
>                   0.00      22.50 us      13.00 us  32.00 us          
>                2     ENTRYLK
>                   0.00      32.00 us      26.00 us  38.00 us          
>                2       FLUSH
>                   0.01      47.50 us      16.00 us  79.00 us          
>                2    FINODELK
>                   0.01     157.00 us     157.00 us 157.00 us          
>                1      CREATE
>                   0.01      92.00 us      70.00 us 114.00 us          
>                2    FXATTROP
>                   0.03      72.57 us      39.00 us 121.00 us          
>                7      LOOKUP
>                  99.94      47.97 us      15.00 us  1598.00 us        
>              35635       WRITE
>                 Duration: 38 seconds
>                Data Read: 0 bytes
>             Data Written: 1073741824 bytes
>             Brick: fsnode4.ibnet:/data/glusterfs/r3vol/brick1/brick
>             -------------------------------------------------------
>             Cumulative Stats:
>                Block Size:               4096b+      8192b+          
>                 16384b+
>              No. of Reads:                    0           0          
>                       0
>             No. of Writes:                 1576        4173          
>                   19605
>                Block Size:              32768b+     65536b+          
>                131072b+
>              No. of Reads:                    0           0          
>                       0
>             No. of Writes:                 7777        1847          
>                     657
>              %-latency   Avg-latency   Min-Latency Max-Latency   No.
>             of calls         Fop
>              ---------   -----------   ----------- -----------  
>             ------------        ----
>                   0.00       0.00 us       0.00 us 0.00 us            
>              1     RELEASE
>                   0.00      58.00 us      58.00 us  58.00 us          
>                1      STATFS
>                   0.00      38.00 us      38.00 us  38.00 us          
>                2     ENTRYLK
>                   0.01      59.00 us      32.00 us  86.00 us          
>                2       FLUSH
>                   0.01      81.00 us      33.00 us 129.00 us          
>                2    FINODELK
>                   0.01      91.50 us      73.00 us 110.00 us          
>                2    FXATTROP
>                   0.01     239.00 us     239.00 us 239.00 us          
>                1      CREATE
>                   0.04     103.14 us      63.00 us 210.00 us          
>                7      LOOKUP
>                  99.92      52.99 us      16.00 us 11289.00 us        
>              35635       WRITE
>                 Duration: 38 seconds
>                Data Read: 0 bytes
>             Data Written: 1073741824 bytes
>             Interval 0 Stats:
>                Block Size:               4096b+      8192b+          
>                 16384b+
>              No. of Reads:                    0           0          
>                       0
>             No. of Writes:                 1576        4173          
>                   19605
>                Block Size:              32768b+     65536b+          
>                131072b+
>              No. of Reads:                    0           0          
>                       0
>             No. of Writes:                 7777        1847          
>                     657
>              %-latency   Avg-latency   Min-Latency Max-Latency   No.
>             of calls         Fop
>              ---------   -----------   ----------- -----------  
>             ------------        ----
>                   0.00       0.00 us       0.00 us 0.00 us            
>              1     RELEASE
>                   0.00      58.00 us      58.00 us  58.00 us          
>                1      STATFS
>                   0.00      38.00 us      38.00 us  38.00 us          
>                2     ENTRYLK
>                   0.01      59.00 us      32.00 us  86.00 us          
>                2       FLUSH
>                   0.01      81.00 us      33.00 us 129.00 us          
>                2    FINODELK
>                   0.01      91.50 us      73.00 us 110.00 us          
>                2    FXATTROP
>                   0.01     239.00 us     239.00 us 239.00 us          
>                1      CREATE
>                   0.04     103.14 us      63.00 us 210.00 us          
>                7      LOOKUP
>                  99.92      52.99 us      16.00 us 11289.00 us        
>              35635       WRITE
>                 Duration: 38 seconds
>                Data Read: 0 bytes
>             Data Written: 1073741824 bytes
>
>
>
>         At this point I'm officially run out of idea where to look
>         next. So any help, suggestions or pointers are highly
>         appreciated!
>
>         -- 
>         Best regards,
>         Anastasia Belyaeva
>
>
>
>
>
>
>         _______________________________________________
>         Gluster-users mailing list
>         Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>         http://lists.gluster.org/mailman/listinfo/gluster-users
>         <http://lists.gluster.org/mailman/listinfo/gluster-users>
>
>
>
>
>
> -- 
> Best regards,
> Anastasia Belyaeva
>
> С уважением,
> Анастасия Беляева
>
>
>
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180414/a9ce9bc1/attachment.html>


More information about the Gluster-users mailing list