[Gluster-users] GlusterFS 3.5.3 - untar: very poor performance

Mathieu Chateau mathieu.chateau at lotp.fr
Sat Jun 20 09:11:38 UTC 2015


I am afraid I am not experienced enough to be much more useful.

My guess is that, since client is writing synchronously to all node (to
keep data coherent), it's going as fast as the slowest brick.

Then small files are often slow because TCP windows doesn't have time to
grow up.
That's why I gave you some kernel tuning to help TCP Windows to get bigger
faster.

Do you use latest version (3.7.1) ?


Cordialement,
Mathieu CHATEAU
http://www.lotp.fr

2015-06-20 11:01 GMT+02:00 Geoffrey Letessier <geoffrey.letessier at cnrs.fr>:

> Hello Mathieu,
>
> Thanks for replying.
>
> Previously, i’ve never notice such throughput performances (around 1GBs
> for 1 big file) but.... The situation with a « big » set of small files
> wasn’t amazing but not such bad than today.
>
> The problem seems to concern exclusively the size of each file.
> "proof":
> [root at node056 tmp]# dd if=/dev/zero of=masterfile bs=1M count=1000
> 1000+0 enregistrements lus
> 1000+0 enregistrements écrits
> 1048576000 octets (1,0 GB) copiés, 2,09139 s, 501 MB/s
> [root at node056 tmp]# time split -b 1000000 -a 12 masterfile  # 1MB per file
>
> real 0m42.841s
> user 0m0.004s
> sys 0m1.416s
> [root at node056 tmp]# rm -f xaaaaaaaaa* && sync
> [root at node056 tmp]# time split -b 5000000 -a 12 masterfile  # 5 MB per
> file
>
> real 0m17.801s
> user 0m0.008s
> sys 0m1.396s
> [root at node056 tmp]# rm -f xaaaaaaaaa* && sync
> [root at node056 tmp]# time split -b 10000000 -a 12 masterfile  # 10MB per
> file
>
> real 0m9.686s
> user 0m0.008s
> sys 0m1.451s
> [root at node056 tmp]# rm -f xaaaaaaaaa* && sync
> [root at node056 tmp]# time split -b 20000000 -a 12 masterfile  # 20MB per
> file
>
> real 0m9.717s
> user 0m0.003s
> sys 0m1.399s
> [root at node056 tmp]# rm -f xaaaaaaaaa* && sync
> [root at node056 tmp]# time split -b 1000000 -a 12 masterfile  # 10MB per
> file
>
> real 0m40.283s
> user 0m0.007s
> sys 0m1.390s
> [root at node056 tmp]# rm -f xaaaaaaaaa* && sync
>
> Higher is the generated file size, best is the performance (IO throughput
> and running time)… ifstat output is coherent from both client/node and
> server side..
>
> a new test:
> [root at node056 tmp]# dd if=/dev/zero of=masterfile bs=1M count=10000
> 10000+0 enregistrements lus
> 10000+0 enregistrements écrits
> 10485760000 octets (10 GB) copiés, 23,0044 s, 456 MB/s
> [root at node056 tmp]# rm -f xaaaaaaaaa* && sync
> [root at node056 tmp]# time split -b 10000000 -a 12 masterfile  # 10MB per
> file
>
> real 1m43.216s
> user 0m0.038s
> sys 0m13.407s
>
>
> So the performance per file is the same (despite of 10x more files)
>
> So, i dont understand why, to get the best performance, i need to create
> file with a size of 10MB or more.
>
> Here are my volume reconfigured options:
> performance.cache-max-file-size: 64MB
> performance.read-ahead: on
> performance.write-behind: on
> features.quota-deem-statfs: on
> performance.stat-prefetch: on
> performance.flush-behind: on
> features.default-soft-limit: 90%
> features.quota: on
> diagnostics.brick-log-level: CRITICAL
> auth.allow: localhost,127.0.0.1,10.*
> nfs.disable: on
> performance.cache-size: 1GB
> performance.write-behind-window-size: 4MB
> performance.quick-read: on
> performance.io-cache: on
> performance.io-thread-count: 64
> nfs.enable-ino32: off
>
> It’s not a local cache trouble because:
> 1- it’s disabled in my mount command mount -t glusterfs -o transport=rdma,
> direct-io-mode=disable,enable-ino32 ib-storage1:vol_home /home
> 2- i made my test also playing with /proc/sys/vm/drop_caches
> 3- I note the same ifstat output from both client and server side which is
> coherent with the computing of bandwidth (file sizes / time (considering
> the replication).
>
> I think it’s not an infiniband network trouble but here are my [not
> default] settings:
> connected mode with MTU set to 65520
>
> Do you confirm my feelings? If yes, do you have any other idea?
>
> Thanks again and thanks by advance,
> Geoffrey
> -----------------------------------------------
> Geoffrey Letessier
>
> Responsable informatique & ingénieur système
> CNRS - UPR 9080 - Laboratoire de Biochimie Théorique
> Institut de Biologie Physico-Chimique
> 13, rue Pierre et Marie Curie - 75005 Paris
> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at cnrs.fr
>
> Le 20 juin 2015 à 09:12, Mathieu Chateau <mathieu.chateau at lotp.fr> a
> écrit :
>
> Hello,
>
> for the replicated one, is it a new issue or you just didn't notice before
> ? Same baseline as before?
>
> I also have slowness with small files/many files.
>
> For now I could only tune up things with:
>
> On gluster level:
> gluster volume set myvolume performance.io-thread-count 16
> gluster volume set myvolume  performance.cache-size 1GB
> gluster volume set myvolume nfs.disable on
> gluster volume set myvolume readdir-ahead enable
> gluster volume set myvolume read-ahead disable
>
> On network level (client and server) (I don't have infiniband):
> sysctl -w vm.swappiness=0
> sysctl -w net.core.rmem_max=67108864
> sysctl -w net.core.wmem_max=67108864
> # increase Linux autotuning TCP buffer limit to 32MB
> sysctl -w net.ipv4.tcp_rmem="4096 87380 33554432"
> sysctl -w net.ipv4.tcp_wmem="4096 65536 33554432"
> # increase the length of the processor input queue
> sysctl -w net.core.netdev_max_backlog=30000
> # recommended default congestion control is htcp
> sysctl -w net.ipv4.tcp_congestion_control=htcp
>
> But it's still really slow, even if better
>
> Cordialement,
> Mathieu CHATEAU
> http://www.lotp.fr
>
> 2015-06-20 2:34 GMT+02:00 Geoffrey Letessier <geoffrey.letessier at cnrs.fr>:
>
>> Re,
>>
>> For comparison, here is the output of the same script run on a
>> distributed only volume (2 servers of the 4 previously described, 2 bricks
>> each):
>> #######################################################
>> ################  UNTAR time consumed  ################
>> #######################################################
>>
>>
>> real 1m44.698s
>> user 0m8.891s
>> sys 0m8.353s
>>
>> #######################################################
>> #################  DU time consumed  ##################
>> #######################################################
>>
>> 554M linux-4.1-rc6
>>
>> real 0m21.062s
>> user 0m0.100s
>> sys 0m1.040s
>>
>> #######################################################
>> #################  FIND time consumed  ################
>> #######################################################
>>
>> 52663
>>
>> real 0m21.325s
>> user 0m0.104s
>> sys 0m1.054s
>>
>> #######################################################
>> #################  GREP time consumed  ################
>> #######################################################
>>
>> 7952
>>
>> real 0m43.618s
>> user 0m0.922s
>> sys 0m3.626s
>>
>> #######################################################
>> #################  TAR time consumed  #################
>> #######################################################
>>
>>
>> real 0m50.577s
>> user 0m29.745s
>> sys 0m4.086s
>>
>> #######################################################
>> #################  RM time consumed  ##################
>> #######################################################
>>
>>
>> real 0m41.133s
>> user 0m0.171s
>> sys 0m2.522s
>>
>> The performances are amazing different!
>>
>> Geoffrey
>> -----------------------------------------------
>> Geoffrey Letessier
>>
>> Responsable informatique & ingénieur système
>> CNRS - UPR 9080 - Laboratoire de Biochimie Théorique
>> Institut de Biologie Physico-Chimique
>> 13, rue Pierre et Marie Curie - 75005 Paris
>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at cnrs.fr
>>
>> Le 20 juin 2015 à 02:12, Geoffrey Letessier <geoffrey.letessier at cnrs.fr>
>> a écrit :
>>
>> Dear all,
>>
>> I just noticed on my main volume of my HPC cluster my IO operations
>> become impressively poor..
>>
>> Doing some file operations above a linux kernel sources compressed file,
>> the untar operation can take more than 1/2 hours for this file (roughly
>> 80MB and 52 000 files inside) as you read below:
>> #######################################################
>> ################  UNTAR time consumed  ################
>> #######################################################
>>
>>
>> real 32m42.967s
>> user 0m11.783s
>> sys 0m15.050s
>>
>> #######################################################
>> #################  DU time consumed  ##################
>> #######################################################
>>
>> 557M linux-4.1-rc6
>>
>> real 0m25.060s
>> user 0m0.068s
>> sys 0m0.344s
>>
>> #######################################################
>> #################  FIND time consumed  ################
>> #######################################################
>>
>> 52663
>>
>> real 0m25.687s
>> user 0m0.084s
>> sys 0m0.387s
>>
>> #######################################################
>> #################  GREP time consumed  ################
>> #######################################################
>>
>> 7952
>>
>> real 2m15.890s
>> user 0m0.887s
>> sys 0m2.777s
>>
>> #######################################################
>> #################  TAR time consumed  #################
>> #######################################################
>>
>>
>> real 1m5.551s
>> user 0m26.536s
>> sys 0m2.609s
>>
>> #######################################################
>> #################  RM time consumed  ##################
>> #######################################################
>>
>>
>> real 2m51.485s
>> user 0m0.167s
>> sys 0m1.663s
>>
>> For information, this volume is a distributed replicated one and is
>> composed by 4 servers with 2 bricks each. Each bricks is a 12-drives RAID6
>> vdisk with nice native performances (around 1.2GBs).
>>
>> In comparison, when I use DD to generate a 100GB file on the same volume,
>> my write throughput is around 1GB (client side) and 500MBs (server side)
>> because of replication:
>> Client side:
>> [root at node056 ~]# ifstat -i ib0
>>        ib0
>>  KB/s in  KB/s out
>>  3251.45  1.09e+06
>>  3139.80  1.05e+06
>>  3185.29  1.06e+06
>>  3293.84  1.09e+06
>> ...
>>
>> Server side:
>> [root at lucifer ~]# ifstat -i ib0
>>        ib0
>>  KB/s in  KB/s out
>> 561818.1   1746.42
>> 560020.3   1737.92
>> 526337.1   1648.20
>> 513972.7   1613.69
>> ...
>>
>> DD command:
>> [root at node056 ~]# dd if=/dev/zero of=/home/root/test.dd bs=1M
>> count=100000
>> 100000+0 enregistrements lus
>> 100000+0 enregistrements écrits
>> 104857600000 octets (105 GB) copiés, 202,99 s, 517 MB/s
>>
>> So this issue doesn’t seem coming from the network (which is Infiniband
>> technology in this case)
>>
>> You can find in attachments a set of files:
>> - mybench.sh: the bench script
>> - benches.txt: output of my "bench"
>> - profile.txt: gluster volume profile during the "bench"
>> - vol_status.txt: gluster volume status
>> - vol_info.txt: gluster volume info
>>
>> Can someone help me to fix it (it’s very critical because this volume is
>> on a HPC cluster in production).
>>
>> Thanks by advance,
>> Geoffrey
>> -----------------------------------------------
>> Geoffrey Letessier
>>
>> Responsable informatique & ingénieur système
>> CNRS - UPR 9080 - Laboratoire de Biochimie Théorique
>> Institut de Biologie Physico-Chimique
>> 13, rue Pierre et Marie Curie - 75005 Paris
>> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier at cnrs.fr
>>  <benches.txt>
>> <mybench.sh>
>> <profile.txt>
>> <vol_info.txt>
>> <vol_status.txt>
>>
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150620/7befa495/attachment.html>


More information about the Gluster-users mailing list