[Gluster-users] glusterfs under high load failing?

Mon Oct 13 16:19:41 UTC 2014

hi Roman,
      Do you think we can run this test again? this time, could you 
enable 'gluster volume profile <volname> start', do the same test. 
Provide output of 'gluster volume profile <volname> info' and logs after 
the test?

Pranith
On 10/13/2014 09:45 PM, Roman wrote:
> Sure !
>
> root at stor1:~# gluster volume info
>
> Volume Name: HA-2TB-TT-Proxmox-cluster
> Type: Replicate
> Volume ID: 66e38bde-c5fa-4ce2-be6e-6b2adeaa16c2
> Status: Started
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: stor1:/exports/HA-2TB-TT-Proxmox-cluster/2TB
> Brick2: stor2:/exports/HA-2TB-TT-Proxmox-cluster/2TB
> Options Reconfigured:
> nfs.disable: 0
> network.ping-timeout: 10
>
> Volume Name: HA-WIN-TT-1T
> Type: Replicate
> Volume ID: 2937ac01-4cba-44a8-8ff8-0161b67f8ee4
> Status: Started
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: stor1:/exports/NFS-WIN/1T
> Brick2: stor2:/exports/NFS-WIN/1T
> Options Reconfigured:
> nfs.disable: 1
> network.ping-timeout: 10
>
>
>
> 2014-10-13 19:09 GMT+03:00 Pranith Kumar Karampuri 
> <pkarampu at redhat.com <mailto:pkarampu at redhat.com>>:
>
>     Could you give your 'gluster volume info' output?
>
>     Pranith
>
>     On 10/13/2014 09:36 PM, Roman wrote:
>>     Hi,
>>
>>     I've got this kind of setup (servers run replica)
>>
>>
>>     @ 10G backend
>>     gluster storage1
>>     gluster storage2
>>     gluster client1
>>
>>     @1g backend
>>     other gluster clients
>>
>>     Servers got HW RAID5 with SAS disks.
>>
>>     So today I've desided to create a 900GB file for iscsi target
>>     that will be located @ glusterfs separate volume, using dd (just
>>     a dummy file filled with zeros, bs=1G count 900)
>>     For the first of all the process took pretty lots of time, the
>>     writing speed was 130 MB/sec (client port was 2 gbps, servers
>>     ports were running @ 1gbps).
>>     Then it reported something like "endpoint is not connected" and
>>     all of my VMs on the other volume started to give me IO errors.
>>     Servers load was around 4,6 (total 12 cores)
>>
>>     Maybe it was due to timeout of 2 secs, so I've made it a big
>>     higher, 10 sec.
>>
>>     Also during the dd image creation time, VMs very often reported
>>     me that their disks are slow like
>>
>>     WARNINGs: Read IO Wait time is -0.02 (outside range [0:1]).
>>
>>     Is 130MB /sec is the maximum bandwidth for all of the volumes in
>>     total? That why would we need 10g backends?
>>
>>     HW Raid local speed is 300 MB/sec, so it should not be an issue.
>>     any ideas or mby any advices?
>>
>>
>>     Maybe some1 got optimized sysctl.conf for 10G backend?
>>
>>     mine is pretty simple, which can be found from googling.
>>
>>
>>     just to mention: those VM-s were connected using separate 1gbps
>>     intraface, which means, they should not be affected by the client
>>     with 10g backend.
>>
>>
>>     logs are pretty useless, they just say  this during the outage
>>
>>
>>     [2014-10-13 12:09:18.392910] W
>>     [client-handshake.c:276:client_ping_cbk]
>>     0-HA-2TB-TT-Proxmox-cluster-client-0: timer must have expired
>>
>>     [2014-10-13 12:10:08.389708] C
>>     [client-handshake.c:127:rpc_client_ping_timer_expired]
>>     0-HA-2TB-TT-Proxmox-cluster-client-0: server 10.250.0.1:49159
>>     <http://10.250.0.1:49159> has not responded in the last 2
>>     seconds, disconnecting.
>>
>>     [2014-10-13 12:10:08.390312] W
>>     [client-handshake.c:276:client_ping_cbk]
>>     0-HA-2TB-TT-Proxmox-cluster-client-0: timer must have expired
>>
>>     so I decided to set the timout a bit higher.
>>
>>     So it seems to me, that under high load GlusterFS is not useable?
>>     130 MB/s is not that much to get some kind of timeouts or makeing
>>     the systme so slow, that VM-s feeling themselves bad.
>>
>>     Of course, after the disconnection, healing process was started,
>>     but as VM-s lost connection to both of servers, it was pretty
>>     useless, they could not run anymore. and BTW, when u load the
>>     server with such huge job (dd of 900GB), healing process goes
>>     soooooo slow :)
>>
>>
>>
>>     -- 
>>     Best regards,
>>     Roman.
>>
>>
>>     _______________________________________________
>>     Gluster-users mailing list
>>     Gluster-users at gluster.org  <mailto:Gluster-users at gluster.org>
>>     http://supercolony.gluster.org/mailman/listinfo/gluster-users
>
>
>
>
> -- 
> Best regards,
> Roman.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20141013/59a9579c/attachment.html>