[Gluster-users] glusterfs under high load failing?

Wed Oct 15 15:24:37 UTC 2014

On 10/14/2014 01:20 AM, Roman wrote:
> ok. done.
> this time there were no disconnects, at least all of vms are working, 
> but got some mails from VM about IO writes again.
>
> WARNINGs: Read IO Wait time is 1.45 (outside range [0:1]).
This warning says 'Read IO wait' and there is not a single READ 
operation that came to gluster. Wondering why that is :-/. Any clue? 
There is at least one write which took 3 seconds according to the stats. 
At least one synchronization operation (FINODELK) took 23 seconds. Could 
you give logs of this run? for  mount, glustershd, bricks.

Pranith
>
> here is the output
>
> root at stor1:~# gluster volume profile HA-WIN-TT-1T info
> Brick: stor1:/exports/NFS-WIN/1T
> --------------------------------
> Cumulative Stats:
>    Block Size:             131072b+              262144b+
>  No. of Reads:                    0                     0
> No. of Writes:              7372798                     1
>  %-latency   Avg-latency   Min-Latency   Max-Latency No. of calls     
>     Fop
>  ---------   -----------   -----------   ----------- ------------     
>    ----
>       0.00       0.00 us       0.00 us       0.00 us         25     
> RELEASE
>       0.00       0.00 us       0.00 us       0.00 us         16 
>  RELEASEDIR
>       0.00      64.00 us      52.00 us      76.00 us          2     
> ENTRYLK
>       0.00      73.50 us      51.00 us      96.00 us          2       
> FLUSH
>       0.00      68.43 us      30.00 us     135.00 us          7     
>  STATFS
>       0.00      54.31 us      44.00 us     109.00 us         16     
> OPENDIR
>       0.00      50.75 us      16.00 us      74.00 us         24       
> FSTAT
>       0.00      47.77 us      19.00 us     119.00 us         26   
>  GETXATTR
>       0.00      59.21 us      21.00 us      89.00 us         24       
>  OPEN
>       0.00      59.39 us      22.00 us     296.00 us         28     
> READDIR
>       0.00    4972.00 us    4972.00 us    4972.00 us          1     
>  CREATE
>       0.00      97.42 us      19.00 us     184.00 us         62     
>  LOOKUP
>       0.00      89.49 us      20.00 us     656.00 us        324   
>  FXATTROP
>       3.91 1255944.81 us     127.00 us 23397532.00 us        189       
> FSYNC
>       7.40 3406275.50 us      17.00 us 23398013.00 us        132     
> INODELK
>      34.96   94598.02 us       8.00 us 23398705.00 us      22445   
>  FINODELK
>      53.73     442.66 us      79.00 us 3116494.00 us    7372799       
> WRITE
>
>     Duration: 7813 seconds
>    Data Read: 0 bytes
> Data Written: 966367641600 bytes
>
> Interval 0 Stats:
>    Block Size:             131072b+              262144b+
>  No. of Reads:                    0                     0
> No. of Writes:              7372798                     1
>  %-latency   Avg-latency   Min-Latency   Max-Latency No. of calls     
>     Fop
>  ---------   -----------   -----------   ----------- ------------     
>    ----
>       0.00       0.00 us       0.00 us       0.00 us         25     
> RELEASE
>       0.00       0.00 us       0.00 us       0.00 us         16 
>  RELEASEDIR
>       0.00      64.00 us      52.00 us      76.00 us          2     
> ENTRYLK
>       0.00      73.50 us      51.00 us      96.00 us          2       
> FLUSH
>       0.00      68.43 us      30.00 us     135.00 us          7     
>  STATFS
>       0.00      54.31 us      44.00 us     109.00 us         16     
> OPENDIR
>       0.00      50.75 us      16.00 us      74.00 us         24       
> FSTAT
>       0.00      47.77 us      19.00 us     119.00 us         26   
>  GETXATTR
>       0.00      59.21 us      21.00 us      89.00 us         24       
>  OPEN
>       0.00      59.39 us      22.00 us     296.00 us         28     
> READDIR
>       0.00    4972.00 us    4972.00 us    4972.00 us          1     
>  CREATE
>       0.00      97.42 us      19.00 us     184.00 us         62     
>  LOOKUP
>       0.00      89.49 us      20.00 us     656.00 us        324   
>  FXATTROP
>       3.91 1255944.81 us     127.00 us 23397532.00 us        189       
> FSYNC
>       7.40 3406275.50 us      17.00 us 23398013.00 us        132     
> INODELK
>      34.96   94598.02 us       8.00 us 23398705.00 us      22445   
>  FINODELK
>      53.73     442.66 us      79.00 us 3116494.00 us    7372799       
> WRITE
>
>     Duration: 7813 seconds
>    Data Read: 0 bytes
> Data Written: 966367641600 bytes
>
> Brick: stor2:/exports/NFS-WIN/1T
> --------------------------------
> Cumulative Stats:
>    Block Size:             131072b+              262144b+
>  No. of Reads:                    0                     0
> No. of Writes:              7372798                     1
>  %-latency   Avg-latency   Min-Latency   Max-Latency No. of calls     
>     Fop
>  ---------   -----------   -----------   ----------- ------------     
>    ----
>       0.00       0.00 us       0.00 us       0.00 us         25     
> RELEASE
>       0.00       0.00 us       0.00 us       0.00 us         16 
>  RELEASEDIR
>       0.00      61.50 us      46.00 us      77.00 us          2     
> ENTRYLK
>       0.00      82.00 us      67.00 us      97.00 us          2       
> FLUSH
>       0.00     265.00 us     265.00 us     265.00 us          1     
>  CREATE
>       0.00      57.43 us      30.00 us      85.00 us          7     
>  STATFS
>       0.00      61.12 us      37.00 us     107.00 us         16     
> OPENDIR
>       0.00      44.04 us      12.00 us      86.00 us         24       
> FSTAT
>       0.00      41.42 us      24.00 us      96.00 us         26   
>  GETXATTR
>       0.00      45.93 us      24.00 us     133.00 us         28     
> READDIR
>       0.00      57.17 us      25.00 us     147.00 us         24       
>  OPEN
>       0.00     145.28 us      31.00 us     288.00 us         32   
>  READDIRP
>       0.00      39.50 us      10.00 us     152.00 us        132     
> INODELK
>       0.00     330.97 us      20.00 us   14280.00 us         62     
>  LOOKUP
>       0.00      79.06 us      19.00 us     851.00 us        430   
>  FXATTROP
>       0.02      29.32 us       7.00 us   28154.00 us      22568   
>  FINODELK
>       7.80 1313096.68 us     125.00 us 23281862.00 us        189       
> FSYNC
>      92.18     397.92 us      76.00 us 1838343.00 us    7372799       
> WRITE
>
>     Duration: 7811 seconds
>    Data Read: 0 bytes
> Data Written: 966367641600 bytes
>
> Interval 0 Stats:
>    Block Size:             131072b+              262144b+
>  No. of Reads:                    0                     0
> No. of Writes:              7372798                     1
>  %-latency   Avg-latency   Min-Latency   Max-Latency No. of calls     
>     Fop
>  ---------   -----------   -----------   ----------- ------------     
>    ----
>       0.00       0.00 us       0.00 us       0.00 us         25     
> RELEASE
>       0.00       0.00 us       0.00 us       0.00 us         16 
>  RELEASEDIR
>       0.00      61.50 us      46.00 us      77.00 us          2     
> ENTRYLK
>       0.00      82.00 us      67.00 us      97.00 us          2       
> FLUSH
>       0.00     265.00 us     265.00 us     265.00 us          1     
>  CREATE
>       0.00      57.43 us      30.00 us      85.00 us          7     
>  STATFS
>       0.00      61.12 us      37.00 us     107.00 us         16     
> OPENDIR
>       0.00      44.04 us      12.00 us      86.00 us         24       
> FSTAT
>       0.00      41.42 us      24.00 us      96.00 us         26   
>  GETXATTR
>       0.00      45.93 us      24.00 us     133.00 us         28     
> READDIR
>       0.00      57.17 us      25.00 us     147.00 us         24       
>  OPEN
>       0.00     145.28 us      31.00 us     288.00 us         32   
>  READDIRP
>       0.00      39.50 us      10.00 us     152.00 us        132     
> INODELK
>       0.00     330.97 us      20.00 us   14280.00 us         62     
>  LOOKUP
>       0.00      79.06 us      19.00 us     851.00 us        430   
>  FXATTROP
>       0.02      29.32 us       7.00 us   28154.00 us      22568   
>  FINODELK
>       7.80 1313096.68 us     125.00 us 23281862.00 us        189       
> FSYNC
>      92.18     397.92 us      76.00 us 1838343.00 us    7372799       
> WRITE
>
>     Duration: 7811 seconds
>    Data Read: 0 bytes
> Data Written: 966367641600 bytes
>
> does it make something more clear?
>
> 2014-10-13 20:40 GMT+03:00 Roman <romeo.r at gmail.com 
> <mailto:romeo.r at gmail.com>>:
>
>     i think i may know what was an issue. There was an iscsitarget
>     service runing, that was exporting this generated block device. so
>     maybe my collegue Windows server picked it up and mountd :) I'll
>     if it will happen again.
>
>     2014-10-13 20:27 GMT+03:00 Roman <romeo.r at gmail.com
>     <mailto:romeo.r at gmail.com>>:
>
>         So may I restart the volume and start the test, or you need
>         something else from this issue?
>
>         2014-10-13 19:49 GMT+03:00 Pranith Kumar Karampuri
>         <pkarampu at redhat.com <mailto:pkarampu at redhat.com>>:
>
>
>             On 10/13/2014 10:03 PM, Roman wrote:
>>             hmm,
>>             seems like another strange issue? Seen this before. Had
>>             to restart the volume to get my empty space back.
>>             root at glstor-cli:/srv/nfs/HA-WIN-TT-1T# ls -l
>>             total 943718400
>>             -rw-r--r-- 1 root root 966367641600 Oct 13 16:55 disk
>>             root at glstor-cli:/srv/nfs/HA-WIN-TT-1T# rm disk
>>             root at glstor-cli:/srv/nfs/HA-WIN-TT-1T# df -h
>>             Filesystem    Size  Used Avail Use% Mounted on
>>             rootfs    282G  1.1G  266G   1% /
>>             udev     10M     0   10M   0% /dev
>>             tmpfs   1.4G  228K  1.4G   1% /run
>>             /dev/disk/by-uuid/c62ee3c0-c0e5-44af-b0cd-7cb3fbcc0fba
>>              282G  1.1G  266G   1% /
>>             tmpfs   5.0M     0  5.0M   0% /run/lock
>>             tmpfs   5.2G     0  5.2G   0% /run/shm
>>             stor1:HA-WIN-TT-1T   1008G  901G   57G  95%
>>             /srv/nfs/HA-WIN-TT-1T
>>
>>             no file, but size is still 901G.
>>             Both servers show the same.
>>             Do I really have to restart the volume to fix that?
>             IMO this can happen if there is an fd leak. open-fd is the
>             only variable that can change with volume restart. How do
>             you re-create the bug?
>
>             Pranith
>
>>
>>             2014-10-13 19:30 GMT+03:00 Roman <romeo.r at gmail.com
>>             <mailto:romeo.r at gmail.com>>:
>>
>>                 Sure.
>>                 I'll let it to run for this night .
>>
>>                 2014-10-13 19:19 GMT+03:00 Pranith Kumar Karampuri
>>                 <pkarampu at redhat.com <mailto:pkarampu at redhat.com>>:
>>
>>                     hi Roman,
>>                          Do you think we can run this test again?
>>                     this time, could you enable 'gluster volume
>>                     profile <volname> start', do the same test.
>>                     Provide output of 'gluster volume profile
>>                     <volname> info' and logs after the test?
>>
>>                     Pranith
>>
>>                     On 10/13/2014 09:45 PM, Roman wrote:
>>>                     Sure !
>>>
>>>                     root at stor1:~# gluster volume info
>>>
>>>                     Volume Name: HA-2TB-TT-Proxmox-cluster
>>>                     Type: Replicate
>>>                     Volume ID: 66e38bde-c5fa-4ce2-be6e-6b2adeaa16c2
>>>                     Status: Started
>>>                     Number of Bricks: 1 x 2 = 2
>>>                     Transport-type: tcp
>>>                     Bricks:
>>>                     Brick1: stor1:/exports/HA-2TB-TT-Proxmox-cluster/2TB
>>>                     Brick2: stor2:/exports/HA-2TB-TT-Proxmox-cluster/2TB
>>>                     Options Reconfigured:
>>>                     nfs.disable: 0
>>>                     network.ping-timeout: 10
>>>
>>>                     Volume Name: HA-WIN-TT-1T
>>>                     Type: Replicate
>>>                     Volume ID: 2937ac01-4cba-44a8-8ff8-0161b67f8ee4
>>>                     Status: Started
>>>                     Number of Bricks: 1 x 2 = 2
>>>                     Transport-type: tcp
>>>                     Bricks:
>>>                     Brick1: stor1:/exports/NFS-WIN/1T
>>>                     Brick2: stor2:/exports/NFS-WIN/1T
>>>                     Options Reconfigured:
>>>                     nfs.disable: 1
>>>                     network.ping-timeout: 10
>>>
>>>
>>>
>>>                     2014-10-13 19:09 GMT+03:00 Pranith Kumar
>>>                     Karampuri <pkarampu at redhat.com
>>>                     <mailto:pkarampu at redhat.com>>:
>>>
>>>                         Could you give your 'gluster volume info'
>>>                         output?
>>>
>>>                         Pranith
>>>
>>>                         On 10/13/2014 09:36 PM, Roman wrote:
>>>>                         Hi,
>>>>
>>>>                         I've got this kind of setup (servers run
>>>>                         replica)
>>>>
>>>>
>>>>                         @ 10G backend
>>>>                         gluster storage1
>>>>                         gluster storage2
>>>>                         gluster client1
>>>>
>>>>                         @1g backend
>>>>                         other gluster clients
>>>>
>>>>                         Servers got HW RAID5 with SAS disks.
>>>>
>>>>                         So today I've desided to create a 900GB
>>>>                         file for iscsi target that will be located
>>>>                         @ glusterfs separate volume, using dd (just
>>>>                         a dummy file filled with zeros, bs=1G count
>>>>                         900)
>>>>                         For the first of all the process took
>>>>                         pretty lots of time, the writing speed was
>>>>                         130 MB/sec (client port was 2 gbps, servers
>>>>                         ports were running @ 1gbps).
>>>>                         Then it reported something like "endpoint
>>>>                         is not connected" and all of my VMs on the
>>>>                         other volume started to give me IO errors.
>>>>                         Servers load was around 4,6 (total 12 cores)
>>>>
>>>>                         Maybe it was due to timeout of 2 secs, so
>>>>                         I've made it a big higher, 10 sec.
>>>>
>>>>                         Also during the dd image creation time, VMs
>>>>                         very often reported me that their disks are
>>>>                         slow like
>>>>
>>>>                         WARNINGs: Read IO Wait time is -0.02
>>>>                         (outside range [0:1]).
>>>>
>>>>                         Is 130MB /sec is the maximum bandwidth for
>>>>                         all of the volumes in total? That why would
>>>>                         we need 10g backends?
>>>>
>>>>                         HW Raid local speed is 300 MB/sec, so it
>>>>                         should not be an issue. any ideas or mby
>>>>                         any advices?
>>>>
>>>>
>>>>                         Maybe some1 got optimized sysctl.conf for
>>>>                         10G backend?
>>>>
>>>>                         mine is pretty simple, which can be found
>>>>                         from googling.
>>>>
>>>>
>>>>                         just to mention: those VM-s were connected
>>>>                         using separate 1gbps intraface, which
>>>>                         means, they should not be affected by the
>>>>                         client with 10g backend.
>>>>
>>>>
>>>>                         logs are pretty useless, they just say
>>>>                          this during the outage
>>>>
>>>>
>>>>                         [2014-10-13 12:09:18.392910] W
>>>>                         [client-handshake.c:276:client_ping_cbk]
>>>>                         0-HA-2TB-TT-Proxmox-cluster-client-0: timer
>>>>                         must have expired
>>>>
>>>>                         [2014-10-13 12:10:08.389708] C
>>>>                         [client-handshake.c:127:rpc_client_ping_timer_expired]
>>>>                         0-HA-2TB-TT-Proxmox-cluster-client-0:
>>>>                         server 10.250.0.1:49159
>>>>                         <http://10.250.0.1:49159> has not responded
>>>>                         in the last 2 seconds, disconnecting.
>>>>
>>>>                         [2014-10-13 12:10:08.390312] W
>>>>                         [client-handshake.c:276:client_ping_cbk]
>>>>                         0-HA-2TB-TT-Proxmox-cluster-client-0: timer
>>>>                         must have expired
>>>>
>>>>                         so I decided to set the timout a bit higher.
>>>>
>>>>                         So it seems to me, that under high load
>>>>                         GlusterFS is not useable? 130 MB/s is not
>>>>                         that much to get some kind of timeouts or
>>>>                         makeing the systme so slow, that VM-s
>>>>                         feeling themselves bad.
>>>>
>>>>                         Of course, after the disconnection, healing
>>>>                         process was started, but as VM-s lost
>>>>                         connection to both of servers, it was
>>>>                         pretty useless, they could not run anymore.
>>>>                         and BTW, when u load the server with such
>>>>                         huge job (dd of 900GB), healing process
>>>>                         goes soooooo slow :)
>>>>
>>>>
>>>>
>>>>                         -- 
>>>>                         Best regards,
>>>>                         Roman.
>>>>
>>>>
>>>>                         _______________________________________________
>>>>                         Gluster-users mailing list
>>>>                         Gluster-users at gluster.org  <mailto:Gluster-users at gluster.org>
>>>>                         http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
>>>
>>>
>>>                     -- 
>>>                     Best regards,
>>>                     Roman.
>>
>>
>>
>>
>>                 -- 
>>                 Best regards,
>>                 Roman.
>>
>>
>>
>>
>>             -- 
>>             Best regards,
>>             Roman.
>
>
>
>
>         -- 
>         Best regards,
>         Roman.
>
>
>
>
>     -- 
>     Best regards,
>     Roman.
>
>
>
>
> -- 
> Best regards,
> Roman.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20141015/7dc8effc/attachment.html>