[Gluster-devel] Fw: Re: Corvid gluster testing

Thu Aug 7 02:11:59 UTC 2014

Just to clarify a little, there are two cases where I was evaluating 
performance.

1) The first case that Pranith was working involved 20-nodes with 
4-processors on each node for a total of 80-processors.  Each processor 
does its own independent i/o.  These files are roughly 100-200MB each 
and there are several hundred of them.  When I mounted the gluster 
system using fuse, it took 1.5-hours to do the i/o.  When I mounted the 
same system using NFS, it took 30-minutes.  Note, that in order to get 
the gluster mounted file-system down to 1.5-hours, I had to get rid of 
the replicated volume (this was done during troubleshooting with Pranith 
to rule out other possible issues).  The timing was significantly worse 
(3+ hours) when I was using a replicated pair.
2) The second case was the output of a larger single file (roughly 
2.5TB).  For this case, it takes the gluster mounted filesystem 
60-seconds (although I got that down to 52-seconds with some gluster 
parameter tuning).  The NFS mount takes 38-seconds.  I sent the results 
of this to the developer list first as this case is much easier to test 
(50-seconds versus what could be 3+ hours).

I am head out of town for a few days and will not be able to do 
additional testing until Monday.  For the second case, I will turn off 
cluster.eager-lock and send the results to the email list. If there is 
any other testing that you would like to see for the first case, let me 
know and I will be happy to perform the tests and send in the results...

Sorry for the confusion...

David

------ Original Message ------
From: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
To: "Anand Avati" <avati at gluster.org>
Cc: "David F. Robinson" <david.robinson at corvidtec.com>; "Gluster Devel" 
<gluster-devel at gluster.org>
Sent: 8/6/2014 9:51:11 PM
Subject: Re: [Gluster-devel] Fw: Re: Corvid gluster testing

>
>On 08/07/2014 07:18 AM, Anand Avati wrote:
>>It would be worth checking the perf numbers without -o acl (in case it 
>>was enabled, as seen in the other gid thread). Client side -o acl 
>>mount option can have a negative impact on performance because of the 
>>increased number of up-calls from FUSE for access().
>Actually it is all write intensive.
>here are the numbers they gave me from earlier runs:
>  %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls     
>     Fop
>  ---------   -----------   -----------   -----------   ------------     
>    ----
>       0.00       0.00 us       0.00 us       0.00 us             99     
>  FORGET
>       0.00       0.00 us       0.00 us       0.00 us           1093     
>RELEASE
>       0.00       0.00 us       0.00 us       0.00 us            468  
>RELEASEDIR
>       0.00      60.00 us      26.00 us     107.00 us              4     
>SETATTR
>       0.00      91.56 us      42.00 us     157.00 us             27     
>  UNLINK
>       0.00      20.75 us      12.00 us      55.00 us            132    
>GETXATTR
>       0.00      19.03 us       9.00 us      95.00 us            152    
>READLINK
>       0.00      43.19 us      12.00 us     106.00 us             83     
>    OPEN
>       0.00      18.37 us       8.00 us      92.00 us            257     
>  STATFS
>       0.00      32.42 us      11.00 us     118.00 us            322     
>OPENDIR
>       0.00      36.09 us       5.00 us     109.00 us            359     
>   FSTAT
>       0.00      51.14 us      37.00 us     183.00 us            663     
>  RENAME
>       0.00      33.32 us       6.00 us     123.00 us           1451     
>    STAT
>       0.00     821.79 us      21.00 us   22678.00 us             84     
>    READ
>       0.00      34.88 us       3.00 us     139.00 us           2326     
>   FLUSH
>       0.01     789.33 us      72.00 us   64054.00 us            347     
>  CREATE
>       0.01    1144.63 us      43.00 us  280735.00 us            337   
>FTRUNCATE
>       0.01      47.82 us      16.00 us   19817.00 us          16513     
>  LOOKUP
>       0.02     604.85 us      11.00 us    1233.00 us           1423    
>READDIRP
>      99.95      17.51 us       6.00 us  212701.00 us      300715967     
>   WRITE
>
>     Duration: 5390 seconds
>    Data Read: 1495257497 bytes
>Data Written: 166546887668 bytes
>
>Pranith
>>
>>Thanks
>>
>>
>>On Wed, Aug 6, 2014 at 6:26 PM, Pranith Kumar Karampuri 
>><pkarampu at redhat.com> wrote:
>>>
>>>On 08/07/2014 06:48 AM, Anand Avati wrote:
>>>>
>>>>
>>>>
>>>>On Wed, Aug 6, 2014 at 6:05 PM, Pranith Kumar Karampuri 
>>>><pkarampu at redhat.com> wrote:
>>>>>We checked this performance with plain distribute as well and on 
>>>>>nfs it gave 25 minutes where as on nfs it gave around 90 minutes 
>>>>>after disabling throttling in both situations.
>>>>
>>>>This sentence is very confusing. Can you please state it more 
>>>>clearly?
>>>sorry :-D.
>>>We checked this performance on plain distribute volume by disabling 
>>>throttling.
>>>On nfs the run took 25 minutes.
>>>On fuse the run took 90 minutes.
>>>
>>>Pranith
>>>
>>>>
>>>>Thanks
>>>>
>>>>
>>>>>I was wondering if any of you guys know what could contribute to 
>>>>>this difference.
>>>>>
>>>>>Pranith
>>>>>
>>>>>On 08/07/2014 01:33 AM, Anand Avati wrote:
>>>>>>Seems like heavy FINODELK contention. As a diagnostic step, can 
>>>>>>you try disabling eager-locking and check the write performance 
>>>>>>again (gluster volume set $name cluster.eager-lock off)?
>>>>>>
>>>>>>
>>>>>>On Tue, Aug 5, 2014 at 11:44 AM, David F. Robinson 
>>>>>><david.robinson at corvidtec.com> wrote:
>>>>>>>Forgot to attach profile info in previous email.  Attached...
>>>>>>>
>>>>>>>David
>>>>>>>
>>>>>>>
>>>>>>>------ Original Message ------
>>>>>>>From: "David F. Robinson" <david.robinson at corvidtec.com>
>>>>>>>To: gluster-devel at gluster.org
>>>>>>>Sent: 8/5/2014 2:41:34 PM
>>>>>>>Subject: Fw: Re: Corvid gluster testing
>>>>>>>
>>>>>>>>I have been testing some of the fixes that Pranith incorporated 
>>>>>>>>into the 3.5.2-beta to see how they performed for moderate 
>>>>>>>>levels of i/o. All of the stability issues that I had seen in 
>>>>>>>>previous versions seem to have been fixed in 3.5.2; however, 
>>>>>>>>there still seem to be some significant performance issues.  
>>>>>>>>Pranith suggested that I send this to the gluster-devel email 
>>>>>>>>list, so here goes:
>>>>>>>>
>>>>>>>>I am running an MPI job that saves a restart file to the gluster 
>>>>>>>>file system.  When I use the following in my fstab to mount the 
>>>>>>>>gluster volume, the i/o time for the 2.5GB file is roughly 
>>>>>>>>45-seconds.
>>>>>>>>
>>>>>>>>     gfsib01a.corvidtec.com:/homegfs /homegfs glusterfs 
>>>>>>>>transport=tcp,_netdev 0 0
>>>>>>>>When I switch this to use the NFS protocol (see below), the i/o 
>>>>>>>>time is 2.5-seconds.
>>>>>>>>
>>>>>>>>   gfsib01a.corvidtec.com:/homegfs /homegfs nfs 
>>>>>>>>vers=3,intr,bg,rsize=32768,wsize=32768 0 0
>>>>>>>>
>>>>>>>>The read-times for gluster are 10-20% faster than NFS, but the 
>>>>>>>>write times are almost 20x slower.
>>>>>>>>
>>>>>>>>I am running SL 6.4 and glusterfs-3.5.2-0.1.beta1.el6.x86_64...
>>>>>>>>
>>>>>>>>[root at gfs01a glusterfs]# gluster volume info homegfs
>>>>>>>>Volume Name: homegfs
>>>>>>>>Type: Distributed-Replicate
>>>>>>>>Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
>>>>>>>>Status: Started
>>>>>>>>Number of Bricks: 2 x 2 = 4
>>>>>>>>Transport-type: tcp
>>>>>>>>Bricks:
>>>>>>>>Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
>>>>>>>>Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
>>>>>>>>Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
>>>>>>>>Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
>>>>>>>>
>>>>>>>>David
>>>>>>>>
>>>>>>>>------ Forwarded Message ------
>>>>>>>>From: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
>>>>>>>>To: "David Robinson" <david.robinson at corvidtec.com>
>>>>>>>>Cc: "Young Thomas" <tom.young at corvidtec.com>
>>>>>>>>Sent: 8/5/2014 2:25:38 AM
>>>>>>>>Subject: Re: Corvid gluster testing
>>>>>>>>
>>>>>>>>gluster-devel at gluster.org is the email-id for the mailing list. 
>>>>>>>>We should probably start with the initial run numbers and the 
>>>>>>>>comparison for glusterfs mount and nfs mounts. May be something 
>>>>>>>>like
>>>>>>>>
>>>>>>>>glusterfs mount: 90 minutes
>>>>>>>>nfs mount: 25 minutes
>>>>>>>>
>>>>>>>>And profile outputs, volume config, number of mounts, hardware 
>>>>>>>>configuration should be a good start.
>>>>>>>>
>>>>>>>>Pranith
>>>>>>>>
>>>>>>>>On 08/05/2014 09:28 AM, David Robinson wrote:
>>>>>>>>>Thanks pranith
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>===============================
>>>>>>>>>David F. Robinson, Ph.D.
>>>>>>>>>President - Corvid Technologies
>>>>>>>>>704.799.6944 x101 [office]
>>>>>>>>>704.252.1310 [cell]
>>>>>>>>>704.799.7974 [fax]
>>>>>>>>>David.Robinson at corvidtec.com
>>>>>>>>>http://www.corvidtechnologies.com/
>>>>>>>>>
>>>>>>>>>>On Aug 4, 2014, at 11:22 PM, Pranith Kumar Karampuri 
>>>>>>>>>><pkarampu at redhat.com> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>On 08/05/2014 08:33 AM, Pranith Kumar Karampuri wrote:
>>>>>>>>>>>
>>>>>>>>>>>On 08/05/2014 08:29 AM, David F. Robinson wrote:
>>>>>>>>>>>>>>On 08/05/2014 12:51 AM, David F. Robinson wrote:
>>>>>>>>>>>>>>No. I don't want to use nfs. It eliminates most of the 
>>>>>>>>>>>>>>benefits of why I want to use gluster. Failover redundancy 
>>>>>>>>>>>>>>of the pair, load balancing, etc.
>>>>>>>>>>>>>What is the meaning of 'Failover redundancy of the pair, 
>>>>>>>>>>>>>load balancing ' Could you elaborate more? 
>>>>>>>>>>>>>smb/nfs/glusterfs are just access protocols that gluster 
>>>>>>>>>>>>>supports functionality is almost same
>>>>>>>>>>>>Here is my understanding. Please correct me where I am 
>>>>>>>>>>>>wrong.
>>>>>>>>>>>>
>>>>>>>>>>>>With gluster, if I am doing a write and one of the 
>>>>>>>>>>>>replicated pairs goes down, there is no interruption to the 
>>>>>>>>>>>>I/o. The failover is handled by gluster and the fuse client. 
>>>>>>>>>>>>This isn't done if I use an nfs mount unless the component 
>>>>>>>>>>>>of the pair that goes down isn't the one I used for the 
>>>>>>>>>>>>mount.
>>>>>>>>>>>>
>>>>>>>>>>>>With nfs, I will have to mount one of the bricks. So, if I 
>>>>>>>>>>>>have gfs01a, gfs01b, gfs02a, gfs02b, gfs03a, gfs03b, etc and 
>>>>>>>>>>>>my fstab mounts gfs01a, it is my understanding that all of 
>>>>>>>>>>>>my I/o will go through gfs01a which then gets distributed to 
>>>>>>>>>>>>all of the other bricks. Gfs01a throughput becomes a 
>>>>>>>>>>>>bottleneck. Where if I do a gluster mount using fuse, the 
>>>>>>>>>>>>load balancing is handled at the client side , not the 
>>>>>>>>>>>>server side. If I have 1000-nodes accessing 20-gluster 
>>>>>>>>>>>>bricks, I need the load balancing aspect. I cannot have all 
>>>>>>>>>>>>traffic going through the network interface on a single 
>>>>>>>>>>>>brick.
>>>>>>>>>>>>
>>>>>>>>>>>>If I am wrong with the above assumptions, I guess my 
>>>>>>>>>>>>question is why would one ever use the gluster mount instead 
>>>>>>>>>>>>of nfs and/or samba?
>>>>>>>>>>>>
>>>>>>>>>>>>Tom: feel free to chime in if I have missed anything.
>>>>>>>>>>>I see your point now. Yes the gluster server where you did 
>>>>>>>>>>>the mount is kind of a bottle neck.
>>>>>>>>>>Now that we established the problem is in the 
>>>>>>>>>>clients/protocols, you should send out a detailed mail on 
>>>>>>>>>>gluster-devel and see if anyone can help with you on 
>>>>>>>>>>performance xlators that can improve it a bit more. My area of 
>>>>>>>>>>expertise is more on replication. I am sub-maintainer for 
>>>>>>>>>>replication,locks components. I also know connection 
>>>>>>>>>>management/io-threads related issues which lead to hangs as I 
>>>>>>>>>>worked on them before. Performance xlators are black box to 
>>>>>>>>>>me.
>>>>>>>>>>
>>>>>>>>>>Performance xlators are enabled only on fuse gluster stack. On 
>>>>>>>>>>nfs server mounts we disable all the performance xlators 
>>>>>>>>>>except write-behind as nfs client does lots of things for 
>>>>>>>>>>improving performance. I suggest you guys follow up more on 
>>>>>>>>>>gluster-devel.
>>>>>>>>>>
>>>>>>>>>>Appreciate all the help you did for improving the product :-). 
>>>>>>>>>>Thanks a ton!
>>>>>>>>>>Pranith
>>>>>>>>>>>Pranith
>>>>>>>>>>>>David (Sent from mobile)
>>>>>>>>>>>>
>>>>>>>>>>>>===============================
>>>>>>>>>>>>David F. Robinson, Ph.D.
>>>>>>>>>>>>President - Corvid Technologies
>>>>>>>>>>>>704.799.6944 x101 [office]
>>>>>>>>>>>>704.252.1310 [cell]
>>>>>>>>>>>>704.799.7974 [fax]
>>>>>>>>>>>>David.Robinson at corvidtec.com
>>>>>>>>>>>>http://www.corvidtechnologies.com/
>>>>>>>>
>>>>>>>
>>>>>>>_______________________________________________
>>>>>>>Gluster-devel mailing list
>>>>>>>Gluster-devel at gluster.org
>>>>>>>http://supercolony.gluster.org/mailman/listinfo/gluster-devel
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>_______________________________________________ Gluster-devel 
>>>>>>mailing list 
>>>>>>Gluster-devel at gluster.orghttp://supercolony.gluster.org/mailman/listinfo/gluster-devel
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20140807/95b15901/attachment-0001.html>