[Gluster-users] Gluster 3.6.3 performance.cache-size not working as expected in some cases

Wed Sep 2 12:56:25 UTC 2015

On 09/02/2015 12:45 PM, Raghavendra Bhat wrote:
>
> Hi Christian,
>
> I have been working on it since couple of days. I have not been able 
> to recreate the issue. I will continue to recreate and get back to you 
> in a day or two.
>
> Regards,
> Raghavendra Bhat
>

Hi Christian,

As per our tests (me and Raghavendra G in CC) we found that the data was 
being served from cache. In fact in our tests data was being served from 
the kernel cache itself.  So we tried with dropping the data cache (i.e. 
echo 1 > /proc/sys/vm/drop_caches)  to see if the read requests coming 
to glusterfs (since the data cache in the kernel is dropped) . We found 
that the read calls were coming to glusterfs and glusterfs is serving 
the requests from the cache (i.e. io-cache xlator).  If the memory 
pressure on the system is huge and kernel is sending forgets to the 
glusterfs client, then there is a possibility that the inodes are 
forgotten along with the data cached within them.

Can you please enable trace log level for the client and run your tests? 
(gluster volume set <volname> client-log-level TRACE) Once your tests 
are done please give the logs.

NOTE: Enabling trace log level will increase the log file size faster 
due to more logging.

Regards,
Raghavendra Bhat
>
> On 09/02/2015 12:45 AM, Christian Rice wrote:
>> This is still an issue for me, I don’t need anyone to tear the code 
>> apart, but I’d be grateful if someone would even chime in and say 
>> “yeah, we’ve seen that too."
>>
>> From: Christian Rice <crice at pandora.com <mailto:crice at pandora.com>>
>> Date: Sunday, August 30, 2015 at 11:18 PM
>> To: "gluster-users at gluster.org <mailto:gluster-users at gluster.org>" 
>> <gluster-users at gluster.org <mailto:gluster-users at gluster.org>>
>> Subject: [Gluster-users] Gluster 3.6.3 performance.cache-size not 
>> working as expected in some cases
>>
>> I am confused about my caching problem.  I’ll try to keep this as 
>> straightforward as possible and include the basic details...
>>
>> I have a sixteen node distributed volume, one brick per node, XFS 
>> isize=512, Debian 7/Wheezy, 32GB RAM minimally.  Every brick node is 
>> also a gluster client, and also importantly an HTTP server.  We use a 
>> back-end 1GbE network for gluster traffic (eth1).  There are a couple 
>> dozen gluster client-only systems accessing this volume, as well.
>>
>> We had a really hot spot on one brick due to an oft-requested file, 
>> and every time any httpd process on any gluster client was asked to 
>> deliver the file, it was physically fetching it (we could see this 
>> traffic using, say, ‘iftop -i eth1’,) so we thought to increase the 
>> volume cache timeout and cache size.  We set the following values for 
>> testing:
>>
>> performance.cache-size 16GB
>> performance.cache-refresh-timeout: 30
>>
>> This test was run from a node that didn’t have the requested file on 
>> the local brick:
>>
>> while(true); do cat /path/to/file > /dev/null; done
>>
>> and what had been very high traffic on the gluster backend network, 
>> delivering the data repeatedly to my requesting node, dropped to 
>> nothing visible.
>>
>> I thought good, problem fixed.  Caching works.  My colleague had run 
>> a test early on to show this perf issue, so he ran it again to sign off.
>>
>> His testing used curl, because all the real front end traffic is 
>> HTTP, and all the gluster nodes are web servers, which are of course 
>> using the fuse mount to access the document root.  Even with our 
>> performance tuning, the traffic on the gluster backend subnet was 
>> continuous and undiminished.  I saw no evidence of cache (again using 
>> ‘iftop -i eth1’, which showed a steady 75+% of line rate on a 1GbE link.
>>
>> Does that make sense at all?  We had theorized that we wouldn’t get 
>> to use VFS/kernel page cache on any node except maybe the one which 
>> held the data in the local brick.  That’s what drove us to setting 
>> the gluster performance cache.  But it doesn’t seem to come into play 
>> with http access.
>>
>>
>> Volume info:
>> Volume Name: DOCROOT
>> Type: Distribute
>> Volume ID: 3aecd277-4d26-44cd-879d-cffbb1fec6ba
>> Status: Started
>> Number of Bricks: 16
>> Transport-type: tcp
>> Bricks:
>> <snipped list of bricks>
>> Options Reconfigured:
>> performance.cache-refresh-timeout: 30
>> performance.cache-size: 16GB
>>
>> The net result of being overwhelmed by a hot spot is all the gluster 
>> client nodes lose access to the gluster volume—it becomes so busy it 
>> hangs.  When the traffic goes away (failing health checks by load 
>> balancers causes requests to be redirected elsewhere), the volume 
>> eventually unfreezes and life goes on.
>>
>> I wish I could type ALL that into a google query and get a lucid 
>> answer :)
>>
>> Regards,
>> Christian
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150902/623c8501/attachment.html>