[Gluster-users] Diagnosing Intermittent Performance Problems Possibly Caused by Gremlins

Tue Feb 3 16:38:41 UTC 2015

I’ve been trying for weeks to reproduce the performance problems in 
our preproduction environments but can’t. As a result, selling that 
just upgrading to 3.6.x and hoping it goes away might be tricky. 3.6 is 
perceived as a little too bleeding edge, and we’ve actually had some 
other not fully explained issues with this cluster recently that make 
us hesitate. I don’t think they’re related.

On Tue, Feb 3, 2015 at 4:58 AM, Justin Clift <justin at gluster.org> wrote:
> ----- Original Message -----
>>  Hello List,
>> 
>>  So I've been frustraded by intermittent performance problems 
>> throughout
>>  January. The problem occurs on a two node setup running 3.4.5, 16 
>> gigs
>>  of ram with a bunch of local disk. For sometimes an hour for 
>> sometimes
>>  weeks at a time (I have extensive graphs in OpenNMS) our Gluster 
>> boxes
>>  will get their CPUs pegged, and in vmstat they'll show extremely 
>> high
>>  numbers of context switches and interrupts. Eventually things calm
>>  down. During this time, memory usage actually drops. Overall usage 
>> on
>>  the box goes from between 6-10 gigs to right around 4 gigs, and 
>> stays
>>  there. That's what really puzzles me.
>> 
>>  When performance is problematic, sar shows one device, the device
>>  corresponding to the glusterfsd problem using all the CPU doing 
>> lots of
>>  little reads, Sometimes 70k/second, very small avg rq size, say 
>> 10-12.
>>  Afraid I don't have any saved output handy, but I can try to capture
>>  some next time it happens. I have tons of information frankly, but 
>> am
>>  trying to keep this reasonably brief.
>> 
>>  There are more than a dozen volumes on this two node setup. The CPU
>>  usage is pretty much entirely contained to one volume, a 1.5 TB 
>> volume
>>  that is just shy of 70% full. It stores uploaded files for a web 
>> app.
>>  What I hate about this app and so am always suspicious of, is that 
>> it
>>  stores a directory for every user in one level, so under the /data
>>  directory in the volume, there are 450,000 sub directories at this
>>  point.
>> 
>>  The only real mitigation step that's been taken so far was to turn 
>> off
>>  the self-heal daemon on the volume, as I thought maybe crawling that
>>  large directory was getting expensive. This doesn't seem to have 
>> done
>>  anything as the problem still occurs.
>> 
>>  At this point I figure there are one of two things sorts of things
>>  happening really broadly: one we're running into some sort of bug or
>>  performance problem with gluster we should either fix perhaps by
>>  upgrading or tuning around, or two, some process we're running but 
>> not
>>  aware of is hammering the file system causing problems.
>> 
>>  If it's the latter option, can anyone give me any tips on figuring 
>> out
>>  what might be hammering the system? I can use volume top to see 
>> what a
>>  brick is doing, but I can't figure out how to tell what clients are
>>  doing what.
>> 
>>  Apologies for the somewhat broad nature of the question, any input
>>  thoughts would be much appreciated. I can certainly provide more 
>> info
>>  about some things if it would help, but I've tried not to write a 
>> novel
>>  here.
> 
> Out of curiosity, are you able to test using GlusterFS 3.6.2?  We've
> had a bunch of pretty in-depth upstream testing at decent scale (100+
> nodes) from 3.5.x onwards, with lots of performance issues identified
> and fixed on the way through.
> 
> So, I'm kinda hopeful the problem you're describing is fixed in newer
> releases. :D
> 
> Regards and best wishes,
> 
> Justin Clift
> 
> --
> GlusterFS - http://www.gluster.org
> 
> An open source, distributed file system scaling to several
> petabytes, and handling thousands of clients.
> 
> My personal twitter: twitter.com/realjustinclift
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150203/8fd2cb2d/attachment.html>