[Gluster-users] Diagnosing Intermittent Performance Problems Possibly Caused by Gremlins

Thu Feb 5 11:14:51 UTC 2015

On 02/03/2015 11:16 AM, Matt wrote:
> Hello List,
>
> So I've been frustraded by intermittent performance problems 
> throughout January. The problem occurs on a two node setup running 
> 3.4.5, 16 gigs of ram with a bunch of local disk. For sometimes an 
> hour for sometimes weeks at a time (I have extensive graphs in 
> OpenNMS) our Gluster boxes will get their CPUs pegged, and in vmstat 
> they'll show extremely high numbers of context switches and 
> interrupts. Eventually things calm down. During this time, memory 
> usage actually drops. Overall usage on the box goes from between 6-10 
> gigs to right around 4 gigs, and stays there. That's what really 
> puzzles me.
>
> When performance is problematic, sar shows one device, the device 
> corresponding to the glusterfsd problem using all the CPU doing lots 
> of little reads, Sometimes 70k/second, very small avg rq size, say 
> 10-12. Afraid I don't have any saved output handy, but I can try to 
> capture some next time it happens. I have tons of information frankly, 
> but am trying to keep this reasonably brief.
>
> There are more than a dozen volumes on this two node setup. The CPU 
> usage is pretty much entirely contained to one volume, a 1.5 TB volume 
> that is just shy of 70% full. It stores uploaded files for a web app. 
> What I hate about this app and so am always suspicious of, is that it 
> stores a directory for every user in one level, so under the /data 
> directory in the volume, there are 450,000 sub directories at this point.
>
> The only real mitigation step that's been taken so far was to turn off 
> the self-heal daemon on the volume, as I thought maybe crawling that 
> large directory was getting expensive. This doesn't seem to have done 
> anything as the problem still occurs.
>
> At this point I figure there are one of two things sorts of things 
> happening really broadly: one we're running into some sort of bug or 
> performance problem with gluster we should either fix perhaps by 
> upgrading or tuning around, or two, some process we're running but not 
> aware of is hammering the file system causing problems.
>
> If it's the latter option, can anyone give me any tips on figuring out 
> what might be hammering the system? I can use volume top to see what a 
> brick is doing, but I can't figure out how to tell what clients are 
> doing what.
>
> Apologies for the somewhat broad nature of the question, any input 
> thoughts would be much appreciated. I can certainly provide more info 
> about some things if it would help, but I've tried not to write a 
> novel here.
>
> Thanks,
Could you enable 'gluster volume profile <volname> start' for this volume?
When next time this issue happens, keep collecting 'gluster volume 
profile <volname> info' outputs. Mail them and lets see what is happening.

Pranith
>
> -Matt
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150205/5b2bf98b/attachment.html>