[Gluster-users] Diagnosing Intermittent Performance Problems Possibly Caused by Gremlins

Matt matt at mattlantis.com
Tue Feb 3 05:46:11 UTC 2015

Hello List,

So I've been frustraded by intermittent performance problems throughout 
January. The problem occurs on a two node setup running 3.4.5, 16 gigs 
of ram with a bunch of local disk. For sometimes an hour for sometimes 
weeks at a time (I have extensive graphs in OpenNMS) our Gluster boxes 
will get their CPUs pegged, and in vmstat they'll show extremely high 
numbers of context switches and interrupts. Eventually things calm 
down. During this time, memory usage actually drops. Overall usage on 
the box goes from between 6-10 gigs to right around 4 gigs, and stays 
there. That's what really puzzles me.

When performance is problematic, sar shows one device, the device 
corresponding to the glusterfsd problem using all the CPU doing lots of 
little reads, Sometimes 70k/second, very small avg rq size, say 10-12. 
Afraid I don't have any saved output handy, but I can try to capture 
some next time it happens. I have tons of information frankly, but am 
trying to keep this reasonably brief.

There are more than a dozen volumes on this two node setup. The CPU 
usage is pretty much entirely contained to one volume, a 1.5 TB volume 
that is just shy of 70% full. It stores uploaded files for a web app. 
What I hate about this app and so am always suspicious of, is that it 
stores a directory for every user in one level, so under the /data 
directory in the volume, there are 450,000 sub directories at this 

The only real mitigation step that's been taken so far was to turn off 
the self-heal daemon on the volume, as I thought maybe crawling that 
large directory was getting expensive. This doesn't seem to have done 
anything as the problem still occurs.

At this point I figure there are one of two things sorts of things 
happening really broadly: one we're running into some sort of bug or 
performance problem with gluster we should either fix perhaps by 
upgrading or tuning around, or two, some process we're running but not 
aware of is hammering the file system causing problems.

If it's the latter option, can anyone give me any tips on figuring out 
what might be hammering the system? I can use volume top to see what a 
brick is doing, but I can't figure out how to tell what clients are 
doing what.

Apologies for the somewhat broad nature of the question, any input 
thoughts would be much appreciated. I can certainly provide more info 
about some things if it would help, but I've tried not to write a novel 


