[Gluster-users] Diagnosing Intermittent Performance Problems Possibly Caused by Gremlins
matt at mattlantis.com
Tue Feb 3 05:46:11 UTC 2015
So I've been frustraded by intermittent performance problems throughout
January. The problem occurs on a two node setup running 3.4.5, 16 gigs
of ram with a bunch of local disk. For sometimes an hour for sometimes
weeks at a time (I have extensive graphs in OpenNMS) our Gluster boxes
will get their CPUs pegged, and in vmstat they'll show extremely high
numbers of context switches and interrupts. Eventually things calm
down. During this time, memory usage actually drops. Overall usage on
the box goes from between 6-10 gigs to right around 4 gigs, and stays
there. That's what really puzzles me.
When performance is problematic, sar shows one device, the device
corresponding to the glusterfsd problem using all the CPU doing lots of
little reads, Sometimes 70k/second, very small avg rq size, say 10-12.
Afraid I don't have any saved output handy, but I can try to capture
some next time it happens. I have tons of information frankly, but am
trying to keep this reasonably brief.
There are more than a dozen volumes on this two node setup. The CPU
usage is pretty much entirely contained to one volume, a 1.5 TB volume
that is just shy of 70% full. It stores uploaded files for a web app.
What I hate about this app and so am always suspicious of, is that it
stores a directory for every user in one level, so under the /data
directory in the volume, there are 450,000 sub directories at this
The only real mitigation step that's been taken so far was to turn off
the self-heal daemon on the volume, as I thought maybe crawling that
large directory was getting expensive. This doesn't seem to have done
anything as the problem still occurs.
At this point I figure there are one of two things sorts of things
happening really broadly: one we're running into some sort of bug or
performance problem with gluster we should either fix perhaps by
upgrading or tuning around, or two, some process we're running but not
aware of is hammering the file system causing problems.
If it's the latter option, can anyone give me any tips on figuring out
what might be hammering the system? I can use volume top to see what a
brick is doing, but I can't figure out how to tell what clients are
Apologies for the somewhat broad nature of the question, any input
thoughts would be much appreciated. I can certainly provide more info
about some things if it would help, but I've tried not to write a novel
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Gluster-users