[Gluster-devel] OOM issue in openstack Cinder - GlusterFS CI env

Sat Feb 21 16:30:18 UTC 2015

Hi All,
  I am looking for some help from glusterfs side for the Out of Memory
(OOM) issue
we are seeing when using GlusterFS as a storage backend for openstack
Cinder (block storage service)

    openstack has a upstream CI env managed by openstack infra team, where
we added a new job that creates a devstack env (openstack all in one for
newbies) and configures the block service (Cinder) with GlusterFS as
storage backend. Once setup, the CI job runs openstack tempest (Integration
test suite of openstack) that does API level testing of the whole openstack
env.

    As part of that testing, ~1.5 to 2 hours into the run, the tempest job
(VM) hits OOM and the kernel oom-killer kills the process with the max
memory to reduce memory pressure.

    The tempest job is based on CentOS 7 and uses glusterfs 3.6.2 as the
storage backend for openstack Cinder

    The openstack-dev thread @
*http://thread.gmane.org/gmane.comp.cloud.openstack.devel/46861
<http://thread.gmane.org/gmane.comp.cloud.openstack.devel/46861>* has
details including links to the logs captured from the tempest jobs.

Per the openstack infra folks, they have other jobs based off CentOS7 that
doesn't hit this issue, the only change we are adding is configuring cinder
with glusterfs when this happens, so right now glusterfs is in the
spotlight for causing this.

I am looking thru the logs trying to co-relate syslog, dstat, tempest info
to figure the state of the system and what was happening at and before the
OOM to get any clues, but wanted to start this thread in gluster-devel to
see if others can pitch in with their ideas to accelerate the debug and
help root cause.

Also pasting relevant part of the chat log I had with infra folks ...

Feb 20 21:46:28 <sdague>        deepakcs: you are at 70% wait time at the
end of that

Feb 20 21:46:37 <sdague>        so your io system is just gone bonkers

Feb 20 21:47:14 <fungi> sdague: that would explain why the console login
prompt and ssh daemon both stopped working, and the df loop in had going in
my second ssh session hung around the same time
Feb 20 21:47:26 <sdague>        yeh, dstat even says it's skipping ticks
there
Feb 20 21:47:29 <sdague>        for that reason

Feb 20 21:47:48 <fungi> likely complete i/o starvation for an extended
period at around that timeframe
Feb 20 21:48:05 <fungi> that would also definitely cause jenkins to give up
on the worker if it persisted for very long at all

Feb 20 21:48:09 <sdague>        yeh, cached memory is down to double digit M

Feb 20 21:49:21 <sdague>        deepakcs: so, honestly, what it means to me
is that glusterfs is may be too inefficient to function in this environment
Feb 20 21:49:34 <sdague>        because it's kind of a constrained
environment
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20150221/e2002fbf/attachment.html>