[Gluster-devel] OOM issue in openstack Cinder - GlusterFS CI env

Mon Feb 23 12:31:58 UTC 2015

On Sun, Feb 22, 2015 at 8:46 PM, Niels de Vos <ndevos at redhat.com> wrote:

> Hi Deepak and others,
>
> On Sat, Feb 21, 2015 at 10:00:18PM +0530, Deepak Shetty wrote:
> > Hi All,
> >   I am looking for some help from glusterfs side for the Out of Memory
> > (OOM) issue
> > we are seeing when using GlusterFS as a storage backend for openstack
> > Cinder (block storage service)
>
> This seems to be using the FUSE mount points for the VM images, right?
>

Yes, no libgfapi.

>
> >     openstack has a upstream CI env managed by openstack infra team,
> where
> > we added a new job that creates a devstack env (openstack all in one for
> > newbies) and configures the block service (Cinder) with GlusterFS as
> > storage backend. Once setup, the CI job runs openstack tempest
> (Integration
> > test suite of openstack) that does API level testing of the whole
> openstack
> > env.
> >
> >     As part of that testing, ~1.5 to 2 hours into the run, the tempest
> job
> > (VM) hits OOM and the kernel oom-killer kills the process with the max
> > memory to reduce memory pressure.
> >
> >     The tempest job is based on CentOS 7 and uses glusterfs 3.6.2 as the
> > storage backend for openstack Cinder
> >
> >     The openstack-dev thread @
> > *http://thread.gmane.org/gmane.comp.cloud.openstack.devel/46861
> > <http://thread.gmane.org/gmane.comp.cloud.openstack.devel/46861>* has
> > details including links to the logs captured from the tempest jobs.
>
> http://fungi.yuggoth.org/tmp/logs2.tar seems to be the most useful.
>
> > Per the openstack infra folks, they have other jobs based off CentOS7
> that
> > doesn't hit this issue, the only change we are adding is configuring
> cinder
> > with glusterfs when this happens, so right now glusterfs is in the
> > spotlight for causing this.
>
> Well, there definitely is something wrong on the Gluster side. After the
> 'blackout' between 17:47 - 18:17, this is logged in
> logs/screen-c-vol.txt.gz:
>
>   mount -t glusterfs 127.0.0.1:/vol1
> /opt/stack/data/cinder/mnt/f4c6ad7e3bba4ad1195b3b538efab64a" returned: 32
>
> 32 as exit code from "mount" means "mount failure".
>

Yeah, but thats the time when things went bad in the system leading to OOM
So can't we safely assume that system was in a bad state, hence mount
failure...

We actually need to look before 17:47 and at/around 17:47 to figure what
was running that caused the
blackout to happen.

I do see some errors I see in the brick logs and warnings in the mount logs
but I Don't think they are anyway related to or cause of creatign OOM.

>
> The mountpoint would log details in
>
> logs/glusterfs/opt-stack-data-cinder-mnt-f4c6ad7e3bba4ad1195b3b538efab64a.log.
> At 18:00 in that file, there is this message:
>
>   0-vol1-client-0: server 104.130.30.179:49152 has not responded in the
> last 42 seconds, disconnecting.
>
> Its a one-brick volume, so this error is pretty fatal. This error does
> not cause an unmount, but the mountpoint will return errors. Subsequent
> mount tries will fail, because the mountpoint is not unmounted yet.
> Tests to see if the target directory can be used to mount a volume, will
> fail, causing the mount process to return error 32.
>

Thats in the middle of the blackout, so disconnections are ok, i feel.

>
> A little later, the (still) existing fuse mount client manages to
> connect to the brick again:
>
>   0-vol1-client-0: changing port to 49152 (from 0)
>
> The log of the brick does not contain anything useful for this time :-/
>
> > I am looking thru the logs trying to co-relate syslog, dstat, tempest
> info
> > to figure the state of the system and what was happening at and before
> the
> > OOM to get any clues, but wanted to start this thread in gluster-devel to
> > see if others can pitch in with their ideas to accelerate the debug and
> > help root cause.
>
> I have not yet found any pointer to what could have taken place between
> 17:47-18:17. It seems that network I/O stopped (ping timeout in FUSE).
>

During that tiem from the tempest logs i could figure that among other tests
VolumesBackupV1Test was running, which reads from Cinder- Gluster backend
and write to swift service, which will add good IO traffic.

> One of the few logs that I came across which contains some messages
> during this 'blackout', is logs/screen-n-cond.txt.gz. But I assume these
> messages are triggered by the underlying network issues.
>
>
> You mention that there was a OOM? Which logs contain some details about
> that? Also, could you explain what test (workload) was running in the
> VM(s?) on the FUSE mountpoint?
>

Look at logs/syslog.txt (thats /var/log/messages), grep for oom-killer
so at 18:17 when it happend it killed nova-compute and mysqld as they were
the
most memory hoggers. But in talking with openstack infra folks, they
indicated
that they have similar CI jobs where mysqld doesn't get killed, so adding
gluster to
the mix is definitely the reason for reaching OOM state.

There is really no application running workload here. Most of the IO is
generated as part
of oepnstack orchestration tests, liek backup (which reads from cinder and
writes to swift), clone
(create a snap and then copy it), create/delete cinder volumes (which are
files in glusterfs) etc

>From what i understood in talking with the openstack infra folks they
referred me to
the dstat logs sayign the kernel buffer cache is going as low as double
digit M and system is in
70-90 % cpu wait time, which to them indicates that the IO of the system
has gone bonkers

Below is the exact thing from the chat log (for the 1st re-create / logs) :

Feb 20 21:46:28 <sdague>    deepakcs: you are at 70% wait time at the end
of that

Feb 20 21:46:37 <sdague>    so your io system is just gone bonkers

Feb 20 21:47:14 <fungi>    sdague: that would explain why the console login
prompt and ssh daemon both stopped working, and the df loop in had going in
my second ssh session hung around the same time
Feb 20 21:47:26 <sdague>    yeh, dstat even says it's skipping ticks there
Feb 20 21:47:29 <sdague>    for that reason

Feb 20 21:47:48 <fungi>    likely complete i/o starvation for an extended
period at around that timeframe
Feb 20 21:48:05 <fungi>    that would also definitely cause jenkins to give
up on the worker if it persisted for very long at all

Feb 20 21:48:09 <sdague>    yeh, cached memory is down to double digit M

Feb 20 21:49:19 <fungi>    because that's 00% of what it entails, at least
for the hard problems
Feb 20 21:49:21 <sdague>    deepakcs: so, honestly, what it means to me is
that glusterfs is may be too inefficient to function in this environment
Feb 20 21:49:25 <fungi>    er, 99%
Feb 20 21:49:34 <sdague>    because it's kind of a constrained environment

Feb 20 21:51:18 <deepakcs>    sdague, skipping ticks means ?

Feb 20 21:51:32 <sdague>    deepakcs: dstat runs once a second
Feb 20 21:51:38 <sdague>    unless the system gets under preasure
Feb 20 21:51:54 <sdague>    then it starts skipping it's runs to prevent
things from getting worse
Feb 20 21:52:01 <sdague>    i.e. to not be part of the problem

Feb 20 21:53:24 <dansmith>    deepakcs: dstat is a well-understood thing,
so you can probably figure out all you need to know from the docs

Feb 20 21:53:41 <dansmith>    deepakcs: it's run in the jobs so we can
identify issues like this where the system is overly taxed and unable to
actually run the tests
Feb 20 21:53:57 <deepakcs>    dansmith, agreed, just trying to understand
system debug better :)

Feb 20 21:54:39 <dansmith>    deepakcs: http://linux.die.net/man/1/dstat
Feb 20 21:54:54 <fungi>    deepakcs: i have another manual run started just
now, with the devstack-gate timeout cranked up to 5 hours just to make sure
we capture the full job duration

Feb 20 21:55:37 <deepakcs>    dansmith, thanks :)
Feb 20 21:55:41 <fungi>    deepakcs: i'll keep you posted in here or follow
up to the e-mail thread if you're gone

Feb 20 21:57:42 <deepakcs>    sdague, dansmith what i am not clear is that
mysqld was the one with the max mem taken, which is why oom killer killed
it, so howz gluster connected to creating oom here ?
Feb 20 21:58:19 <dansmith>    deepakcs: buffer cache can create huge
amounts of memory pressure
Feb 20 21:58:34 <dansmith>    deepakcs: the largest memory user gets the
boot just because it's the most likely to reduce the pressure

Feb 20 21:58:48 <dansmith>    deepakcs: there is a good data point here
that mysql doesn't get killed in the other jobs

Feb 20 21:59:20 <deepakcs>    dansmith, ok, i looked up the glusterfsd (the
process doing the IO to the disk) and its max 1.7G ram
Feb 20 21:59:21 <dansmith>    deepakcs: nor do I know what's going on in
your job
Feb 20 21:59:31 <dansmith>    deepakcs: that's user memory
Feb 20 21:59:49 <deepakcs>    dansmith, yes and glusterfs is a user space
FS entirely
Feb 20 22:00:03 <dansmith>    deepakcs: is it writing with O_DIRECT?

Feb 20 22:00:38 <deepakcs>    dansmith, the logs doesn't have glusterfs
logs, so cant tell, but we don't configure it w/ o_direct by default, so it
shouldn't
Feb 20 22:00:55 <dansmith>    deepakcs: so if not, every write it makes
generates buffer cache load in the kernel

Feb 20 22:01:24 <dansmith>    deepakcs: regardless, this is really not the
right forum to discuss this level of system debug I think
Feb 20 22:01:35 <deepakcs>    dansmith, ok
Feb 20 22:01:59 <dansmith>    deepakcs: the stats runs from your job and
other non-failing jobs should be helpful, and it's clear that something in
your job is eating all the resources

>
> In logs/screen-dstat.txt.gz there seems to be a process called
> "cinder-backup" that is running around that time. From the details of
> dstat, I guess that this process causes the reading of files from the
> Gluster Volume. When the system becomes responsive again, dstat has 11G
> as "dsk/total read", it jumped from ~400M in 30 minutes. If this amount
> of data is not flushed but kept in memory, I don't doubt that an OOM
> could be triggered.
>

You do or don't doubt ? - thats confusing :)

Yes, I am also doubting that IO traffic generated by doing backup tests is
the only scenario
where most IO is generated (from Cinder-GlusterFS perspective). I looked at
all the tests
that openstack Cinder has, and this was the only intensive test.

Given that Backup was running before the time system went OOM, it seems
liek a good
one to doubt upon.

Plan is to disable Backup tests and see if we still hit the OOM issue.

>
> Do you know where this backup is written to? If we can setup a
> reproducer environment that emulates the workload, things would become
> much easier for debugging :)
>

In the current setup, backup source is GlusterFS and target is swift
(configured to
store its data in /opt/stack/swift/ mounted on 2nd disk)

>
> One of the 1st things I would try, is to configure the kernel to flush
> dirty memory more aggressively ('sysctl' values for 'vm.*' options). An
> other thing that may help to identify the memory usage, is capturing the
> output of 'slabtop --once' in a 1-minute interval.
>

Unfortunately, openstack infra doesn't provide that kind of job control.
The idea is to have a uniform
env for all storage backends and just change the specific backend that one
wants to  test, so that the runs are
comparable agaisnt a common base. Atleast that my understanding.

But yes there are knobs given to tweak/tune the kind of tests you wanna
run, which is what I plan
to exploit and turn off Backup tests and see if it helps keep system in
sane state.

Thanks Neils for your time and analysis, it really helped.

I m hoping now with the backup tests disabled, we won't hit, will update
here once openstack infra folks co-operate
to run the CI job manually for us!

thanx,
deepak
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20150223/9acadf61/attachment-0001.html>