[Gluster-devel] OOM issue in openstack Cinder - GlusterFS CI env

Thu Feb 26 05:24:48 UTC 2015

A quick update

   By doing more experiments, we were able to figure that the failures are
happening only on the rax (rackspace) cloud VMs, but not on the hpcloud VMs
Also the OOM issue happened on rax VMs even when gluster was not used as
the storage (which means the cinder default storage uses LVMs)

So right now its seems like a combination of rax & centos7 is causing the
fail, making centos7 CI job unsuitable for openstack tempest runs.
Since openstack-infra jobs can be schedule across any cloud provider, we
cannot use centos7 as the distro, until the rax+centos7 oom issue is
resolved

Given the strict cinder community guidelines for having a CI job in place,
and openstack infra is having other issues with F20 and F21 based VMs,
(which are under debug as I was told on -infra IRC by infra core members )
the only option left is to use ubuntu as the base OS for running
cinder-gluster CI job, which we are in the process of evaluating and
working on.

 For full details refer to thread @
http://thread.gmane.org/gmane.comp.cloud.openstack.devel/46861

 For a quick summary, see this mail of the above thread @
http://article.gmane.org/gmane.comp.cloud.openstack.devel/47287

 thanx,
deepak

On Mon, Feb 23, 2015 at 6:01 PM, Deepak Shetty <dpkshetty at gmail.com> wrote:

>
>
> On Sun, Feb 22, 2015 at 8:46 PM, Niels de Vos <ndevos at redhat.com> wrote:
>
>> Hi Deepak and others,
>>
>> On Sat, Feb 21, 2015 at 10:00:18PM +0530, Deepak Shetty wrote:
>> > Hi All,
>> >   I am looking for some help from glusterfs side for the Out of Memory
>> > (OOM) issue
>> > we are seeing when using GlusterFS as a storage backend for openstack
>> > Cinder (block storage service)
>>
>> This seems to be using the FUSE mount points for the VM images, right?
>>
>
> Yes, no libgfapi.
>
>
>>
>> >     openstack has a upstream CI env managed by openstack infra team,
>> where
>> > we added a new job that creates a devstack env (openstack all in one for
>> > newbies) and configures the block service (Cinder) with GlusterFS as
>> > storage backend. Once setup, the CI job runs openstack tempest
>> (Integration
>> > test suite of openstack) that does API level testing of the whole
>> openstack
>> > env.
>> >
>> >     As part of that testing, ~1.5 to 2 hours into the run, the tempest
>> job
>> > (VM) hits OOM and the kernel oom-killer kills the process with the max
>> > memory to reduce memory pressure.
>> >
>> >     The tempest job is based on CentOS 7 and uses glusterfs 3.6.2 as the
>> > storage backend for openstack Cinder
>> >
>> >     The openstack-dev thread @
>> > *http://thread.gmane.org/gmane.comp.cloud.openstack.devel/46861
>> > <http://thread.gmane.org/gmane.comp.cloud.openstack.devel/46861>* has
>> > details including links to the logs captured from the tempest jobs.
>>
>> http://fungi.yuggoth.org/tmp/logs2.tar seems to be the most useful.
>>
>> > Per the openstack infra folks, they have other jobs based off CentOS7
>> that
>> > doesn't hit this issue, the only change we are adding is configuring
>> cinder
>> > with glusterfs when this happens, so right now glusterfs is in the
>> > spotlight for causing this.
>>
>> Well, there definitely is something wrong on the Gluster side. After the
>> 'blackout' between 17:47 - 18:17, this is logged in
>> logs/screen-c-vol.txt.gz:
>>
>>   mount -t glusterfs 127.0.0.1:/vol1
>> /opt/stack/data/cinder/mnt/f4c6ad7e3bba4ad1195b3b538efab64a" returned: 32
>>
>> 32 as exit code from "mount" means "mount failure".
>>
>
> Yeah, but thats the time when things went bad in the system leading to OOM
> So can't we safely assume that system was in a bad state, hence mount
> failure...
>
> We actually need to look before 17:47 and at/around 17:47 to figure what
> was running that caused the
> blackout to happen.
>
> I do see some errors I see in the brick logs and warnings in the mount logs
> but I Don't think they are anyway related to or cause of creatign OOM.
>
>
>
>>
>> The mountpoint would log details in
>>
>> logs/glusterfs/opt-stack-data-cinder-mnt-f4c6ad7e3bba4ad1195b3b538efab64a.log.
>> At 18:00 in that file, there is this message:
>>
>>   0-vol1-client-0: server 104.130.30.179:49152 has not responded in the
>> last 42 seconds, disconnecting.
>>
>> Its a one-brick volume, so this error is pretty fatal. This error does
>> not cause an unmount, but the mountpoint will return errors. Subsequent
>> mount tries will fail, because the mountpoint is not unmounted yet.
>> Tests to see if the target directory can be used to mount a volume, will
>> fail, causing the mount process to return error 32.
>>
>
> Thats in the middle of the blackout, so disconnections are ok, i feel.
>
>
>
>>
>> A little later, the (still) existing fuse mount client manages to
>> connect to the brick again:
>>
>>   0-vol1-client-0: changing port to 49152 (from 0)
>>
>> The log of the brick does not contain anything useful for this time :-/
>>
>> > I am looking thru the logs trying to co-relate syslog, dstat, tempest
>> info
>> > to figure the state of the system and what was happening at and before
>> the
>> > OOM to get any clues, but wanted to start this thread in gluster-devel
>> to
>> > see if others can pitch in with their ideas to accelerate the debug and
>> > help root cause.
>>
>> I have not yet found any pointer to what could have taken place between
>> 17:47-18:17. It seems that network I/O stopped (ping timeout in FUSE).
>>
>
> During that tiem from the tempest logs i could figure that among other
> tests
> VolumesBackupV1Test was running, which reads from Cinder- Gluster backend
> and write to swift service, which will add good IO traffic.
>
>
>> One of the few logs that I came across which contains some messages
>> during this 'blackout', is logs/screen-n-cond.txt.gz. But I assume these
>> messages are triggered by the underlying network issues.
>>
>>
>> You mention that there was a OOM? Which logs contain some details about
>> that? Also, could you explain what test (workload) was running in the
>> VM(s?) on the FUSE mountpoint?
>>
>
> Look at logs/syslog.txt (thats /var/log/messages), grep for oom-killer
> so at 18:17 when it happend it killed nova-compute and mysqld as they were
> the
> most memory hoggers. But in talking with openstack infra folks, they
> indicated
> that they have similar CI jobs where mysqld doesn't get killed, so adding
> gluster to
> the mix is definitely the reason for reaching OOM state.
>
> There is really no application running workload here. Most of the IO is
> generated as part
> of oepnstack orchestration tests, liek backup (which reads from cinder and
> writes to swift), clone
> (create a snap and then copy it), create/delete cinder volumes (which are
> files in glusterfs) etc
>
> From what i understood in talking with the openstack infra folks they
> referred me to
> the dstat logs sayign the kernel buffer cache is going as low as double
> digit M and system is in
> 70-90 % cpu wait time, which to them indicates that the IO of the system
> has gone bonkers
>
> Below is the exact thing from the chat log (for the 1st re-create / logs) :
>
> Feb 20 21:46:28 <sdague>    deepakcs: you are at 70% wait time at the end
> of that
>
> Feb 20 21:46:37 <sdague>    so your io system is just gone bonkers
>
> Feb 20 21:47:14 <fungi>    sdague: that would explain why the console
> login prompt and ssh daemon both stopped working, and the df loop in had
> going in my second ssh session hung around the same time
> Feb 20 21:47:26 <sdague>    yeh, dstat even says it's skipping ticks there
> Feb 20 21:47:29 <sdague>    for that reason
>
> Feb 20 21:47:48 <fungi>    likely complete i/o starvation for an extended
> period at around that timeframe
> Feb 20 21:48:05 <fungi>    that would also definitely cause jenkins to
> give up on the worker if it persisted for very long at all
>
> Feb 20 21:48:09 <sdague>    yeh, cached memory is down to double digit M
>
>
> Feb 20 21:49:19 <fungi>    because that's 00% of what it entails, at least
> for the hard problems
> Feb 20 21:49:21 <sdague>    deepakcs: so, honestly, what it means to me is
> that glusterfs is may be too inefficient to function in this environment
> Feb 20 21:49:25 <fungi>    er, 99%
> Feb 20 21:49:34 <sdague>    because it's kind of a constrained environment
>
> Feb 20 21:51:18 <deepakcs>    sdague, skipping ticks means ?
>
> Feb 20 21:51:32 <sdague>    deepakcs: dstat runs once a second
> Feb 20 21:51:38 <sdague>    unless the system gets under preasure
> Feb 20 21:51:54 <sdague>    then it starts skipping it's runs to prevent
> things from getting worse
> Feb 20 21:52:01 <sdague>    i.e. to not be part of the problem
>
> Feb 20 21:53:24 <dansmith>    deepakcs: dstat is a well-understood thing,
> so you can probably figure out all you need to know from the docs
>
> Feb 20 21:53:41 <dansmith>    deepakcs: it's run in the jobs so we can
> identify issues like this where the system is overly taxed and unable to
> actually run the tests
> Feb 20 21:53:57 <deepakcs>    dansmith, agreed, just trying to understand
> system debug better :)
>
> Feb 20 21:54:39 <dansmith>    deepakcs: http://linux.die.net/man/1/dstat
> Feb 20 21:54:54 <fungi>    deepakcs: i have another manual run started
> just now, with the devstack-gate timeout cranked up to 5 hours just to make
> sure we capture the full job duration
>
> Feb 20 21:55:37 <deepakcs>    dansmith, thanks :)
> Feb 20 21:55:41 <fungi>    deepakcs: i'll keep you posted in here or
> follow up to the e-mail thread if you're gone
>
> Feb 20 21:57:42 <deepakcs>    sdague, dansmith what i am not clear is that
> mysqld was the one with the max mem taken, which is why oom killer killed
> it, so howz gluster connected to creating oom here ?
> Feb 20 21:58:19 <dansmith>    deepakcs: buffer cache can create huge
> amounts of memory pressure
> Feb 20 21:58:34 <dansmith>    deepakcs: the largest memory user gets the
> boot just because it's the most likely to reduce the pressure
>
> Feb 20 21:58:48 <dansmith>    deepakcs: there is a good data point here
> that mysql doesn't get killed in the other jobs
>
> Feb 20 21:59:20 <deepakcs>    dansmith, ok, i looked up the glusterfsd
> (the process doing the IO to the disk) and its max 1.7G ram
> Feb 20 21:59:21 <dansmith>    deepakcs: nor do I know what's going on in
> your job
> Feb 20 21:59:31 <dansmith>    deepakcs: that's user memory
> Feb 20 21:59:49 <deepakcs>    dansmith, yes and glusterfs is a user space
> FS entirely
> Feb 20 22:00:03 <dansmith>    deepakcs: is it writing with O_DIRECT?
>
> Feb 20 22:00:38 <deepakcs>    dansmith, the logs doesn't have glusterfs
> logs, so cant tell, but we don't configure it w/ o_direct by default, so it
> shouldn't
> Feb 20 22:00:55 <dansmith>    deepakcs: so if not, every write it makes
> generates buffer cache load in the kernel
>
> Feb 20 22:01:24 <dansmith>    deepakcs: regardless, this is really not the
> right forum to discuss this level of system debug I think
> Feb 20 22:01:35 <deepakcs>    dansmith, ok
> Feb 20 22:01:59 <dansmith>    deepakcs: the stats runs from your job and
> other non-failing jobs should be helpful, and it's clear that something in
> your job is eating all the resources
>
>
>
>
>>
>> In logs/screen-dstat.txt.gz there seems to be a process called
>> "cinder-backup" that is running around that time. From the details of
>> dstat, I guess that this process causes the reading of files from the
>> Gluster Volume. When the system becomes responsive again, dstat has 11G
>> as "dsk/total read", it jumped from ~400M in 30 minutes. If this amount
>> of data is not flushed but kept in memory, I don't doubt that an OOM
>> could be triggered.
>>
>
> You do or don't doubt ? - thats confusing :)
>
> Yes, I am also doubting that IO traffic generated by doing backup tests is
> the only scenario
> where most IO is generated (from Cinder-GlusterFS perspective). I looked
> at all the tests
> that openstack Cinder has, and this was the only intensive test.
>
> Given that Backup was running before the time system went OOM, it seems
> liek a good
> one to doubt upon.
>
> Plan is to disable Backup tests and see if we still hit the OOM issue.
>
>
>
>>
>> Do you know where this backup is written to? If we can setup a
>> reproducer environment that emulates the workload, things would become
>> much easier for debugging :)
>>
>
> In the current setup, backup source is GlusterFS and target is swift
> (configured to
> store its data in /opt/stack/swift/ mounted on 2nd disk)
>
>
>>
>> One of the 1st things I would try, is to configure the kernel to flush
>> dirty memory more aggressively ('sysctl' values for 'vm.*' options). An
>> other thing that may help to identify the memory usage, is capturing the
>> output of 'slabtop --once' in a 1-minute interval.
>>
>
> Unfortunately, openstack infra doesn't provide that kind of job control.
> The idea is to have a uniform
> env for all storage backends and just change the specific backend that one
> wants to  test, so that the runs are
> comparable agaisnt a common base. Atleast that my understanding.
>
> But yes there are knobs given to tweak/tune the kind of tests you wanna
> run, which is what I plan
> to exploit and turn off Backup tests and see if it helps keep system in
> sane state.
>
> Thanks Neils for your time and analysis, it really helped.
>
> I m hoping now with the backup tests disabled, we won't hit, will update
> here once openstack infra folks co-operate
> to run the CI job manually for us!
>
>
> thanx,
> deepak
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20150226/66927df3/attachment-0001.html>