[Gluster-users] CentOS Freeze with GlusterFS Error

Fri Jan 23 05:32:28 UTC 2015

HI Deepak,

Please find below details.

* cat  multipath.conf

multipath {
uid 162
gid 162
wwid "360050763008084b07800000000000008"
mode 0777
alias nova
}

*  ls -l /dev/mapper/

  nova -> ../dm-0

*df -h

/dev/mapper/nova      120T  4.1T  116T   4% /gluster1

* ls /gluster1/nova/

brick0  brick1  brick2  brick3

* Status of volume: nova
Gluster process Port Online Pid
------------------------------------------------------------------------------
Brick lkcontroller:/gluster1/nova/brick0 49152 Y 3708
Brick lkcontroller:/gluster1/nova/brick1 49153 Y 3707
Brick lkcontroller:/gluster1/nova/brick2 49154 Y 3716
Brick lkcontroller:/gluster1/nova/brick3 49155 Y 3723
NFS Server on localhost 2049 Y 7444
Self-heal Daemon on localhost N/A Y 7445
NFS Server on lkcompute03 2049 Y 18288
Self-heal Daemon on lkcompute03 N/A Y 18295
NFS Server on lkcompute01 2049 Y 13722
Self-heal Daemon on lkcompute01 N/A Y 13728
NFS Server on lkcompute02 2049 Y 28264
Self-heal Daemon on lkcompute02 N/A Y 28274

Thank You,
Chamara.

On Fri, Jan 23, 2015 at 10:50 AM, Deepak Shetty <dpkshetty at gmail.com> wrote:

> My gut still says it could be related to the multipath.
> I never got the answer to whether the bricks are using the multipath'ed
> devices using mpathXX device or you are direclty using the dm-X device ?
>
> If dm-X then are you ensuring that you are NOT using 2 dm-X device that
> map to the same LUN on the backend SAN ?
> My hunch is that in case you are doing that and xfs'ing the 2 dm-X and
> using then as separate bricks anything can happen
>
> So trying to remove multipath or even before that stop glusterfs volumes
> (which should stop glusterfsd process, hence no IO on the xfs bricks) and
> see if this re-creates
> Since we are seeing glusterfsd everytime the kernel bug shows up, it may
> not be a co-incidence but a possibility due to invalud multipath setup
>
> thanx,
> deepak
>
>
> On Thu, Jan 22, 2015 at 12:57 AM, Niels de Vos <ndevos at redhat.com> wrote:
>
>> On Wed, Jan 21, 2015 at 10:11:20PM +0530, chamara samarakoon wrote:
>> > HI All,
>> >
>> >
>> > Same error encountered again before trying anything else. So I took
>> screen
>> > shot  with more details of the incident.
>>
>> This shows an XFS error. So it can be a problem with XFS, or something
>> that contributes to it in the XFS path. I would guess it is caused by an
>> issue on the disk(s) because there is the mentioning of corruption.
>> However, it could also be bad RAM, or an other hardware component that
>> is used to access data from the disks. I suggest you take two
>> approaches:
>>
>> 1. run hardware tests - if the error is detected, contact your HW vendor
>> 2. open a support case with the vendor of the OS and check for updates
>>
>> Gluster can stress filesystems in ways that are not very common, and
>> there have been issues found in XFS due to this. Your OS support vendor
>> should be able to tell you if the latest and related XFS fixes are
>> included in your kernel.
>>
>> HTH,
>> Niels
>>
>> >
>> >
>> > 
>> >
>> > Thank You,
>> > Chamara
>> >
>> >
>> >
>> > On Tue, Jan 20, 2015 at 5:33 PM, chamara samarakoon <
>> chthsa123 at gmail.com>
>> > wrote:
>> >
>> > > HI All,
>> > >
>> > > Thank You for valuable feedback , I will test the suggested
>> solutions, and
>> > > update the thread.
>> > >
>> > > Regards,
>> > > Chamara
>> > >
>> > > On Tue, Jan 20, 2015 at 4:17 PM, Deepak Shetty <dpkshetty at gmail.com>
>> > > wrote:
>> > >
>> > >> In addition, I would also like to add that i do suspect (just my
>> hunch)
>> > >> that it could be related to multipath.
>> > >> If you can try without multipath and if it doesn't re-create, i think
>> > >> that would be a good data point for kernel/OS vendor to debug
>> further.
>> > >>
>> > >> my 2 cents again :)
>> > >>
>> > >> thanx,
>> > >> deepak
>> > >>
>> > >>
>> > >> On Tue, Jan 20, 2015 at 2:32 PM, Niels de Vos <ndevos at redhat.com>
>> wrote:
>> > >>
>> > >>> On Tue, Jan 20, 2015 at 11:55:40AM +0530, Deepak Shetty wrote:
>> > >>> > What does "Controller" mean, the openstack controller node or
>> somethign
>> > >>> > else (like HBA ) ?
>> > >>> > You picture says its SAN but the text says multi-path mount.. SAN
>> would
>> > >>> > mean block devices, so I am assuming you have redundant block
>> devices
>> > >>> on
>> > >>> > the compute host, mkfs'ing it and then creating bricks for
>> gluster ?
>> > >>> >
>> > >>> >
>> > >>> > The stack trace looks like you hit a kernel bug and glusterfsd
>> happens
>> > >>> to
>> > >>> > be running on the CPU at the time... my 2 cents
>> > >>>
>> > >>> That definitely is a kernel issue. You should contact your OS
>> support
>> > >>> vendor about this.
>> > >>>
>> > >>> The bits you copy/pasted are not sufficient to see what caused it.
>> The
>> > >>> glusterfsd process is just a casualty of the kernel issue, and it
>> is not
>> > >>> likely this can be fixed in Gluster. I suspect you need a kernel
>> > >>> patch/update.
>> > >>>
>> > >>> Niels
>> > >>>
>> > >>> >
>> > >>> > thanx,
>> > >>> > deepak
>> > >>> >
>> > >>> > On Tue, Jan 20, 2015 at 11:29 AM, chamara samarakoon <
>> > >>> chthsa123 at gmail.com>
>> > >>> > wrote:
>> > >>> >
>> > >>> > > Hi All,
>> > >>> > >
>> > >>> > >
>> > >>> > > We have setup Openstack cloud as below. And the
>> > >>> "/va/lib/nova/instances"
>> > >>> > > is a Gluster volume.
>> > >>> > >
>> > >>> > > CentOS - 6.5
>> > >>> > > Kernel -  2.6.32-431.29.2.el6.x86_64
>> > >>> > > GlusterFS - glusterfs 3.5.2 built on Jul 31 2014 18:47:54
>> > >>> > > OpenStack - RDO using Packstack
>> > >>> > >
>> > >>> > >
>> > >>> > >
>> > >>> > >
>> > >>> > > 
>> > >>> > >
>> > >>> > >
>> > >>> > > Recently Controller node freezes with following error (Which
>> > >>> required hard
>> > >>> > > reboot), as a result Gluster volumes on compute node can not
>> reach
>> > >>> the
>> > >>> > > controller and due to that all the instances on compute nodes
>> > >>> become to
>> > >>> > > read-only status  which causes to restart all instances.
>> > >>> > >
>> > >>> > >
>> > >>> > >
>> > >>> > >
>> > >>> > > *BUG: scheduling while atomic : glusterfsd/42725/0xffffffff*
>> > >>> > > *BUG: unable to handle kernel paging request at
>> 0000000038a60d0a8*
>> > >>> > > *IP: [<fffffffff81058e5d>] task_rq_lock+0x4d/0xa0*
>> > >>> > > *PGD 1065525067 PUD 0*
>> > >>> > > *Oops: 0000 [#1] SMP*
>> > >>> > > *last sysfs file :
>> > >>> > >
>> > >>>
>> /sys/device/pci0000:80/0000:80:02.0/0000:86:00.0/host2/port-2:0/end_device-2:0/target2:0:0/2:0:0:1/state*
>> > >>> > > *CPU 0*
>> > >>> > > *Modules linked in : xtconntrack iptable_filter ip_tables
>> > >>> ipt_REDIRECT
>> > >>> > > fuse ipv openvswitch vxlan iptable_mangle *
>> > >>> > >
>> > >>> > > Please advice on above incident , also feedback on the
>> Openstack +
>> > >>> > > GlusterFS setup is appreciated.
>> > >>> > >
>> > >>> > > Thank You,
>> > >>> > > Chamara
>> > >>> > >
>> > >>> > >
>> > >>> > > _______________________________________________
>> > >>> > > Gluster-users mailing list
>> > >>> > > Gluster-users at gluster.org
>> > >>> > > http://www.gluster.org/mailman/listinfo/gluster-users
>> > >>> > >
>> > >>>
>> > >>>
>> > >>>
>> > >>> > _______________________________________________
>> > >>> > Gluster-users mailing list
>> > >>> > Gluster-users at gluster.org
>> > >>> > http://www.gluster.org/mailman/listinfo/gluster-users
>> > >>>
>> > >>>
>> > >>
>> > >
>> > >
>> > > --
>> > > chthsa
>> > >
>> >
>> >
>> >
>> > --
>> > chthsa
>>
>>
>>
>

-- 
chthsa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150123/547674f2/attachment.html>