[Gluster-devel] [glusterfs-3.6.0beta3-0.11.gitd01b00a] gluster volume status is running even though the Disk is detached

Tue Oct 28 08:38:32 UTC 2014

The content of file zp2-brick2.log is at http://ur1.ca/iku0l (
http://fpaste.org/145714/44849041/ )

I can't open the file /zp2/brick2/.glusterfs/health_check since it hangs
due to no disk present.

Let me know the filename pattern, so that I can find it.

On Tue, Oct 28, 2014 at 1:42 PM, Niels de Vos <ndevos at redhat.com> wrote:

> On Tue, Oct 28, 2014 at 01:10:56PM +0530, Kiran Patil wrote:
> > I applied the patches, compiled and installed the gluster.
> >
> > # glusterfs --version
> > glusterfs 3.7dev built on Oct 28 2014 12:03:10
> > Repository revision: git://git.gluster.com/glusterfs.git
> > Copyright (c) 2006-2013 Red Hat, Inc. <http://www.redhat.com/>
> > GlusterFS comes with ABSOLUTELY NO WARRANTY.
> > It is licensed to you under your choice of the GNU Lesser
> > General Public License, version 3 or any later version (LGPLv3
> > or later), or the GNU General Public License, version 2 (GPLv2),
> > in all cases as published by the Free Software Foundation.
> >
> > # git log
> > commit 990ce16151c3af17e4cdaa94608b737940b60e4d
> > Author: Lalatendu Mohanty <lmohanty at redhat.com>
> > Date:   Tue Jul 1 07:52:27 2014 -0400
> >
> >     Posix: Brick failure detection fix for ext4 filesystem
> > ...
> > ...
> >
> > I see below messages
>
> Many thanks Kiran!
>
> Do you have the messages from the brick that uses the zp2 mountpoint?
>
> There also should be a file with a timestamp when the last check was
> done successfully. If the brick is still running, this timestamp should
> get updated every storage.health-check-interval seconds:
>     /zp2/brick2/.glusterfs/health_check
>
> Niels
>
> >
> > File /var/log/glusterfs/etc-glusterfs-glusterd.vol.log :
> >
> > The message "I [MSGID: 106005]
> > [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: Brick
> > 192.168.1.246:/zp2/brick2 has disconnected from glusterd." repeated 39
> > times between [2014-10-28 05:58:09.209419] and [2014-10-28
> 06:00:06.226330]
> > [2014-10-28 06:00:09.226507] W [socket.c:545:__socket_rwv] 0-management:
> > readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed (Invalid
> > argument)
> > [2014-10-28 06:00:09.226712] I [MSGID: 106005]
> > [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: Brick
> > 192.168.1.246:/zp2/brick2 has disconnected from glusterd.
> > [2014-10-28 06:00:12.226881] W [socket.c:545:__socket_rwv] 0-management:
> > readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed (Invalid
> > argument)
> > [2014-10-28 06:00:15.227249] W [socket.c:545:__socket_rwv] 0-management:
> > readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed (Invalid
> > argument)
> > [2014-10-28 06:00:18.227616] W [socket.c:545:__socket_rwv] 0-management:
> > readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed (Invalid
> > argument)
> > [2014-10-28 06:00:21.227976] W [socket.c:545:__socket_rwv] 0-management:
> > readv on
> >
> > .....
> > .....
> >
> > [2014-10-28 06:19:15.142867] I
> > [glusterd-handler.c:1280:__glusterd_handle_cli_get_volume] 0-glusterd:
> > Received get vol req
> > The message "I [MSGID: 106005]
> > [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: Brick
> > 192.168.1.246:/zp2/brick2 has disconnected from glusterd." repeated 12
> > times between [2014-10-28 06:18:09.368752] and [2014-10-28
> 06:18:45.373063]
> > [2014-10-28 06:23:38.207649] W [glusterfsd.c:1194:cleanup_and_exit] (-->
> > 0-: received signum (15), shutting down
> >
> >
> > dmesg output:
> >
> > SPLError: 7869:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp2' has
> > encountered an uncorrectable I/O failure and has been suspended.
> >
> > SPLError: 7868:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp2' has
> > encountered an uncorrectable I/O failure and has been suspended.
> >
> > SPLError: 7869:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp2' has
> > encountered an uncorrectable I/O failure and has been suspended.
> >
> > The brick is still online.
> >
> > # gluster volume status
> > Status of volume: repvol
> > Gluster process Port Online Pid
> >
> ------------------------------------------------------------------------------
> > Brick 192.168.1.246:/zp1/brick1 49152 Y 4067
> > Brick 192.168.1.246:/zp2/brick2 49153 Y 4078
> > NFS Server on localhost 2049 Y 4092
> > Self-heal Daemon on localhost N/A Y 4097
> >
> > Task Status of Volume repvol
> >
> ------------------------------------------------------------------------------
> > There are no active volume tasks
> >
> > # gluster volume info
> >
> > Volume Name: repvol
> > Type: Replicate
> > Volume ID: ba1e7c6d-1e1c-45cd-8132-5f4fa4d2d22b
> > Status: Started
> > Number of Bricks: 1 x 2 = 2
> > Transport-type: tcp
> > Bricks:
> > Brick1: 192.168.1.246:/zp1/brick1
> > Brick2: 192.168.1.246:/zp2/brick2
> > Options Reconfigured:
> > storage.health-check-interval: 30
> >
> > Let me know if you need further information.
> >
> > Thanks,
> > Kiran.
> >
> > On Tue, Oct 28, 2014 at 11:44 AM, Kiran Patil <kiran at fractalio.com>
> wrote:
> >
> > > I changed  git fetch git://review.gluster.org/glusterfs  to git fetch
> > > http://review.gluster.org/glusterfs  and now it works.
> > >
> > > Thanks,
> > > Kiran.
> > >
> > > On Tue, Oct 28, 2014 at 11:13 AM, Kiran Patil <kiran at fractalio.com>
> wrote:
> > >
> > >> Hi Niels,
> > >>
> > >> I am getting "fatal: Couldn't find remote ref refs/changes/13/8213/9"
> > >> error.
> > >>
> > >> Steps to reproduce the issue.
> > >>
> > >> 1) # git clone git://review.gluster.org/glusterfs
> > >> Initialized empty Git repository in /root/gluster-3.6/glusterfs/.git/
> > >> remote: Counting objects: 84921, done.
> > >> remote: Compressing objects: 100% (48307/48307), done.
> > >> remote: Total 84921 (delta 57264), reused 63233 (delta 36254)
> > >> Receiving objects: 100% (84921/84921), 23.23 MiB | 192 KiB/s, done.
> > >> Resolving deltas: 100% (57264/57264), done.
> > >>
> > >> 2) # cd glusterfs
> > >>     # git branch
> > >>     * master
> > >>
> > >> 3) # git fetch git://review.gluster.org/glusterfs
> refs/changes/13/8213/9
> > >> && git checkout FETCH_HEAD
> > >> fatal: Couldn't find remote ref refs/changes/13/8213/9
> > >>
> > >> Note: I also tried the above steps on git repo
> > >> https://github.com/gluster/glusterfs and the result is same as above.
> > >>
> > >> Please let me know if I miss any steps.
> > >>
> > >> Thanks,
> > >> Kiran.
> > >>
> > >> On Mon, Oct 27, 2014 at 5:53 PM, Niels de Vos <ndevos at redhat.com>
> wrote:
> > >>
> > >>> On Mon, Oct 27, 2014 at 05:19:13PM +0530, Kiran Patil wrote:
> > >>> > Hi,
> > >>> >
> > >>> > I created replicated vol with two bricks on the same node and
> copied
> > >>> some
> > >>> > data to it.
> > >>> >
> > >>> > Now removed the disk which has hosted one of the brick of the
> volume.
> > >>> >
> > >>> > Storage.health-check-interval is set to 30 seconds.
> > >>> >
> > >>> > I could see the disk is unavailable using zpool command of zfs on
> > >>> linux but
> > >>> > the gluster volume status still displays the brick process running
> > >>> which
> > >>> > should have been shutdown by this time.
> > >>> >
> > >>> > Is this a bug in 3.6 since it is mentioned as feature "
> > >>> >
> > >>>
> https://github.com/gluster/glusterfs/blob/release-3.6/doc/features/brick-failure-detection.md
> > >>> "
> > >>> >  or am I doing any mistakes here?
> > >>>
> > >>> The initial detection of brick failures did not work for all
> > >>> filesystems. It may not work for ZFS too. A fix has been posted, but
> it
> > >>> has not been merged into the master branch yet. When the change has
> been
> > >>> merged, it can get backported to 3.6 and 3.5.
> > >>>
> > >>> You may want to test with the patch applied, and add your "+1
> Verified"
> > >>> to the change in case it makes it functional for you:
> > >>> - http://review.gluster.org/8213
> > >>>
> > >>> Cheers,
> > >>> Niels
> > >>>
> > >>> >
> > >>> > [root at fractal-c92e gluster-3.6]# gluster volume status
> > >>> > Status of volume: repvol
> > >>> > Gluster process Port Online Pid
> > >>> >
> > >>>
> ------------------------------------------------------------------------------
> > >>> > Brick 192.168.1.246:/zp1/brick1 49154 Y 17671
> > >>> > Brick 192.168.1.246:/zp2/brick2 49155 Y 17682
> > >>> > NFS Server on localhost 2049 Y 17696
> > >>> > Self-heal Daemon on localhost N/A Y 17701
> > >>> >
> > >>> > Task Status of Volume repvol
> > >>> >
> > >>>
> ------------------------------------------------------------------------------
> > >>> > There are no active volume tasks
> > >>> >
> > >>> >
> > >>> > [root at fractal-c92e gluster-3.6]# gluster volume info
> > >>> >
> > >>> > Volume Name: repvol
> > >>> > Type: Replicate
> > >>> > Volume ID: d4f992b1-1393-43b8-9fda-2e2b6e3b5039
> > >>> > Status: Started
> > >>> > Number of Bricks: 1 x 2 = 2
> > >>> > Transport-type: tcp
> > >>> > Bricks:
> > >>> > Brick1: 192.168.1.246:/zp1/brick1
> > >>> > Brick2: 192.168.1.246:/zp2/brick2
> > >>> > Options Reconfigured:
> > >>> > storage.health-check-interval: 30
> > >>> >
> > >>> > [root at fractal-c92e gluster-3.6]# zpool status zp2
> > >>> >   pool: zp2
> > >>> >  state: UNAVAIL
> > >>> > status: One or more devices are faulted in response to IO failures.
> > >>> > action: Make sure the affected devices are connected, then run
> 'zpool
> > >>> > clear'.
> > >>> >    see: http://zfsonlinux.org/msg/ZFS-8000-HC
> > >>> >   scan: none requested
> > >>> > config:
> > >>> >
> > >>> > NAME        STATE     READ WRITE CKSUM
> > >>> > zp2         UNAVAIL      0     0     0  insufficient replicas
> > >>> >   sdb       UNAVAIL      0     0     0
> > >>> >
> > >>> > errors: 2 data errors, use '-v' for a list
> > >>> >
> > >>> >
> > >>> > Thanks,
> > >>> > Kiran.
> > >>>
> > >>> > _______________________________________________
> > >>> > Gluster-devel mailing list
> > >>> > Gluster-devel at gluster.org
> > >>> > http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> > >>>
> > >>>
> > >>
> > >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20141028/bcde3f6a/attachment-0001.html>