[Gluster-devel] [glusterfs-3.6.0beta3-0.11.gitd01b00a] gluster volume status is running even though the Disk is detached

Fri Oct 31 10:29:21 UTC 2014

I am not seeing below message in any log files under /var/log/glusterfs
directroy and its subdirectories.

health-check failed, going down

On Fri, Oct 31, 2014 at 3:16 PM, Kiran Patil <kiran at fractalio.com> wrote:

> I set zfs pool failmode to continue, which should disable only write and
> not read as explained below
>
> failmode=wait | continue | panic
>
>            Controls the system behavior in the event of catastrophic pool
> failure. This condition is typically a result of a loss of connec-
>            tivity  to  the underlying storage device(s) or a failure of
> all devices within the pool. The behavior of such an event is deter-
>            mined as follows:
>
>            wait        Blocks all I/O access until the device connectivity
> is recovered and the errors are  cleared.  This  is  the  default
>                        behavior.
>
>            continue    Returns  EIO  to  any  new  write  I/O  requests
>  but allows reads to any of the remaining healthy devices. Any write
>                        requests that have yet to be committed to disk
> would be blocked.
>
>            panic       Prints out a message to the console and generates a
> system crash dump.
>
>
> Now, I rebuilt the glusterfs master and tried to see if failed driver
> results in failed brick and in turn kill brick process and the brick is not
> going offline.
>
> # gluster volume status
> Status of volume: repvol
> Gluster process Port Online Pid
>
> ------------------------------------------------------------------------------
> Brick 192.168.1.246:/zp1/brick1 49152 Y 2400
> Brick 192.168.1.246:/zp2/brick2 49153 Y 2407
> NFS Server on localhost 2049 Y 30488
> Self-heal Daemon on localhost N/A Y 30495
>
> Task Status of Volume repvol
>
> ------------------------------------------------------------------------------
> There are no active volume tasks
>
>
> The /var/log/gluster/mnt.log output:
>
> [2014-10-31 09:18:15.934700] W [rpc-clnt-ping.c:154:rpc_clnt_ping_cbk]
> 0-repvol-client-1: socket disconnected
> [2014-10-31 09:18:15.934725] I [client.c:2215:client_rpc_notify]
> 0-repvol-client-1: disconnected from repvol-client-1. Client process will
> keep trying to connect to glusterd until brick's port is available
> [2014-10-31 09:18:15.935238] I [rpc-clnt.c:1765:rpc_clnt_reconfig]
> 0-repvol-client-1: changing port to 49153 (from 0)
>
> Now if I copy a file to /mnt it copied without any hang and brick still
> shows online.
>
> Thanks,
> Kiran.
>
> On Tue, Oct 28, 2014 at 3:44 PM, Niels de Vos <ndevos at redhat.com> wrote:
>
>> On Tue, Oct 28, 2014 at 02:08:32PM +0530, Kiran Patil wrote:
>> > The content of file zp2-brick2.log is at http://ur1.ca/iku0l (
>> > http://fpaste.org/145714/44849041/ )
>> >
>> > I can't open the file /zp2/brick2/.glusterfs/health_check since it hangs
>> > due to no disk present.
>> >
>> > Let me know the filename pattern, so that I can find it.
>>
>> Hmm, if there is a hang while reading from the disk, it will not get
>> detected in the current solution. We implemented failure detection on
>> top of the detection that is done by the filesystem. Suspending a
>> filesystem with fsfreeze or similar should probably not be seen as a
>> failure.
>>
>> In your case, it seems that the filesystem suspends itself when the disk
>> went away. I have no idea if it is possible to configure ZFS to not
>> suspend, but return an error to the reading/writing application. Please
>> check with such an option.
>>
>> If you find such an option, please update the wiki page and recommend
>> enabling it:
>> - http://gluster.org/community/documentation/index.php/GlusterOnZFS
>>
>>
>> Thanks,
>> Niels
>>
>>
>> >
>> > On Tue, Oct 28, 2014 at 1:42 PM, Niels de Vos <ndevos at redhat.com>
>> wrote:
>> >
>> > > On Tue, Oct 28, 2014 at 01:10:56PM +0530, Kiran Patil wrote:
>> > > > I applied the patches, compiled and installed the gluster.
>> > > >
>> > > > # glusterfs --version
>> > > > glusterfs 3.7dev built on Oct 28 2014 12:03:10
>> > > > Repository revision: git://git.gluster.com/glusterfs.git
>> > > > Copyright (c) 2006-2013 Red Hat, Inc. <http://www.redhat.com/>
>> > > > GlusterFS comes with ABSOLUTELY NO WARRANTY.
>> > > > It is licensed to you under your choice of the GNU Lesser
>> > > > General Public License, version 3 or any later version (LGPLv3
>> > > > or later), or the GNU General Public License, version 2 (GPLv2),
>> > > > in all cases as published by the Free Software Foundation.
>> > > >
>> > > > # git log
>> > > > commit 990ce16151c3af17e4cdaa94608b737940b60e4d
>> > > > Author: Lalatendu Mohanty <lmohanty at redhat.com>
>> > > > Date:   Tue Jul 1 07:52:27 2014 -0400
>> > > >
>> > > >     Posix: Brick failure detection fix for ext4 filesystem
>> > > > ...
>> > > > ...
>> > > >
>> > > > I see below messages
>> > >
>> > > Many thanks Kiran!
>> > >
>> > > Do you have the messages from the brick that uses the zp2 mountpoint?
>> > >
>> > > There also should be a file with a timestamp when the last check was
>> > > done successfully. If the brick is still running, this timestamp
>> should
>> > > get updated every storage.health-check-interval seconds:
>> > >     /zp2/brick2/.glusterfs/health_check
>> > >
>> > > Niels
>> > >
>> > > >
>> > > > File /var/log/glusterfs/etc-glusterfs-glusterd.vol.log :
>> > > >
>> > > > The message "I [MSGID: 106005]
>> > > > [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management:
>> Brick
>> > > > 192.168.1.246:/zp2/brick2 has disconnected from glusterd."
>> repeated 39
>> > > > times between [2014-10-28 05:58:09.209419] and [2014-10-28
>> > > 06:00:06.226330]
>> > > > [2014-10-28 06:00:09.226507] W [socket.c:545:__socket_rwv]
>> 0-management:
>> > > > readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed
>> (Invalid
>> > > > argument)
>> > > > [2014-10-28 06:00:09.226712] I [MSGID: 106005]
>> > > > [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management:
>> Brick
>> > > > 192.168.1.246:/zp2/brick2 has disconnected from glusterd.
>> > > > [2014-10-28 06:00:12.226881] W [socket.c:545:__socket_rwv]
>> 0-management:
>> > > > readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed
>> (Invalid
>> > > > argument)
>> > > > [2014-10-28 06:00:15.227249] W [socket.c:545:__socket_rwv]
>> 0-management:
>> > > > readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed
>> (Invalid
>> > > > argument)
>> > > > [2014-10-28 06:00:18.227616] W [socket.c:545:__socket_rwv]
>> 0-management:
>> > > > readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed
>> (Invalid
>> > > > argument)
>> > > > [2014-10-28 06:00:21.227976] W [socket.c:545:__socket_rwv]
>> 0-management:
>> > > > readv on
>> > > >
>> > > > .....
>> > > > .....
>> > > >
>> > > > [2014-10-28 06:19:15.142867] I
>> > > > [glusterd-handler.c:1280:__glusterd_handle_cli_get_volume]
>> 0-glusterd:
>> > > > Received get vol req
>> > > > The message "I [MSGID: 106005]
>> > > > [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management:
>> Brick
>> > > > 192.168.1.246:/zp2/brick2 has disconnected from glusterd."
>> repeated 12
>> > > > times between [2014-10-28 06:18:09.368752] and [2014-10-28
>> > > 06:18:45.373063]
>> > > > [2014-10-28 06:23:38.207649] W [glusterfsd.c:1194:cleanup_and_exit]
>> (-->
>> > > > 0-: received signum (15), shutting down
>> > > >
>> > > >
>> > > > dmesg output:
>> > > >
>> > > > SPLError: 7869:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp2' has
>> > > > encountered an uncorrectable I/O failure and has been suspended.
>> > > >
>> > > > SPLError: 7868:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp2' has
>> > > > encountered an uncorrectable I/O failure and has been suspended.
>> > > >
>> > > > SPLError: 7869:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp2' has
>> > > > encountered an uncorrectable I/O failure and has been suspended.
>> > > >
>> > > > The brick is still online.
>> > > >
>> > > > # gluster volume status
>> > > > Status of volume: repvol
>> > > > Gluster process Port Online Pid
>> > > >
>> > >
>> ------------------------------------------------------------------------------
>> > > > Brick 192.168.1.246:/zp1/brick1 49152 Y 4067
>> > > > Brick 192.168.1.246:/zp2/brick2 49153 Y 4078
>> > > > NFS Server on localhost 2049 Y 4092
>> > > > Self-heal Daemon on localhost N/A Y 4097
>> > > >
>> > > > Task Status of Volume repvol
>> > > >
>> > >
>> ------------------------------------------------------------------------------
>> > > > There are no active volume tasks
>> > > >
>> > > > # gluster volume info
>> > > >
>> > > > Volume Name: repvol
>> > > > Type: Replicate
>> > > > Volume ID: ba1e7c6d-1e1c-45cd-8132-5f4fa4d2d22b
>> > > > Status: Started
>> > > > Number of Bricks: 1 x 2 = 2
>> > > > Transport-type: tcp
>> > > > Bricks:
>> > > > Brick1: 192.168.1.246:/zp1/brick1
>> > > > Brick2: 192.168.1.246:/zp2/brick2
>> > > > Options Reconfigured:
>> > > > storage.health-check-interval: 30
>> > > >
>> > > > Let me know if you need further information.
>> > > >
>> > > > Thanks,
>> > > > Kiran.
>> > > >
>> > > > On Tue, Oct 28, 2014 at 11:44 AM, Kiran Patil <kiran at fractalio.com>
>> > > wrote:
>> > > >
>> > > > > I changed  git fetch git://review.gluster.org/glusterfs  to git
>> fetch
>> > > > > http://review.gluster.org/glusterfs  and now it works.
>> > > > >
>> > > > > Thanks,
>> > > > > Kiran.
>> > > > >
>> > > > > On Tue, Oct 28, 2014 at 11:13 AM, Kiran Patil <
>> kiran at fractalio.com>
>> > > wrote:
>> > > > >
>> > > > >> Hi Niels,
>> > > > >>
>> > > > >> I am getting "fatal: Couldn't find remote ref
>> refs/changes/13/8213/9"
>> > > > >> error.
>> > > > >>
>> > > > >> Steps to reproduce the issue.
>> > > > >>
>> > > > >> 1) # git clone git://review.gluster.org/glusterfs
>> > > > >> Initialized empty Git repository in
>> /root/gluster-3.6/glusterfs/.git/
>> > > > >> remote: Counting objects: 84921, done.
>> > > > >> remote: Compressing objects: 100% (48307/48307), done.
>> > > > >> remote: Total 84921 (delta 57264), reused 63233 (delta 36254)
>> > > > >> Receiving objects: 100% (84921/84921), 23.23 MiB | 192 KiB/s,
>> done.
>> > > > >> Resolving deltas: 100% (57264/57264), done.
>> > > > >>
>> > > > >> 2) # cd glusterfs
>> > > > >>     # git branch
>> > > > >>     * master
>> > > > >>
>> > > > >> 3) # git fetch git://review.gluster.org/glusterfs
>> > > refs/changes/13/8213/9
>> > > > >> && git checkout FETCH_HEAD
>> > > > >> fatal: Couldn't find remote ref refs/changes/13/8213/9
>> > > > >>
>> > > > >> Note: I also tried the above steps on git repo
>> > > > >> https://github.com/gluster/glusterfs and the result is same as
>> above.
>> > > > >>
>> > > > >> Please let me know if I miss any steps.
>> > > > >>
>> > > > >> Thanks,
>> > > > >> Kiran.
>> > > > >>
>> > > > >> On Mon, Oct 27, 2014 at 5:53 PM, Niels de Vos <ndevos at redhat.com
>> >
>> > > wrote:
>> > > > >>
>> > > > >>> On Mon, Oct 27, 2014 at 05:19:13PM +0530, Kiran Patil wrote:
>> > > > >>> > Hi,
>> > > > >>> >
>> > > > >>> > I created replicated vol with two bricks on the same node and
>> > > copied
>> > > > >>> some
>> > > > >>> > data to it.
>> > > > >>> >
>> > > > >>> > Now removed the disk which has hosted one of the brick of the
>> > > volume.
>> > > > >>> >
>> > > > >>> > Storage.health-check-interval is set to 30 seconds.
>> > > > >>> >
>> > > > >>> > I could see the disk is unavailable using zpool command of
>> zfs on
>> > > > >>> linux but
>> > > > >>> > the gluster volume status still displays the brick process
>> running
>> > > > >>> which
>> > > > >>> > should have been shutdown by this time.
>> > > > >>> >
>> > > > >>> > Is this a bug in 3.6 since it is mentioned as feature "
>> > > > >>> >
>> > > > >>>
>> > >
>> https://github.com/gluster/glusterfs/blob/release-3.6/doc/features/brick-failure-detection.md
>> > > > >>> "
>> > > > >>> >  or am I doing any mistakes here?
>> > > > >>>
>> > > > >>> The initial detection of brick failures did not work for all
>> > > > >>> filesystems. It may not work for ZFS too. A fix has been
>> posted, but
>> > > it
>> > > > >>> has not been merged into the master branch yet. When the change
>> has
>> > > been
>> > > > >>> merged, it can get backported to 3.6 and 3.5.
>> > > > >>>
>> > > > >>> You may want to test with the patch applied, and add your "+1
>> > > Verified"
>> > > > >>> to the change in case it makes it functional for you:
>> > > > >>> - http://review.gluster.org/8213
>> > > > >>>
>> > > > >>> Cheers,
>> > > > >>> Niels
>> > > > >>>
>> > > > >>> >
>> > > > >>> > [root at fractal-c92e gluster-3.6]# gluster volume status
>> > > > >>> > Status of volume: repvol
>> > > > >>> > Gluster process Port Online Pid
>> > > > >>> >
>> > > > >>>
>> > >
>> ------------------------------------------------------------------------------
>> > > > >>> > Brick 192.168.1.246:/zp1/brick1 49154 Y 17671
>> > > > >>> > Brick 192.168.1.246:/zp2/brick2 49155 Y 17682
>> > > > >>> > NFS Server on localhost 2049 Y 17696
>> > > > >>> > Self-heal Daemon on localhost N/A Y 17701
>> > > > >>> >
>> > > > >>> > Task Status of Volume repvol
>> > > > >>> >
>> > > > >>>
>> > >
>> ------------------------------------------------------------------------------
>> > > > >>> > There are no active volume tasks
>> > > > >>> >
>> > > > >>> >
>> > > > >>> > [root at fractal-c92e gluster-3.6]# gluster volume info
>> > > > >>> >
>> > > > >>> > Volume Name: repvol
>> > > > >>> > Type: Replicate
>> > > > >>> > Volume ID: d4f992b1-1393-43b8-9fda-2e2b6e3b5039
>> > > > >>> > Status: Started
>> > > > >>> > Number of Bricks: 1 x 2 = 2
>> > > > >>> > Transport-type: tcp
>> > > > >>> > Bricks:
>> > > > >>> > Brick1: 192.168.1.246:/zp1/brick1
>> > > > >>> > Brick2: 192.168.1.246:/zp2/brick2
>> > > > >>> > Options Reconfigured:
>> > > > >>> > storage.health-check-interval: 30
>> > > > >>> >
>> > > > >>> > [root at fractal-c92e gluster-3.6]# zpool status zp2
>> > > > >>> >   pool: zp2
>> > > > >>> >  state: UNAVAIL
>> > > > >>> > status: One or more devices are faulted in response to IO
>> failures.
>> > > > >>> > action: Make sure the affected devices are connected, then run
>> > > 'zpool
>> > > > >>> > clear'.
>> > > > >>> >    see: http://zfsonlinux.org/msg/ZFS-8000-HC
>> > > > >>> >   scan: none requested
>> > > > >>> > config:
>> > > > >>> >
>> > > > >>> > NAME        STATE     READ WRITE CKSUM
>> > > > >>> > zp2         UNAVAIL      0     0     0  insufficient replicas
>> > > > >>> >   sdb       UNAVAIL      0     0     0
>> > > > >>> >
>> > > > >>> > errors: 2 data errors, use '-v' for a list
>> > > > >>> >
>> > > > >>> >
>> > > > >>> > Thanks,
>> > > > >>> > Kiran.
>> > > > >>>
>> > > > >>> > _______________________________________________
>> > > > >>> > Gluster-devel mailing list
>> > > > >>> > Gluster-devel at gluster.org
>> > > > >>> > http://supercolony.gluster.org/mailman/listinfo/gluster-devel
>> > > > >>>
>> > > > >>>
>> > > > >>
>> > > > >
>> > >
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20141031/2b3cab3d/attachment-0001.html>