[Gluster-users] gluster 3.7.3 - volume heal info hangs - unknown heal status

Mon Oct 5 12:26:35 UTC 2015

Hi!

It's VMs based on KVM/qemu managed by libvirtd. I figured I could see the
heal status by comparing the bricks: nothing was replicated, but new files
were (after a long delay of about 5 mins). So I wanted to see if existing
files (VM images) will be healed if I would stop a VM (close any open
handle on the file), which turned out not to be the case.

I ended up shutting down all VMs and restarting the server. Afterwards
healing worked as expected....

- Andreas

On Mon, Oct 5, 2015 at 1:01 PM, Anuradha Talur <atalur at redhat.com> wrote:

>
>
> ----- Original Message -----
> > From: "Andreas Mather" <andreas at allaboutapps.at>
> > To: "Anuradha Talur" <atalur at redhat.com>
> > Cc: "Gluster-users at gluster.org List" <gluster-users at gluster.org>
> > Sent: Thursday, September 24, 2015 6:59:38 PM
> > Subject: Re: [Gluster-users] gluster 3.7.3 - volume heal info hangs -
> unknown heal status
> >
> > Hi Anuradha!
> >
> > Thanks for your reply! Attached you can find the dump files. As I'm not
> > sure if they make their way through as attachments, here're links to them
> > as well:
> >
> > brick1 - http://pastebin.com/3ivkhuRH
> > brick2 - http://pastebin.com/77sT1mut
> Hi,
>
> I see some blocked locks from the statedump.
> Could you let me know what kind of workload you had when you observed the
> hang?
> >
> > - Andreas
> >
> >
> >
> >
> > On Thu, Sep 24, 2015 at 3:18 PM, Anuradha Talur <atalur at redhat.com>
> wrote:
> >
> > >
> > >
> > > ----- Original Message -----
> > > > From: "Andreas Mather" <andreas at allaboutapps.at>
> > > > To: "Gluster-users at gluster.org List" <gluster-users at gluster.org>
> > > > Sent: Thursday, September 24, 2015 1:24:12 PM
> > > > Subject: [Gluster-users] gluster 3.7.3 - volume heal info hangs -
> > > unknown     heal status
> > > >
> > > > Hi!
> > > >
> > > > Our provider had network maintenance this night, so 2 of our 4
> servers
> > > got
> > > > disconnected and reconnected. Since we knew this was coming, we
> shifted
> > > all
> > > > work load off the affected servers. This morning, most of the cluster
> > > seems
> > > > fine, but for one volume, no heal info can be retrieved, so we
> basically
> > > > don't know about the healing state of the volume. The volume is a
> > > replica 2
> > > > volume between vhost4-int/brick1 and vhost3-int/brick2.
> > > >
> > > > The volume is accessible, but since I don't get any heal info, I
> don't
> > > know
> > > > if it is probably replicated. Any help to resolve this situation is
> > > highly
> > > > appreciated.
> > > >
> > > > hangs forever:
> > > > [root at vhost4 ~]# gluster volume heal vol4 info
> > > >
> > > > glfsheal-vol4.log:
> > > > [2015-09-24 07:47:59.284723] I [MSGID: 101190]
> > > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started
> thread
> > > with
> > > > index 1
> > > > [2015-09-24 07:47:59.293735] I [MSGID: 101190]
> > > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started
> thread
> > > with
> > > > index 2
> > > > [2015-09-24 07:47:59.294061] I [MSGID: 104045]
> [glfs-master.c:95:notify]
> > > > 0-gfapi: New graph 76686f73-7434-2e61-6c6c-61626f757461 (0) coming up
> > > > [2015-09-24 07:47:59.294081] I [MSGID: 114020] [client.c:2118:notify]
> > > > 0-vol4-client-1: parent translators are ready, attempting connect on
> > > > transport
> > > > [2015-09-24 07:47:59.309470] I [MSGID: 114020] [client.c:2118:notify]
> > > > 0-vol4-client-2: parent translators are ready, attempting connect on
> > > > transport
> > > > [2015-09-24 07:47:59.310525] I [rpc-clnt.c:1819:rpc_clnt_reconfig]
> > > > 0-vol4-client-1: changing port to 49155 (from 0)
> > > > [2015-09-24 07:47:59.315958] I [MSGID: 114057]
> > > > [client-handshake.c:1437:select_server_supported_programs]
> > > 0-vol4-client-1:
> > > > Using Program GlusterFS 3.3, Num (1298437), Version (330)
> > > > [2015-09-24 07:47:59.316481] I [MSGID: 114046]
> > > > [client-handshake.c:1213:client_setvolume_cbk] 0-vol4-client-1:
> > > Connected to
> > > > vol4-client-1, attached to remote volume '/storage/brick2/brick2'.
> > > > [2015-09-24 07:47:59.316495] I [MSGID: 114047]
> > > > [client-handshake.c:1224:client_setvolume_cbk] 0-vol4-client-1:
> Server
> > > and
> > > > Client lk-version numbers are not same, reopening the fds
> > > > [2015-09-24 07:47:59.316538] I [MSGID: 108005]
> > > [afr-common.c:3960:afr_notify]
> > > > 0-vol4-replicate-0: Subvolume 'vol4-client-1' came back up; going
> online.
> > > > [2015-09-24 07:47:59.317150] I [MSGID: 114035]
> > > > [client-handshake.c:193:client_set_lk_version_cbk] 0-vol4-client-1:
> > > Server
> > > > lk version = 1
> > > > [2015-09-24 07:47:59.320898] I [rpc-clnt.c:1819:rpc_clnt_reconfig]
> > > > 0-vol4-client-2: changing port to 49154 (from 0)
> > > > [2015-09-24 07:47:59.325633] I [MSGID: 114057]
> > > > [client-handshake.c:1437:select_server_supported_programs]
> > > 0-vol4-client-2:
> > > > Using Program GlusterFS 3.3, Num (1298437), Version (330)
> > > > [2015-09-24 07:47:59.325780] I [MSGID: 114046]
> > > > [client-handshake.c:1213:client_setvolume_cbk] 0-vol4-client-2:
> > > Connected to
> > > > vol4-client-2, attached to remote volume '/storage/brick1/brick1'.
> > > > [2015-09-24 07:47:59.325791] I [MSGID: 114047]
> > > > [client-handshake.c:1224:client_setvolume_cbk] 0-vol4-client-2:
> Server
> > > and
> > > > Client lk-version numbers are not same, reopening the fds
> > > > [2015-09-24 07:47:59.333346] I [MSGID: 114035]
> > > > [client-handshake.c:193:client_set_lk_version_cbk] 0-vol4-client-2:
> > > Server
> > > > lk version = 1
> > > > [2015-09-24 07:47:59.334545] I [MSGID: 108031]
> > > > [afr-common.c:1745:afr_local_discovery_cbk] 0-vol4-replicate-0:
> selecting
> > > > local read_child vol4-client-2
> > > > [2015-09-24 07:47:59.335833] I [MSGID: 104041]
> > > > [glfs-resolve.c:862:__glfs_active_subvol] 0-vol4: switched to graph
> > > > 76686f73-7434-2e61-6c6c-61626f757461 (0)
> > > >
> > > > Questions to this output:
> > > > -) Why does it report " Using Program GlusterFS 3.3, Num (1298437),
> > > Version
> > > > (330) ". We run 3.7.3 ?!
> > > > -) guster logs timestamps in UTC not taking server timezone into
> > > account. Is
> > > > there a way to fix this?
> > > >
> > > > etc-glusterfs-glusterd.vol.log:
> > > > no logs to after volume heal info command
> > > >
> > > > storage-brick1-brick1.log:
> > > > [2015-09-24 07:47:59.325720] I [login.c:81:gf_auth] 0-auth/login:
> allowed
> > > > user names: 67ef1559-d3a1-403a-b8e7-fb145c3acf4e
> > > > [2015-09-24 07:47:59.325743] I [MSGID: 115029]
> > > > [server-handshake.c:610:server_setvolume] 0-vol4-server: accepted
> client
> > > > from
> > > >
> vhost4.allaboutapps.at-14900-2015/09/24-07:47:59:282313-vol4-client-2-0-0
> > > > (version: 3.7.3)
> > > >
> > > > storage-brick2-brick2.log:
> > > > no logs to after volume heal info command
> > > >
> > > >
> > > Hi Andreas,
> > >
> > > Could you please provide the following information so that we can
> > > understand why the command is hanging?
> > > When the command is hung, run the following command from one of the
> > > servers:
> > > `gluster volume statedump <volname>`
> > > This command will generate statedumps of glusterfsd processes in the
> > > servers. You can find them at /var/run/gluster . A typical statedump
> for a
> > > brick has "<brick-path>.<pid-of-brick>.dump.<timestamp>" as its name.
> Could
> > > you please attach them and respond?
> > >
> > > > Thanks,
> > > >
> > > > - Andreas
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Gluster-users mailing list
> > > > Gluster-users at gluster.org
> > > > http://www.gluster.org/mailman/listinfo/gluster-users
> > >
> > > --
> > > Thanks,
> > > Anuradha.
> > >
> >
>
> --
> Thanks,
> Anuradha.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20151005/3dfdd1cd/attachment.html>