[Gluster-users] Failover problems with gluster 3.8.8-1 (latest Debian stable)

Thu Feb 15 19:34:02 UTC 2018

Hi,

Have you checked for any file system errors on the brick mount point?

I once was facing weird io errors and xfs_repair fixed the issue.

What about the heal? Does it report any pending heals?

On Feb 15, 2018 14:20, "Dave Sherohman" <dave at sherohman.org> wrote:

> Well, it looks like I've stumped the list, so I did a bit of additional
> digging myself:
>
> azathoth replicates with yog-sothoth, so I compared their brick
> directories.  `ls -R /var/local/brick0/data | md5sum` gives the same
> result on both servers, so the filenames are identical in both bricks.
> However, `du -s /var/local/brick0/data` shows that azathoth has about 3G
> more data (445G vs 442G) than yog.
>
> This seems consistent with my assumption that the problem is on
> yog-sothoth (everything is fine with only azathoth; there are problems
> with only yog-sothoth) and I am reminded that a few weeks ago,
> yog-sothoth was offline for 4-5 days, although it should have been
> brought back up-to-date once it came back online.
>
> So, assuming that the issue is stale/missing data on yog-sothoth, is
> there a way to force gluster to do a full refresh of the data from
> azathoth's brick to yog-sothoth's brick?  I would have expected running
> heal and/or rebalance to do that sort of thing, but I've run them both
> (with and without fix-layout on the rebalance) and the problem persists.
>
> If there isn't a way to force a refresh, how risky would it be to kill
> gluster on yog-sothoth, wipe everything from /var/local/brick0, and then
> re-add it to the cluster as if I were replacing a physically failed
> disk?  Seems like that should work in principle, but it feels dangerous
> to wipe the partition and rebuild, regardless.
>
> On Tue, Feb 13, 2018 at 07:33:44AM -0600, Dave Sherohman wrote:
> > I'm using gluster for a virt-store with 3x2 distributed/replicated
> > servers for 16 qemu/kvm/libvirt virtual machines using image files
> > stored in gluster and accessed via libgfapi.  Eight of these disk images
> > are standalone, while the other eight are qcow2 images which all share a
> > single backing file.
> >
> > For the most part, this is all working very well.  However, one of the
> > gluster servers (azathoth) causes three of the standalone VMs and all 8
> > of the shared-backing-image VMs to fail if it goes down.  Any of the
> > other gluster servers can go down with no problems; only azathoth causes
> > issues.
> >
> > In addition, the kvm hosts have the gluster volume fuse mounted and one
> > of them (out of five) detects an error on the gluster volume and puts
> > the fuse mount into read-only mode if azathoth goes down.  libgfapi
> > connections to the VM images continue to work normally from this host
> > despite this and the other four kvm hosts are unaffected.
> >
> > It initially seemed relevant that I have the libgfapi URIs specified as
> > gluster://azathoth/..., but I've tried changing them to make the initial
> > connection via other gluster hosts and it had no effect on the problem.
> > Losing azathoth still took them out.
> >
> > In addition to changing the mount URI, I've also manually run a heal and
> > rebalance on the volume, enabled the bitrot daemons (then turned them
> > back off a week later, since they reported no activity in that time),
> > and copied one of the standalone images to a new file in case it was a
> > problem with the file itself.  As far as I can tell, none of these
> > attempts changed anything.
> >
> > So I'm at a loss.  Is this a known type of problem?  If so, how do I fix
> > it?  If not, what's the next step to troubleshoot it?
> >
> >
> > # gluster --version
> > glusterfs 3.8.8 built on Jan 11 2017 14:07:11
> > Repository revision: git://git.gluster.com/glusterfs.git
> >
> > # gluster volume status
> > Status of volume: palantir
> > Gluster process                             TCP Port  RDMA Port  Online
> > Pid
> > ------------------------------------------------------------
> ------------------
> > Brick saruman:/var/local/brick0/data        49154     0          Y
> > 10690
> > Brick gandalf:/var/local/brick0/data        49155     0          Y
> > 18732
> > Brick azathoth:/var/local/brick0/data       49155     0          Y
> > 9507
> > Brick yog-sothoth:/var/local/brick0/data    49153     0          Y
> > 39559
> > Brick cthulhu:/var/local/brick0/data        49152     0          Y
> > 2682
> > Brick mordiggian:/var/local/brick0/data     49152     0          Y
> > 39479
> > Self-heal Daemon on localhost               N/A       N/A        Y
> > 9614
> > Self-heal Daemon on saruman.lub.lu.se       N/A       N/A        Y
> > 15016
> > Self-heal Daemon on cthulhu.lub.lu.se       N/A       N/A        Y
> > 9756
> > Self-heal Daemon on gandalf.lub.lu.se       N/A       N/A        Y
> > 5962
> > Self-heal Daemon on mordiggian.lub.lu.se    N/A       N/A        Y
> > 8295
> > Self-heal Daemon on yog-sothoth.lub.lu.se   N/A       N/A        Y
> > 7588
> >
> > Task Status of Volume palantir
> > ------------------------------------------------------------
> ------------------
> > Task                 : Rebalance
> > ID                   : c38e11fe-fe1b-464d-b9f5-1398441cc229
> > Status               : completed
> >
> >
> > --
> > Dave Sherohman
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://lists.gluster.org/mailman/listinfo/gluster-users
>
>
> --
> Dave Sherohman
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180215/69d8c74e/attachment.html>