[Gluster-users] [Gluster-devel] VM fs becomes read only when one gluster node goes down

Tue Oct 27 09:13:29 UTC 2015

On Tue, Oct 27, 2015 at 01:56:31AM +0200, Roman wrote:
> Aren't we are talking about this patch?
> https://git.proxmox.com/?p=pve-qemu-kvm.git;a=blob;f=debian/patches/gluster-backupserver.patch;h=ad241ee1154ebbd536d7c2c7987d86a02255aba2;hb=HEAD

No, a backup-volserver option is only effective while doing the initial
mount. In case the 1st storage server is not available to retrieve the
volume layout (.vol file), other servers can be used for backup. Once
the volume layout is known to the Gluster client, the client will talk
to all the bricks directly.

Also qemu+libgfapi does a "mount" of the volume, before it can open the
disk image. This "mount" is a library call, not the usual syscall, and
only fetches the volume layout from a GlusterD service.

HTH,
Niels

> 
> 2015-10-26 22:56 GMT+02:00 Niels de Vos <ndevos at redhat.com>:
> 
> > On Thu, Oct 22, 2015 at 08:45:04PM +0200, André Bauer wrote:
> > > Hi,
> > >
> > > i have a 4 node Glusterfs 3.5.6 Cluster.
> > >
> > > My VM images are in an replicated distributed volume which is accessed
> > > from kvm/qemu via libgfapi.
> > >
> > > Mount is against storage.domain.local which has IPs for all 4 Gluster
> > > nodes set in DNS.
> > >
> > > When one of the Gluster nodes goes down (accidently reboot) a lot of the
> > > vms getting read only filesystem. Even when the node comes back up.
> > >
> > > How can i prevent this?
> > > I expect that the vm just uses the replicated file on the other node,
> > > without getting ro fs.
> > >
> > > Any hints?
> >
> > There are at least two timeouts that are involved in this problem:
> >
> > 1. The filesystem in a VM can go read-only when the virtual disk where
> >    the filesystem is located does not respond for a while.
> >
> > 2. When a storage server that holds a replica of the virtual disk
> >    becomes unreachable, the Gluster client (qemu+libgfapi) waits for
> >    max. network.ping-timeout seconds before it resumes I/O.
> >
> > Once a filesystem in a VM goes read-only, you might be able to fsck and
> > re-mount it read-writable again. It is not something a VM will do by
> > itself.
> >
> >
> > The timeouts for (1) are set in sysfs:
> >
> >     $ cat /sys/block/sda/device/timeout
> >     30
> >
> > 30 seconds is the default for SD-devices, and for testing you can change
> > it with an echo:
> >
> >     # echo 300 > /sys/block/sda/device/timeout
> >
> > This is not a peristent change, you can create a udev-rule to apply this
> > change at bootup.
> >
> > Some of the filesystem offer a mount option that can change the
> > behaviour after a disk error is detected. "man mount" shows the "errors"
> > option for ext*. Changing this to "continue" is not recommended, "abort"
> > or "panic" will be the most safe for your data.
> >
> >
> > The timeout mentioned in (2) is for the Gluster Volume, and checked by
> > the client. When a client does a write to a replicated volume, the write
> > needs to be acknowledged by both/all replicas. The client (libgfapi)
> > delays the reply to the application (qemu) until both/all replies from
> > the replicas has been received. This delay is configured as the volume
> > option network.ping-timeout (42 seconds by default).
> >
> >
> > Now, if the VM returns block errors after 30 seconds, and the client
> > waits up to 42 seconds for recovery, there is an issue... So, your
> > solution could be to increase the timeout for error detection of the
> > disks inside the VMs, and/or decrease the network.ping-timeout.
> >
> > It would be interesting to know if adapting these values prevents the
> > read-only occurrences in your environment. If you do any testing with
> > this, please keep me informed about the results.
> >
> > Niels
> >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> >
> 
> 
> 
> -- 
> Best regards,
> Roman.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20151027/51ef0f6c/attachment.sig>