[Gluster-devel] VM fs becomes read only when one gluster node goes down

Mon Oct 26 23:56:31 UTC 2015

Aren't we are talking about this patch?
https://git.proxmox.com/?p=pve-qemu-kvm.git;a=blob;f=debian/patches/gluster-backupserver.patch;h=ad241ee1154ebbd536d7c2c7987d86a02255aba2;hb=HEAD

2015-10-26 22:56 GMT+02:00 Niels de Vos <ndevos at redhat.com>:

> On Thu, Oct 22, 2015 at 08:45:04PM +0200, André Bauer wrote:
> > Hi,
> >
> > i have a 4 node Glusterfs 3.5.6 Cluster.
> >
> > My VM images are in an replicated distributed volume which is accessed
> > from kvm/qemu via libgfapi.
> >
> > Mount is against storage.domain.local which has IPs for all 4 Gluster
> > nodes set in DNS.
> >
> > When one of the Gluster nodes goes down (accidently reboot) a lot of the
> > vms getting read only filesystem. Even when the node comes back up.
> >
> > How can i prevent this?
> > I expect that the vm just uses the replicated file on the other node,
> > without getting ro fs.
> >
> > Any hints?
>
> There are at least two timeouts that are involved in this problem:
>
> 1. The filesystem in a VM can go read-only when the virtual disk where
>    the filesystem is located does not respond for a while.
>
> 2. When a storage server that holds a replica of the virtual disk
>    becomes unreachable, the Gluster client (qemu+libgfapi) waits for
>    max. network.ping-timeout seconds before it resumes I/O.
>
> Once a filesystem in a VM goes read-only, you might be able to fsck and
> re-mount it read-writable again. It is not something a VM will do by
> itself.
>
>
> The timeouts for (1) are set in sysfs:
>
>     $ cat /sys/block/sda/device/timeout
>     30
>
> 30 seconds is the default for SD-devices, and for testing you can change
> it with an echo:
>
>     # echo 300 > /sys/block/sda/device/timeout
>
> This is not a peristent change, you can create a udev-rule to apply this
> change at bootup.
>
> Some of the filesystem offer a mount option that can change the
> behaviour after a disk error is detected. "man mount" shows the "errors"
> option for ext*. Changing this to "continue" is not recommended, "abort"
> or "panic" will be the most safe for your data.
>
>
> The timeout mentioned in (2) is for the Gluster Volume, and checked by
> the client. When a client does a write to a replicated volume, the write
> needs to be acknowledged by both/all replicas. The client (libgfapi)
> delays the reply to the application (qemu) until both/all replies from
> the replicas has been received. This delay is configured as the volume
> option network.ping-timeout (42 seconds by default).
>
>
> Now, if the VM returns block errors after 30 seconds, and the client
> waits up to 42 seconds for recovery, there is an issue... So, your
> solution could be to increase the timeout for error detection of the
> disks inside the VMs, and/or decrease the network.ping-timeout.
>
> It would be interesting to know if adapting these values prevents the
> read-only occurrences in your environment. If you do any testing with
> this, please keep me informed about the results.
>
> Niels
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>

-- 
Best regards,
Roman.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20151027/b2431be8/attachment.html>