[Gluster-devel] VM fs becomes read only when one gluster node goes down

Mon Oct 26 20:56:24 UTC 2015

On Thu, Oct 22, 2015 at 08:45:04PM +0200, André Bauer wrote:
> Hi,
> 
> i have a 4 node Glusterfs 3.5.6 Cluster.
> 
> My VM images are in an replicated distributed volume which is accessed
> from kvm/qemu via libgfapi.
> 
> Mount is against storage.domain.local which has IPs for all 4 Gluster
> nodes set in DNS.
> 
> When one of the Gluster nodes goes down (accidently reboot) a lot of the
> vms getting read only filesystem. Even when the node comes back up.
> 
> How can i prevent this?
> I expect that the vm just uses the replicated file on the other node,
> without getting ro fs.
> 
> Any hints?

There are at least two timeouts that are involved in this problem:

1. The filesystem in a VM can go read-only when the virtual disk where
   the filesystem is located does not respond for a while.

2. When a storage server that holds a replica of the virtual disk
   becomes unreachable, the Gluster client (qemu+libgfapi) waits for
   max. network.ping-timeout seconds before it resumes I/O.

Once a filesystem in a VM goes read-only, you might be able to fsck and
re-mount it read-writable again. It is not something a VM will do by
itself.

The timeouts for (1) are set in sysfs:

    $ cat /sys/block/sda/device/timeout
    30

30 seconds is the default for SD-devices, and for testing you can change
it with an echo:

    # echo 300 > /sys/block/sda/device/timeout

This is not a peristent change, you can create a udev-rule to apply this
change at bootup.

Some of the filesystem offer a mount option that can change the
behaviour after a disk error is detected. "man mount" shows the "errors"
option for ext*. Changing this to "continue" is not recommended, "abort"
or "panic" will be the most safe for your data.

The timeout mentioned in (2) is for the Gluster Volume, and checked by
the client. When a client does a write to a replicated volume, the write
needs to be acknowledged by both/all replicas. The client (libgfapi)
delays the reply to the application (qemu) until both/all replies from
the replicas has been received. This delay is configured as the volume
option network.ping-timeout (42 seconds by default).

Now, if the VM returns block errors after 30 seconds, and the client
waits up to 42 seconds for recovery, there is an issue... So, your
solution could be to increase the timeout for error detection of the
disks inside the VMs, and/or decrease the network.ping-timeout.

It would be interesting to know if adapting these values prevents the
read-only occurrences in your environment. If you do any testing with
this, please keep me informed about the results.

Niels
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20151026/b23a1bb4/attachment.sig>