[Gluster-devel] VM fs becomes read only when one gluster node goes down

André Bauer abauer at magix.net
Tue Oct 27 18:21:35 UTC 2015


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hi Niels,

my network.ping-timeout was already set to 5 seconds.

Unfortunately it seems i dont have the timout setting in Ubuntu 14.04
for my vda disk.

ls -al /sys/block/vda/device/ gives me only:

drwxr-xr-x 4 root root    0 Oct 26 20:21 ./
drwxr-xr-x 5 root root    0 Oct 26 20:21 ../
drwxr-xr-x 3 root root    0 Oct 26 20:21 block/
- -r--r--r-- 1 root root 4096 Oct 27 18:13 device
lrwxrwxrwx 1 root root    0 Oct 27 18:13 driver ->
../../../../bus/virtio/drivers/virtio_blk/
- -r--r--r-- 1 root root 4096 Oct 27 18:13 features
- -r--r--r-- 1 root root 4096 Oct 27 18:13 modalias
drwxr-xr-x 2 root root    0 Oct 27 18:13 power/
- -r--r--r-- 1 root root 4096 Oct 27 18:13 status
lrwxrwxrwx 1 root root    0 Oct 26 20:21 subsystem ->
../../../../bus/virtio/
- -rw-r--r-- 1 root root 4096 Oct 26 20:21 uevent
- -r--r--r-- 1 root root 4096 Oct 26 20:21 vendor


Is the qourum setting a problem, if you only have 2 replicas?

My volume has this quorum options set:

cluster.quorum-type: auto
cluster.server-quorum-type: server

As i understand the documentation (
https://access.redhat.com/documentation/en-US/Red_Hat_Storage/2.0/html/A
dministration_Guide/sect-User_Guide-Managing_Volumes-Quorum.html
), cluster.server-quorum-ratio is set to "< 50%" by default, which can
never happen if you only have 2 replicas and one node goes down, right?

Do in need cluster.server-quorum-ratio = 50% in this case?



@ Josh

Qemu had this in log for the time the vm got read only fs:

[2015-10-22 17:44:42.699990] E [socket.c:2244:socket_connect_finish]
0-vmimages-client-2: connection to 192.168.0.43:24007 failed
(Connection refused)
[2015-10-22 17:45:03.411721] E
[client-handshake.c:1760:client_query_portmap_cbk]
0-vmimages-client-2: failed to get the port number for remote
subvolume. Please run 'gluster volume status' on server to see if
brick process is running.

netstat looks good. As axpected i got connectiosn to all 4 Glusterfs
nodes at the moment.



@ Eivind
I don't think i had a split brain.
Only the vm got read only filesystem not the file on the Glusterfs node.



Regards
André

Am 26.10.2015 um 21:56 schrieb Niels de Vos:
> 
> There are at least two timeouts that are involved in this problem:
> 
> 1. The filesystem in a VM can go read-only when the virtual disk
> where the filesystem is located does not respond for a while.
> 
> 2. When a storage server that holds a replica of the virtual disk 
> becomes unreachable, the Gluster client (qemu+libgfapi) waits for 
> max. network.ping-timeout seconds before it resumes I/O.
> 
> Once a filesystem in a VM goes read-only, you might be able to fsck
> and re-mount it read-writable again. It is not something a VM will
> do by itself.
> 
> 
> The timeouts for (1) are set in sysfs:
> 
> $ cat /sys/block/sda/device/timeout 30
> 
> 30 seconds is the default for SD-devices, and for testing you can
> change it with an echo:
> 
> # echo 300 > /sys/block/sda/device/timeout
> 
> This is not a peristent change, you can create a udev-rule to apply
> this change at bootup.
> 
> Some of the filesystem offer a mount option that can change the 
> behaviour after a disk error is detected. "man mount" shows the
> "errors" option for ext*. Changing this to "continue" is not
> recommended, "abort" or "panic" will be the most safe for your
> data.
> 
> 
> The timeout mentioned in (2) is for the Gluster Volume, and checked
> by the client. When a client does a write to a replicated volume,
> the write needs to be acknowledged by both/all replicas. The client
> (libgfapi) delays the reply to the application (qemu) until
> both/all replies from the replicas has been received. This delay is
> configured as the volume option network.ping-timeout (42 seconds by
> default).
> 
> 
> Now, if the VM returns block errors after 30 seconds, and the
> client waits up to 42 seconds for recovery, there is an issue...
> So, your solution could be to increase the timeout for error
> detection of the disks inside the VMs, and/or decrease the
> network.ping-timeout.
> 
> It would be interesting to know if adapting these values prevents
> the read-only occurrences in your environment. If you do any
> testing with this, please keep me informed about the results.
> 
> Niels
> 


- -- 
Mit freundlichen Grüßen
André Bauer

MAGIX Software GmbH
André Bauer
Administrator
August-Bebel-Straße 48
01219 Dresden
GERMANY

tel.: 0351 41884875
e-mail: abauer at magix.net
abauer at magix.net <mailto:Email>
www.magix.com <http://www.magix.com/>

Geschäftsführer | Managing Directors: Dr. Arnd Schröder, Klaus Schmidt
Amtsgericht | Commercial Register: Berlin Charlottenburg, HRB 127205

Find us on:

<http://www.facebook.com/MAGIX> <http://www.twitter.com/magix_de>
<http://www.youtube.com/wwwmagixcom> <http://www.magixmagazin.de>
- ----------------------------------------------------------------------
The information in this email is intended only for the addressee named
above. Access to this email by anyone else is unauthorized. If you are
not the intended recipient of this message any disclosure, copying,
distribution or any action taken in reliance on it is prohibited and
may be unlawful. MAGIX does not warrant that any attachments are free
from viruses or other defects and accepts no liability for any losses
resulting from infected email transmissions. Please note that any
views expressed in this email may be those of the originator and do
not necessarily represent the agenda of the company.
- ----------------------------------------------------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBCAAGBQJWL8CvAAoJEES+J36frTguwowH/iJTvA3fuF/VKRl24Re2sOkI
5d3YFH0PXtqBMocSoiQDfKAlFxrLNwRloaKywM97K5odBcoQ8jcI03vIFqArCjdS
RmLEFdHv1gPUeQOiZy6zM4b6I0osHoF89POe+UNbcN0uTB014q29B1+JpQtAVi2T
rR+g+gc0gYt1PTP/Gxuk4klObXgZGEIbuAGPVZ0IUGH9FAF6buSGYMsi92h8t8qH
J3AFH/3abr3aEYpm8KO1qR5ZsC2TfYfMXQyFbRPDLnX0qu8q96RFBa+uNcuvAEwc
vkhPDNGDmql7pYCZ9IWpsLCuq/aCECIOLNV4Y/O4KbO2SURNMlVxFRdQcWJowlQ=
=h/cI
-----END PGP SIGNATURE-----


More information about the Gluster-devel mailing list