[Gluster-users] Virtual machines and self-healing on GlusterFS v3.3

Dario Berzano dario.berzano at cern.ch
Mon Sep 17 08:06:56 UTC 2012


Hi Pranith,

  those bricks stay on different servers connected on the same switch: the only possibility I see is that the switch went down for some reason, it is our only single point of failure. The servers themselves never went down at the same time.

I do not understand however why if I run getfattr continuously:

  watch -n1 'getfattr -d -m . -e hex 1814/images/disk.0'

I get alternating:

trusted.afr.VmDir-client-0=0x000000010000000000000000
trusted.afr.VmDir-client-1=0x000000010000000000000000

and:

trusted.afr.VmDir-client-0=0x000000000000000000000000
trusted.afr.VmDir-client-1=0x000000000000000000000000

This again happens with every "big" file.

Does this suggest a network problem, maybe? One of the servers has 1 GbE while the other one has a faster 10 GbE, but I do not think this is enough to continuously de-synchronize the bricks...

Cheers
--
: Dario Berzano
: CERN PH-SFT & Università di Torino (Italy)
: Wiki: http://newton.ph.unito.it/~berzano
: GPG: http://newton.ph.unito.it/~berzano/gpg
: Mobiles: +41 766124782 (CH), +39 3487222520 (IT)



Il giorno 17/set/2012, alle ore 00:11, Pranith Kumar Karampuri <pkarampu at redhat.com>
 ha scritto:

> 1814/images/disk.0 has pending data change log for both subvolumes. i.e. 0x00000001. This happens when both the bricks go out at the same time, while an operation is in progress. Did that happen?
> 
> Pranith.
> 
> ----- Original Message -----
> From: "Dario Berzano" <dario.berzano at cern.ch>
> To: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> Cc: "gluster-users" <gluster-users at gluster.org>
> Sent: Sunday, September 16, 2012 9:20:23 PM
> Subject: Re: [Gluster-users] Virtual machines and self-healing on GlusterFS v3.3
> 
> Ok, here's the output for 1816/images/disk.0:
> 
> # file: bricks/VmDir01/1816/images/disk.0
> security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
> trusted.afr.VmDir-client-0=0x000000000000000000000000
> trusted.afr.VmDir-client-1=0x000000000000000000000000
> trusted.gfid=0x1cef9d386f1c4424af6d95dfbcf2989b
> 
> # file: bricks/VmDir02/1816/images/disk.0
> security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
> trusted.afr.VmDir-client-0=0x000000000000000000000000
> trusted.afr.VmDir-client-1=0x000000000000000000000000
> trusted.gfid=0x1cef9d386f1c4424af6d95dfbcf2989b
> 
> And for 1814/images/disk.0:
> 
> # file: bricks/VmDir01/1814/images/disk.0
> security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
> trusted.afr.VmDir-client-0=0x000000010000000000000000
> trusted.afr.VmDir-client-1=0x000000010000000000000000
> trusted.gfid=0xaabc0c344ccc4cfe8e2ed588dd78323b
> 
> # file: bricks/VmDir02/1814/images/disk.0
> security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
> trusted.afr.VmDir-client-0=0x000000010000000000000000
> trusted.afr.VmDir-client-1=0x000000010000000000000000
> trusted.gfid=0xaabc0c344ccc4cfe8e2ed588dd78323b
> 
> Note that these are just two sample files, since the problem occurs with 100% of our "big" virtual machines. Here's the whole content of the GlusterFS volume along with file sizes:
> 
> 6.3G ./1981/images/disk.0
> 53M ./1820/images/disk.0
> 9.7G ./1838/images/disk.0
> 10G ./1819/images/disk.0
> 9.2G ./1818/images/disk.0
> 10G ./1816/images/disk.0
> 53M ./1962/images/disk.0
> 10G ./1814/images/disk.0
> 6.2G ./1988/images/disk.0
> 10G ./1817/images/disk.0
> 53M ./1821/images/disk.0
> 
> We currently have 11 running VMs. The "small" ones (53 MB big) have never shown any problem so far. *All* the other VMs (6 to 10 GB big) periodically show up in the output of:
> 
> gluster volume heal VmDir info
> 
> when there's some intense I/O occuring, disappearing immediately shortly afterwards.
> 
> Thanks, cheers,
> --
> : Dario Berzano
> : CERN PH-SFT & Università di Torino (Italy)
> : Wiki: http://newton.ph.unito.it/~berzano
> : GPG: http://newton.ph.unito.it/~berzano/gpg
> : Mobiles: +41 766124782 (CH), +39 3487222520 (IT)
> 
> 
> Il giorno 14/set/2012, alle ore 18:21, Pranith Kumar Karampuri <pkarampu at redhat.com> ha scritto:
> 
>> Dario,
>> Ok that confirms that it is not a split-brain. Could you post the getfattr output I requested as well?. What is the size of the VM files?.
>> 
>> Pranith
>> ----- Original Message -----
>> From: "Dario Berzano" <dario.berzano at cern.ch>
>> To: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
>> Cc: "<gluster-users at gluster.org>" <gluster-users at gluster.org>
>> Sent: Friday, September 14, 2012 9:42:38 PM
>> Subject: Re: [Gluster-users] Virtual machines and self-healing on GlusterFS v3.3
>> 
>> 
>> # gluster volume heal VmDir info healed 
>> 
>> 
>> Heal operation on volume VmDir has been successful 
>> 
>> 
>> Brick one-san-01:/bricks/VmDir01 
>> Number of entries: 259 
>> Segmentation fault (core dumped) 
>> 
>> 
>> (same story for heal-failed) which seems to be exactly this bug: 
>> 
>> 
>> https://bugzilla.redhat.com/show_bug.cgi?id=836421 
>> 
>> 
>> Should I upgrade to latest QA RPMs to see what is going on? 
>> 
>> 
>> Btw, with split-brain I have no entries: 
>> 
>> 
>> 
>> Heal operation on volume VmDir has been successful 
>> 
>> 
>> Brick one-san-01:/bricks/VmDir01 
>> Number of entries: 0 
>> 
>> 
>> Brick one-san-02:/bricks/VmDir02 
>> Number of entries: 0 
>> 
>> 
>> Thank you, cheers, 
>> -- 
>> : Dario Berzano 
>> : CERN PH-SFT & Università di Torino (Italy) 
>> : Wiki: http://newton.ph.unito.it/~berzano 
>> : GPG: http://newton.ph.unito.it/~berzano/gpg 
>> : Mobiles: +41 766124782 (CH), +39 3487222520 (IT) 
>> 
>> 
>> 
>> 
>> Il giorno 14/set/2012, alle ore 17:16, Pranith Kumar Karampuri < pkarampu at redhat.com > 
>> ha scritto: 
>> 
>> 
>> hi Dario, 
>> Could you post the output of the following commands: 
>> gluster volume heal VmDir info healed 
>> gluster volume heal VmDir info split-brain 
>> 
>> Also provide the output of 'getfattr -d -m . -e hex' On both the bricks for the two files listed in the output of 'gluster volume heal VmDir info' 
>> 
>> Pranith. 
>> 
>> ----- Original Message ----- 
>> From: "Dario Berzano" < dario.berzano at cern.ch > 
>> To: gluster-users at gluster.org 
>> Sent: Friday, September 14, 2012 6:57:32 PM 
>> Subject: [Gluster-users] Virtual machines and self-healing on GlusterFS v3.3 
>> 
>> 
>> 
>> Hello, 
>> 
>> 
>> in our computing centre we have an infrastructure with a GlusterFS volume made of two bricks in replicated mode: 
>> 
>> 
>> 
>> 
>> 
>> Volume Name: VmDir 
>> Type: Replicate 
>> Volume ID: 9aab85df-505c-460a-9e5b-381b1bf3c030 
>> Status: Started 
>> Number of Bricks: 1 x 2 = 2 
>> Transport-type: tcp 
>> Bricks: 
>> Brick1: one-san-01:/bricks/VmDir01 
>> Brick2: one-san-02:/bricks/VmDir02 
>> 
>> 
>> 
>> 
>> We are using this volume to store running images of some KVM virtual machines and thought we could benefit from the replicated storage in order to achieve more robustness as well as the ability to live-migrate VMs. 
>> 
>> 
>> Our GlusterFS volume VmDir is mounted on several (three at the moment) hypervisors. 
>> 
>> 
>> However, in many cases (but it is difficult to reproduce: best way is to stress VM I/O), either when one brick becomes unavailable for some reason, or when we perform live migrations, virtual machines decide to remount filesystems from their virtual disks in read-only. At the same time, on the hypervisors mounting the GlusterFS partitions, we spot some kernel messages like: 
>> 
>> 
>> 
>> 
>> INFO: task kvm:13560 blocked for more than 120 seconds. 
>> 
>> 
>> 
>> 
>> By googling it I have found some "workarounds" to mitigate this problem, like mounting disks within virtual machines with barrier=0: 
>> 
>> 
>> http://invalidlogic.com/2012/04/28/ubuntu-precise-on-xenserver-disk-errors/ 
>> 
>> 
>> but I actually fear to damage my virtual machine disks by doing such a thing! 
>> 
>> 
>> AFAIK from GlusterFS v3.3 self-healing should be performed server-side (and no self-healing at all is performed on the clients and by granularly locking big files). When I connect to my GlusterFS pool, if I monitor the self-healing status continuously: 
>> 
>> 
>> watch -n1 'gluster volume heal VmDir info' 
>> 
>> 
>> I obtain an output like: 
>> 
>> 
>> 
>> 
>> 
>> Heal operation on volume VmDir has been successful 
>> 
>> 
>> Brick one-san-01:/bricks/VmDir01 
>> Number of entries: 2 
>> /1814/images/disk.0 
>> /1816/images/disk.0 
>> 
>> 
>> Brick one-san-02:/bricks/VmDir02 
>> Number of entries: 2 
>> /1816/images/disk.0 
>> /1814/images/disk.0 
>> 
>> 
>> 
>> 
>> with a list of virtual machine disks healed by GlusterFS. Those and other files continuously appear and disappear from the list. 
>> 
>> 
>> This is a behavior I don't understand at all: does this mean that those files continuously get corrupted and healed, and self-healing is just a natural part of the replication process?! Or some kind of corruption is actually happening on our virtual disks for some reason? Is this related to the "remount readonly" problem? 
>> 
>> 
>> A more general question maybe would be: is GlusterFS v3.3 ready for storing running virtual machines (and is there some special configuration option needed on the volumes and clients for that)? 
>> 
>> Thank you in advance for shedding some light... 
>> 
>> 
>> Regards, 
>> 
>> -- 
>> : Dario Berzano 
>> : CERN PH-SFT & Università di Torino (Italy) 
>> : Wiki: http://newton.ph.unito.it/~berzano 
>> : GPG: http://newton.ph.unito.it/~berzano/gpg 
>> _______________________________________________ 
>> Gluster-users mailing list 
>> Gluster-users at gluster.org 
>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users 
>> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4050 bytes
Desc: not available
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120917/b15eeb19/attachment.p7s>


More information about the Gluster-users mailing list