[Gluster-users] libgfapi failover problem on replica bricks

Wed Apr 16 16:20:42 UTC 2014

>>I can easily reproduce the problem on this cluster. It appears that
>>there is a "primary" replica and a "secondary" replica.
>>
>>If I reboot or kill the glusterfs process there is no problems on the
>>running VM.
>
> Good. That is as expected.

Sorry, I was not clear enough. I meant that if I reboot the
"secondary" replica, there are no problems.

>>If I reboot or "killall -KILL glusterfsd" the primary replica (so I
>>don't let it terminate properly), I can block the the VM each time.
>
> Have you followed my blog advise to prevent the vm from remounting the image filesystem read-only and waited ping-timeout seconds (42 by default)?

I have not followed your advice, but there is a difference: I get i/o
errors *reading* from the disk. Once the problem kicks, I cannot issue
commands (like ls) because they can't be read.

There is a problem with that setup: It cannot be implemented on
windows machines (which are move vulnerable) and also cannot be
implemented on machines which I have no control on (customers).

>>If I "reset" the VM it will not find the boot disk.
>
> Somewhat expected if within the ping-timeout.

The issue persists beyond the ping-timeout. The KVM process needs to
be reinitialized. I guess libgfapi needs to reconnect from scratch.

>>If I power down and power up the VM, then it will boot but will find
>>corruption on disk during the boot that requires fixing.
>
> Expected since the vm doesn't use the image filesystem synchronously. You can change that with mount options at the cost of performance.

Ok. I understand this point.

> Unless you wait for ping-timeout and then continue writing the replica is actually still in sync. It's only out of sync if you write to one replica but not the other.
>
> You can shorten the ping timeout. There is a cost to reconnection if you do.  Be sure to test a scenario with servers under production loads and see what the performance degradation during a reconnect is. Balance your needs appropriately.

Could you please elaborate on the cost of reconnection? I will try to
run with a very short ping timeout (2sec) and see if the problem is in
the ping-timeout or perhaps not.

Paul

2014-04-06 17:52 GMT+02:00 Paul Penev <ppquant at gmail.com>:
> Hello,
>
> I'm having an issue with rebooting bricks holding images for live KVM
> machines (using libgfapi).
>
> I have a replicated+distributed setup of 4 bricks (2x2). The cluster
> contains images for a couple of kvm virtual machines.
>
> My problem is that when I reboot a brick containing a an image of a
> VM, the VM will start throwing disk errors and eventually die.
>
> The gluster volume is made like this:
>
> # gluster vol info pool
>
> Volume Name: pool
> Type: Distributed-Replicate
> Volume ID: xxxxxxxxxxxxxxxxxxxx
> Status: Started
> Number of Bricks: 2 x 2 = 4
> Transport-type: tcp
> Bricks:
> Brick1: srv10g:/data/gluster/brick
> Brick2: srv11g:/data/gluster/brick
> Brick3: srv12g:/data/gluster/brick
> Brick4: srv13g:/data/gluster/brick
> Options Reconfigured:
> network.ping-timeout: 10
> cluster.server-quorum-type: server
> diagnostics.client-log-level: WARNING
> auth.allow: 192.168.0.*,127.*
> nfs.disable: on
>
> The KVM instances run on the same gluster bricks, with disks mounted
> as : file=gluster://localhost/pool/images/vm-xxx-disk-1.raw,.......,cache=writethrough,aio=native
>
> My self-heal backlog is not always 0. It looks like some writes are
> not going to all bricks at the same time (?).
>
> gluster vol heal pool info
>
> sometime shows the images needing sync on one brick, the other or both.
>
> There are no network problems or errors on the wire.
>
> Any ideas what could be causing this ?
>
> Thanks.