[Gluster-users] libgfapi failover problem on replica bricks

Sun Apr 6 15:52:53 UTC 2014

Hello,

I'm having an issue with rebooting bricks holding images for live KVM
machines (using libgfapi).

I have a replicated+distributed setup of 4 bricks (2x2). The cluster
contains images for a couple of kvm virtual machines.

My problem is that when I reboot a brick containing a an image of a
VM, the VM will start throwing disk errors and eventually die.

The gluster volume is made like this:

# gluster vol info pool

Volume Name: pool
Type: Distributed-Replicate
Volume ID: xxxxxxxxxxxxxxxxxxxx
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: srv10g:/data/gluster/brick
Brick2: srv11g:/data/gluster/brick
Brick3: srv12g:/data/gluster/brick
Brick4: srv13g:/data/gluster/brick
Options Reconfigured:
network.ping-timeout: 10
cluster.server-quorum-type: server
diagnostics.client-log-level: WARNING
auth.allow: 192.168.0.*,127.*
nfs.disable: on

The KVM instances run on the same gluster bricks, with disks mounted
as : file=gluster://localhost/pool/images/vm-xxx-disk-1.raw,.......,cache=writethrough,aio=native

My self-heal backlog is not always 0. It looks like some writes are
not going to all bricks at the same time (?).

gluster vol heal pool info

sometime shows the images needing sync on one brick, the other or both.

There are no network problems or errors on the wire.

Any ideas what could be causing this ?

Thanks.