[Gluster-users] libgfapi failover problem on replica bricks

Mon Apr 21 07:47:48 UTC 2014

Ok, here is one more hint that point in the direction of libgfapi not
re-establishing the connections to the bricks after they come back
online: if I migrate the KVM machine (live) from one node to another
after the bricks are back online, and I kill the second brick, the KVM
will not suffer from disk problems. It is obvious that during
migration, the new process on the new node is forced to reconnect to
the gluster volume, hence reestablishing both links. After this it is
ready to loose one of the links without problems.

Steps to replicate:

1. Start KVM VM and boot from a replicated volume
2. killall -KILL glusterfsd on one brick (brick1). Verify the the kvm
is still working.
3. Bring back the glusterfsd on brick1.
4. heal the volume (gluster vol heal <vol>) and wait until gluster vol
heal <vol> info shows no self-heal backlog.
5. Now migrate the KVM from one node to another node.
6. killall -KILL glusterfsd on the second brick (brick2).
7. Verify that KVM is still working (!) It would die from disk errors
before, if step 5 was not executed.
8. Bring back glusterfsd on brick2, heal and enjoy.
9. repeat at will: The KVM will never die again, provided you migrate
is once before brick failure.

What this means to me: there's a problem in libgfapi, gluster 3.4.2
and 3.4.3 (at least) and/or kvm 1.7.1 (I'm running the latest 1.7
source tree in production).

Joe: we're in your hands. I hope you find the problem somewhere.

Paul.