[Gluster-users] libgfapi failover problem on replica bricks

Wed Apr 16 16:20:52 UTC 2014

libgfapi uses the same translators as the fuse client. That means you 
have the same client translator with the same behavior as any other 
client. Since the client translator connects to all servers, the loss of 
any one server without closing the tcp connection /should/ result in the 
same ping-timeout->continued-use as any other client. Since this isn't 
happening, I would look to the client logs and/or network captures. 
There, as you know, is no "primary" nor "secondary" bricks. They're all 
equal. Failure to continue using any particular server suggests to me 
that maybe there's some problem there.

I'll see if I can put together some sort of simulation today to test it 
myself though.

On 4/16/2014 8:04 AM, Paul Penev wrote:
>>> I can easily reproduce the problem on this cluster. It appears that
>>> there is a "primary" replica and a "secondary" replica.
>>>
>>> If I reboot or kill the glusterfs process there is no problems on the
>>> running VM.
>> Good. That is as expected.
> Sorry, I was not clear enough. I meant that if I reboot the
> "secondary" replica, there are no problems.
>
>>> If I reboot or "killall -KILL glusterfsd" the primary replica (so I
>>> don't let it terminate properly), I can block the the VM each time.
>> Have you followed my blog advise to prevent the vm from remounting the image filesystem read-only and waited ping-timeout seconds (42 by default)?
> I have not followed your advice, but there is a difference: I get i/o
> errors *reading* from the disk. Once the problem kicks, I cannot issue
> commands (like ls) because they can't be read.
>
> There is a problem with that setup: It cannot be implemented on
> windows machines (which are move vulnerable) and also cannot be
> implemented on machines which I have no control on (customers).
>
>>> If I "reset" the VM it will not find the boot disk.
>> Somewhat expected if within the ping-timeout.
> The issue persists beyond the ping-timeout. The KVM process needs to
> be reinitialized. I guess libgfapi needs to reconnect from scratch.
>
>>> If I power down and power up the VM, then it will boot but will find
>>> corruption on disk during the boot that requires fixing.
>> Expected since the vm doesn't use the image filesystem synchronously. You can change that with mount options at the cost of performance.
> Ok. I understand this point.
>
>> Unless you wait for ping-timeout and then continue writing the replica is actually still in sync. It's only out of sync if you write to one replica but not the other.
>>
>> You can shorten the ping timeout. There is a cost to reconnection if you do.  Be sure to test a scenario with servers under production loads and see what the performance degradation during a reconnect is. Balance your needs appropriately.
> Could you please elaborate on the cost of reconnection? I will try to
> run with a very short ping timeout (2sec) and see if the problem is in
> the ping-timeout or perhaps not.
>
> Paul

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140416/3c81cf9a/attachment.html>