[Gluster-users] libgfapi failover problem on replica bricks

Sun Apr 20 22:15:54 UTC 2014

Hi All,

I'm trying to make sense of this problem. I did more testing in a two
node cluster. I came with a reliable mode to reproduce the problem.

First, I make a KVM machine (linux) and start it on the 2 x replica
gluster volume on two partitions (xfs). I am using debian gluster
3.4.3, KVM 1.7.1 and kernel 2.6.32-openvz.

If I simply stop the glusterfsd using the debian init.d script, I see
that the daemon will never actually quit. It stays in memory and keeps
working.

If I kill it, the KVM will continue working. I use KILL -KILL so it's
not a clean shutdown.

I then bring back the gluster brick and issue a gluster vol heal <vol>
and *wait* for the heal to finish until gluster vol heal <vol> info
returns '0' files for both nodes. This can take several minutes on a
32G image.

Now I repeat the process of stopping and killing glusterfsd on the other node.

At the moment I kill glusterfsd I get a disk error in the VM and the
filesystem is mounted readonly ( I have ping timeout set to 2
seconds).

>From this moment, the VM cannot be brought back on. If I reset it (via
the kvm monitor) it will not boot because it cannot find it's boot
disk

To boot again I need to stop the KVM process and start a new process
from scratch. In this case the VM will find the disk, recover it (fsck
finding errors) and boot from it.

My impression is that the client (libgfapi) does not reconnect the
bricks after they're gone. So loosing the first brick makes no harm,
but loosing the other one later leaves the client with no bricks.

If I repeat the process swapping the bricks, I still cause the issue
on the second time I do "failover".

One more interesting thing: If I stop the same brick over and over,
the KVM will keep working.

So, from all this testing, I think that the problem is not the
network.ping-timeout.
It is reproducible each time and looks like a bug.

However I do not know how to get debug from KVM directly. I tried
looking in the code of KVM's disk driver (gluster.c) but I see nothing
that deals with reconnections of the client.

This leads me to think that the problem is inside libgfapi.

I hope this helps for diagnosing the issue.

2014-04-17 17:52 GMT+02:00 Paul Penev <ppquant at gmail.com>:
> Joe, this will be greatly appreciated.
>
> All I see in the client logs is:
>
> [2014-04-15 15:11:08.213748] W [socket.c:514:__socket_rwv]
> 0-pool-client-2: readv failed (No data available)
> [2014-04-15 15:11:08.214165] W [socket.c:514:__socket_rwv]
> 0-pool-client-3: readv failed (No data available)
> [2014-04-15 15:11:08.214596] W [socket.c:514:__socket_rwv]
> 0-pool-client-0: readv failed (No data available)
> [2014-04-15 15:11:08.214941] W [socket.c:514:__socket_rwv]
> 0-pool-client-1: readv failed (No data available)
> [2014-04-15 15:35:24.165391] W [socket.c:514:__socket_rwv]
> 0-glusterfs: readv failed (No data available)
> [2014-04-15 15:35:24.165437] W
> [socket.c:1962:__socket_proto_state_machine] 0-glusterfs: reading from
> socket failed. Error (No data available), peer (127.0.0.1:24007)
> [2014-04-15 15:35:34.419719] E [socket.c:2157:socket_connect_finish]
> 0-glusterfs: connection to 127.0.0.1:24007 failed (Connection refused)
> [2014-04-15 15:35:34.419757] W [socket.c:514:__socket_rwv]
> 0-glusterfs: readv failed (No data available)
> [2014-04-15 15:35:37.420492] W [socket.c:514:__socket_rwv]
> 0-glusterfs: readv failed (No data available)
> [2014-04-15 15:35:39.330948] W [glusterfsd.c:1002:cleanup_and_exit]
> (-->/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f9a705460ed]
> (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50) [0x7f9a70bf2b50] (-
> [2014-04-15 15:37:52.849982] I [glusterfsd.c:1910:main]
> 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version
> 3.4.3 (/usr/sbin/glusterfs --volfile-id=pool
> --volfile-server=localhost /mnt/pve
> [2014-04-15 15:37:52.879574] I [socket.c:3480:socket_init]
> 0-glusterfs: SSL support is NOT enabled
> [2014-04-15 15:37:52.879617] I [socket.c:3495:socket_init]
> 0-glusterfs: using system polling thread
>
> Sometimes I see a lot of it:
>
> [2014-04-16 13:29:29.521516] W [socket.c:514:__socket_rwv]
> 0-glusterfs: readv failed (No data available)
> [2014-04-16 13:29:32.522267] W [socket.c:514:__socket_rwv]
> 0-glusterfs: readv failed (No data available)
> [2014-04-16 13:29:35.523006] W [socket.c:514:__socket_rwv]
> 0-glusterfs: readv failed (No data available)
> [2014-04-16 13:29:38.523773] W [socket.c:514:__socket_rwv]
> 0-glusterfs: readv failed (No data available)
> [2014-04-16 13:29:41.524456] W [socket.c:514:__socket_rwv]
> 0-glusterfs: readv failed (No data available)
> [2014-04-16 13:29:44.525324] W [socket.c:514:__socket_rwv]
> 0-glusterfs: readv failed (No data available)
> [2014-04-16 13:29:47.526080] W [socket.c:514:__socket_rwv]
> 0-glusterfs: readv failed (No data available)
> [2014-04-16 13:29:50.526819] W [socket.c:514:__socket_rwv]
> 0-glusterfs: readv failed (No data available)
> [2014-04-16 13:29:53.527617] W [socket.c:514:__socket_rwv]
> 0-glusterfs: readv failed (No data available)
> [2014-04-16 13:29:56.528228] W [socket.c:514:__socket_rwv]
> 0-glusterfs: readv failed (No data available)
> [2014-04-16 13:29:59.529023] W [socket.c:514:__socket_rwv]
> 0-glusterfs: readv failed (No data available)
> [2014-04-16 13:30:02.529772] W [socket.c:514:__socket_rwv]
> 0-glusterfs: readv failed (No data available)
>
> 2014-04-16 18:20 GMT+02:00 Joe Julian <joe at julianfamily.org>:
>> libgfapi uses the same translators as the fuse client. That means you have
>> the same client translator with the same behavior as any other client. Since
>> the client translator connects to all servers, the loss of any one server
>> without closing the tcp connection should result in the same
>> ping-timeout->continued-use as any other client. Since this isn't happening,
>> I would look to the client logs and/or network captures. There, as you know,
>> is no "primary" nor "secondary" bricks. They're all equal. Failure to
>> continue using any particular server suggests to me that maybe there's some
>> problem there.
>>
>> I'll see if I can put together some sort of simulation today to test it
>> myself though.
>>
>>
>> On 4/16/2014 8:04 AM, Paul Penev wrote:
>>
>> I can easily reproduce the problem on this cluster. It appears that
>> there is a "primary" replica and a "secondary" replica.
>>
>> If I reboot or kill the glusterfs process there is no problems on the
>> running VM.
>>
>> Good. That is as expected.
>>
>> Sorry, I was not clear enough. I meant that if I reboot the
>> "secondary" replica, there are no problems.
>>
>> If I reboot or "killall -KILL glusterfsd" the primary replica (so I
>> don't let it terminate properly), I can block the the VM each time.
>>
>> Have you followed my blog advise to prevent the vm from remounting the image
>> filesystem read-only and waited ping-timeout seconds (42 by default)?
>>
>> I have not followed your advice, but there is a difference: I get i/o
>> errors *reading* from the disk. Once the problem kicks, I cannot issue
>> commands (like ls) because they can't be read.
>>
>> There is a problem with that setup: It cannot be implemented on
>> windows machines (which are move vulnerable) and also cannot be
>> implemented on machines which I have no control on (customers).
>>
>> If I "reset" the VM it will not find the boot disk.
>>
>> Somewhat expected if within the ping-timeout.
>>
>> The issue persists beyond the ping-timeout. The KVM process needs to
>> be reinitialized. I guess libgfapi needs to reconnect from scratch.
>>
>> If I power down and power up the VM, then it will boot but will find
>> corruption on disk during the boot that requires fixing.
>>
>> Expected since the vm doesn't use the image filesystem synchronously. You
>> can change that with mount options at the cost of performance.
>>
>> Ok. I understand this point.
>>
>> Unless you wait for ping-timeout and then continue writing the replica is
>> actually still in sync. It's only out of sync if you write to one replica
>> but not the other.
>>
>> You can shorten the ping timeout. There is a cost to reconnection if you do.
>> Be sure to test a scenario with servers under production loads and see what
>> the performance degradation during a reconnect is. Balance your needs
>> appropriately.
>>
>> Could you please elaborate on the cost of reconnection? I will try to
>> run with a very short ping timeout (2sec) and see if the problem is in
>> the ping-timeout or perhaps not.
>>
>> Paul
>>
>>