[Gluster-devel] GlusterFS AFR not failing over

gordan at bobich.net gordan at bobich.net
Wed Jun 11 14:05:58 UTC 2008


Will do if I catch it at the time. It only happened twice in the last 
month or so. I'm not even sure if the problem that causes the lock-up is 
in the client or the server, as the only way I've managed to get it going 
again in both cases was by restarting both.

Gordan

On Wed, 11 Jun 2008, Krishna Srinivas wrote:

> Gordan,
>
> So glusterfs/glusterfsds hang when one of the servers go down
> and they dont recover. At the point when it hangs, can you
> attach gdb to the processes and get bt of them? (glusterfs
> and glusterfsds)
>
> Are you using 1.3.* release? non-blocking read/write fixes
> have gone in 1.4.* release, where I think this behavior might be
> fixed. The backtrace will help.
>
> Thanks
> Krishna
>
> On Mon, Jun 9, 2008 at 7:11 PM,  <gordan at bobich.net> wrote:
>> No - this is a different problem. If the transport timeout was the problem,
>> the access should return after < 60 seconds, should it not? In the case I'm
>> seeing, something goes wrong and the only way to recover is to restart
>> glusterfsd on the server(s) _AND_ glusterfs on the clients.
>>
>> It's kind of hard to reproduce, as I only see it happening about once every
>> week or so.
>>
>> Gordan
>>
>> On Sat, 7 Jun 2008, Krishna Srinivas wrote:
>>
>>> Gordon,
>>>
>>> Is this the case of transport-timeout being high?
>>>
>>> Krishna
>>>
>>> On Sat, Jun 7, 2008 at 1:04 AM, Gordan Bobic <gordan at bobich.net> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I have /home mounted from GlusterFS with AFR, and if one of the servers
>>>> (secondary) goes away, I cannot log in. sshd tries to read ~/.ssh and
>>>> bash
>>>> tries to read ~/.bashrc and this seems to fail - or at least take a very
>>>> long time to time out and try the remaining server (which verifiably
>>>> works).
>>>>
>>>> I get this sort of thing in the logs:
>>>>
>>>> E [tcp-client.c:190:tcp_connect] home2: non-blocking connect() returned:
>>>> 110
>>>> (Connection timed out)
>>>> E [client-protocol.c:4423:client_lookup_cbk] home2: no proper reply from
>>>> server, returning ENOTCONN
>>>> C [client-protocol.c:212:call_bail] home2: bailing transport
>>>>
>>>> where home2 is the name of the GlusterFS export on the secondary.
>>>>
>>>> Is this a known issue or have I managed to trip another error case?
>>>>
>>>> Gordan
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-devel mailing list
>>>> Gluster-devel at nongnu.org
>>>> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>>>>
>>>
>>
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at nongnu.org
>> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>>
>





More information about the Gluster-devel mailing list