[Gluster-devel] ping timeout

Stephan von Krawczynski skraw at ithnet.com
Thu Mar 18 16:29:38 UTC 2010


On Thu, 18 Mar 2010 10:59:41 -0400 (EDT)
Christopher Hawkins <chawkins at bplinux.com> wrote:

> Thanks Stephan. But in my testing, I see the exact opposite. The hang is painful (everything stops) but the reconnect causes no problems at all. It seems to work great (good job on 3.0!) What kind of problems is it causing for you? Maybe there is something I am missing in my test setup. 

We experienced _server_ hangs that could only be cured by hard-resetting the
box. We tried with glusterfs 2.X, not 3.X. 
 
> You mention that stopping and restarting glusterfsd on one box works out well... That is a reconnect, as far as I can tell. There is no hang because when you shut it down, the gluster client immediately gets a connection refused and doesn't wait for the timeout period:
> [2010-03-18 10:04:46] E [socket.c:760:socket_connect_finish] master2: connection to 10.0.0.102:3302 failed (Connection refused)

Yes, of course. But the servers are healthy in this case.
 
> As opposed to the server just going away, which hangs for a while:
> [2010-03-18 10:05:44] E [client-protocol.c:415:client_ping_timer_expired] master2: Server 10.0.0.102:3302 has not responded in the last 42 seconds, disconnecting.
> 
> But when you start it up again, you should get reconnected quickly and with no problems:
> [2010-03-18 09:00:00] N [afr.c:2625:notify] mirror1: Subvolume 'master1' came back up; going online.
> [2010-03-18 09:00:00] N [client-protocol.c:6228:client_setvolume_cbk] master1: Connected to 10.0.0.101:3301, attached to remote volume 'threads2'.
>   
> Seems to me that disconnect / reconnect is only painful because ping timeout is so long... And on a high latency network, maybe you need that to avoid frequent little split brains, but on a low latency network, long ping timeouts seem to cause more problems than they fix. Or are you experiencing something that I am not? 

There is really one thing that we did not think of either in the first place:
network packet loss. We came across the whole problem because every now and
then pings just seem to vanish. Then, after the correspoding server got kicked
out by the client the server entered a freeze state where its local fs seemed
to hang indefinitely.
Even the best switches have a minimum amount of packet loss on the network. If
you reduce the ping time to very low values you make sure that your servers
get disconnected once a day (if you have enough data throughput). Together
with another phenomenon - glusterfs failing to identify the latest file
version - your data may be trash within a month of runtime.
We made these experiences during the last few months.

--
Regards
Stephan


> 
> Christopher Hawkins
> 
> ----- "Stephan von Krawczynski" <skraw at ithnet.com> wrote:
> 
> > Hi Christopher,
> > 
> > I advise you to really try the most important part of your description
> > you
> > take for granted - the reconnect case.
> > Our experiences are quite away from what you think is the worst case.
> > You can
> > easily check out what happens if you just pull the network cable 5
> > times in 10
> > minutes. We came to the conclusion that disconnect/reconnect should be
> > avoided
> > under all circumstances. Interestingly stopping one servers'
> > glusterfsd and
> > restarting it works out quite well in our setup. So offline-updating a
> > server
> > (which was our main purpose) is quite ok.
> > 
> > -- 
> > Regards,
> > Stephan
> > 
> > 
> > 
> > On Thu, 18 Mar 2010 08:33:51 -0400 (EDT)
> > Christopher Hawkins <chawkins at bplinux.com> wrote:
> > 
> > > I have a question re: ping timeout for any of the dev's. The minimum
> > value is 5 and the max is 1013... But in my case, I use replicate to
> > mirror server pairs that are each gigabit connected by crossover
> > cables. The latency is very low. 5 seconds is a long time and
> > personally I would like them to give up on the failed link after 500ms
> > or so, so the mountpoint becomes available quickly to the remaining
> > node. 
> > > 
> > > Or I would at least like to test it and see if it's stable that way;
> > I don't mind getting disconnected early in the case of a slow server,
> > because it will just reconnect when the server comes back. Is there
> > any hope for being able to tweak this parameter? Or is there a reason
> > why it simply cannot be lower than 5?
> > > 
> > > Thanks for any insight and for glusterfs!
> > > 
> > > Christopher Hawkins
> > > 
> > > 
> > > _______________________________________________
> > > Gluster-devel mailing list
> > > Gluster-devel at nongnu.org
> > > http://lists.nongnu.org/mailman/listinfo/gluster-devel
> > >
> 






More information about the Gluster-devel mailing list