[Gluster-devel] Priority based ping packet for 3.10

Wed Jan 25 05:04:28 UTC 2017

On Tue, Jan 24, 2017 at 10:39 AM, Vijay Bellur <vbellur at redhat.com> wrote:

>
>
> On Thu, Jan 19, 2017 at 8:06 AM, Jeff Darcy <jdarcy at redhat.com> wrote:
>
>> > The more relevant question would be with TCP_KEEPALIVE and
>> TCP_USER_TIMEOUT
>> > on sockets, do we really need ping-pong framework in Clients? We might
>> need
>> > that in transport/rdma setups, but my question is concentrating on
>> > transport/rdma. In other words would like to hear why do we need
>> heart-beat
>> > mechanism in the first place. One scenario might be a healthy socket
>> level
>> > connection but an unhealthy brick/client (like a deadlocked one).
>>
>> This is an important case to consider.  On the one hand, I think it
>> answers
>> your question about TCP_KEEPALIVE.  What we really care about is whether a
>> brick's request queue is moving.  In other words, what's the time since
>> the
>> last reply from that brick, and does that time exceed some threshold?
>
>
I agree with this.

> On a
>> busy system, we don't even need ping packets to know that.  We can just
>> use
>> responses on other requests to set/reset that timer.  We only need to send
>> ping packets when our *outbound* queue has remained empty for some
>> fraction
>> of our timeout.
>>
>
Do we need ping packets sent even when client is not waiting for any
replies? I assume no. If there are no responses to be received and no
requests being sent to a brick, why would be a client be interested in the
health of server/brick?

>
>> However, it's important that our measurements be *end to end* and not just
>> at the transport level.  This is particularly true with multiplexing,
>> where multiple bricks will share and contend on various resources.  We
>> should ping *through* client and server, with separate translators above
>> and below each.  This would give us a true end-to-end ping *for that
>> brick*, and also keep the code nicely modular.
>>
>
Agree with this. My understanding of ping framework is a tool to identify
unhealthy bricks (we are interested in bricks as they are the ones going to
serve fops). With that understanding ping-pong should be end to end (to
whatever logical entity that constitutes brick). However, where in the
brick xlator stack ping packets should be responded? Should they go all the
way down to storage/posix?

>
> +1 to this. Having ping, pong xlators immediately above and below protocol
> translators would also address the problem of epoll threads getting blocked
> in gluster's xlator stacks in busy systems.
>
> Having said that, I do see value in Rafi's patch that prompted this
> thread. Would it not help to prioritize ping - pong traffic in all parts of
> the gluster stack including the send queue on the client?
>

I've two concerns here:
1. Responsiveness of brick to client invariably involves latency of network
and our own transport's io-queue. Wouldn't prioritizing ping packets over
normal data give us a skewed view of brick's responsiveness? For eg., On a
network with heavy traffic ping-pong might be happening, but fops might be
moving very slowely. What is that we achieve with a successful ping-pong in
this scenario? Also, Is our response to the opposite scenario of
ping-timeout happening and disconnecting the transport achieves anything
substantially good? May be it helps to bring the latency of syscalls down
(as experienced by application), as our HA translators like afr, EC add the
latency of identifying disconnect (or  a successful fop) to latency of
syscalls. As developers many of us keep wondering what is that we are
trying to achieve with an heart beat mechanism.

2. Assuming that we want to prioritize ping traffic over normal traffic
(which we do logically now as ping packets doesn't traverse the entire
brick xlator stack all the way down to posix, instead short circuit at
protocol/server), the fix in discussion here is partial (as we can't
prioritize ping traffic ON the WIRE and through tcp/ip stack). While I
don't have strong objections to it, I feel that its partial solution and
might be inconsequential (just an hunch, no data). However, I can accept
the patch, if we feel it helps.

> Regards,
> Vijay
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>

-- 
Raghavendra G
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170125/05540e95/attachment-0001.html>