[Gluster-devel] Blocking client when server is down
Raghavendra G
raghavendra.hg at gmail.com
Wed Dec 31 05:49:30 UTC 2008
Hi Martin,
please find the comments inlined.
On Wed, Dec 31, 2008 at 9:03 AM, Martin Fick <mogulguy at yahoo.com> wrote:
> --- On Tue, 12/30/08, Basavanagowda Kanur <gowda at zresearch.com> wrote:
>
> > If server is down for transport-timout time, then client
> > returns all the calls with 'Transport Endpoint not connected'
> > error.
>
> Yes, this is exactly what I do not want. I want reads/writes to simply
> block when the server is down and to complete (the blocked calls) when the
> server returns. I do not want my applications to get an error, only a
> delay. Without this it is not possible to recover gracefully from a
> server/network failure.
>
> While we are at it, what is the timeout in, seconds, milliseconds?
>
>
> I have been trying to understand what it would take to implement this
> feature in the client protocol translator. At first thought, it seems like
> there are two main cases that would need to be dealt with, 1) requests which
> have not yet hit the wire, and fail when they attempt to, and 2) requests
> which have already hit the wire but have not been responded to by the
> server. Possibly a third more complex case 3) would be requests which hit
> the server and were responded to, but the response was never received by the
> client.
>
> The simplest case seems to be # 1, simply wait for the connection to
> reestablish itself and retry to submit the protocol to the wire. I hacked a
> simple implementation of this (looping in protocol_client_xfer until the
> connection is reestablished without holding the lock) which seems to work,
> but I have no clue if it is correct. ;) I will attach it below.
blocking the protocol_client_xfer till the server comes up is not good
always. It may not make any difference in a simple client/server setup. But
in a setup consisting of cluster translators, say afr, this would lead to
glusterfs getting blocked on trying to send requests to the server which is
down, though the request can be fulfilled from the other server(s) which
is(are) up.
>
>
> For # 2, it looks like the client protocol keeps a list of outstanding
> requests in the saved_frames list. Is there any reason this list could not
> be resubmitted when the connection is reestablished instead of it being
> purged when the connection fails (apart from the problems associated with
> corner case #3)? Is all the required data still in the frame at this point
> (before protocol_client_cleanup is called)?
>
> Corner case # 3 seems like it would require the server to keep track of
> responses it knows did not reach the client. If it can resend these
> responses to the client when the connection is reestablished, the client
> could process those requests without resending them.
>
> This is my simplistic understanding of the problem. Am I overlooking
> something major that would prevent this from working? Is this something you
> would consider implementing or accepting patches for if I can get it to work
> (although it might be way beyond my abilities)? Am I way off and wasting my
> time? :(
>
> Thanks,
>
> -Martin
>
>
> --- xlators/protocol/client/src/client-protocol.c 2008-12-30 17
> :24:34.000000000 -0700
> +++ xlators/protocol/client/src/client-protocol.c.orig 2008-12-30 13
> :23:26.000000000 -0700
> @@ -388,7 +388,6 @@
> gf_hdr_common_t rsphdr = {0, };
> client_forget_t forget = {0, };
> uint8_t send_forget = 0;
> - uint8_t reconnect = 1;
>
> priv = this->private;
> trans = priv->transport;
> @@ -431,32 +430,14 @@
> hdr->req.pid = hton32 (frame->root->pid);
> }
>
> - if(type == GF_OP_TYPE_MOP_REQUEST &&
> - op == GF_MOP_SETVOLUME)
> - reconnect = 0;
> -
> - while(1) {
> - if (cprivate->connected == 0)
> - transport_connect (trans);
> -
> - if (cprivate->connected ||
> - ((type == GF_OP_TYPE_MOP_REQUEST) &&
> - (op == GF_MOP_SETVOLUME))) {
> - ret = transport_submit (trans, (char *)hdr,
> hdrlen,
> - vector, count,
> refs);
> - }
> -
> - if (!reconnect || ret >= 0 || cprivate->connected >
> 0)
> - break;
> + if (cprivate->connected == 0)
> + transport_connect (trans);
>
> - pthread_mutex_unlock (&cprivate->lock);
> - while (cprivate->connected <= 0) {
> - gf_log (this->name, GF_LOG_DEBUG,
> - "protocol_client_xfer waiting for
> connection(%i)",
> - cprivate->connected);
> - sleep(1);
> - }
> - pthread_mutex_lock (&cprivate->lock);
> + if (cprivate->connected ||
> + ((type == GF_OP_TYPE_MOP_REQUEST) &&
> + (op == GF_MOP_SETVOLUME))) {
> + ret = transport_submit (trans, (char *)hdr, hdrlen,
> + vector, count, refs);
> }
>
> if ((ret >= 0) && frame) {
>
>
>
>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>
--
Raghavendra G
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20081231/00c8514c/attachment-0003.html>
More information about the Gluster-devel
mailing list