[Gluster-devel] How does GD_SYNCOP work?

Fri Sep 12 03:51:03 UTC 2014

----- Original Message -----
> Krishnan Parthasarathi <kparthas at redhat.com> wrote:
> 
> > The scheduling of a paused task happens when the epoll thread receives a
> > POLLIN event along with the response from the remote endpoint. This is
> > contingent on the fact that the call back must issue a synctask_wake,
> > which will trigger the resumption of the task (in one of the threads from
> > the syncenv). In summary, the call back code triggers the scheduling back
> > of the paused task.
> 
> Right, this seems to work. I found the __wake() call at the end of
> _gd_syncop_brick_op_cbk() and it is executed.  The problem is therefore
> not there.
> 
> I tried running the test setps one by one. The offending command is
> "gluster volume heal $V0 info", hence I run it between each step.
> 
> It works at the beginning, it works if I kill 3 out ouf 6 bricks, and it
> hangs after I created files in the volume (with 3 out of 6 bricks down).
> 
> And at that time, the bricks that are still up show this in the logs:
> 
> [2014-09-11 17:47:31.452067] I [server.c:518:server_rpc_notify]
> 0-patchy-server: disconnecting connection from
> netbsd0.cloud.gluster.org-24431-2014/09/11-17:40:47:719843-patchy-client
> -1-0-0
> [2014-09-11 17:47:31.452142] I [server-helpers.c:290:do_fd_cleanup]
> 0-patchy-server: fd cleanup on /a/a/a/a/a/a/a/a/a/a
> [2014-09-11 17:47:31.452689] I [client_t.c:417:gf_client_unref]
> 0-patchy-server: Shutting down connection
> netbsd0.cloud.gluster.org-24431-2014/09/11-17:40:47:719843-patchy-client
> -1-0-0
> [2014-09-11 17:47:31.455145] I [server.c:518:server_rpc_notify]
> 0-patchy-server: disconnecting connection from
> netbsd0.cloud.gluster.org-3612-2014/09/11-17:40:28:979958-patchy-client-
> 1-0-0
> [2014-09-11 17:47:31.455172] I [client_t.c:417:gf_client_unref]
> 0-patchy-server: Shutting down connection
> netbsd0.cloud.gluster.org-3612-2014/09/11-17:40:28:979958-patchy-client-
> 1-0-0
> [2014-09-11 17:47:31.455208] I [server.c:518:server_rpc_notify]
> 0-patchy-server: disconnecting connection from
> netbsd0.cloud.gluster.org-26218-2014/09/11-17:40:28:900316-patchy-client
> -1-0-0
> [2014-09-11 17:47:31.455230] I [client_t.c:417:gf_client_unref]
> 0-patchy-server: Shutting down connection
> netbsd0.cloud.gluster.org-26218-2014/09/11-17:40:28:900316-patchy-client
> -1-0-0
> 
> If I understood correctly, gluster volume heal info causes glusterd to
> send requests to bricks that are alive. If they go offline at that time
> it may explain why the command hangs. What is the correct behavior here?

How were the bricks brought down? 

If glusterd doesn't get a POLLERR (for every brick that went down) after the bricks went down,
then the ping timer mechanism in glusterd should kick in and 'abort' the RPC.
If that didn't fire for some reason (could be a bug), the frame-timeout
for glusterd RPCs should kick in and the frame corresponding to the RPC should 'bail'.
If you left the hung setup for over ten minutes from the time the bricks went down,
you should see logs corresponding to one of the above two mechanisms in action. Let me know if
you don't. Then we need to investigate further.

~KP