[Gluster-devel] How does GD_SYNCOP work?

Fri Sep 12 04:31:55 UTC 2014

Krishnan Parthasarathi <kparthas at redhat.com> wrote:

> If you left the hung setup for over ten minutes from the time the bricks
> went down, you should see logs corresponding to one of the above two
> mechanisms in action. Let me know if you don't. Then we need to
> investigate further.

Yes, it is able to miserably die after 10 minutes :-)

I added some debug printf to see what was hanging. Here is the path of
glusterd when receiving the gluster volume heal info

gd_brick_op_phase
  glusterd_volinfo_find 
    glusterd_bricks_select_heal_volume -> rxlator_count = 3
  glusterd_syncop_aggr_rsp_dict
  list_for_each_entry (pending_node, &selected, list) {
    First in list is rpc->conn.name = "management"
    gd_syncop_mgmt_brick_op
       glusterd_brick_op_build_payload
       GD_SYNCOP -> never resume
  } 

It is fine for me that glusterd_bricks_select_heal_volume() finds 3
bricks, they are the 3 remaining alive bricks. However I am surprised to
see the first in the list having rpc->conn.name = "management". It
should be a brick name here, right? Or is this glustershd?

The logs give a hint about GD_SYNCOP not returning:

[2014-09-12 04:19:35.266126] I [socket.c:3277:socket_submit_reply]
0-socket.management: not connected (priv->connected = -1)
[2014-09-12 04:19:35.266139] E [rpcsvc.c:1249:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0x1, Program: GlusterD svc
cli, ProgVers: 2, Proc: 31) to rpc-transport (socket.management)

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu at netbsd.org