[Gluster-devel] SSL enabled glusterd crash

Thu Aug 6 09:59:36 UTC 2015

On 3.7.3 with SSL enabled, restarting glusterd is quite unreliable, 
with peers and bricks showing up or not in gluster status outputs. 
And results can be different on different peers, and even not
symetrical: a peer sees the bricks of another but not the other
way around.

After playing a bit, I managed to get a real crash on restarting
glusterd on all peers. 3 of them crash here:

Program terminated with signal 11, Segmentation fault.
#0  0xbbbda1f4 in rpc_clnt_reconnect (conn_ptr=0xb9ce5150) at rpc-clnt.c:409
409                             gf_timer_call_cancel (clnt->ctx,
#0  0xbbbda1f4 in rpc_clnt_reconnect (conn_ptr=0xb9ce5150) at rpc-clnt.c:409
#1  0xbbb33d0c in gf_timer_proc (ctx=Cannot access memory at address 
                                 0xba9fffd8) at timer.c:194
(gdb) list
404                     if (!trans) {
405                             pthread_mutex_unlock (&conn->lock);
406                             return;
407                     }
408                     if (conn->reconnect)
409                             gf_timer_call_cancel (clnt->ctx,
410                                                   conn->reconnect);
411                     conn->reconnect = 0;
412     
413                     if ((conn->connected == 0) && !clnt->disabled) {
(gdb) print clnt
$1 = (struct rpc_clnt *) 0x39bb
(gdb) print conn
$2 = (rpc_clnt_connection_t *) 0xb9ce5150
(gdb) print conn->lock
$3 = {ptm_magic = 51200, ptm_errorcheck = 0 '\000', ptm_pad1 = "0Q\316", 
      ptm_interlock = 185 '\271', ptm_pad2 = "\336\300\255", 
      ptm_owner = 0x6af000de, ptm_waiters = 0x39bb, ptm_recursed = 51200, 
      ptm_spare2 = 0xce513000}

ptm_magix is wrong. NetBSD libpthread sets it as 0x33330003 when created
and as 0xDEAD0003 when destroyed. This means we either have memory 
corruption, or the mutex was never initialized.

The last one crashes somewhere else:

Program terminated with signal 11, Segmentation fault#0  0xbbb33e60 in gf_timer_registry_init (ctx=0x80) at timer.c:241
241             if (!ctx->timer) {
(gdb) bt
#0  0xbbb33e60 in gf_timer_registry_init (ctx=0x80) at timer.c:241
#1  0xbbb339ce in gf_timer_call_cancel (ctx=0x80, event=0xb9dffb24) 
    at timer.c:121
#2  0xbbbda206 in rpc_clnt_reconnect (conn_ptr=0xb9ce9150) at rpc-clnt.c:409
#3  0xbbb33d0c in gf_timer_proc (ctx=Cannot access memory at 
                                 address 0xba9fffd8) at timer.c:194
(gdb) print ctx
$1 = (glusterfs_ctx_t *) 0x80
(gdb) frame 2
#2  0xbbbda206 in rpc_clnt_reconnect (conn_ptr=0xb9ce9150) at rpc-clnt.c:409
409                             gf_timer_call_cancel (clnt->ctx,
(gdb) print clnt
$2 = (struct rpc_clnt *) 0xb9dffd94
(gdb) print clnt->lock.ptm_magic
$3 = 1

Here again, corrupted or not initialized.

I kept the cores for further investigation if this is needed.

-- 
Emmanuel Dreyfus
manu at netbsd.org