[Gluster-devel] SSL enabled glusterd crash

Thu Aug 6 12:13:02 UTC 2015

There is a race b/w gf_timer_call_cancel and firing of timer addressed by [1]. Can this be the cause? Also note that [1] is not sufficient enough, as the callers of gf_timer_call_cancel should check return value and shouldn't free opaque pointer it passed to gf_timer_call_after during timer registration when gf_timer_call_cancel returns -1. Note that [1] is not in 3.7.3.

[1] http://review.gluster.org/6459

----- Original Message -----
> From: "Emmanuel Dreyfus" <manu at netbsd.org>
> To: gluster-devel at gluster.org
> Sent: Thursday, August 6, 2015 3:29:36 PM
> Subject: [Gluster-devel] SSL enabled glusterd crash
> 
> On 3.7.3 with SSL enabled, restarting glusterd is quite unreliable,
> with peers and bricks showing up or not in gluster status outputs.
> And results can be different on different peers, and even not
> symetrical: a peer sees the bricks of another but not the other
> way around.
> 
> After playing a bit, I managed to get a real crash on restarting
> glusterd on all peers. 3 of them crash here:
> 
> Program terminated with signal 11, Segmentation fault.
> #0  0xbbbda1f4 in rpc_clnt_reconnect (conn_ptr=0xb9ce5150) at rpc-clnt.c:409
> 409                             gf_timer_call_cancel (clnt->ctx,
> #0  0xbbbda1f4 in rpc_clnt_reconnect (conn_ptr=0xb9ce5150) at rpc-clnt.c:409
> #1  0xbbb33d0c in gf_timer_proc (ctx=Cannot access memory at address
>                                  0xba9fffd8) at timer.c:194
> (gdb) list
> 404                     if (!trans) {
> 405                             pthread_mutex_unlock (&conn->lock);
> 406                             return;
> 407                     }
> 408                     if (conn->reconnect)
> 409                             gf_timer_call_cancel (clnt->ctx,
> 410                                                   conn->reconnect);
> 411                     conn->reconnect = 0;
> 412
> 413                     if ((conn->connected == 0) && !clnt->disabled) {
> (gdb) print clnt
> $1 = (struct rpc_clnt *) 0x39bb
> (gdb) print conn
> $2 = (rpc_clnt_connection_t *) 0xb9ce5150
> (gdb) print conn->lock
> $3 = {ptm_magic = 51200, ptm_errorcheck = 0 '\000', ptm_pad1 = "0Q\316",
>       ptm_interlock = 185 '\271', ptm_pad2 = "\336\300\255",
>       ptm_owner = 0x6af000de, ptm_waiters = 0x39bb, ptm_recursed = 51200,
>       ptm_spare2 = 0xce513000}
> 
> ptm_magix is wrong. NetBSD libpthread sets it as 0x33330003 when created
> and as 0xDEAD0003 when destroyed. This means we either have memory
> corruption, or the mutex was never initialized.
> 
> The last one crashes somewhere else:
> 
> Program terminated with signal 11, Segmentation fault#0  0xbbb33e60 in
> gf_timer_registry_init (ctx=0x80) at timer.c:241
> 241             if (!ctx->timer) {
> (gdb) bt
> #0  0xbbb33e60 in gf_timer_registry_init (ctx=0x80) at timer.c:241
> #1  0xbbb339ce in gf_timer_call_cancel (ctx=0x80, event=0xb9dffb24)
>     at timer.c:121
> #2  0xbbbda206 in rpc_clnt_reconnect (conn_ptr=0xb9ce9150) at rpc-clnt.c:409
> #3  0xbbb33d0c in gf_timer_proc (ctx=Cannot access memory at
>                                  address 0xba9fffd8) at timer.c:194
> (gdb) print ctx
> $1 = (glusterfs_ctx_t *) 0x80
> (gdb) frame 2
> #2  0xbbbda206 in rpc_clnt_reconnect (conn_ptr=0xb9ce9150) at rpc-clnt.c:409
> 409                             gf_timer_call_cancel (clnt->ctx,
> (gdb) print clnt
> $2 = (struct rpc_clnt *) 0xb9dffd94
> (gdb) print clnt->lock.ptm_magic
> $3 = 1
> 
> Here again, corrupted or not initialized.
> 
> 
> I kept the cores for further investigation if this is needed.
> 
> --
> Emmanuel Dreyfus
> manu at netbsd.org
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>