[Gluster-devel] SSL enabled glusterd crash
Raghavendra Gowdappa
rgowdapp at redhat.com
Thu Aug 6 12:13:02 UTC 2015
There is a race b/w gf_timer_call_cancel and firing of timer addressed by [1]. Can this be the cause? Also note that [1] is not sufficient enough, as the callers of gf_timer_call_cancel should check return value and shouldn't free opaque pointer it passed to gf_timer_call_after during timer registration when gf_timer_call_cancel returns -1. Note that [1] is not in 3.7.3.
[1] http://review.gluster.org/6459
----- Original Message -----
> From: "Emmanuel Dreyfus" <manu at netbsd.org>
> To: gluster-devel at gluster.org
> Sent: Thursday, August 6, 2015 3:29:36 PM
> Subject: [Gluster-devel] SSL enabled glusterd crash
>
> On 3.7.3 with SSL enabled, restarting glusterd is quite unreliable,
> with peers and bricks showing up or not in gluster status outputs.
> And results can be different on different peers, and even not
> symetrical: a peer sees the bricks of another but not the other
> way around.
>
> After playing a bit, I managed to get a real crash on restarting
> glusterd on all peers. 3 of them crash here:
>
> Program terminated with signal 11, Segmentation fault.
> #0 0xbbbda1f4 in rpc_clnt_reconnect (conn_ptr=0xb9ce5150) at rpc-clnt.c:409
> 409 gf_timer_call_cancel (clnt->ctx,
> #0 0xbbbda1f4 in rpc_clnt_reconnect (conn_ptr=0xb9ce5150) at rpc-clnt.c:409
> #1 0xbbb33d0c in gf_timer_proc (ctx=Cannot access memory at address
> 0xba9fffd8) at timer.c:194
> (gdb) list
> 404 if (!trans) {
> 405 pthread_mutex_unlock (&conn->lock);
> 406 return;
> 407 }
> 408 if (conn->reconnect)
> 409 gf_timer_call_cancel (clnt->ctx,
> 410 conn->reconnect);
> 411 conn->reconnect = 0;
> 412
> 413 if ((conn->connected == 0) && !clnt->disabled) {
> (gdb) print clnt
> $1 = (struct rpc_clnt *) 0x39bb
> (gdb) print conn
> $2 = (rpc_clnt_connection_t *) 0xb9ce5150
> (gdb) print conn->lock
> $3 = {ptm_magic = 51200, ptm_errorcheck = 0 '\000', ptm_pad1 = "0Q\316",
> ptm_interlock = 185 '\271', ptm_pad2 = "\336\300\255",
> ptm_owner = 0x6af000de, ptm_waiters = 0x39bb, ptm_recursed = 51200,
> ptm_spare2 = 0xce513000}
>
> ptm_magix is wrong. NetBSD libpthread sets it as 0x33330003 when created
> and as 0xDEAD0003 when destroyed. This means we either have memory
> corruption, or the mutex was never initialized.
>
> The last one crashes somewhere else:
>
> Program terminated with signal 11, Segmentation fault#0 0xbbb33e60 in
> gf_timer_registry_init (ctx=0x80) at timer.c:241
> 241 if (!ctx->timer) {
> (gdb) bt
> #0 0xbbb33e60 in gf_timer_registry_init (ctx=0x80) at timer.c:241
> #1 0xbbb339ce in gf_timer_call_cancel (ctx=0x80, event=0xb9dffb24)
> at timer.c:121
> #2 0xbbbda206 in rpc_clnt_reconnect (conn_ptr=0xb9ce9150) at rpc-clnt.c:409
> #3 0xbbb33d0c in gf_timer_proc (ctx=Cannot access memory at
> address 0xba9fffd8) at timer.c:194
> (gdb) print ctx
> $1 = (glusterfs_ctx_t *) 0x80
> (gdb) frame 2
> #2 0xbbbda206 in rpc_clnt_reconnect (conn_ptr=0xb9ce9150) at rpc-clnt.c:409
> 409 gf_timer_call_cancel (clnt->ctx,
> (gdb) print clnt
> $2 = (struct rpc_clnt *) 0xb9dffd94
> (gdb) print clnt->lock.ptm_magic
> $3 = 1
>
> Here again, corrupted or not initialized.
>
>
> I kept the cores for further investigation if this is needed.
>
> --
> Emmanuel Dreyfus
> manu at netbsd.org
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>
More information about the Gluster-devel
mailing list