[Bugs] [Bug 1486134] glusterfsd (brick) process crashed

bugzilla at redhat.com bugzilla at redhat.com
Tue Aug 29 07:21:16 UTC 2017


https://bugzilla.redhat.com/show_bug.cgi?id=1486134



--- Comment #3 from Raghavendra G <rgowdapp at redhat.com> ---
Some observations by reading code:

* A slot won't be deallocated and hence the fd associated with it won't change
till there is a positive non-zero refcount. Since we increment refcount by one
before calling handler, slot's fd won't be changed till handler returns.

* From __socket_reset,

        event_unregister_close (this->ctx->event_pool, priv->sock, priv->idx);

        priv->sock = -1;
        priv->idx = -1;
        priv->connected = -1;

  As can be seen above, priv->sock that is passed as argument to
event_unregister_close and it is set to -1 post that. So, I guess we got more
than one POLLERR event on the socket. The first event set priv->sock to -1 and
the second event resulted in this crash during event_unregister_close as
slot->fd != -1.

Since socket_event_handler logs in DEBUG log-level, logs cannot help me to
confirm this hypothesis. 

As to parallel POLLERR events, we register socket with EPOLLONESHOT, which
means it has to be explicitly added back through epoll_ctl to receive more
events. Normally we do this once the handler completes processing of current
event. But event_select_on_epoll is one asynchronous codepath where socket can
be added back for polling while an event on the same socket is being processed.
event_select_on_epoll has a check whether an event is being processed in the
form of slot->in_handler. But this check is not sufficient enough to prevent
parallel events as slot->in_handler is not atomically incremented with respect
to reception of the event. This means following imaginary sequence of events
can happen:

* epoll_wait returns with a POLLERR - say POLLERR1 - on a socket (sock1)
associated with slot s1. socket_event_handle_pollerr is yet to be invoked.
* an event_select_on called from __socket_ioq_churn which was called in
request/reply/msg submission codepath (as opposed to __socket_ioq_churn called
as part of POLLOUT handling - we cannot receive a POLLOUT due to EPOLLONESHOT)
adds back sock1 for polling.
* since sock1 was added back for polling in step 2 and our polling is
level-triggered, another thread picks up another POLLERR event - say POLLERR2.
socket_event_handler is invoked as part of processing POLLERR2 and it completes
execution setting priv->sock to -1.
* event_unregister_epoll called as part of __socket_reset due to POLLERR1 would
receive fd as -1 resulting in assert failure.

Also, since the first pollerr event has done rpc_transport_unref, subsequent
parallel events (not just pollerr, but other events too) could be acting on a
freed up transport too.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list