[Bugs] [Bug 1449169] Multiple bricks WILL crash after TCP port probing

bugzilla at redhat.com bugzilla at redhat.com
Thu May 11 05:51:11 UTC 2017


https://bugzilla.redhat.com/show_bug.cgi?id=1449169



--- Comment #3 from Worker Ant <bugzilla-bot at gluster.org> ---
COMMIT: https://review.gluster.org/17217 committed in release-3.10 by
Raghavendra Talur (rtalur at redhat.com) 
------
commit eb7597b1a20c04a7fd017f7b0f620a5d11eb2769
Author: Milind Changire <mchangir at redhat.com>
Date:   Tue May 9 17:02:27 2017 +0530

    rpc: fix transport add/remove race on port probing

    Problem:
    Spurious __gf_free() assertion failures seen all over the place with
    header->magic being overwritten when running port probing tests with
    'nmap'

    Solution:
    Fix sequence of:
    1. add accept()ed socket connection fd to epoll set
    2. add newly created rpc_transport_t object in RPCSVC service list

    Correct sequence is #2 followed by #1.

    Reason:
    Adding new fd returned by accept() to epoll set causes an epoll_wait()
    to return immediately with a POLLIN event. This races ahead to a readv()
    which returms with errno:104 (Connection reset by peer) during port
    probing using 'nmap'. The error is then handled by POLLERR code to
    remove the new transport object from RPCSVC service list and later
    unref and destroy the rpc transport object.
    socket_server_event_handler() then catches up with registering the
    unref'd/destroyed rpc transport object. This is later manifest as
    assertion failures in __gf_free() with the header->magic field botched
    due to invalid address references.
    All this does not result in a Segmentation Fault since the address
    space continues to be mapped into the process and pages still being
    referenced elsewhere.

    As a further note:
    This race happens only in accept() codepath. Only in this codepath,
    the notify will be referring to two transports:
    1, listener transport and
    2. newly accepted transport
    All other notify refer to only one transport i.e., the transport/socket
    on which the event is received. Since epoll is ONE_SHOT another event
    won't arrive on the same socket till the current event is processed.
    However, in the accept() codepath, the current event - ACCEPT - and the
    new event - POLLIN/POLLER - arrive on two different sockets:
    1. ACCEPT on listener socket and
    2. POLLIN/POLLERR on newly registered socket.
    Also, note that these two events are handled different thread contexts.

    Cleanup:
    Critical section in socket_server_event_handler() has been removed.
    Instead, an additional ref on new_trans has been used to avoid ref/unref
    race when notifying RPCSVC.

    mainline:
    > BUG: 1438966
    > Signed-off-by: Milind Changire <mchangir at redhat.com>
    > Reviewed-on: https://review.gluster.org/17139
    > Smoke: Gluster Build System <jenkins at build.gluster.org>
    > NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    > CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    > Reviewed-by: Amar Tumballi <amarts at redhat.com>
    > Reviewed-by: Oleksandr Natalenko <oleksandr at natalenko.name>
    > Reviewed-by: Jeff Darcy <jeff at pl.atyp.us>
    (cherry picked from commit 4f7ef3020edcc75cdeb22d8da8a1484f9db77ac9)

    Change-Id: I4417924bc9e6277d24bd1a1c5bcb7445bcb226a3
    BUG: 1449169
    Signed-off-by: Milind Changire <mchangir at redhat.com>
    Reviewed-on: https://review.gluster.org/17217
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Raghavendra G <rgowdapp at redhat.com>

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list