[Bugs] [Bug 1763036] glusterfsd crashed with "'MemoryError' Cannot access memory at address"

bugzilla at redhat.com bugzilla at redhat.com
Fri Oct 18 06:01:22 UTC 2019


https://bugzilla.redhat.com/show_bug.cgi?id=1763036



--- Comment #1 from Mohit Agrawal <moagrawa at redhat.com> ---
Hi,

It seems the brick process is getting crashed because the function
event_slot_alloc is not able to 
return a valid slot.

bt
#0  0x00007f9efaed5207 in __GI_raise (sig=sig at entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f9efaed68f8 in __GI_abort () at abort.c:90
#2  0x00007f9efaece026 in __assert_fail_base (fmt=0x7f9efb028ea0 "%s%s%s:%u:
%s%sAssertion `%s' failed.\n%n", 
    assertion=assertion at entry=0x7f9efc94507a "slot->fd == fd",
file=file at entry=0x7f9efc945054 "event-epoll.c", 
    line=line at entry=417, function=function at entry=0x7f9efc945420
<__PRETTY_FUNCTION__.11118> "event_register_epoll")
    at assert.c:92
#3  0x00007f9efaece0d2 in __GI___assert_fail
(assertion=assertion at entry=0x7f9efc94507a "slot->fd == fd", 
    file=file at entry=0x7f9efc945054 "event-epoll.c", line=line at entry=417, 
    function=function at entry=0x7f9efc945420 <__PRETTY_FUNCTION__.11118>
"event_register_epoll") at assert.c:101
#4  0x00007f9efc8f7d04 in event_register_epoll (event_pool=0x563fac588150,
fd=<optimized out>, handler=<optimized out>, 
    data=<optimized out>, poll_in=<optimized out>, poll_out=<optimized out>,
notify_poller_death=0 '\000')
    at event-epoll.c:417
#5  0x00007f9ef798ceb2 in socket_server_event_handler (fd=<optimized out>,
idx=<optimized out>, gen=<optimized out>, 
    data=0x7f9ee80403f0, poll_in=<optimized out>, poll_out=<optimized out>,
poll_err=0, event_thread_died=0 '\000')
    at socket.c:2950
#6  0x00007f9efc8f8870 in event_dispatch_epoll_handler (event=0x7f98fffc9e70,
event_pool=0x563fac588150) at event-epoll.c:643
#7  event_dispatch_epoll_worker (data=0x7f992c8ea110) at event-epoll.c:759
#8  0x00007f9efb6d5dd5 in start_thread (arg=0x7f98fffca700) at
pthread_create.c:307
#9  0x00007f9efaf9cead in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) f 4
#4  0x00007f9efc8f7d04 in event_register_epoll (event_pool=0x563fac588150,
fd=<optimized out>, handler=<optimized out>, 
    data=<optimized out>, poll_in=<optimized out>, poll_out=<optimized out>,
notify_poller_death=0 '\000')
    at event-epoll.c:417
417             assert (slot->fd == fd);
(gdb) p slot
$3184 = (struct event_slot_epoll *) 0x7f9e247f81b0
(gdb) p *slot
$3185 = {fd = -1, events = 1073741851, gen = 216, idx = 0, ref = 1, do_close =
1, in_handler = 0, handled_error = 0, 
  data = 0x7f9e3c050be0, handler = 0x7f9ef7989980 <socket_event_handler>, lock
= {spinlock = 0, mutex = {__data = {
        __lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins
= 0, __elision = 0, __list = {__prev = 0x0, 
          __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}},
poller_death = {next = 0x7f9e247f8208, 
    prev = 0x7f9e247f8208}}

p event_pool->slots_used
$3188 = {1021, 10, 0 <repeats 1022 times>}
f 4
After print the complete fd table i am not able to figure out any entry for
socket 1340
set $tmp = 1
while ($tmp != 1023)
print ((struct event_slot_epoll *)event_pool->ereg[0])[$tmp].fd
set $tmp = $tmp + 1
f 5
p new_sock
1340


As per the current code of event_slot_alloc first, it checks the value of
slot_used to validate the free entry in the table. Current bt is showing the
value of slots used is 1021(less than 1024) it means still it has some free
slot and it sets the table to this registry index(0)

>>>>>>>>>>>>>>>>>>>>>

for (i = 0; i < EVENT_EPOLL_TABLES; i++) {
        switch (event_pool->slots_used[i]) {
            case EVENT_EPOLL_SLOTS:
                continue;
            case 0:
                if (!event_pool->ereg[i]) {
                    table = __event_newtable(event_pool, i);
                    if (!table)
                        return -1;
                } else {
                    table = event_pool->ereg[i];
                }
                break;
            default:
                table = event_pool->ereg[i];
                break;
        }

        if (table)
            /* break out of the loop */
            break;
    }

    if (!table)
        return -1;

    table_idx = i;

>>>>>>>>>>>>>>>>>>>>>>>>>

In below code it tries to check the free entry in the table.As per current
slots_used value ideally 3 entry should be free in table but somehow here no
entry is free. 

The code is not validating the fd assignment in the table. It is just returning
idx.

>>>>>>>>>>>>>>>>>>>>>>>>>>

    for (i = 0; i < EVENT_EPOLL_SLOTS; i++) {
        if (table[i].fd == -1) {
            /* wipe everything except bump the generation */
            gen = table[i].gen;
            memset(&table[i], 0, sizeof(table[i]));
            table[i].gen = gen + 1;

            LOCK_INIT(&table[i].lock);
            INIT_LIST_HEAD(&table[i].poller_death);

            table[i].fd = fd;
            if (notify_poller_death) {
                table[i].idx = table_idx * EVENT_EPOLL_SLOTS + i;
                list_add_tail(&table[i].poller_death,
                              &event_pool->poller_death);
            }

            event_pool->slots_used[table_idx]++;

            break;
        }
    }

>>>>>>>>>>>>>>>>>
    return table_idx * EVENT_EPOLL_SLOTS + i;


I think we need to update the code.
I have checked the event code,  I am not able to figure out why slots_used is
not showing correct value.
Ideally slots_used value should be 1024 because in registry table no index is
free and before returning
the index it should validate fd is successfully assigned or not.

RCA: The slot->ref is not incremented atomically when slot is allocated.
Instead it is done later as part of event_slot_get. If in this window if we
happen to run into ref/unref cycles (as explained above) it would result in
more calls to event_slot_deallocation than actually needed resulting in wrong
accounting of slots_used in slot table.

The fix would be:
1. increment slot->ref atomically in __event_slot_alloc
2. Add checks to __event_slot_alloc whether it actually returns a valid slot
instead of assuming it does.

Thanks,
Mohit Agrawal

-- 
You are receiving this mail because:
You are the assignee for the bug.


More information about the Bugs mailing list