[Bugs] [Bug 1488354] gluster-blockd process crashed and core generated

Tue Sep 5 08:10:46 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1488354

Pranith Kumar K <pkarampu at redhat.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|bugs at gluster.org            |pkarampu at redhat.com

--- Comment #1 from Pranith Kumar K <pkarampu at redhat.com> ---
(gdb) bt
#0  0x00007f549eaf3c30 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007f549e207f15 in fd_anonymous () from /lib64/libglusterfs.so.0
#2  0x00007f54869d1927 in shard_common_inode_write_do ()
   from /usr/lib64/glusterfs/3.8.4/xlator/features/shard.so
#3  0x00007f54869d1c7d in shard_common_inode_write_post_mknod_handler ()
   from /usr/lib64/glusterfs/3.8.4/xlator/features/shard.so
#4  0x00007f54869ca77f in shard_common_mknod_cbk ()
   from /usr/lib64/glusterfs/3.8.4/xlator/features/shard.so
#5  0x00007f5486c1164b in dht_newfile_cbk ()
   from /usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so
#6  0x00007f5486e71ab1 in afr_mknod_unwind ()
   from /usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so
#7  0x00007f5486e73eeb in __afr_dir_write_cbk ()
   from /usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so
#8  0x00007f5486e7482d in afr_mknod_wind_cbk ()
   from /usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so
#9  0x00007f54870f6168 in client3_3_mknod_cbk ()
   from /usr/lib64/glusterfs/3.8.4/xlator/protocol/client.so
#10 0x00007f549dfac840 in rpc_clnt_handle_reply () from /lib64/libgfrpc.so.0
#11 0x00007f549dfacb27 in rpc_clnt_notify () from /lib64/libgfrpc.so.0
#12 0x00007f549dfa89e3 in rpc_transport_notify () from /lib64/libgfrpc.so.0
#13 0x00007f5490be63d6 in socket_event_poll_in ()
   from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
---Type <return> to continue, or q <return> to quit--- 
#14 0x00007f5490be897c in socket_event_handler ()
   from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
#15 0x00007f549e23e1e6 in event_dispatch_epoll_worker ()
   from /lib64/libglusterfs.so.0
#16 0x00007f549eaf1e25 in start_thread () from /lib64/libpthread.so.0
#17 0x00007f549d8b034d in clone () from /lib64/libc.so.6

Based on the core-file, the only way it can happen is because it doesn't create
all the shards.
(gdb) fr 2
#2  0x00007f54869d1927 in shard_common_inode_write_do (frame=0x7f548c0dbbe0, 
    this=0x7f54800120d0) at shard.c:3883
3883                            anon_fd = fd_anonymous (local->inode_list[i]);
(gdb) p i
$1 = 255
(gdb) p local->inode_list[i]
$2 = (inode_t *) 0x0
(gdb) p lical->inode_list[i-1]
No symbol "lical" in current context.
(gdb) p local->inode_list[i-1]
$3 = (inode_t *) 0x7f5474765440
(gdb) p local->offset
$4 = 0
(gdb) p local->num_blocks
$5 = 256

Based on this data, I went through the code and I see two races:
1) In shard_common_mknod_cbk()
local->eexist_count is incremented without frame->lock
2) In shard_common_lookup_shards_cbk()
local->create_count is incremented without frame->lock

This can lead to the counts being less than what they need to be, so mknod is
done just on 255 shards instead of 256 shards.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are on the CC list for the bug.
You are the assignee for the bug.