[Bugs] [Bug 1751085] New: Gluster fuse mount crashed during truncate

Wed Sep 11 07:16:39 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1751085

            Bug ID: 1751085
           Summary: Gluster fuse mount crashed during truncate
           Product: GlusterFS
           Version: mainline
            Status: NEW
         Component: sharding
          Keywords: Triaged
          Assignee: kdhananj at redhat.com
          Reporter: kdhananj at redhat.com
        QA Contact: bugs at gluster.org
                CC: bugs at gluster.org
  Target Milestone: ---
    Classification: Community

Description of problem:

Gluster fuse mount crashes in shard translator while truncating the file from a
really high size (Exabytes) to a really low size.

See bt:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfs --process-name fuse
--volfile-server=tendrl25.lab.eng.blr.r'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f01260a2eca in shard_common_resolve_shards
(frame=frame at entry=0x7f0104000a58, this=this at entry=0x7f0118015b60, 
    post_res_handler=0x7f01260ad770 <shard_post_resolve_truncate_handler>) at
shard.c:1030
1030                local->inode_list[i] = inode_ref(res_inode);
Missing separate debuginfos, use: debuginfo-install
keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_6.x86_64
libcom_err-1.42.9-13.el7.x86_64 libgcc-4.8.5-36.el7_6.2.x86_64
libselinux-2.5-14.1.el7.x86_64 pcre-8.32-17.el7.x86_64
sssd-client-1.16.2-13.el7_6.8.x86_64
(gdb) bt
#0  0x00007f01260a2eca in shard_common_resolve_shards
(frame=frame at entry=0x7f0104000a58, this=this at entry=0x7f0118015b60, 
    post_res_handler=0x7f01260ad770 <shard_post_resolve_truncate_handler>) at
shard.c:1030
#1  0x00007f01260a3bfd in shard_refresh_internal_dir
(frame=frame at entry=0x7f0104000a58, this=this at entry=0x7f0118015b60,
type=type at entry=SHARD_INTERNAL_DIR_DOT_SHARD) at shard.c:1317
#2  0x00007f01260ad90d in shard_truncate_begin
(frame=frame at entry=0x7f0104000a58, this=this at entry=0x7f0118015b60) at
shard.c:2596
#3  0x00007f01260b506d in shard_post_lookup_truncate_handler
(frame=0x7f0104000a58, this=0x7f0118015b60) at shard.c:2659
#4  0x00007f01260a1f8b in shard_lookup_base_file_cbk (frame=0x7f0104000a58,
cookie=<optimized out>, this=0x7f0118015b60, op_ret=<optimized out>,
op_errno=<optimized out>, 
    inode=<optimized out>, buf=0x7f0104013bb0, xdata=0x7f010c00faf8,
postparent=0x7f0104013c48) at shard.c:1702
#5  0x00007f01265e4922 in afr_discover_unwind (frame=0x7f010400bd28,
this=<optimized out>) at afr-common.c:3011
#6  0x00007f01265e4eeb in afr_discover_done (frame=<optimized out>,
this=<optimized out>) at afr-common.c:3106
#7  0x00007f01265f27fd in afr_lookup_metadata_heal_check
(frame=frame at entry=0x7f010400bd28, this=this at entry=0x7f0118010da0) at
afr-common.c:2761
#8  0x00007f01265f3608 in afr_discover_cbk (frame=frame at entry=0x7f010400bd28,
cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>,
op_errno=<optimized out>, 
    inode=inode at entry=0x7f011805b9d8, buf=buf at entry=0x7f011f7fcb40,
xdata=0x7f010c00faf8, postparent=postparent at entry=0x7f011f7fcbe0) at
afr-common.c:3147
#9  0x00007f012687c412 in client4_0_lookup_cbk (req=<optimized out>,
iov=<optimized out>, count=<optimized out>, myframe=0x7f010c010838) at
client-rpc-fops_v2.c:2641
#10 0x00007f012ee6c021 in rpc_clnt_handle_reply
(clnt=clnt at entry=0x7f01180510c0, pollin=pollin at entry=0x7f010c001a20) at
rpc-clnt.c:755
#11 0x00007f012ee6c387 in rpc_clnt_notify (trans=0x7f0118051380,
mydata=0x7f01180510f0, event=<optimized out>, data=0x7f010c001a20) at
rpc-clnt.c:922
#12 0x00007f012ee689f3 in rpc_transport_notify (this=this at entry=0x7f0118051380,
event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=data at entry=0x7f010c001a20)
at rpc-transport.c:542
#13 0x00007f0129778875 in socket_event_poll_in (notify_handled=true,
this=0x7f0118051380) at socket.c:2522
#14 socket_event_handler (fd=<optimized out>, idx=<optimized out>,
gen=<optimized out>, data=0x7f0118051380, poll_in=<optimized out>,
poll_out=<optimized out>, poll_err=0, 
    event_thread_died=0 '\000') at socket.c:2924
#15 0x00007f012f126806 in event_dispatch_epoll_handler (event=0x7f011f7fd130,
event_pool=0x55cbb4dc2300) at event-epoll.c:648
#16 event_dispatch_epoll_worker (data=0x7f0118049a60) at event-epoll.c:761
#17 0x00007f012df01dd5 in start_thread (arg=0x7f011f7fe700) at
pthread_create.c:307
#18 0x00007f012d7c902d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) p local->first_block
$1 = 0
(gdb) p local->last_block 
$2 = -1
(gdb) p local->prebuf.ia_size
$3 = 18446744073709547520
(gdb) p local->num_blocks
$4 = 0
(gdb) p local->block_size 
$5 = 67108864
(gdb) p (local->prebuf.ia_size - 1)/local->block_size
$6 = 274877906943
(gdb) p (int) $6
$7 = -1

Turns out the quotient resulting from division of a really high unsigned 64 int
to a relatively low unsigned 64 int is assigned to a signed int32 variable. And
sometimes this quotient is larger than the largest signed int 32. In this case,
local->last_block gets assigned a '-1' and the resulting local->num_blocks
after that becomes 0. This leads to a GF_CALLOC with size 0 of
local->inode_list[] here in shard_truncate_begin():

2580        local->inode_list = GF_CALLOC(local->num_blocks, sizeof(inode_t *),
2581                                      gf_shard_mt_inode_list);
2582        if (!local->inode_list)
2583            goto err;
2584    

When the members of local->inode_list[] beyond its boundary are accessed, there
is illegal memory access and the process crashes.

While this explains the cause of the crash, what it doesn't explain is why the
size of the file was as big as 16Eb when the writes on the file in question
wouldn't extend its size beyond 4k. That is still being investigated. I'll
update the bz once I have the RC.

Version-Release number of selected component (if applicable):

master

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are on the CC list for the bug.