[Bugs] [Bug 1669382] New: [ovirt-gluster] Fuse mount crashed while creating the preallocated image

Fri Jan 25 04:54:28 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1669382

            Bug ID: 1669382
           Summary: [ovirt-gluster] Fuse mount crashed while creating the
                    preallocated image
           Product: GlusterFS
           Version: 5
          Hardware: x86_64
                OS: Linux
            Status: NEW
         Component: sharding
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: kdhananj at redhat.com
        QA Contact: bugs at gluster.org
                CC: bugs at gluster.org
        Depends On: 1668304, 1669077
            Blocks: 1668309
  Target Milestone: ---
    Classification: Community

+++ This bug was initially created as a clone of Bug #1669077 +++

+++ This bug was initially created as a clone of Bug #1668304 +++

Description of problem:
------------------------
Fuse mount crashed while creating the preallocated image of size 1TB on a
arbitrated replicate volume

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
RHGS 3.4.3 nightly ( glusterfs-3.12.2-38.el7rhgs )
RHV 4.2.8

How reproducible:
-----------------
1 out of 2 times

Steps to Reproduce:
-------------------
1. Create a arbitrated replicate volume
2. Fuse mount it
3. Create preallocated image of size 1TB
# qemu-img create -f qcow2 -o preallocation=falloc /mnt/test1/vm1.img 1072G

Actual results:
---------------
Fuse mount crashed with segfault

Expected results:
-----------------
Preallocated image should be created successfully

--- Additional comment from SATHEESARAN on 2019-01-22 12:10:19 UTC ---

1. Cluster info
----------------
There are 3 nodes in the cluster

2. Volume info
---------------
[root@ ~]# gluster volume info data

Volume Name: data
Type: Replicate
Volume ID: 7eb49e90-e2b6-4f8f-856e-7108212dbb72
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: server1:/gluster_bricks/data/data
Brick2: server2:/gluster_bricks/data/data
Brick3: server3:/gluster_bricks/data/data (arbiter)
Options Reconfigured:
performance.client-io-threads: on
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
user.cifs: off
cluster.choose-local: off
client.event-threads: 4
server.event-threads: 4
storage.owner-uid: 36
storage.owner-gid: 36
network.ping-timeout: 30
performance.strict-o-direct: on
cluster.granular-entry-heal: enable
cluster.enable-shared-storage: enable

--- Additional comment from SATHEESARAN on 2019-01-22 12:11:03 UTC ---

Backtrace
-----------

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfs --volfile-server=localhost
--volfile-id=/data /mnt/test1'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fb1233e0a7a in __inode_get_xl_index (xlator=0x7fb1100143b0,
inode=0x7fb104026258) at inode.c:455
455             if ((inode->_ctx[xlator->xl_id].xl_key != NULL) &&
(gdb) bt
#0  0x00007fb1233e0a7a in __inode_get_xl_index (xlator=0x7fb1100143b0,
inode=0x7fb104026258) at inode.c:455
#1  __inode_ref (inode=inode at entry=0x7fb104026258) at inode.c:537
#2  0x00007fb1233e0b81 in inode_ref (inode=inode at entry=0x7fb104026258) at
inode.c:581
#3  0x00007fb1233f5d2b in __fd_create (inode=inode at entry=0x7fb104026258,
pid=pid at entry=0) at fd.c:633
#4  0x00007fb1233f6f4a in __fd_anonymous (inode=inode at entry=0x7fb104026258,
flags=flags at entry=2) at fd.c:779
#5  0x00007fb1233f729d in fd_anonymous (inode=0x7fb104026258) at fd.c:803
#6  0x00007fb115161534 in shard_post_lookup_fsync_handler
(frame=0x7fb0c05eb178, this=0x7fb1100143b0) at shard.c:5936
#7  0x00007fb11514913c in shard_lookup_base_file
(frame=frame at entry=0x7fb0c05eb178, this=this at entry=0x7fb1100143b0,
loc=loc at entry=0x7fb10800a158, 
    handler=handler at entry=0x7fb115161030 <shard_post_lookup_fsync_handler>) at
shard.c:1746
#8  0x00007fb1151544c3 in shard_fsync (frame=0x7fb0c05eb178,
this=0x7fb1100143b0, fd=0x7fb0f800eb78, datasync=1, xdata=0x0) at shard.c:6015
#9  0x00007fb114f30189 in wb_fsync_helper (frame=0x7fb0f80022e8,
this=0x7fb1100159d0, fd=0x7fb0f800eb78, datasync=1, xdata=0x0) at
write-behind.c:1974
#10 0x00007fb1233f5b15 in call_resume_keep_stub (stub=0x7fb0f80250f8) at
call-stub.c:2582
#11 0x00007fb114f35a69 in wb_do_winds (wb_inode=wb_inode at entry=0x7fb0f800dd70,
tasks=tasks at entry=0x7fb10dae7510) at write-behind.c:1672
#12 0x00007fb114f35b7b in wb_process_queue
(wb_inode=wb_inode at entry=0x7fb0f800dd70) at write-behind.c:1709
#13 0x00007fb114f35c57 in wb_fulfill_cbk (frame=0x7fb0f8010a58,
cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>,
op_errno=<optimized out>, prebuf=<optimized out>, 
    postbuf=0x7fb0f8003670, xdata=0x7fb0c065ee98) at write-behind.c:1054
#14 0x00007fb115156840 in shard_common_inode_write_success_unwind
(fop=<optimized out>, frame=0x7fb0f80019b8, op_ret=65536) at shard.c:903
#15 0x00007fb115156bc0 in shard_common_inode_write_post_update_size_handler
(frame=<optimized out>, this=<optimized out>) at shard.c:5214
#16 0x00007fb115147cc0 in shard_update_file_size
(frame=frame at entry=0x7fb0f80019b8, this=this at entry=0x7fb1100143b0,
fd=0x7fb0f800eb78, loc=loc at entry=0x0, 
    handler=handler at entry=0x7fb115156ba0
<shard_common_inode_write_post_update_size_handler>) at shard.c:1201
#17 0x00007fb11515e811 in shard_common_inode_write_do_cbk
(frame=frame at entry=0x7fb0f80019b8, cookie=0x7fb0f800eb78, this=0x7fb1100143b0,
op_ret=op_ret at entry=65536, 
    op_errno=op_errno at entry=0, pre=pre at entry=0x7fb0f8029730,
post=post at entry=0x7fb0f80297a0, xdata=xdata at entry=0x7fb0c065ee98) at
shard.c:5326
#18 0x00007fb1153d467e in dht_writev_cbk (frame=0x7fb0f80021d8,
cookie=<optimized out>, this=<optimized out>, op_ret=65536, op_errno=0,
prebuf=0x7fb0f8029730, postbuf=0x7fb0f80297a0, 
    xdata=0x7fb0c065ee98) at dht-inode-write.c:119
#19 0x00007fb115630b32 in afr_writev_unwind (frame=frame at entry=0x7fb0f8004888,
this=this at entry=0x7fb11000fff0) at afr-inode-write.c:246
#20 0x00007fb11563105e in afr_writev_wind_cbk (frame=0x7fb0f800bd08,
cookie=<optimized out>, this=0x7fb11000fff0, op_ret=<optimized out>,
op_errno=<optimized out>, 
    prebuf=<optimized out>, postbuf=0x7fb10dae7990, xdata=0x7fb0c065ee98) at
afr-inode-write.c:406
#21 0x00007fb1158a7ffa in client3_3_writev_cbk (req=<optimized out>,
iov=<optimized out>, count=<optimized out>, myframe=0x7fb0f802d148) at
client-rpc-fops.c:838
#22 0x00007fb123198b30 in rpc_clnt_handle_reply
(clnt=clnt at entry=0x7fb11004a940, pollin=pollin at entry=0x7fb10bb95520) at
rpc-clnt.c:778
#23 0x00007fb123198ed3 in rpc_clnt_notify (trans=<optimized out>,
mydata=0x7fb11004a970, event=<optimized out>, data=0x7fb10bb95520) at
rpc-clnt.c:971
#24 0x00007fb123194c33 in rpc_transport_notify (this=this at entry=0x7fb11004ac90,
event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=data at entry=0x7fb10bb95520)
at rpc-transport.c:552
#25 0x00007fb117d89576 in socket_event_poll_in (this=this at entry=0x7fb11004ac90,
notify_handled=<optimized out>) at socket.c:2322
#26 0x00007fb117d8bb1c in socket_event_handler (fd=11, idx=4, gen=1,
data=0x7fb11004ac90, poll_in=1, poll_out=0, poll_err=0) at socket.c:2474
#27 0x00007fb12342ee84 in event_dispatch_epoll_handler (event=0x7fb10dae7e80,
event_pool=0x56476beb1ec0) at event-epoll.c:583
#28 event_dispatch_epoll_worker (data=0x56476bf0b1d0) at event-epoll.c:659
#29 0x00007fb12222fdd5 in start_thread (arg=0x7fb10dae8700) at
pthread_create.c:307
#30 0x00007fb121af7ead in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

--- Additional comment from SATHEESARAN on 2019-01-22 12:12:24 UTC ---

Errors as seen in the fuse mount log:
[2019-01-22 11:29:05.487058] E [shard.c:5935:shard_post_lookup_fsync_handler]
(-->/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x134c3)
[0x7fb1151544c3]
-->/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x813c)
[0x7fb11514913c]
-->/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x2083e)
[0x7fb11516183e] ) 0-: Assertion failed: fsync_count > 0
pending frames:
frame : type(0) op(0)
frame : type(1) op(FSYNC)
frame : type(1) op(FSYNC)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2019-01-22 11:29:05
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.2
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x9d)[0x7fb1233d0e8d]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7fb1233daf54]
/lib64/libc.so.6(+0x36280)[0x7fb121a30280]
/lib64/libglusterfs.so.0(+0x3ba7a)[0x7fb1233e0a7a]
/lib64/libglusterfs.so.0(inode_ref+0x21)[0x7fb1233e0b81]
/lib64/libglusterfs.so.0(+0x50d2b)[0x7fb1233f5d2b]
/lib64/libglusterfs.so.0(+0x51f4a)[0x7fb1233f6f4a]
/lib64/libglusterfs.so.0(fd_anonymous+0x2d)[0x7fb1233f729d]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x20534)[0x7fb115161534]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x813c)[0x7fb11514913c]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x134c3)[0x7fb1151544c3]
/usr/lib64/glusterfs/3.12.2/xlator/performance/write-behind.so(+0x3189)[0x7fb114f30189]
/lib64/libglusterfs.so.0(call_resume_keep_stub+0x75)[0x7fb1233f5b15]
/usr/lib64/glusterfs/3.12.2/xlator/performance/write-behind.so(+0x8a69)[0x7fb114f35a69]
/usr/lib64/glusterfs/3.12.2/xlator/performance/write-behind.so(+0x8b7b)[0x7fb114f35b7b]
/usr/lib64/glusterfs/3.12.2/xlator/performance/write-behind.so(+0x8c57)[0x7fb114f35c57]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x15840)[0x7fb115156840]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x15bc0)[0x7fb115156bc0]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x6cc0)[0x7fb115147cc0]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x1d811)[0x7fb11515e811]
/usr/lib64/glusterfs/3.12.2/xlator/cluster/distribute.so(+0x6967e)[0x7fb1153d467e]
/usr/lib64/glusterfs/3.12.2/xlator/cluster/replicate.so(+0x22b32)[0x7fb115630b32]
/usr/lib64/glusterfs/3.12.2/xlator/cluster/replicate.so(+0x2305e)[0x7fb11563105e]
/usr/lib64/glusterfs/3.12.2/xlator/protocol/client.so(+0x18ffa)[0x7fb1158a7ffa]
/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7fb123198b30]
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a3)[0x7fb123198ed3]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fb123194c33]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x7576)[0x7fb117d89576]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x9b1c)[0x7fb117d8bb1c]
/lib64/libglusterfs.so.0(+0x89e84)[0x7fb12342ee84]
/lib64/libpthread.so.0(+0x7dd5)[0x7fb12222fdd5]
/lib64/libc.so.6(clone+0x6d)[0x7fb121af7ead]
---------

--- Additional comment from Krutika Dhananjay on 2019-01-23 16:37:55 UTC ---

So this seems to happen when the vm image creation fills the lru list whose
size is 16K maintained by shard translator.
Anyway the bug isn't related to the shard deletion patch. Crash is seen during
fsync post write. And it seems to be caused by this Dalton fix @
https://bugzilla.redhat.com/show_bug.cgi?id=1583462 at the moment.

Managed to recreate the bug with a smaller data set:
1. Just set shard-lru-limit option to say 100. (In fact the very reason why I
added this option in https://bugzilla.redhat.com/show_bug.cgi?id=1603118#c13
was for enabling easier testing of this code path since it is a bit complicated
and prone to races).

2. And execute qemu-img command from "Description" above to create a smaller
file (say 10G) and there is a crash.

Made some progress in terms of identifying that this bug is hit when the shard
is part of the fsync list but NOT the lru list. This defies a certain
assumption made at the time of writing that patch that if a shard is part of
fsync list, it is guaranteed to be part of lru list as well which is where it
gets ref'd. But that is not what happened.

So no lru list presence => no ref. Inode is destroyed in-memory at some point
with its "ctx" object still in fsync list. When this list is traversed during
fsync, the client crashes as its memory is freed by now.

What's left is to identify the specific case where this might have happened.
Calling it a day for now. Will resume investigation tomorrow.

-Krutika

--- Additional comment from Krutika Dhananjay on 2019-01-24 06:33:26 UTC ---

Found the issue.

When an inode write fop (in this case FALLOCATE) needs to write to more shards
than the lru list can hold, some shards that are to participate in this inode
write fop get evicted from lru list and later get added to fsync list (because
these shards are modified and need to be fsync'd when application sends one)
but without holding a ref.
Because of this, at the end of the fop, the inode and its ctx get unref'd and
destroyed while still being part of fsync list.

When the application does send an fsync later, the fsync list is traversed and
the (now) bad address accessed leading to a crash.

The fix is simple - just ref the shard at the time of adding to fsync list as
well and unref when fsync'd.

Need to make this change and perform some tests (to be sure new ref doesn't
cause mem-leaks in case it is not unref'd wherever it needs to be) after which
I'll send out the patch upstream.

-Krutika

--- Additional comment from Worker Ant on 2019-01-24 18:15:43 UTC ---

REVIEW: https://review.gluster.org/22091 (features/shard: Ref shard inode while
adding to fsync list) merged (#4) on master by Xavi Hernandez

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1668304
[Bug 1668304] [RHHI-V] Fuse mount crashed while creating the preallocated image
https://bugzilla.redhat.com/show_bug.cgi?id=1668309
[Bug 1668309] Fuse mount crashed while creating the preallocated image with
size > 1TB
https://bugzilla.redhat.com/show_bug.cgi?id=1669077
[Bug 1669077] [ovirt-gluster] Fuse mount crashed while creating the
preallocated image
-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are on the CC list for the bug.
You are the assignee for the bug.