[Gluster-users] query about glusterfs 3.12-3 write-behind.c coredump

Wed Jan 30 04:15:11 UTC 2019

On Wed, Jan 30, 2019 at 9:44 AM Raghavendra Gowdappa <rgowdapp at redhat.com>
wrote:

>
>
> On Wed, Jan 30, 2019 at 7:35 AM Li, Deqian (NSB - CN/Hangzhou) <
> deqian.li at nokia-sbell.com> wrote:
>
>> Hi,
>>
>>
>>
>> Could you help to check this coredump?
>>
>> We are using glusterfs 3.12-3(3 replicated bricks solution ) to do
>> stability testing under high CPU load like 80% by stress and doing I/O.
>>
>> After several hours, coredump happened in glusterfs side .
>>
>>
>>
>> [Current thread is 1 (Thread 0x7ffff37d2700 (LWP 3696))]
>>
>> Missing separate debuginfos, use: dnf debuginfo-install
>> rcp-pack-glusterfs-1.8.1_11_g99e9ca6-RCP2.wf28.x86_64
>>
>> (gdb) bt
>>
>> #0  0x00007ffff0d5c845 in wb_fulfill (wb_inode=0x7fffd406b3b0,
>> liabilities=0x7fffdc234b50) at write-behind.c:1148
>>
>> #1  0x00007ffff0d5e4d5 in wb_process_queue (wb_inode=0x7fffd406b3b0) at
>> write-behind.c:1718
>>
>> #2  0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290,
>> this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1,
>> offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0)
>>
>>     at write-behind.c:1825
>>
>> #3  0x00007ffff0b51fcb in du_writev_resume (ret=0, frame=0x7fffdc0305a0,
>> opaque=0x7fffdc0305a0) at disk-usage.c:490
>>
>> #4  0x00007ffff7b3510d in synctask_wrap () at syncop.c:377
>>
>> #5  0x00007ffff60d0660 in ?? () from /lib64/libc.so.6
>>
>> #6  0x0000000000000000 in ?? ()
>>
>> (gdb) p wb_inode
>>
>> $1 = (wb_inode_t *) 0x7fffd406b3b0
>>
>> (gdb) frame 2
>>
>> #2  0x00007ffff0d5eda7 in wb_writev (frame=0x7fffe0086290,
>> this=0x7fffec014b00, fd=0x7fffe4034070, vector=0x7fffdc445720, count=1,
>> offset=67108863, flags=32770, iobref=0x7fffdc00d550, xdata=0x0)
>>
>>     at write-behind.c:1825
>>
>> 1825         in write-behind.c
>>
>> (gdb) p *fd
>>
>> $2 = {pid = 18154, flags = 32962, refcount = 0, inode_list = {next =
>> 0x7fffe4034080, prev = 0x7fffe4034080}, inode = 0x0, lock = {spinlock = 0,
>> mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
>>
>>         __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list =
>> {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>,
>> "\377\377\377\377", '\000' <repeats 19 times>, __align = 0}},
>>
>>   _ctx = 0x7fffe4022930, xl_count = 17, lk_ctx = 0x7fffe40350e0,
>> anonymous = _gf_false}
>>
>> (gdb) p fd
>>
>> $3 = (fd_t *) 0x7fffe4034070
>>
>>
>>
>> (gdb) p wb_inode->this
>>
>> $1 = (xlator_t *) 0xffffffffffffff00
>>
>>
>>
>> After adding test log I found the FOP  sequence in write-behind xlator
>> side was mass as bellow showing.  In the FUSE side the FLUSH is after
>> write2, but in the WB side, FLUSH is between write2 ‘wb_do_unwinds’ and
>> ‘wb_fulfill’.
>>
>> So I think this should has problem.
>>
>
> wb_do_unwinds the write after caching it.
>

* wb_do_unwinds unwind the write after caching it.

Once the write response is unwound, kernel issues a FLUSH. So, this is a
> valid sequence of operations and nothing wrong.
>
> I think it’s possible that the FLUSH and later RELEASE operation will
>> destroy the fd , it will cause ‘wb_in->this(0xffffffffffffff00)’. Do you
>> think so?
>>
>> And I think our new adding disk-usage xlator’s synctask_new will dealy
>> the write operation, but the FLUSH operation without this delay(because not
>> invoked the disk-usage xlator).
>>
>
> Flush and release wait for completion of any on-going writes. So, unless
> write-behind has unwound the write, it won't see a flush. By the time write
> is unwound, write-behind makes sure to take references on objects it uses
> (like fds, iobref etc). So, I don't see a problem there.
>
>
>>
>> Do you agree with my speculation ? and how to fix?(we don’t want to move
>> the disk-usage xlator)
>>
>
> I've still not found the RCA. We can discuss about the fix once RCA is
> found.
>
>
>>
>>
>>
>> Problematic FOP sequence :
>>
>>
>>
>> FUSE side:             WB side:
>>
>>
>>
>> Write 1                write1
>>
>> Write2 do unwind
>>
>> Write 2                FLUSH
>>
>>                       Release(destroy fd)
>>
>> FLUSH                write2  (wb_fulfill)  then coredump.
>>
>> Release
>>
>>
>>
>>
>>
>> int
>>
>> wb_fulfill (wb_inode_t *wb_inode, list_head_t *liabilities)
>>
>> {
>>
>>          wb_request_t  *req     = NULL;
>>
>>          wb_request_t  *head    = NULL;
>>
>>          wb_request_t  *tmp     = NULL;
>>
>>          wb_conf_t     *conf    = NULL;
>>
>>          off_t          expected_offset = 0;
>>
>>          size_t         curr_aggregate = 0;
>>
>>          size_t         vector_count = 0;
>>
>>         int            ret          = 0;
>>
>>
>>
>>          conf = wb_inode->this->private;   à this line coredump
>>
>>
>>
>>          list_for_each_entry_safe (req, tmp, liabilities, winds) {
>>
>>                   list_del_init (&req->winds);
>>
>>
>>
>> ….
>>
>>
>>
>>
>>
>> volume ccs-write-behind
>>
>> 68:     type performance/write-behind
>>
>> 69:     subvolumes ccs-dht
>>
>> 70: end-volume
>>
>> 71:
>>
>> * 72: volume ccs-disk-usage                      **à we add a new xlator
>> here for write op ,just for checking if disk if full.  And synctask_new for
>> write.*
>>
>> *73:     type performance/disk-usage*
>>
>> *74:     subvolumes ccs-write-behind*
>>
>> *75: end-volume*
>>
>> 76:
>>
>>  77: volume ccs-read-ahead
>>
>> 78:     type performance/read-ahead
>>
>> 79:     subvolumes ccs-disk-usage
>>
>> 80: end-volume
>>
>>
>>
>>
>>
>>
>>
>> Ps. Part of  Our new translator code
>>
>>
>>
>> int
>>
>> du_writev (call_frame_t *frame, xlator_t *this, fd_t *fd,
>>
>>             struct iovec *vector, int count, off_t off, uint32_t flags,
>>
>>             struct iobref *iobref, dict_t *xdata)
>>
>> {
>>
>>     int           op_errno = -1;
>>
>>     int           ret = -1;
>>
>>     du_local_t  *local = NULL;
>>
>>     loc_t          tmp_loc      = {0,};
>>
>>
>>
>>     VALIDATE_OR_GOTO (frame, err);
>>
>>     VALIDATE_OR_GOTO (this, err);
>>
>>     VALIDATE_OR_GOTO (fd, err);
>>
>>
>>
>>     tmp_loc.gfid[15] = 1;
>>
>>     tmp_loc.inode = fd->inode;
>>
>>     tmp_loc.parent = fd->inode;
>>
>>     local = du_local_init (frame, &tmp_loc, fd, GF_FOP_WRITE);
>>
>>     if (!local) {
>>
>>
>>
>>             op_errno = ENOMEM;
>>
>>             goto err;
>>
>>     }
>>
>>     local->vector = iov_dup (vector, count);
>>
>>     local->offset = off;
>>
>>     local->count = count;
>>
>>     local->flags = flags;
>>
>>     local->iobref = iobref_ref (iobref);
>>
>>
>>
>>     ret = synctask_new(this->ctx->env,
>> du_get_du_info,du_writev_resume,frame,frame);
>>
>>     if(ret)
>>
>>     {
>>
>>             op_errno = -1;
>>
>>             gf_log (this->name, GF_LOG_WARNING,"synctask_new return
>> failure ret(%d)  ",ret);
>>
>>             goto err;
>>
>>     }
>>
>>     return 0;
>>
>> err:
>>
>>     op_errno = (op_errno == -1) ? errno : op_errno;
>>
>>     DU_STACK_UNWIND (writev, frame, -1, op_errno, NULL, NULL, NULL);
>>
>>     return 0;
>>
>> }
>>
>>
>>
>> Br,
>>
>> Li Deqian
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190130/43399621/attachment.html>