[Gluster-Maintainers] [Gluster-devel] Memory overwrites due to processing vol files???

Wed Oct 3 16:58:08 UTC 2018

On Fri, Sep 28, 2018 at 4:01 PM Shyam Ranganathan <srangana at redhat.com>
wrote:

> We tested with ASAN and without the fix at [1], and it consistently
> crashes at the mdcache xlator when brick mux is enabled.
> On 09/28/2018 03:50 PM, FNU Raghavendra Manjunath wrote:
> >
> > I was looking into the issue and  this is what I could find while
> > working with shyam.
> >
> > There are 2 things here.
> >
> > 1) The multiplexed brick process for the snapshot(s) getting the client
> > volfile (I suspect, it happened
> >      when restore operation was performed).
> > 2) Memory corruption happening while the multiplexed brick process is
> > building the graph (for the client
> >      volfile it got above)
> >
> > I have been able to reproduce the issue in my local computer once, when
> > I ran the testcase tests/bugs/snapshot/bug-1275616.t
> >
> > Upon comparison, we found that the backtrace of the core I got and the
> > core generated in the regression runs was similar.
> > In fact, the victim information shyam mentioned before, is also similar
> > in the core that I was able to get.
> >
> > On top of that, when the brick process was run with valgrind, it
> > reported following memory corruption
> >
> > ==31257== Conditional jump or move depends on uninitialised value(s)
> > ==31257==    at 0x1A7D0564: mdc_xattr_list_populate (md-cache.c:3127)
> > ==31257==    by 0x1A7D1903: mdc_init (md-cache.c:3486)
> > ==31257==    by 0x4E62D41: __xlator_init (xlator.c:684)
> > ==31257==    by 0x4E62E67: xlator_init (xlator.c:709)
> > ==31257==    by 0x4EB2BEB: glusterfs_graph_init (graph.c:359)
> > ==31257==    by 0x4EB37F8: glusterfs_graph_activate (graph.c:722)
> > ==31257==    by 0x40AEC3: glusterfs_process_volfp (glusterfsd.c:2528)
> > ==31257==    by 0x410868: mgmt_getspec_cbk (glusterfsd-mgmt.c:2076)
> > ==31257==    by 0x518408D: rpc_clnt_handle_reply (rpc-clnt.c:755)
> > ==31257==    by 0x51845C1: rpc_clnt_notify (rpc-clnt.c:923)
> > ==31257==    by 0x518084E: rpc_transport_notify (rpc-transport.c:525)
> > ==31257==    by 0x123273DF: socket_event_poll_in (socket.c:2504)
> > ==31257==  Uninitialised value was created by a heap allocation
> > ==31257==    at 0x4C2DB9D: malloc (vg_replace_malloc.c:299)
> > ==31257==    by 0x4E9F58E: __gf_malloc (mem-pool.c:136)
> > ==31257==    by 0x1A7D052A: mdc_xattr_list_populate (md-cache.c:3123)
> > ==31257==    by 0x1A7D1903: mdc_init (md-cache.c:3486)
> > ==31257==    by 0x4E62D41: __xlator_init (xlator.c:684)
> > ==31257==    by 0x4E62E67: xlator_init (xlator.c:709)
> > ==31257==    by 0x4EB2BEB: glusterfs_graph_init (graph.c:359)
> > ==31257==    by 0x4EB37F8: glusterfs_graph_activate (graph.c:722)
> > ==31257==    by 0x40AEC3: glusterfs_process_volfp (glusterfsd.c:2528)
> > ==31257==    by 0x410868: mgmt_getspec_cbk (glusterfsd-mgmt.c:2076)
> > ==31257==    by 0x518408D: rpc_clnt_handle_reply (rpc-clnt.c:755)
> > ==31257==    by 0x51845C1: rpc_clnt_notify (rpc-clnt.c:923)
> >
> > Based on the above observations, I think the below patch  by Shyam
> > should fix the crash.
>
> [1]
>
> > https://review.gluster.org/#/c/glusterfs/+/21299/
> >
> > But, I am still trying understand, why a brick process should get a
> > client volfile (i.e. the 1st issue mentioned above).
> >
>

It was glusterd which was giving the client volfile instead of the brick
volfile.

The following patch has been submitted for review to address the cause of
this problem.

https://review.gluster.org/#/c/glusterfs/+/21314/

Regards,
Raghavendra

> > Regards,
> > Raghavendra
> >
> > On Wed, Sep 26, 2018 at 9:00 PM Shyam Ranganathan <srangana at redhat.com
> > <mailto:srangana at redhat.com>> wrote:
> >
> >     On 09/26/2018 10:21 AM, Shyam Ranganathan wrote:
> >     > 2. Testing dashboard to maintain release health (new, thanks Nigel)
> >     >   - Dashboard at [2]
> >     >   - We already have 3 failures here as follows, needs attention
> from
> >     > appropriate *maintainers*,
> >     >     (a)
> >     >
> >
> https://build.gluster.org/job/regression-test-with-multiplex/871/consoleText
> >     >       - Failed with core:
> >     ./tests/basic/afr/gfid-mismatch-resolution-with-cli.t
> >     >     (b)
> >     >
> >
> https://build.gluster.org/job/regression-test-with-multiplex/873/consoleText
> >     >       - Failed with core: ./tests/bugs/snapshot/bug-1275616.t
> >     >       - Also test ./tests/bugs/glusterd/validating-server-quorum.t
> >     had to be
> >     > retried
> >
> >     I was looking at the cores from the above 2 instances, the one in job
> >     873 is been a typical pattern, where malloc fails as there is
> internal
> >     header corruption in the free bins.
> >
> >     When examining the victim that would have been allocated, it is often
> >     carrying incorrect size and other magic information. If the data in
> >     victim is investigated it looks like a volfile.
> >
> >     With the crash in 871, I thought there maybe a point where this is
> >     detected earlier, but not able to make headway in the same.
> >
> >     So, what could be corrupting this memory and is it when the graph is
> >     being processed? Can we run this with ASAN or such (I have not tried,
> >     but need pointers if anyone has run tests with ASAN).
> >
> >     Here is the (brief) stack analysis of the core in 873:
> >     NOTE: we need to start avoiding flushing the logs when we are dumping
> >     core, as that leads to more memory allocations and causes a sort of
> >     double fault in such cases.
> >
> >     Core was generated by `/build/install/sbin/glusterfsd -s
> >     builder101.cloud.gluster.org <http://builder101.cloud.gluster.org>
> >     --volfile-id /sn'.
> >     Program terminated with signal 6, Aborted.
> >     #0  0x00007f23cf590277 in __GI_raise (sig=sig at entry=6) at
> >     ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> >     56        return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
> >     (gdb) bt
> >     #0  0x00007f23cf590277 in __GI_raise (sig=sig at entry=6) at
> >     ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> >     #1  0x00007f23cf591968 in __GI_abort () at abort.c:90
> >     #2  0x00007f23cf5d2d37 in __libc_message (do_abort=do_abort at entry=2,
> >     fmt=fmt at entry=0x7f23cf6e4d58 "*** Error in `%s': %s: 0x%s ***\n") at
> >     ../sysdeps/unix/sysv/linux/libc_fatal.c:196
> >     #3  0x00007f23cf5db499 in malloc_printerr (ar_ptr=0x7f23bc000020,
> >     ptr=<optimized out>, str=0x7f23cf6e4ea8 "free(): corrupted unsorted
> >     chunks", action=3) at malloc.c:5025
> >     #4  _int_free (av=0x7f23bc000020, p=<optimized out>, have_lock=0) at
> >     malloc.c:3847
> >     #5  0x00007f23d0f7c6e4 in __gf_free (free_ptr=0x7f23bc0a56a0) at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/mem-pool.c:356
> >     #6  0x00007f23d0f41821 in log_buf_destroy (buf=0x7f23bc0a5568) at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:358
> >     #7  0x00007f23d0f44e55 in gf_log_flush_list (copy=0x7f23c404a290,
> >     ctx=0x1ff6010) at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:1739
> >     #8  0x00007f23d0f45081 in gf_log_flush_extra_msgs (ctx=0x1ff6010,
> new=0)
> >     at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:1807
> >     #9  0x00007f23d0f4162d in gf_log_set_log_buf_size (buf_size=0) at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:290
> >     #10 0x00007f23d0f41acc in gf_log_disable_suppression_before_exit
> >     (ctx=0x1ff6010) at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:444
> >     #11 0x00007f23d0f4c027 in gf_print_trace (signum=6, ctx=0x1ff6010) at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/common-utils.c:922
> >     #12 0x000000000040a84a in glusterfsd_print_trace (signum=6) at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/glusterfsd/src/glusterfsd.c:2316
> >     #13 <signal handler called>
> >     #14 0x00007f23cf590277 in __GI_raise (sig=sig at entry=6) at
> >     ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> >     #15 0x00007f23cf591968 in __GI_abort () at abort.c:90
> >     #16 0x00007f23cf5d2d37 in __libc_message (do_abort=2,
> >     fmt=fmt at entry=0x7f23cf6e4d58 "*** Error in `%s': %s: 0x%s ***\n") at
> >     ../sysdeps/unix/sysv/linux/libc_fatal.c:196
> >     #17 0x00007f23cf5dcc86 in malloc_printerr (ar_ptr=0x7f23bc000020,
> >     ptr=0x7f23bc003cd0, str=0x7f23cf6e245b "malloc(): memory corruption",
> >     action=<optimized out>) at malloc.c:5025
> >     #18 _int_malloc (av=av at entry=0x7f23bc000020, bytes=bytes at entry=15664)
> at
> >     malloc.c:3473
> >     #19 0x00007f23cf5df84c in __GI___libc_malloc (bytes=15664) at
> >     malloc.c:2899
> >     #20 0x00007f23d0f3bbbf in __gf_default_malloc (size=15664) at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/mem-pool.h:106
> >     #21 0x00007f23d0f3f02f in xlator_mem_acct_init (xl=0x7f23bc082b20,
> >     num_types=163) at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/xlator.c:800
> >     #22 0x00007f23b90a37bf in mem_acct_init (this=0x7f23bc082b20) at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/xlators/performance/open-behind/src/open-behind.c:1189
> >     #23 0x00007f23d0f3ebe8 in xlator_init (xl=0x7f23bc082b20) at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/xlator.c:700
> >     #24 0x00007f23d0f8fb5f in glusterfs_graph_init
> (graph=0x7f23bc010570) at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/graph.c:359
> >     #25 0x00007f23d0f907ac in glusterfs_graph_activate
> >     (graph=0x7f23bc010570, ctx=0x1ff6010) at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/graph.c:722
> >     #26 0x000000000040af89 in glusterfs_process_volfp (ctx=0x1ff6010,
> >     fp=0x7f23bc00b0a0) at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/glusterfsd/src/glusterfsd.c:2528
> >     #27 0x000000000041094c in mgmt_getspec_cbk (req=0x7f23a4004f78,
> >     iov=0x7f23a4004fb8, count=1, myframe=0x7f23a4002b88)
> >         at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/glusterfsd/src/glusterfsd-mgmt.c:2076
> >     #28 0x00007f23d0d0617d in rpc_clnt_handle_reply (clnt=0x2077910,
> >     pollin=0x7f23bc001e80) at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-lib/src/rpc-clnt.c:755
> >     #29 0x00007f23d0d066ad in rpc_clnt_notify (trans=0x2077c70,
> >     mydata=0x2077940, event=RPC_TRANSPORT_MSG_RECEIVED,
> data=0x7f23bc001e80)
> >         at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-lib/src/rpc-clnt.c:923
> >     #30 0x00007f23d0d02895 in rpc_transport_notify (this=0x2077c70,
> >     event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f23bc001e80)
> >         at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-lib/src/rpc-transport.c:525
> >     #31 0x00007f23c5b143ff in socket_event_poll_in (this=0x2077c70,
> >     notify_handled=true) at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-transport/socket/src/socket.c:2504
> >     #32 0x00007f23c5b153e0 in socket_event_handler (fd=9, idx=1, gen=1,
> >     data=0x2077c70, poll_in=1, poll_out=0, poll_err=0)
> >         at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-transport/socket/src/socket.c:2905
> >     #33 0x00007f23d0fbd3bc in event_dispatch_epoll_handler
> >     (event_pool=0x202dc40, event=0x7f23c404bea0) at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/event-epoll.c:591
> >     #34 0x00007f23d0fbd6b5 in event_dispatch_epoll_worker
> (data=0x2079470)
> >     at
> >
>  /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/event-epoll.c:668
> >     #35 0x00007f23cfd8fe25 in start_thread (arg=0x7f23c404c700) at
> >     pthread_create.c:308
> >     #36 0x00007f23cf658bad in clone () at
> >     ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
> >
> >     (gdb) p victim
> >     $1 = (struct malloc_chunk *) 0x7f23bc003cc0
> >
> >     (gdb) x/16c (char *)victim - 16
> >     0x7f23bc003cb0: 54 '6'  48 '0'  57 '9'  53 '5'  13 '\r' -16 '\360'
> >         -83
> >     '\255'  -70 '\272'
> >     0x7f23bc003cb8: 56 '8'  57 '9'  51 '3'  48 '0'  50 '2'  99 'c'  99
> >     'c'  55 '7'
> >     (gdb)
> >     0x7f23bc003cc0: 50 '2'  52 '4'  47 '/'  98 'b'  114 'r' 105 'i' 99
> >     'c'  107 'k'
> >     0x7f23bc003cc8: 33 '!'  4 '\004'        115 's' 101 'e' 99 'c'  117
> >     'u' 114 'r'
> >     105 'i'
> >     (gdb)
> >     0x7f23bc003cd0: 116 't' 121 'y' 46 '.'  99 'c'  97 'a'  112 'p' 97
> >     'a'  98 'b'
> >     0x7f23bc003cd8: 105 'i' 108 'l' 105 'i' 116 't' 121 'y' 44 ','  115
> 's'
> >     101 'e'
> >     (gdb)
> >     0x7f23bc003ce0: 99 'c'  117 'u' 114 'r' 105 'i' 116 't' 121 'y' 46
> '.'
> >     105 'i'
> >     0x7f23bc003ce8: 109 'm' 97 'a'  44 ','  117 'u' 115 's' 101 'e' 114
> >     'r' 46 '.'
> >     (gdb)
> >     0x7f23bc003cf0: 115 's' 119 'w' 105 'i' 102 'f' 116 't' 46 '.'  109
> 'm'
> >     101 'e'
> >     0x7f23bc003cf8: 116 't' 97 'a'  100 'd' 97 'a'  116 't' 97 'a'  44
> >     ','  0 '\000'
> >     _______________________________________________
> >     Gluster-devel mailing list
> >     Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
> >     https://lists.gluster.org/mailman/listinfo/gluster-devel
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/maintainers/attachments/20181003/3313fdb7/attachment-0001.html>