[Gluster-Maintainers] [Gluster-devel] Memory overwrites due to processing vol files???

Fri Sep 28 19:50:51 UTC 2018

I was looking into the issue and  this is what I could find while working
with shyam.

There are 2 things here.

1) The multiplexed brick process for the snapshot(s) getting the client
volfile (I suspect, it happened
     when restore operation was performed).
2) Memory corruption happening while the multiplexed brick process is
building the graph (for the client
     volfile it got above)

I have been able to reproduce the issue in my local computer once, when I
ran the testcase tests/bugs/snapshot/bug-1275616.t

Upon comparison, we found that the backtrace of the core I got and the core
generated in the regression runs was similar.
In fact, the victim information shyam mentioned before, is also similar in
the core that I was able to get.

On top of that, when the brick process was run with valgrind, it reported
following memory corruption

==31257== Conditional jump or move depends on uninitialised value(s)
==31257==    at 0x1A7D0564: mdc_xattr_list_populate (md-cache.c:3127)
==31257==    by 0x1A7D1903: mdc_init (md-cache.c:3486)
==31257==    by 0x4E62D41: __xlator_init (xlator.c:684)
==31257==    by 0x4E62E67: xlator_init (xlator.c:709)
==31257==    by 0x4EB2BEB: glusterfs_graph_init (graph.c:359)
==31257==    by 0x4EB37F8: glusterfs_graph_activate (graph.c:722)
==31257==    by 0x40AEC3: glusterfs_process_volfp (glusterfsd.c:2528)
==31257==    by 0x410868: mgmt_getspec_cbk (glusterfsd-mgmt.c:2076)
==31257==    by 0x518408D: rpc_clnt_handle_reply (rpc-clnt.c:755)
==31257==    by 0x51845C1: rpc_clnt_notify (rpc-clnt.c:923)
==31257==    by 0x518084E: rpc_transport_notify (rpc-transport.c:525)
==31257==    by 0x123273DF: socket_event_poll_in (socket.c:2504)
==31257==  Uninitialised value was created by a heap allocation
==31257==    at 0x4C2DB9D: malloc (vg_replace_malloc.c:299)
==31257==    by 0x4E9F58E: __gf_malloc (mem-pool.c:136)
==31257==    by 0x1A7D052A: mdc_xattr_list_populate (md-cache.c:3123)
==31257==    by 0x1A7D1903: mdc_init (md-cache.c:3486)
==31257==    by 0x4E62D41: __xlator_init (xlator.c:684)
==31257==    by 0x4E62E67: xlator_init (xlator.c:709)
==31257==    by 0x4EB2BEB: glusterfs_graph_init (graph.c:359)
==31257==    by 0x4EB37F8: glusterfs_graph_activate (graph.c:722)
==31257==    by 0x40AEC3: glusterfs_process_volfp (glusterfsd.c:2528)
==31257==    by 0x410868: mgmt_getspec_cbk (glusterfsd-mgmt.c:2076)
==31257==    by 0x518408D: rpc_clnt_handle_reply (rpc-clnt.c:755)
==31257==    by 0x51845C1: rpc_clnt_notify (rpc-clnt.c:923)

Based on the above observations, I think the below patch  by Shyam should
fix the crash.
https://review.gluster.org/#/c/glusterfs/+/21299/

But, I am still trying understand, why a brick process should get a client
volfile (i.e. the 1st issue mentioned above).

Regards,
Raghavendra

On Wed, Sep 26, 2018 at 9:00 PM Shyam Ranganathan <srangana at redhat.com>
wrote:

> On 09/26/2018 10:21 AM, Shyam Ranganathan wrote:
> > 2. Testing dashboard to maintain release health (new, thanks Nigel)
> >   - Dashboard at [2]
> >   - We already have 3 failures here as follows, needs attention from
> > appropriate *maintainers*,
> >     (a)
> >
> https://build.gluster.org/job/regression-test-with-multiplex/871/consoleText
> >       - Failed with core:
> ./tests/basic/afr/gfid-mismatch-resolution-with-cli.t
> >     (b)
> >
> https://build.gluster.org/job/regression-test-with-multiplex/873/consoleText
> >       - Failed with core: ./tests/bugs/snapshot/bug-1275616.t
> >       - Also test ./tests/bugs/glusterd/validating-server-quorum.t had
> to be
> > retried
>
> I was looking at the cores from the above 2 instances, the one in job
> 873 is been a typical pattern, where malloc fails as there is internal
> header corruption in the free bins.
>
> When examining the victim that would have been allocated, it is often
> carrying incorrect size and other magic information. If the data in
> victim is investigated it looks like a volfile.
>
> With the crash in 871, I thought there maybe a point where this is
> detected earlier, but not able to make headway in the same.
>
> So, what could be corrupting this memory and is it when the graph is
> being processed? Can we run this with ASAN or such (I have not tried,
> but need pointers if anyone has run tests with ASAN).
>
> Here is the (brief) stack analysis of the core in 873:
> NOTE: we need to start avoiding flushing the logs when we are dumping
> core, as that leads to more memory allocations and causes a sort of
> double fault in such cases.
>
> Core was generated by `/build/install/sbin/glusterfsd -s
> builder101.cloud.gluster.org --volfile-id /sn'.
> Program terminated with signal 6, Aborted.
> #0  0x00007f23cf590277 in __GI_raise (sig=sig at entry=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> 56        return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
> (gdb) bt
> #0  0x00007f23cf590277 in __GI_raise (sig=sig at entry=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> #1  0x00007f23cf591968 in __GI_abort () at abort.c:90
> #2  0x00007f23cf5d2d37 in __libc_message (do_abort=do_abort at entry=2,
> fmt=fmt at entry=0x7f23cf6e4d58 "*** Error in `%s': %s: 0x%s ***\n") at
> ../sysdeps/unix/sysv/linux/libc_fatal.c:196
> #3  0x00007f23cf5db499 in malloc_printerr (ar_ptr=0x7f23bc000020,
> ptr=<optimized out>, str=0x7f23cf6e4ea8 "free(): corrupted unsorted
> chunks", action=3) at malloc.c:5025
> #4  _int_free (av=0x7f23bc000020, p=<optimized out>, have_lock=0) at
> malloc.c:3847
> #5  0x00007f23d0f7c6e4 in __gf_free (free_ptr=0x7f23bc0a56a0) at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/mem-pool.c:356
> #6  0x00007f23d0f41821 in log_buf_destroy (buf=0x7f23bc0a5568) at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:358
> #7  0x00007f23d0f44e55 in gf_log_flush_list (copy=0x7f23c404a290,
> ctx=0x1ff6010) at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:1739
> #8  0x00007f23d0f45081 in gf_log_flush_extra_msgs (ctx=0x1ff6010, new=0)
> at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:1807
> #9  0x00007f23d0f4162d in gf_log_set_log_buf_size (buf_size=0) at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:290
> #10 0x00007f23d0f41acc in gf_log_disable_suppression_before_exit
> (ctx=0x1ff6010) at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:444
> #11 0x00007f23d0f4c027 in gf_print_trace (signum=6, ctx=0x1ff6010) at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/common-utils.c:922
> #12 0x000000000040a84a in glusterfsd_print_trace (signum=6) at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/glusterfsd/src/glusterfsd.c:2316
> #13 <signal handler called>
> #14 0x00007f23cf590277 in __GI_raise (sig=sig at entry=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> #15 0x00007f23cf591968 in __GI_abort () at abort.c:90
> #16 0x00007f23cf5d2d37 in __libc_message (do_abort=2,
> fmt=fmt at entry=0x7f23cf6e4d58 "*** Error in `%s': %s: 0x%s ***\n") at
> ../sysdeps/unix/sysv/linux/libc_fatal.c:196
> #17 0x00007f23cf5dcc86 in malloc_printerr (ar_ptr=0x7f23bc000020,
> ptr=0x7f23bc003cd0, str=0x7f23cf6e245b "malloc(): memory corruption",
> action=<optimized out>) at malloc.c:5025
> #18 _int_malloc (av=av at entry=0x7f23bc000020, bytes=bytes at entry=15664) at
> malloc.c:3473
> #19 0x00007f23cf5df84c in __GI___libc_malloc (bytes=15664) at malloc.c:2899
> #20 0x00007f23d0f3bbbf in __gf_default_malloc (size=15664) at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/mem-pool.h:106
> #21 0x00007f23d0f3f02f in xlator_mem_acct_init (xl=0x7f23bc082b20,
> num_types=163) at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/xlator.c:800
> #22 0x00007f23b90a37bf in mem_acct_init (this=0x7f23bc082b20) at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/xlators/performance/open-behind/src/open-behind.c:1189
> #23 0x00007f23d0f3ebe8 in xlator_init (xl=0x7f23bc082b20) at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/xlator.c:700
> #24 0x00007f23d0f8fb5f in glusterfs_graph_init (graph=0x7f23bc010570) at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/graph.c:359
> #25 0x00007f23d0f907ac in glusterfs_graph_activate
> (graph=0x7f23bc010570, ctx=0x1ff6010) at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/graph.c:722
> #26 0x000000000040af89 in glusterfs_process_volfp (ctx=0x1ff6010,
> fp=0x7f23bc00b0a0) at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/glusterfsd/src/glusterfsd.c:2528
> #27 0x000000000041094c in mgmt_getspec_cbk (req=0x7f23a4004f78,
> iov=0x7f23a4004fb8, count=1, myframe=0x7f23a4002b88)
>     at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/glusterfsd/src/glusterfsd-mgmt.c:2076
> #28 0x00007f23d0d0617d in rpc_clnt_handle_reply (clnt=0x2077910,
> pollin=0x7f23bc001e80) at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-lib/src/rpc-clnt.c:755
> #29 0x00007f23d0d066ad in rpc_clnt_notify (trans=0x2077c70,
> mydata=0x2077940, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f23bc001e80)
>     at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-lib/src/rpc-clnt.c:923
> #30 0x00007f23d0d02895 in rpc_transport_notify (this=0x2077c70,
> event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f23bc001e80)
>     at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-lib/src/rpc-transport.c:525
> #31 0x00007f23c5b143ff in socket_event_poll_in (this=0x2077c70,
> notify_handled=true) at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-transport/socket/src/socket.c:2504
> #32 0x00007f23c5b153e0 in socket_event_handler (fd=9, idx=1, gen=1,
> data=0x2077c70, poll_in=1, poll_out=0, poll_err=0)
>     at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-transport/socket/src/socket.c:2905
> #33 0x00007f23d0fbd3bc in event_dispatch_epoll_handler
> (event_pool=0x202dc40, event=0x7f23c404bea0) at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/event-epoll.c:591
> #34 0x00007f23d0fbd6b5 in event_dispatch_epoll_worker (data=0x2079470)
> at
>
> /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/event-epoll.c:668
> #35 0x00007f23cfd8fe25 in start_thread (arg=0x7f23c404c700) at
> pthread_create.c:308
> #36 0x00007f23cf658bad in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
>
> (gdb) p victim
> $1 = (struct malloc_chunk *) 0x7f23bc003cc0
>
> (gdb) x/16c (char *)victim - 16
> 0x7f23bc003cb0: 54 '6'  48 '0'  57 '9'  53 '5'  13 '\r' -16 '\360'      -83
> '\255'  -70 '\272'
> 0x7f23bc003cb8: 56 '8'  57 '9'  51 '3'  48 '0'  50 '2'  99 'c'  99 'c'  55
> '7'
> (gdb)
> 0x7f23bc003cc0: 50 '2'  52 '4'  47 '/'  98 'b'  114 'r' 105 'i' 99 'c'
> 107 'k'
> 0x7f23bc003cc8: 33 '!'  4 '\004'        115 's' 101 'e' 99 'c'  117 'u'
> 114 'r'
> 105 'i'
> (gdb)
> 0x7f23bc003cd0: 116 't' 121 'y' 46 '.'  99 'c'  97 'a'  112 'p' 97 'a'  98
> 'b'
> 0x7f23bc003cd8: 105 'i' 108 'l' 105 'i' 116 't' 121 'y' 44 ','  115 's'
> 101 'e'
> (gdb)
> 0x7f23bc003ce0: 99 'c'  117 'u' 114 'r' 105 'i' 116 't' 121 'y' 46 '.'
> 105 'i'
> 0x7f23bc003ce8: 109 'm' 97 'a'  44 ','  117 'u' 115 's' 101 'e' 114 'r' 46
> '.'
> (gdb)
> 0x7f23bc003cf0: 115 's' 119 'w' 105 'i' 102 'f' 116 't' 46 '.'  109 'm'
> 101 'e'
> 0x7f23bc003cf8: 116 't' 97 'a'  100 'd' 97 'a'  116 't' 97 'a'  44 ','  0
> '\000'
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/maintainers/attachments/20180928/fcf2fae9/attachment-0001.html>