<div dir="ltr"><div dir="ltr"><br><div><br></div><div><br></div><br><div class="gmail_quote"><div dir="ltr">On Fri, Sep 28, 2018 at 4:01 PM Shyam Ranganathan &lt;<a href="mailto:srangana@redhat.com">srangana@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">We tested with ASAN and without the fix at [1], and it consistently<br>
crashes at the mdcache xlator when brick mux is enabled.<br>
On 09/28/2018 03:50 PM, FNU Raghavendra Manjunath wrote:<br>
&gt; <br>
&gt; I was looking into the issue and  this is what I could find while<br>
&gt; working with shyam.<br>
&gt; <br>
&gt; There are 2 things here.<br>
&gt; <br>
&gt; 1) The multiplexed brick process for the snapshot(s) getting the client<br>
&gt; volfile (I suspect, it happened<br>
&gt;      when restore operation was performed).<br>
&gt; 2) Memory corruption happening while the multiplexed brick process is<br>
&gt; building the graph (for the client<br>
&gt;      volfile it got above)<br>
&gt; <br>
&gt; I have been able to reproduce the issue in my local computer once, when<br>
&gt; I ran the testcase tests/bugs/snapshot/bug-1275616.t<br>
&gt; <br>
&gt; Upon comparison, we found that the backtrace of the core I got and the<br>
&gt; core generated in the regression runs was similar.<br>
&gt; In fact, the victim information shyam mentioned before, is also similar<br>
&gt; in the core that I was able to get.  <br>
&gt; <br>
&gt; On top of that, when the brick process was run with valgrind, it<br>
&gt; reported following memory corruption<br>
&gt; <br>
&gt; ==31257== Conditional jump or move depends on uninitialised value(s)<br>
&gt; ==31257==    at 0x1A7D0564: mdc_xattr_list_populate (md-cache.c:3127)<br>
&gt; ==31257==    by 0x1A7D1903: mdc_init (md-cache.c:3486)<br>
&gt; ==31257==    by 0x4E62D41: __xlator_init (xlator.c:684)<br>
&gt; ==31257==    by 0x4E62E67: xlator_init (xlator.c:709)<br>
&gt; ==31257==    by 0x4EB2BEB: glusterfs_graph_init (graph.c:359)<br>
&gt; ==31257==    by 0x4EB37F8: glusterfs_graph_activate (graph.c:722)<br>
&gt; ==31257==    by 0x40AEC3: glusterfs_process_volfp (glusterfsd.c:2528)<br>
&gt; ==31257==    by 0x410868: mgmt_getspec_cbk (glusterfsd-mgmt.c:2076)<br>
&gt; ==31257==    by 0x518408D: rpc_clnt_handle_reply (rpc-clnt.c:755)<br>
&gt; ==31257==    by 0x51845C1: rpc_clnt_notify (rpc-clnt.c:923)<br>
&gt; ==31257==    by 0x518084E: rpc_transport_notify (rpc-transport.c:525)<br>
&gt; ==31257==    by 0x123273DF: socket_event_poll_in (socket.c:2504)<br>
&gt; ==31257==  Uninitialised value was created by a heap allocation<br>
&gt; ==31257==    at 0x4C2DB9D: malloc (vg_replace_malloc.c:299)<br>
&gt; ==31257==    by 0x4E9F58E: __gf_malloc (mem-pool.c:136)<br>
&gt; ==31257==    by 0x1A7D052A: mdc_xattr_list_populate (md-cache.c:3123)<br>
&gt; ==31257==    by 0x1A7D1903: mdc_init (md-cache.c:3486)<br>
&gt; ==31257==    by 0x4E62D41: __xlator_init (xlator.c:684)<br>
&gt; ==31257==    by 0x4E62E67: xlator_init (xlator.c:709)<br>
&gt; ==31257==    by 0x4EB2BEB: glusterfs_graph_init (graph.c:359)<br>
&gt; ==31257==    by 0x4EB37F8: glusterfs_graph_activate (graph.c:722)<br>
&gt; ==31257==    by 0x40AEC3: glusterfs_process_volfp (glusterfsd.c:2528)<br>
&gt; ==31257==    by 0x410868: mgmt_getspec_cbk (glusterfsd-mgmt.c:2076)<br>
&gt; ==31257==    by 0x518408D: rpc_clnt_handle_reply (rpc-clnt.c:755)<br>
&gt; ==31257==    by 0x51845C1: rpc_clnt_notify (rpc-clnt.c:923)<br>
&gt; <br>
&gt; Based on the above observations, I think the below patch  by Shyam<br>
&gt; should fix the crash.<br>
<br>
[1]<br>
<br>
&gt; <a href="https://review.gluster.org/#/c/glusterfs/+/21299/" rel="noreferrer" target="_blank">https://review.gluster.org/#/c/glusterfs/+/21299/</a><br>
&gt; <br>
&gt; But, I am still trying understand, why a brick process should get a<br>
&gt; client volfile (i.e. the 1st issue mentioned above). <br>
&gt; <br></blockquote><div><br></div><div>It was glusterd which was giving the client volfile instead of the brick volfile. </div><div><br></div><div>The following patch has been submitted for review to address the cause of this problem.</div><div><br></div><div><a href="https://review.gluster.org/#/c/glusterfs/+/21314/">https://review.gluster.org/#/c/glusterfs/+/21314/</a></div><div><br></div><div>Regards,</div><div>Raghavendra</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
&gt; Regards,<br>
&gt; Raghavendra<br>
&gt; <br>
&gt; On Wed, Sep 26, 2018 at 9:00 PM Shyam Ranganathan &lt;<a href="mailto:srangana@redhat.com" target="_blank">srangana@redhat.com</a><br>
&gt; &lt;mailto:<a href="mailto:srangana@redhat.com" target="_blank">srangana@redhat.com</a>&gt;&gt; wrote:<br>
&gt; <br>
&gt;     On 09/26/2018 10:21 AM, Shyam Ranganathan wrote:<br>
&gt;     &gt; 2. Testing dashboard to maintain release health (new, thanks Nigel)<br>
&gt;     &gt;   - Dashboard at [2]<br>
&gt;     &gt;   - We already have 3 failures here as follows, needs attention from<br>
&gt;     &gt; appropriate *maintainers*,<br>
&gt;     &gt;     (a)<br>
&gt;     &gt;<br>
&gt;     <a href="https://build.gluster.org/job/regression-test-with-multiplex/871/consoleText" rel="noreferrer" target="_blank">https://build.gluster.org/job/regression-test-with-multiplex/871/consoleText</a><br>
&gt;     &gt;       - Failed with core:<br>
&gt;     ./tests/basic/afr/gfid-mismatch-resolution-with-cli.t<br>
&gt;     &gt;     (b)<br>
&gt;     &gt;<br>
&gt;     <a href="https://build.gluster.org/job/regression-test-with-multiplex/873/consoleText" rel="noreferrer" target="_blank">https://build.gluster.org/job/regression-test-with-multiplex/873/consoleText</a><br>
&gt;     &gt;       - Failed with core: ./tests/bugs/snapshot/bug-1275616.t<br>
&gt;     &gt;       - Also test ./tests/bugs/glusterd/validating-server-quorum.t<br>
&gt;     had to be<br>
&gt;     &gt; retried<br>
&gt; <br>
&gt;     I was looking at the cores from the above 2 instances, the one in job<br>
&gt;     873 is been a typical pattern, where malloc fails as there is internal<br>
&gt;     header corruption in the free bins.<br>
&gt; <br>
&gt;     When examining the victim that would have been allocated, it is often<br>
&gt;     carrying incorrect size and other magic information. If the data in<br>
&gt;     victim is investigated it looks like a volfile.<br>
&gt; <br>
&gt;     With the crash in 871, I thought there maybe a point where this is<br>
&gt;     detected earlier, but not able to make headway in the same.<br>
&gt; <br>
&gt;     So, what could be corrupting this memory and is it when the graph is<br>
&gt;     being processed? Can we run this with ASAN or such (I have not tried,<br>
&gt;     but need pointers if anyone has run tests with ASAN).<br>
&gt; <br>
&gt;     Here is the (brief) stack analysis of the core in 873:<br>
&gt;     NOTE: we need to start avoiding flushing the logs when we are dumping<br>
&gt;     core, as that leads to more memory allocations and causes a sort of<br>
&gt;     double fault in such cases.<br>
&gt; <br>
&gt;     Core was generated by `/build/install/sbin/glusterfsd -s<br>
&gt;     <a href="http://builder101.cloud.gluster.org" rel="noreferrer" target="_blank">builder101.cloud.gluster.org</a> &lt;<a href="http://builder101.cloud.gluster.org" rel="noreferrer" target="_blank">http://builder101.cloud.gluster.org</a>&gt;<br>
&gt;     --volfile-id /sn&#39;.<br>
&gt;     Program terminated with signal 6, Aborted.<br>
&gt;     #0  0x00007f23cf590277 in __GI_raise (sig=sig@entry=6) at<br>
&gt;     ../nptl/sysdeps/unix/sysv/linux/raise.c:56<br>
&gt;     56        return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);<br>
&gt;     (gdb) bt<br>
&gt;     #0  0x00007f23cf590277 in __GI_raise (sig=sig@entry=6) at<br>
&gt;     ../nptl/sysdeps/unix/sysv/linux/raise.c:56<br>
&gt;     #1  0x00007f23cf591968 in __GI_abort () at abort.c:90<br>
&gt;     #2  0x00007f23cf5d2d37 in __libc_message (do_abort=do_abort@entry=2,<br>
&gt;     fmt=fmt@entry=0x7f23cf6e4d58 &quot;*** Error in `%s&#39;: %s: 0x%s ***\n&quot;) at<br>
&gt;     ../sysdeps/unix/sysv/linux/libc_fatal.c:196<br>
&gt;     #3  0x00007f23cf5db499 in malloc_printerr (ar_ptr=0x7f23bc000020,<br>
&gt;     ptr=&lt;optimized out&gt;, str=0x7f23cf6e4ea8 &quot;free(): corrupted unsorted<br>
&gt;     chunks&quot;, action=3) at malloc.c:5025<br>
&gt;     #4  _int_free (av=0x7f23bc000020, p=&lt;optimized out&gt;, have_lock=0) at<br>
&gt;     malloc.c:3847<br>
&gt;     #5  0x00007f23d0f7c6e4 in __gf_free (free_ptr=0x7f23bc0a56a0) at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/mem-pool.c:356<br>
&gt;     #6  0x00007f23d0f41821 in log_buf_destroy (buf=0x7f23bc0a5568) at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:358<br>
&gt;     #7  0x00007f23d0f44e55 in gf_log_flush_list (copy=0x7f23c404a290,<br>
&gt;     ctx=0x1ff6010) at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:1739<br>
&gt;     #8  0x00007f23d0f45081 in gf_log_flush_extra_msgs (ctx=0x1ff6010, new=0)<br>
&gt;     at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:1807<br>
&gt;     #9  0x00007f23d0f4162d in gf_log_set_log_buf_size (buf_size=0) at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:290<br>
&gt;     #10 0x00007f23d0f41acc in gf_log_disable_suppression_before_exit<br>
&gt;     (ctx=0x1ff6010) at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/logging.c:444<br>
&gt;     #11 0x00007f23d0f4c027 in gf_print_trace (signum=6, ctx=0x1ff6010) at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/common-utils.c:922<br>
&gt;     #12 0x000000000040a84a in glusterfsd_print_trace (signum=6) at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/glusterfsd/src/glusterfsd.c:2316<br>
&gt;     #13 &lt;signal handler called&gt;<br>
&gt;     #14 0x00007f23cf590277 in __GI_raise (sig=sig@entry=6) at<br>
&gt;     ../nptl/sysdeps/unix/sysv/linux/raise.c:56<br>
&gt;     #15 0x00007f23cf591968 in __GI_abort () at abort.c:90<br>
&gt;     #16 0x00007f23cf5d2d37 in __libc_message (do_abort=2,<br>
&gt;     fmt=fmt@entry=0x7f23cf6e4d58 &quot;*** Error in `%s&#39;: %s: 0x%s ***\n&quot;) at<br>
&gt;     ../sysdeps/unix/sysv/linux/libc_fatal.c:196<br>
&gt;     #17 0x00007f23cf5dcc86 in malloc_printerr (ar_ptr=0x7f23bc000020,<br>
&gt;     ptr=0x7f23bc003cd0, str=0x7f23cf6e245b &quot;malloc(): memory corruption&quot;,<br>
&gt;     action=&lt;optimized out&gt;) at malloc.c:5025<br>
&gt;     #18 _int_malloc (av=av@entry=0x7f23bc000020, bytes=bytes@entry=15664) at<br>
&gt;     malloc.c:3473<br>
&gt;     #19 0x00007f23cf5df84c in __GI___libc_malloc (bytes=15664) at<br>
&gt;     malloc.c:2899<br>
&gt;     #20 0x00007f23d0f3bbbf in __gf_default_malloc (size=15664) at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/mem-pool.h:106<br>
&gt;     #21 0x00007f23d0f3f02f in xlator_mem_acct_init (xl=0x7f23bc082b20,<br>
&gt;     num_types=163) at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/xlator.c:800<br>
&gt;     #22 0x00007f23b90a37bf in mem_acct_init (this=0x7f23bc082b20) at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/xlators/performance/open-behind/src/open-behind.c:1189<br>
&gt;     #23 0x00007f23d0f3ebe8 in xlator_init (xl=0x7f23bc082b20) at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/xlator.c:700<br>
&gt;     #24 0x00007f23d0f8fb5f in glusterfs_graph_init (graph=0x7f23bc010570) at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/graph.c:359<br>
&gt;     #25 0x00007f23d0f907ac in glusterfs_graph_activate<br>
&gt;     (graph=0x7f23bc010570, ctx=0x1ff6010) at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/graph.c:722<br>
&gt;     #26 0x000000000040af89 in glusterfs_process_volfp (ctx=0x1ff6010,<br>
&gt;     fp=0x7f23bc00b0a0) at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/glusterfsd/src/glusterfsd.c:2528<br>
&gt;     #27 0x000000000041094c in mgmt_getspec_cbk (req=0x7f23a4004f78,<br>
&gt;     iov=0x7f23a4004fb8, count=1, myframe=0x7f23a4002b88)<br>
&gt;         at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/glusterfsd/src/glusterfsd-mgmt.c:2076<br>
&gt;     #28 0x00007f23d0d0617d in rpc_clnt_handle_reply (clnt=0x2077910,<br>
&gt;     pollin=0x7f23bc001e80) at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-lib/src/rpc-clnt.c:755<br>
&gt;     #29 0x00007f23d0d066ad in rpc_clnt_notify (trans=0x2077c70,<br>
&gt;     mydata=0x2077940, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f23bc001e80)<br>
&gt;         at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-lib/src/rpc-clnt.c:923<br>
&gt;     #30 0x00007f23d0d02895 in rpc_transport_notify (this=0x2077c70,<br>
&gt;     event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f23bc001e80)<br>
&gt;         at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-lib/src/rpc-transport.c:525<br>
&gt;     #31 0x00007f23c5b143ff in socket_event_poll_in (this=0x2077c70,<br>
&gt;     notify_handled=true) at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-transport/socket/src/socket.c:2504<br>
&gt;     #32 0x00007f23c5b153e0 in socket_event_handler (fd=9, idx=1, gen=1,<br>
&gt;     data=0x2077c70, poll_in=1, poll_out=0, poll_err=0)<br>
&gt;         at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/rpc/rpc-transport/socket/src/socket.c:2905<br>
&gt;     #33 0x00007f23d0fbd3bc in event_dispatch_epoll_handler<br>
&gt;     (event_pool=0x202dc40, event=0x7f23c404bea0) at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/event-epoll.c:591<br>
&gt;     #34 0x00007f23d0fbd6b5 in event_dispatch_epoll_worker (data=0x2079470)<br>
&gt;     at<br>
&gt;     /home/jenkins/root/workspace/regression-test-with-multiplex/libglusterfs/src/event-epoll.c:668<br>
&gt;     #35 0x00007f23cfd8fe25 in start_thread (arg=0x7f23c404c700) at<br>
&gt;     pthread_create.c:308<br>
&gt;     #36 0x00007f23cf658bad in clone () at<br>
&gt;     ../sysdeps/unix/sysv/linux/x86_64/clone.S:113<br>
&gt; <br>
&gt;     (gdb) p victim<br>
&gt;     $1 = (struct malloc_chunk *) 0x7f23bc003cc0<br>
&gt; <br>
&gt;     (gdb) x/16c (char *)victim - 16<br>
&gt;     0x7f23bc003cb0: 54 &#39;6&#39;  48 &#39;0&#39;  57 &#39;9&#39;  53 &#39;5&#39;  13 &#39;\r&#39; -16 &#39;\360&#39; <br>
&gt;         -83<br>
&gt;     &#39;\255&#39;  -70 &#39;\272&#39;<br>
&gt;     0x7f23bc003cb8: 56 &#39;8&#39;  57 &#39;9&#39;  51 &#39;3&#39;  48 &#39;0&#39;  50 &#39;2&#39;  99 &#39;c&#39;  99<br>
&gt;     &#39;c&#39;  55 &#39;7&#39;<br>
&gt;     (gdb)<br>
&gt;     0x7f23bc003cc0: 50 &#39;2&#39;  52 &#39;4&#39;  47 &#39;/&#39;  98 &#39;b&#39;  114 &#39;r&#39; 105 &#39;i&#39; 99<br>
&gt;     &#39;c&#39;  107 &#39;k&#39;<br>
&gt;     0x7f23bc003cc8: 33 &#39;!&#39;  4 &#39;\004&#39;        115 &#39;s&#39; 101 &#39;e&#39; 99 &#39;c&#39;  117<br>
&gt;     &#39;u&#39; 114 &#39;r&#39;<br>
&gt;     105 &#39;i&#39;<br>
&gt;     (gdb)<br>
&gt;     0x7f23bc003cd0: 116 &#39;t&#39; 121 &#39;y&#39; 46 &#39;.&#39;  99 &#39;c&#39;  97 &#39;a&#39;  112 &#39;p&#39; 97<br>
&gt;     &#39;a&#39;  98 &#39;b&#39;<br>
&gt;     0x7f23bc003cd8: 105 &#39;i&#39; 108 &#39;l&#39; 105 &#39;i&#39; 116 &#39;t&#39; 121 &#39;y&#39; 44 &#39;,&#39;  115 &#39;s&#39;<br>
&gt;     101 &#39;e&#39;<br>
&gt;     (gdb)<br>
&gt;     0x7f23bc003ce0: 99 &#39;c&#39;  117 &#39;u&#39; 114 &#39;r&#39; 105 &#39;i&#39; 116 &#39;t&#39; 121 &#39;y&#39; 46 &#39;.&#39;<br>
&gt;     105 &#39;i&#39;<br>
&gt;     0x7f23bc003ce8: 109 &#39;m&#39; 97 &#39;a&#39;  44 &#39;,&#39;  117 &#39;u&#39; 115 &#39;s&#39; 101 &#39;e&#39; 114<br>
&gt;     &#39;r&#39; 46 &#39;.&#39;<br>
&gt;     (gdb)<br>
&gt;     0x7f23bc003cf0: 115 &#39;s&#39; 119 &#39;w&#39; 105 &#39;i&#39; 102 &#39;f&#39; 116 &#39;t&#39; 46 &#39;.&#39;  109 &#39;m&#39;<br>
&gt;     101 &#39;e&#39;<br>
&gt;     0x7f23bc003cf8: 116 &#39;t&#39; 97 &#39;a&#39;  100 &#39;d&#39; 97 &#39;a&#39;  116 &#39;t&#39; 97 &#39;a&#39;  44<br>
&gt;     &#39;,&#39;  0 &#39;\000&#39;<br>
&gt;     _______________________________________________<br>
&gt;     Gluster-devel mailing list<br>
&gt;     <a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a> &lt;mailto:<a href="mailto:Gluster-devel@gluster.org" target="_blank">Gluster-devel@gluster.org</a>&gt;<br>
&gt;     <a href="https://lists.gluster.org/mailman/listinfo/gluster-devel" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-devel</a><br>
&gt; <br>
</blockquote></div></div></div>