[Bugs] [Bug 1373630] New: Unable to profile GlusterFS FUSE client with Valgrind' s Massif tool

Tue Sep 6 19:28:32 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1373630

            Bug ID: 1373630
           Summary: Unable to profile GlusterFS FUSE client with
                    Valgrind's Massif tool
           Product: GlusterFS
           Version: 3.7.15
         Component: fuse
          Severity: low
          Assignee: bugs at gluster.org
          Reporter: oleksandr at natalenko.name
                CC: bugs at gluster.org

Description of problem:

In order to find out why GlusterFS FUSE client leaks I would like to use
Valgrind's Massif tool (because Memcheck does not show any reasonable leaks).
So, I install GlusterFS packages + debug packages and run the following:

===
valgrind --tool=massif --smc-check=all --trace-children=yes
--sim-hints=fuse-compatible /usr/sbin/glusterfs -N
--volfile-server=glusterfs.example.com --volfile-id=some_volume
/mnt/net/glusterfs/test
===

This command produces instant output:

===
==25482== Massif, a heap profiler
==25482== Copyright (C) 2003-2013, and GNU GPL'd, by Nicholas Nethercote
==25482== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
==25482== Command: /usr/sbin/glusterfs -N --volfile-server=glusterfs.la.net.ua
--volfile-id=mail_boxes /mnt/net/glusterfs/test
==25482== 
==25483== 
==25484==
===

Immediately after this I also get 2 files generated by Valgrind:

===
-rw------- 1 root root  20K вер  6 22:17 massif.out.25483
-rw------- 1 root root 9.1K вер  6 22:17 massif.out.25484
===

(both files are attached).

Then I start to manipulate files within mounted volume provoking memory to
leak. After dancing around and assuming I see memory leaking in top/htop
output, I finally decide to unmount volume to get my memory profile:

===
umount /mnt/net/glusterfs/test
===

Right after this command is executed, Valgrind shows me the following:

===
valgrind: m_mallocfree.c:304 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi'
failed.                                                 valgrind: Heap block
lo/hi size mismatch: lo = 1, hi = 0.
This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata.  If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away.  Please try that before reporting this as a bug.              

host stacktrace:
==25482==    at 0x3802FC56: ??? (in /usr/lib64/valgrind/massif-amd64-linux)
==25482==    by 0x3802FD84: ??? (in /usr/lib64/valgrind/massif-amd64-linux)
==25482==    by 0x3802FF06: ??? (in /usr/lib64/valgrind/massif-amd64-linux)
==25482==    by 0x3803D5E1: ??? (in /usr/lib64/valgrind/massif-amd64-linux)
==25482==    by 0x3807F6C5: ??? (in /usr/lib64/valgrind/massif-amd64-linux)
==25482==    by 0x380349EF: ??? (in /usr/lib64/valgrind/massif-amd64-linux)
==25482==    by 0x38034D53: ??? (in /usr/lib64/valgrind/massif-amd64-linux)
==25482==    by 0x3808E2D4: ??? (in /usr/lib64/valgrind/massif-amd64-linux)
==25482==    by 0x3808E55A: ??? (in /usr/lib64/valgrind/massif-amd64-linux)
==25482==    by 0x380B5B0D: ??? (in /usr/lib64/valgrind/massif-amd64-linux)
==25482==    by 0xDEADBEEFDEADBEEE: ???
==25482==    by 0xDEADBEEFDEADBEEE: ???
==25482==    by 0xDEADBEEFDEADBEEE: ???

sched status:                                                                  
                                                        running_tid=3

Thread 3: status = VgTs_Runnable
==25482==    at 0x4C29037: free (in
/usr/lib64/valgrind/vgpreload_massif-amd64-linux.so)
==25482==    by 0x67CE63B: __libc_freeres (in /usr/lib64/libc-2.17.so)
==25482==    by 0x4A246B4: _vgnU_freeres (in
/usr/lib64/valgrind/vgpreload_core-amd64-linux.so)
==25482==    by 0x66A2E2A: __run_exit_handlers (in /usr/lib64/libc-2.17.so)
==25482==    by 0x66A2EB4: exit (in /usr/lib64/libc-2.17.so)
==25482==    by 0x1117E9: cleanup_and_exit (glusterfsd.c:1308)
==25482==    by 0x111914: glusterfs_sigwaiter (glusterfsd.c:2029)
==25482==    by 0x606DDC4: start_thread (in /usr/lib64/libpthread-2.17.so)
==25482==    by 0x6760CEC: clone (in /usr/lib64/libc-2.17.so)

Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using.  Thanks.
===

I clearly see some crap happening within 25482 PID. Okay, lets check Valgrind's
output:

===
-rw------- 1 root root  20K вер  6 22:17 massif.out.25483
-rw------- 1 root root 9.1K вер  6 22:17 massif.out.25484
===

No changes! Files didn't get updated, and output for misbehaving PID 25482 did
not appear!

I see 0xDEADBEEFDEADBEEE pattern in Valgrind's output, and that means some
memory gets corrupted.

Okay, let's re-run Valgrind with Memcheck tool, because this is what output
above suggests:

===
valgrind --leak-check=full --show-leak-kinds=all --log-file="valgrind_fuse.log"
/usr/sbin/glusterfs -N --volfile-server=glusterfs.example.com
--volfile-id=some_volume /mnt/net/glusterfs/test
===

valgrind_fuse.log is attached as well. I've noticed there the following
warnings/errors for main PID:

===
==26441== Thread 7:
==26441== Syscall param writev(vector[...]) points to uninitialised byte(s)
==26441==    at 0x675FEA0: writev (in /usr/lib64/libc-2.17.so)
==26441==    by 0xE664795: send_fuse_iov (fuse-bridge.c:158)
==26441==    by 0xE6649B9: send_fuse_data (fuse-bridge.c:197)
==26441==    by 0xE666F7A: fuse_attr_cbk (fuse-bridge.c:753)
==26441==    by 0xE6671A6: fuse_root_lookup_cbk (fuse-bridge.c:783)
==26441==    by 0x1451A937: io_stats_lookup_cbk (io-stats.c:1512)
==26441==    by 0x14301B3E: mdc_lookup_cbk (md-cache.c:867)
==26441==    by 0x13EEA226: qr_lookup_cbk (quick-read.c:446)
==26441==    by 0x13CD9B66: ioc_lookup_cbk (io-cache.c:260)
==26441==    by 0x1346515D: dht_revalidate_cbk (dht-common.c:985)
==26441==    by 0x1320F0F0: afr_discover_done (afr-common.c:2429)
==26441==    by 0x1320F0F0: afr_discover_cbk (afr-common.c:2474)
==26441==    by 0x12F9B6F8: client3_3_lookup_cbk (client-rpc-fops.c:2988)
==26441==  Address 0x168b538c is on thread 7's stack
==26441==  in frame #3, created by fuse_attr_cbk (fuse-bridge.c:723)
==26441== 
==26441== Warning: invalid file descriptor -1 in syscall close()
==26441== Thread 3:
==26441== Invalid free() / delete / delete[] / realloc()
==26441==    at 0x4C2AD17: free (in
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==26441==    by 0x67D663B: __libc_freeres (in /usr/lib64/libc-2.17.so)
==26441==    by 0x4A246B4: _vgnU_freeres (in
/usr/lib64/valgrind/vgpreload_core-amd64-linux.so)
==26441==    by 0x66AAE2A: __run_exit_handlers (in /usr/lib64/libc-2.17.so)
==26441==    by 0x66AAEB4: exit (in /usr/lib64/libc-2.17.so)
==26441==    by 0x1117E9: cleanup_and_exit (glusterfsd.c:1308)
==26441==    by 0x111914: glusterfs_sigwaiter (glusterfsd.c:2029)
==26441==    by 0x6075DC4: start_thread (in /usr/lib64/libpthread-2.17.so)
==26441==    by 0x6768CEC: clone (in /usr/lib64/libc-2.17.so)
==26441==  Address 0x6a2d3d0 is 0 bytes inside data symbol "noai6ai_cached"
===

Could this be the reason for Massif to fail?

Version-Release number of selected component (if applicable):

GlusterFS 3.7.15, CentOS 7.2.

How reproducible:

Always.

Steps to Reproduce:

See above.

Actual results:

Massif tool does not provide reasonable output.

Expected results:

I want my memory to be profiled.

Additional info:

Feel free to ask me for any additional info.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.