[Bugs] [Bug 1354250] New: Gluster fuse client crashed generating core dump

bugzilla at redhat.com bugzilla at redhat.com
Mon Jul 11 04:21:47 UTC 2016


https://bugzilla.redhat.com/show_bug.cgi?id=1354250

            Bug ID: 1354250
           Summary: Gluster fuse client crashed generating core dump
           Product: GlusterFS
           Version: 3.8.0
         Component: transport
          Severity: medium
          Priority: medium
          Assignee: bugs at gluster.org
          Reporter: nbalacha at redhat.com
                CC: bkunal at redhat.com, bugs at gluster.org, csaba at redhat.com,
                    rhs-bugs at redhat.com, storage-qa-internal at redhat.com
        Depends On: 1343320, 1343374



+++ This bug was initially created as a clone of Bug #1343374 +++

This bug was initially created as a clone of Bug #1343320 +++

Description of problem:
Client crash with core dump due to excessive memory consumption


Version-Release number of selected component (if applicable):

3.7.5-19.el7rhgs.x86_64
RHEL 5

Additional info:
lots of DNS resolution error found in client logs

The following logs 
I can see continuous error messages :
[2016-04-27 10:33:29.833969] E [name.c:242:af_inet_client_get_remote_sockaddr]
0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:32.843124] E [name.c:242:af_inet_client_get_remote_sockaddr]
0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:35.850581] E [name.c:242:af_inet_client_get_remote_sockaddr]
0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:38.858181] E [name.c:242:af_inet_client_get_remote_sockaddr]
0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:41.865251] E [name.c:242:af_inet_client_get_remote_sockaddr]
0-vol01-client-1: DNS resolution failed on host server1
The message "E [MSGID: 101075] [common-utils.c:306:gf_resolve_ip6] 0-resolver:
getaddrinfo failed (Name or service not known)" repeated 39 times between
[2016-04-27 10:31:44.561995] and [2016-04-27 10:33:41.865245]
[2016-04-27 10:33:44.873510] E [MSGID: 101075]
[common-utils.c:306:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or
service not known)
[2016-04-27 10:33:44.873599] E [name.c:242:af_inet_client_get_remote_sockaddr]
0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:47.881687] E [name.c:242:af_inet_client_get_remote_sockaddr]
0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:50.890768] E [name.c:242:af_inet_client_get_remote_sockaddr]
0-vol01-client-1: DNS resolution failed on host server1
.
.
.
.
.

and after sometime(almost after 27 hour) :

[2016-04-28 13:47:23.002272] E [socket.c:3124:socket_connect] 0-vol01-client-1:
pthread_createfailed: Cannot allocate memory
[2016-04-28 13:47:23.002528] E [socket.c:3126:socket_connect]
(-->/usr/lib64/libglusterfs.so.0(gf_timer_proc+0xf5) [0x3bb8046e65]
-->/usr/lib64/libgfrpc.so.0(rpc_clnt_reconnect+0xea) [0x3bb7c0e67a]
-->/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so [0x2b2a7319f0ca] ) 0-:
Assertion failed: 0
[2016-04-28 13:47:26.008762] E [name.c:242:af_inet_client_get_remote_sockaddr]
0-vol01-client-1: DNS resolution failed on host server1

[2016-04-28 13:47:26.008933] E [socket.c:3124:socket_connect] 0-vol01-client-1:
pthread_createfailed: Cannot allocate memory
[2016-04-28 13:47:26.009134] E [socket.c:3126:socket_connect]
(-->/usr/lib64/libglusterfs.so.0(gf_timer_proc+0xf5) [0x3bb8046e65]
-->/usr/lib64/libgfrpc.so.0(rpc_clnt_reconnect+0xea) [0x3bb7c0e67a]
-->/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so [0x2b2a7319f0ca] ) 0-:
Assertion failed: 0
[2016-04-28 13:47:29.015862] E [name.c:242:af_inet_client_get_remote_sockaddr]
0-vol01-client-1: DNS resolution failed on host server1



.
.This continued for almost a week
.

.
.Followed by
.
.

.

[2016-05-15 04:12:17.272132] A [MSGID: 0] [mem-pool.c:120:__gf_calloc] : no
memory available for size (2097224) [call stack follows]
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb5)[0x3bb8025395]
/usr/lib64/libglusterfs.so.0(_gf_msg_nomem+0x42e)[0x3bb802984e]
/usr/lib64/libglusterfs.so.0(__gf_calloc+0x100)[0x3bb805bda0]
/usr/lib64/libglusterfs.so.0(synctask_create+0x3a1)[0x3bb806cf21]
/usr/lib64/libglusterfs.so.0(synctask_new1+0x9)[0x3bb806d4f9]
[2016-05-15 04:12:18.863904] A [MSGID: 0] [mem-pool.c:120:__gf_calloc] : no
memory available for size (2097224) [call stack follows]
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb5)[0x3bb8025395]
/usr/lib64/libglusterfs.so.0(_gf_msg_nomem+0x42e)[0x3bb802984e]
/usr/lib64/libglusterfs.so.0(__gf_calloc+0x100)[0x3bb805bda0]
/usr/lib64/libglusterfs.so.0(synctask_create+0x3a1)[0x3bb806cf21]
/usr/lib64/libglusterfs.so.0(synctask_new1+0x9)[0x3bb806d4f9]
.
.
.
.
.
.
[2016-05-15 04:12:31.572526] A [MSGID: 0] [mem-pool.c:120:__gf_calloc] : no
memory available for size (124) [call stack follows]
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb5)[0x3bb8025395]
/usr/lib64/libglusterfs.so.0(_gf_msg_nomem+0x42e)[0x3bb802984e]
/usr/lib64/libglusterfs.so.0(__gf_calloc+0x100)[0x3bb805bda0]
/usr/lib64/libglusterfs.so.0(mem_get+0xb8)[0x3bb805be98]
/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)[0x3bb805bf0b]
pending frames:
frame : type(1) op(LOOKUP)
frame : type(1) op(LOOKUP)
frame : type(1) op(LOOKUP)
.
.
.
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2016-05-15 04:12:31
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.5
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb5)[0x3bb8025395]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x338)[0x3bb8042378]
/lib64/libc.so.6[0x34f2030030]
/usr/lib64/libglusterfs.so.0(mem_get+0x6e)[0x3bb805be4e]
/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)[0x3bb805bf0b]
/usr/lib64/libglusterfs.so.0(get_new_data+0x20)[0x3bb801f260]
/usr/lib64/libglusterfs.so.0(dict_unserialize+0xf4)[0x3bb801f374]
/usr/lib64/glusterfs/3.7.5/xlator/protocol/client.so(client3_3_lookup_cbk+0x7bc)[0x2b2a741f5acc]
/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa0)[0x3bb7c0fa70]
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1b4)[0x3bb7c0fd34]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x3bb7c0b517]
/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so[0x2b2a731a3f68]
/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so[0x2b2a731a4994]
/usr/lib64/libglusterfs.so.0[0x3bb808b363]
/lib64/libpthread.so.0[0x34f280683d]
/lib64/libc.so.6(clone+0x6d)[0x34f20d4fcd]

--- Additional comment from Nithya Balachandran on 2016-06-07 04:50:10 EDT ---

RCA:

There is a memory leak in the socket_connect code in case of failure. 

In socket_connect ():

        /* if sock != -1, then cleanup is done from the event handler */
        if (ret == -1 && sock == -1) {
                /* Cleaup requires to send notification to upper layer which
                   intern holds the big_lock. There can be dead-lock situation
                   if big_lock is already held by the current thread. 
                   So transfer the ownership to seperate thread for cleanup.
                */      
                arg = GF_CALLOC (1, sizeof (*arg), 
                                 gf_sock_connect_error_state_t);
                arg->this = THIS; 
                arg->trans = this; 
                arg->refd = refd; 
                th_ret = pthread_create (&th_id, NULL,
socket_connect_error_cbk,
                                         arg);   
                if (th_ret) {
                       gf_log (this->name, GF_LOG_ERROR, "pthread_create"
                               "failed: %s", strerror(errno));
                        GF_FREE (arg);
                        GF_ASSERT (0);
                }       
        }       


pthread_create does not create a detached thread so the thread resources are
not cleaned up. socket_connect is called at 3 second intervals so this quickly
adds up causing the process to run out of memory.

--- Additional comment from Vijay Bellur on 2016-06-07 04:56:31 EDT ---

REVIEW: http://review.gluster.org/14661 (rpc/socket: pthread resources are not
cleanup up) posted (#1) for review on master by N Balachandran
(nbalacha at redhat.com)

--- Additional comment from Nithya Balachandran on 2016-06-07 05:01:17 EDT ---

Fix:

Create a detached thread so all thread resources are cleaned up automatically.

--- Additional comment from Vijay Bellur on 2016-06-07 05:09:38 EDT ---

REVIEW: http://review.gluster.org/14661 (rpc/socket: pthread resources are not
cleaned up) posted (#2) for review on master by N Balachandran
(nbalacha at redhat.com)

--- Additional comment from Vijay Bellur on 2016-07-08 01:18:43 EDT ---

REVIEW: http://review.gluster.org/14875 (rpc/socket: pthread resources are not
cleaned up) posted (#1) for review on master by N Balachandran
(nbalacha at redhat.com)

--- Additional comment from Vijay Bellur on 2016-07-08 01:22:40 EDT ---

REVIEW: http://review.gluster.org/14875 (rpc/socket: pthread resources are not
cleaned up) posted (#2) for review on master by N Balachandran
(nbalacha at redhat.com)

--- Additional comment from Vijay Bellur on 2016-07-08 01:54:26 EDT ---

REVIEW: http://review.gluster.org/14875 (rpc/socket: pthread resources are not
cleaned up) posted (#3) for review on master by N Balachandran
(nbalacha at redhat.com)

--- Additional comment from Vijay Bellur on 2016-07-08 16:17:16 EDT ---

COMMIT: http://review.gluster.org/14875 committed in master by Jeff Darcy
(jdarcy at redhat.com) 
------
commit 9886d568a7a8839bf3acc81cb1111fa372ac5270
Author: N Balachandran <nbalacha at redhat.com>
Date:   Fri Jul 8 10:46:46 2016 +0530

    rpc/socket: pthread resources are not cleaned up

    A socket_connect failure creates a new pthread which
    is not a detached thread. As no pthread_join is called,
    the thread resources are not cleaned up causing a memory leak.

    Now, socket_connect creates a detached thread to handle failure.

    Change-Id: Idbf25d312f91464ae20c97d501b628bfdec7cf0c
    BUG: 1343374
    Signed-off-by: N Balachandran <nbalacha at redhat.com>
    Reviewed-on: http://review.gluster.org/14875
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Atin Mukherjee <amukherj at redhat.com>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Jeff Darcy <jdarcy at redhat.com>


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1343320
[Bug 1343320] [GSS] Gluster fuse client crashed generating core dump
https://bugzilla.redhat.com/show_bug.cgi?id=1343374
[Bug 1343374] Gluster fuse client crashed generating core dump
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list