[Bugs] [Bug 1311377] New: Memory leak in glusterd

bugzilla at redhat.com bugzilla at redhat.com
Wed Feb 24 05:46:31 UTC 2016


https://bugzilla.redhat.com/show_bug.cgi?id=1311377

            Bug ID: 1311377
           Summary: Memory leak in glusterd
           Product: GlusterFS
           Version: 3.7.0
         Component: glusterd
          Keywords: Triaged
          Severity: urgent
          Priority: medium
          Assignee: bugs at gluster.org
          Reporter: ggarg at redhat.com
                CC: amukherj at redhat.com, bordas.csaba at gmail.com,
                    bugs at gluster.org, chris.l.cao at nokia.com,
                    kostas.makedos at gmail.com, tereshonkov.roman at gmail.com
        Depends On: 1287517



+++ This bug was initially created as a clone of Bug #1287517 +++

Description of problem:
I have a deployment with 3 nodes and 6 bricks.
Tried to change glusterfs version but problem persists.
glusterd process memory consumption is increasing constantly.
Using top to check its RSS values.
On a fresh installation started with 27M.
After 3 hours it was 71M.
Today is 515M.
Valgrind file attaches shows some possibly lost records.
Tried to echo 2 at /proc/sys/vm/drop_caches but no effect in memory
consumption.
Consumption seems to increase if self-heal daemon is closed.

Version-Release number of selected component (if applicable):
Tried with 3.6.2, 3.7.5 and 3.7.6

How reproducible:
Install a 3 node cluster.
glusterd in storage master is increasing memory consumption.

Additional info:

--- Additional comment from Atin Mukherjee on 2015-12-02 04:52:08 EST ---

(In reply to kmak from comment #0)
> Created attachment 1101371 [details]
> valgrind report
> 
> Description of problem:
> I have a deployment with 3 nodes and 6 bricks.
> Tried to change glusterfs version but problem persists.
What command did you run here? Can you please take statedump of glusterd
process by executing 'kill -USR1 $(pidof glusterd)' and it would generate the
file in /var/run/gluster. Also please attach *glusterd.log & cmd_history.log.

> glusterd process memory consumption is increasing constantly.
> Using top to check its RSS values.
> On a fresh installation started with 27M.
> After 3 hours it was 71M.
> Today is 515M.
> Valgrind file attaches shows some possibly lost records.
> Tried to echo 2 at /proc/sys/vm/drop_caches but no effect in memory
> consumption.
> Consumption seems to increase if self-heal daemon is closed.
> 
> Version-Release number of selected component (if applicable):
> Tried with 3.6.2, 3.7.5 and 3.7.6
> 
> How reproducible:
> Install a 3 node cluster.
> glusterd in storage master is increasing memory consumption.
> 
> Additional info:

--- Additional comment from kmak on 2015-12-02 05:46 EST ---

Logs and statedump of glusterd

--- Additional comment from kmak on 2015-12-02 05:47:04 EST ---

Thanks for fast response

--- Additional comment from kmak on 2015-12-04 03:24:10 EST ---

Do we have any updates on this?

--- Additional comment from Atin Mukherjee on 2015-12-06 23:03:40 EST ---

(In reply to kmak from comment #4)
> Do we have any updates on this?

You should be able to hear something from us in couple of days time. We haven't
got a chance to look at it yet.

--- Additional comment from Vijay Bellur on 2015-12-09 09:44:14 EST ---

REVIEW: http://review.gluster.org/12927 (glusterd: fixing few memory leak in
glusterd) posted (#1) for review on master by Gaurav Kumar Garg
(ggarg at redhat.com)

--- Additional comment from Gaurav Kumar Garg on 2015-12-10 02:24:22 EST ---

Hi kmak,

the  statedump + log files  file which you have attach is binary file its not
in human readable formate. could you attach correct  statedump + log files 
file.

one more thing

did you check memory leak of glusterd or glusterfsd or both 

could you give us information how did you check this memory leak.

--- Additional comment from kmak on 2015-12-10 02:45 EST ---



--- Additional comment from kmak on 2015-12-10 02:47 EST ---



--- Additional comment from kmak on 2015-12-10 02:47 EST ---



--- Additional comment from kmak on 2015-12-10 02:48 EST ---



--- Additional comment from kmak on 2015-12-10 02:50:39 EST ---

Hello, 
Yes sorry I uploaded them initially as a zip file.
Checking was done through top, process was glusterd.
glusterfs processes were using a stable amount of memory if I remember
correctly.
This review which is mentioned above is correlated with my bug? 
(http://review.gluster.org/12927)

--- Additional comment from Vijay Bellur on 2015-12-10 06:53:10 EST ---

REVIEW: http://review.gluster.org/12927 (glusterd: fixing few memory leak in
glusterd) posted (#2) for review on master by Gaurav Kumar Garg
(ggarg at redhat.com)

--- Additional comment from Gaurav Kumar Garg on 2015-12-10 07:06:22 EST ---

Hi kmak,

ya there is memory leak for sure but its not in MB i saw its in KB. I have
created 6 distributed volume having 2 node in the cluster and after executing
1000 gluster command i saw only 10KB memory leak. Wondering how memory leak
happening in MB in your system. 

Will further analysis this issue.

--- Additional comment from Roman Tereshonkov on 2015-12-10 07:52:30 EST ---

I observed that memory leak can be increased by continuous run
for ((N=0;N<2000;N++)) ; do gluster volume status all ; done

Making the rate of leak of the order 1M/min.


In addition, making some hard coding I saw the cases when dict_t refcount was
-1.

--- Additional comment from Gaurav Kumar Garg on 2015-12-10 08:09:40 EST ---

just now i have executed loop from 0 to 2000 and saw that 22KB memory leak.
will further check for that 22KB memory leak. how can your system cause 1M/min
memory leak. will check it further.

--- Additional comment from  on 2015-12-10 08:27:05 EST ---

Hi! 

I have the same problem on my Debian (Jessie) system!

I have tested my cluster (Distributed-Replicated, 2x2 brick) by the following
bash script from 5 clients simultaneously:

#!/bin/bash

while true;
do

        for (( c=1; c<=100; c++ ))
        do
                dd if=/dev/zero of=/mnt/vol_www/960b_${HOSTNAME}_$c.txt bs=960
count=1
        done

        sleep 3

        rm -f /mnt/vol_www/960b_$HOSTNAME*

        sleep 3

done

The consumed memory grows very quickly!
(More than 1MB/min)

--- Additional comment from  on 2015-12-10 09:32:56 EST ---

Hi!

I have an update!

I did "echo 2 > /proc/sys/vm/drop_caches" and the memory reduced back to
"normal" level!

Also tried some similar like in comment 15 (gluster volume status)!

My test was:

on console 1: watch -n 0.1 'gluster volume status'
on console 2: watch -n 0.1 'gluster volume status vol_one clients'

It was running for 15-20 minutes!

After that I did "echo 2 > /proc/sys/vm/drop_caches" again, but
the memory did not reduced back to "normal" level at this time!

--- Additional comment from kmak on 2015-12-15 05:45:04 EST ---

Do you have any findings on this? 
Can you tell us what is the best way to debug it?

--- Additional comment from Atin Mukherjee on 2015-12-15 07:25:02 EST ---

(In reply to kmak from comment #19)
> Do you have any findings on this? 
> Can you tell us what is the best way to debug it?

We haven't found the RCA of this as its not reproducible at our end. Although
we have identified few places in the code where it could lead to a small chunk
of memory leaks but that's not that severe as you have reported. 

I've one more follow up question which is that you mentioned that you failed to
change the version. Did you mean that you tried to downgrade?

--- Additional comment from kmak on 2015-12-15 09:56:16 EST ---

We tried the same configuration with GlusterFS 3.6.2, 3.7.5 and 3.7.6.
So, do you think that this could be a configuration issue?
We surely have a lot of network activity and a lot of nodes.

--- Additional comment from Atin Mukherjee on 2015-12-15 22:39:11 EST ---

(In reply to kmak from comment #21)
> We tried the same configuration with GlusterFS 3.6.2, 3.7.5 and 3.7.6.
> So, do you think that this could be a configuration issue?
> We surely have a lot of network activity and a lot of nodes.

If you mean to say you are hitting memory leak through out different versions
then I am inline with what you mentioned that configuration could be a suspect.
Can you upload the complete sosreport of the node where you see the memory
leak? Are you sure that you have not modified any configurations in
glusterd.vol ?

--- Additional comment from Roman Tereshonkov on 2015-12-16 04:01:51 EST ---

The problem is reproducible for glusterfs 3.7.6 using the simple two nodes
cluster.
The proposed fix http://review.gluster.org/#/c/12927/ has no influence.

The setup:
1. any two nodes with network communication (node1, node2);
2. glusterfs 3.7.6 (./configure --prefix=/usr --libdir="/usr/lib64"
--sysconfdir=/etc --localstatedir=/var && make && make install)
3. rdma is not used (is it critical or not ?). glusterd.vol: option
transport-type socket
4. simple volume configuration for nodes:
     mkdir -p /brick/brick1
     gluster peer probe node1
     gluster volume create MYVOL replica 2 node1:/brick/brick1
node2:/brick/brick1 force
     gluster volume start MYVOL
     mount -t glusterfs node1:MYVOL /mnt/test (indeed I do not know if mounting
is needed or not to see the bug, not tested yet)
5. for ((N=0;N<20000;N++)) ; do gluster volume status all ; done
6. "top" to observe glusterd RSS memory growing (and it grows in both nodes)


NB: If you are using only one node with this setup the problem is not
reproduced.
    It look like two nodes communicating by network is the important
precondition.

--- Additional comment from kmak on 2015-12-17 04:32:41 EST ---

Hello,
problem is verified also from my side (my colleague's Roman comment 23 is also
valid) in my laptop (using RHEL 7.2) in two VMs (node1, node2)

Description of my setup:
1) Create two nodes connected to a common network.
2) Create two extra partitions, format them to ext4 and mount them to
/data/brick/gv0/gv in both servers
3) Download glusterfs repo from here:
http://download.gluster.org/pub/gluster/glusterfs/LATEST/EPEL.repo/glusterfs-epel.repo
& install glusterfs-server
4) gluster peer probe node1/node2 to both nodes
5) gluster volume create gv0 replica 2 node1:/data/brick/gv0/gv
node2:/data/brick/gv0/gv 
6) gluster volume start gv0
7) Run "gluster volume status all" x 2000 times every 30sec
8) See glusterd memory (rss taken from ps aux) grow constantly with 1MB/min

I will attach a result log with the first runs.

--- Additional comment from kmak on 2015-12-17 04:35 EST ---



--- Additional comment from Roman Tereshonkov on 2015-12-22 09:42 EST ---

Fix glusterd 3.7.6 memory leaks in the serialized buffer.

Memory is allocated by dict_allocate_and_serialize function and
pointer is assigned to req.dict.dict_val variable.
On exit this memory has to be freed.

These leaks can be observed by command usage:
glusterd volume status all
The order of leaks ~ 2K per command run.

--- Additional comment from Roman Tereshonkov on 2015-12-22 09:52:29 EST ---

The attached patch fixes the problem partially.
It reduces the memory leaks by about 5 times.
I did not observe any visible leaks on host sending "glusterd volume status
all" command.
But the receiving host still has some small leaks.

I suspect the problem is somewhere in glusterd-op-sm.c file.
The volume status command initiates two transactions. 
For each transaction glusterd_set_txn_opinfo function is called and 
opinfo_obj = GF_CALLOC (1, sizeof(glusterd_txn_opinfo_obj),
gf_common_mt_txn_opinfo_obj_t);
memory is allocated.
On finnishing glusterd_clear_txn_opinfo function is called to free allocated
memory. But it happens only once for the second transaction.
So memory allocated for the first transaction is kept unallocated.

Can anybody check this and review?

--- Additional comment from Gaurav Kumar Garg on 2015-12-22 13:22:09 EST ---

Hi roman,

Thanks for your efforts. 

i will check it and get back to you.

~Gaurav

--- Additional comment from Gaurav Kumar Garg on 2015-12-23 02:37:08 EST ---

Hi roman,

Thank you for this finding. That actually solve the memory leak problem. i have
tested it and find out that there are only few byte of memory leak after this
patch. 

regarding your suspension in glusterd-op-sm.c file volume status is initiate
transaction and call glusterd_set_txn_opinfo function and it call the
glusterd_clear_txn_opinfo same time.

--- Additional comment from Vijay Bellur on 2015-12-23 03:46:47 EST ---

REVIEW: http://review.gluster.org/12927 (glusterd: fixing few memory leak in
glusterd) posted (#3) for review on master by Gaurav Kumar Garg
(ggarg at redhat.com)

--- Additional comment from Roman Tereshonkov on 2015-12-23 03:48:04 EST ---

> regarding your suspension in glusterd-op-sm.c file volume status is initiate transaction and call glusterd_set_txn_opinfo function and it call the glusterd_clear_txn_opinfo same time.

According to the logs glusterd_clear_txn_opinfo is not called for the first
transacation.
This chunk of log created on local host by running volume status on remote
host.
As you can see "cleared opinfo" is called only for the second transaction.
What is your case. Do you see two "cleared opinfo" in your log?

[2015-12-22 11:09:01.321481] D [MSGID: 0]
[glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set
opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
[2015-12-22 11:09:01.328820] D [MSGID: 0]
[glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got
opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
[2015-12-22 11:09:01.343665] D [MSGID: 0]
[glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set
opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
[2015-12-22 11:09:01.356197] D [MSGID: 0]
[glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got
opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
[2015-12-22 11:09:01.366702] D [MSGID: 0]
[glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set
opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
[2015-12-22 11:09:01.371448] D [MSGID: 0]
[glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got
opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
[2015-12-22 11:09:01.382440] D [MSGID: 0]
[glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set
opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
[2015-12-22 11:09:01.433093] D [MSGID: 0]
[glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set
opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.440280] D [MSGID: 0]
[glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got
opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.452016] D [MSGID: 0]
[glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set
opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.468647] D [MSGID: 0]
[glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got
opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.469808] D [MSGID: 0]
[glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got
opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.472555] D [MSGID: 0]
[glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set
opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.477988] D [MSGID: 0]
[glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got
opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.482711] D [MSGID: 0]
[glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set
opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.484789] D [MSGID: 0]
[glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got
opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.496658] D [MSGID: 0]
[glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully set
opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.504998] D [MSGID: 0]
[glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got
opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.509757] D [MSGID: 0]
[glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully got
opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
[2015-12-22 11:09:01.511833] D [MSGID: 0]
[glusterd-op-sm.c:360:glusterd_clear_txn_opinfo] 0-management: Successfully
cleared opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556

--- Additional comment from Vijay Bellur on 2015-12-23 03:54:43 EST ---

REVIEW: http://review.gluster.org/12927 (glusterd: fixing few memory leak in
glusterd) posted (#4) for review on master by Gaurav Kumar Garg
(ggarg at redhat.com)

--- Additional comment from Gaurav Kumar Garg on 2015-12-23 04:18:31 EST ---

(In reply to Roman Tereshonkov from comment #31)
> > regarding your suspension in glusterd-op-sm.c file volume status is initiate transaction and call glusterd_set_txn_opinfo function and it call the glusterd_clear_txn_opinfo same time.
> 
> According to the logs glusterd_clear_txn_opinfo is not called for the first
> transacation.
> This chunk of log created on local host by running volume status on remote
> host.
> As you can see "cleared opinfo" is called only for the second transaction.
> What is your case. Do you see two "cleared opinfo" in your log?
> 
> [2015-12-22 11:09:01.321481] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
> [2015-12-22 11:09:01.328820] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
> [2015-12-22 11:09:01.343665] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
> [2015-12-22 11:09:01.356197] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
> [2015-12-22 11:09:01.366702] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
> [2015-12-22 11:09:01.371448] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
> [2015-12-22 11:09:01.382440] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : e52c9b6e-78d5-449d-a2dc-e45d36cfa9b4
> [2015-12-22 11:09:01.433093] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.440280] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.452016] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.468647] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.469808] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.472555] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.477988] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.482711] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.484789] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.496658] D [MSGID: 0]
> [glusterd-op-sm.c:311:glusterd_set_txn_opinfo] 0-management: Successfully
> set opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.504998] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.509757] D [MSGID: 0]
> [glusterd-op-sm.c:255:glusterd_get_txn_opinfo] 0-management: Successfully
> got opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556
> [2015-12-22 11:09:01.511833] D [MSGID: 0]
> [glusterd-op-sm.c:360:glusterd_clear_txn_opinfo] 0-management: Successfully
> cleared opinfo for transaction ID : 1e3ebf40-f38b-406e-a159-0a7776439556

Hi roman,

i my test case was using gdb. i just put glusterd process in gdb and put a
break point in these two function that you have mentioned
(glusterd_set_txn_opinfo and glusterd_clear_txn_opinfo) and executed gluster
volume status command. i saw that both function called same number of time.

--- Additional comment from Vijay Bellur on 2015-12-23 04:27:24 EST ---

REVIEW: http://review.gluster.org/12927 (glusterd: fixing few memory leak in
glusterd) posted (#5) for review on master by Gaurav Kumar Garg
(ggarg at redhat.com)

--- Additional comment from Roman Tereshonkov on 2015-12-23 04:42:06 EST ---

> i my test case was using gdb. i just put glusterd process in gdb and put a break point in these two function that you have mentioned (glusterd_set_txn_opinfo and glusterd_clear_txn_opinfo) and executed gluster volume status command. i saw that both function called same number of time.


If it is not confidential can you attach the chunk of debug log
/var/log/glusterfs/etc-glusterfs-glusterd.vol.log (--log-level=DEBUG) from both
local and remote hosts where you make one run of the command "gluster volume
status"?

--- Additional comment from Gaurav Kumar Garg on 2015-12-23 07:29:43 EST ---

Hi roman,

yes i still saw some memory leak at remote side. there is no memory leak at
originator node side. rca seems to be set_txn_opinfo are calling more number of
time compare to clear_txn_opinfo. will update you about it soon.

--- Additional comment from Gaurav Kumar Garg on 2015-12-24 01:28:27 EST ---

hi roman,

there is one update regarding this.

if by looking into glusterd log's you are saying that memory allocated by
glusterd_set_txn_opinfo call too many number of time compare to
glusterd_clear_txn_opinfo then its wrong. its true that glusterd_set_txn_opinfo
calling more number of time at remote node compare to glusterd_clear_txn_opinfo
but it doesn't mean that every time its allocate memory. from the function
(glusterd_set_txn_opinfo) you can see that if ret is non zero (means its unable
to get txn opinfo from the dictionary) then only it will allocate memory and it
happen only one's at the starting of transaction. 


if you apply http://review.gluster.org/12927 patch at all the node in the
cluster then you will see only few byte of memory leak. 

i am checking it further why few byte of memory leak happening. will update you
further.

--- Additional comment from Roman Tereshonkov on 2015-12-28 04:06:28 EST ---

About http://review.gluster.org/#/c/12927/

Why have you moved GF_FREE (req.dict.dict_val) before "out:"?
If error comes after GD_ALLOC_COPY_UUID then the allocated memory leaks.
In my patch proposed it was after "out:" statement.

--- Additional comment from Vijay Bellur on 2015-12-29 01:42:14 EST ---

REVIEW: http://review.gluster.org/12927 (glusterd: fixing few memory leak in
glusterd) posted (#6) for review on master by Gaurav Kumar Garg
(ggarg at redhat.com)

--- Additional comment from Roman Tereshonkov on 2015-12-29 07:35:23 EST ---

> if by looking into glusterd log's you are saying that memory allocated by glusterd_set_txn_opinfo call too many number of time compare to glusterd_clear_txn_opinfo then its wrong. its true that glusterd_set_txn_opinfo calling more number of time at remote node compare to glusterd_clear_txn_opinfo but it doesn't mean that every time its allocate memory. from the function (glusterd_set_txn_opinfo) you can see that if ret is non zero (means its unable to get txn opinfo from the dictionary) then only it will allocate memory and it happen only one's at the starting of transaction. 


By transaction I mean a sequence of commands with the same transaction id.

As I mentioned earlier every run of "volume status all" generates two different
transactions with different transaction ids.
glusterd_set_txn_opinfo is called with "transaction id" as argument and
allocates opinfo every time when there is no such transaction opinfo in
dictionary.
glusterd_clear_txn_opinfo is called with "transaction id" as argument and frees
opinfo with such id from dictionary glusterd_txn_opinfo.

In our case the glusterd_clear_txn_opinfo function is called only once with
"transaction id" corresponding to the second transaction.

As a provement you can add "dict_dump_to_log(priv->glusterd_txn_opinfo);" at
the end of function glusterd_clear_txn_opinfo to see the content of the
dictionary.
You will see that after every run of "volume status all" the dictionary
glusterd_txn_opinfo grows keeping all unallocated opinfo identified by the
every first transaction id.


By the way if you run "volume status VOLNAME" with existed VOLNAME then only
one transaction is generated and it looks like no memory leaks.

--- Additional comment from Roman Tereshonkov on 2015-12-30 06:29:56 EST ---

It looks like it is a functional problem.

The issued command "gluster volume status all" generates a series of
transaction identified by different ids.
The very first one is to get the volume names supported by remote host
glusterfs server.
The others transaction are issued per each volume name and include
locking/unlocking phases.
And it is the unlocking phase that is the marker that the transaction ended and
all allocated resources are to be free.
There is no other code pathway to call glusterd_clear_txn_opinfo function
except unlock or unlock_all pathways.
Thus, there is no any way for the very first transaction to know when it is
ended to call glusterd_clear_txn_opinfo and free opinfo.

This is my analysis which is needed to be confirmed by you.
Logically it is different issue now from what we have already fixed.

Should I open new bug for this functional problem?

Do you see any simple way to fix it?

--- Additional comment from Gaurav Kumar Garg on 2015-12-30 06:37:05 EST ---

Hi Roman, 

ya its a functional problem for "#gluster volume status all" command. but it
need to be fixed. ya analysis is correct. will update you about it. no need to
raise separate bug as of this as of now.

--- Additional comment from kmak on 2016-02-02 09:23:02 EST ---

Hello,
Is there any update for this bug?
Do you have implemented any fix for it and in which version?
Thanks

--- Additional comment from Gaurav Kumar Garg on 2016-02-04 05:26:19 EST ---

Hi Kmak,

patch http://review.gluster.org/12927 fix most of the memory leak. but still
there is some memory leak. Its seems to be in our op-sm framework. This patch
still not available in latest GlusterFS version.

--- Additional comment from Gaurav Kumar Garg on 2016-02-04 05:26:53 EST ---

Hi Kmak,

patch http://review.gluster.org/12927 fix most of the memory leak. but still
there is some memory leak. Its seems to be in our op-sm framework. This patch
still not available in latest GlusterFS version.

--- Additional comment from Vijay Bellur on 2016-02-22 06:27:38 EST ---

REVIEW: http://review.gluster.org/12927 (glusterd: fixing few memory leak in
glusterd) posted (#7) for review on master by Gaurav Kumar Garg
(ggarg at redhat.com)

--- Additional comment from Roman Tereshonkov on 2016-02-22 06:33:39 EST ---

Has that two lines memory leak fix patch been accepted into upstream 3.7?

What is the status of the functional issue based memory leak. Any progress to
fix "gluster volume status all" issue?

--- Additional comment from Gaurav Kumar Garg on 2016-02-23 01:52:44 EST ---

Hi roman, 

patch is still in review. It have not merged yet to upstream.

--- Additional comment from Vijay Bellur on 2016-02-23 23:19:15 EST ---

COMMIT: http://review.gluster.org/12927 committed in master by Jeff Darcy
(jdarcy at redhat.com) 
------
commit e38bf1bdeda3c7a89be3193ad62a72b9139358dd
Author: Gaurav Kumar Garg <garg.gaurav52 at gmail.com>
Date:   Wed Dec 9 20:12:17 2015 +0530

    glusterd: fixing few memory leak in glusterd

    Current glusterd code base having memory leak. This is because of
    memory allocate by dict_allocate_and_serialize function in
    "gd_syncop_mgmt_v3_lock" and "gd_syncop_mgmt_v3_unlock"
    function is not freeing up meory upon exit.

    Fix is to free the memory after exit of the above function.

    Thanx Carlos and Roman for finding out the issue and fix.

    Change-Id: Id67aa794c84969830ca7ea8c2374f80c64d7a639
    BUG: 1287517
    Signed-off-by: Gaurav Kumar Garg <ggarg at redhat.com>
    Signed-off-by: Carlos Chinea <carlos.chinea at nokia.com>
    Signed-off-by: Roman Tereshonkov <roman.tereshonkov at nokia.com>
    Reviewed-on: http://review.gluster.org/12927
    Smoke: Gluster Build System <jenkins at build.gluster.com>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.com>
    Reviewed-by: Jeff Darcy <jdarcy at redhat.com>


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1287517
[Bug 1287517] Memory leak in glusterd
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list