[Bugs] [Bug 1729107] New: Memory leak in glusterfsd process

Thu Jul 11 12:18:19 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1729107

            Bug ID: 1729107
           Summary: Memory leak in glusterfsd process
           Product: GlusterFS
           Version: mainline
          Hardware: mips64
                OS: Linux
            Status: NEW
         Component: core
          Severity: urgent
          Priority: high
          Assignee: bugs at gluster.org
          Reporter: nbalacha at redhat.com
                CC: abhishpaliwal at gmail.com, amukherj at redhat.com,
                    atumball at redhat.com, bugs at gluster.org,
                    hgowtham at redhat.com, nbalacha at redhat.com,
                    pasik at iki.fi, ryan at magenta.tv
        Depends On: 1718734
  Target Milestone: ---
    Classification: Community

+++ This bug was initially created as a clone of Bug #1718734 +++

Description of problem:

We are seeing the memory leak in glusterfsd process when writing and deleting
the specific file in some interval

Version-Release number of selected component (if applicable): Glusterfs 5.4

How reproducible:

Here is the Setup details and test which we are doing as below:

One client, two gluster Server.
The client is writing and deleting one file each 15 minutes by script
test_v4.15.sh.

IP
Server side:
128.224.98.157 /gluster/gv0/
128.224.98.159 /gluster/gv0/

Client side:
128.224.98.160 /gluster_mount/

Server side:
gluster volume create gv0 replica 2 128.224.98.157:/gluster/gv0/
128.224.98.159:/gluster/gv0/ force
gluster volume start gv0

root at 128:/tmp/brick/gv0# gluster volume info

Volume Name: gv0
Type: Replicate
Volume ID: 7105a475-5929-4d60-ba23-be57445d97b5
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 128.224.98.157:/gluster/gv0
Brick2: 128.224.98.159:/gluster/gv0
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

exec script: ./ps_mem.py -p 605 -w 61 > log
root at 128:/# ./ps_mem.py -p 605
Private + Shared = RAM used Program
23668.0 KiB + 1188.0 KiB = 24856.0 KiB glusterfsd
---------------------------------
24856.0 KiB
=================================

Client side:
mount -t glusterfs -o acl -o resolve-gids 128.224.98.157:gv0 /gluster_mount

We are using the below script write and delete the file.

test_v4.15.sh

Also the below script to see the memory increase whihle the script is above
script is running in background.

ps_mem.py

I am attaching the script files as well as the result got after testing the
scenario.

Actual results: Memory leak is present 

Expected results: Leak should not be there

Additional info: Please see the attached file for more details also attaching
the statedumps

--- Additional comment from Abhishek on 2019-06-10 06:19 UTC ---

--- Additional comment from Abhishek on 2019-06-10 06:19 UTC ---

--- Additional comment from Abhishek on 2019-06-10 06:20 UTC ---

--- Additional comment from Abhishek on 2019-06-10 06:23 UTC ---

--- Additional comment from Abhishek on 2019-06-10 06:23 UTC ---

--- Additional comment from Abhishek on 2019-06-10 06:24:50 UTC ---

Attached are some statedumps taken on the gluster fs server. An initial dump,
then one from an hour or so later, then one from another 3 hours or so. I
believe we have looked before at the statedumps though and not seen any
evidence there what is going wrong, but please double-check this.

I have a system running with gluster 5.4 setup in replicate mode. So an Active
and a Passive server, and one client that has mounted the gluster volume, and
simply writes and deletes a file every 15 minutes to the gluster volume. That
is all that is going on.

What we see is that the memory usage for glusterfsd process is increasing
slowly in a liner fashion. I am running a python script every minute to log the
memory usage, and then plot the result on a graph. I attach the graph showing
glusterfsd private, shared and total memory usage over time (some 78 days
running). I also attach two screenshots from 'top' taken at various stages.

This is the graph during the one file every 15 minutes write test:

Please see the attachment for image.png

And BTW, if we 'hammer' the gluster volume with file writes/deletes in a much
faster fashion, ie many files written/deleted every second, or even every
minute, we see that the glusterfsd memory usage increases only for a very short
period, then it levels off and stays level forever at around 35MB total. So
there is clearly something different happening in the 'slow' file access case
where the total is at nearly 200MB and still increasing. 

If we run Valgrind, we see that memory allocates are freed up when the process
is ended, but the problem we have is that this will be on a system where
gluster is up and running all the time. So there seems to be a problem that
memory is dynamically allocated each time there is a write/read on the gluster
volume, but it is not dynamically freed in runtime. The worry is at some point
glusterfsd will completely use up all the memory on the system - might take a
long time but this is not acceptable.

My steps are here:

root at board0:/tmp# gluster --version
glusterfs 5.4

root at board0:/tmp# gluster volume status gv0
Status of volume: gv0
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick board0:/export/sdb1/brick 49152 0 Y 1702
Brick board1:/export/sdb1/brick 49152 0 Y 1652
Self-heal Daemon on localhost N/A N/A Y 1725
Self-heal Daemon on board1 N/A N/A Y 1675
Task Status of Volume gv0
------------------------------------------------------------------------------
There are no active volume tasks

root at board0:/tmp# jobs
[1]+ Running ./ps_mem.py -w 61 > /tmp/ps_mem.log & (wd: ~)

root at board0:/tmp# ps -ef | grep gluster
root 1608 1 0 May08 ? 00:00:04 /usr/sbin/glusterd -p /var/run/glusterd.pid
root 1702 1 0 May08 ? 00:00:14 /usr/sbin/glusterfsd -s board0 --volfile-id
gv0.board0.export-sdb1-brick -p
/var/run/gluster/vols/gv0/board0-export-sdb1-brick.pid -S
/var/run/gluster/6c09da8ec6e017c8.socket --brick-name /export/sdb1/brick -l
/var/log/glusterfs/bricks/export-sdb1-brick.log --xlator-option
*-posix.glusterd-uuid=336dc4a8-1371-4366-b2f9-003c35e12ca1 --process-name brick
--brick-port 49152 --xlator-option gv0-server.listen-port=49152
root 1725 1 0 May08 ? 00:00:03 /usr/sbin/glusterfs -s localhost --volfile-id
gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l
/var/log/glusterfs/glustershd.log -S /var/run/gluster/7ae70daf2745f7d4.socket
--xlator-option replicate.node-uuid=336dc4a8-1371-4366-b2f9-003c35e12ca1
--process-name glustershd
root 3115 1241 0 03:00 ttyS0 00:00:00 grep gluster 

This is the cmd used to create the gluster volume:

gluster volume create gv0 replica 2 board0:/export/sdb1/brick
board1:/export/sdb1/brick

And on client I do like:

mount -t glusterfs board0:gv0 /mnt

and then just run the one file each 15 min test

./test_v4.sh

To get the data, I run like this after some time:

grep glusterfsd ps_mem.log | awk '{ print $1 "," $4 "," $7 }' >
gluster54-glusterfsd.csv

Then plot the points in excel

--- Additional comment from Abhishek on 2019-06-12 14:53 UTC ---

--- Additional comment from Abhishek on 2019-06-12 14:54:43 UTC ---

Hi,

As requested we have checked tested the same on Glusterfs 6.1 and memory leak
is present here as well.

Please check the "statedumps log for 6.1" in attachment for more details.

--- Additional comment from Abhishek on 2019-06-14 10:15:53 UTC ---

Hi Team,

is there any update on this?

Regards,
Abhishek

--- Additional comment from hari gowtham on 2019-06-14 11:17:32 UTC ---

(In reply to Abhishek from comment #9)
> Hi Team,
> 
> is there any update on this?
> 
> Regards,
> Abhishek

Hi Abhishek,

I'll take a look into it. 
This will take some time. 
Will update once I have made some progress.

Regards,
Hari.

--- Additional comment from  on 2019-06-19 16:02:37 UTC ---

We are also seeing this issue, with memory slowly increasing over time until
all system resources (SWAP and RAM) are exhausted.

--- Additional comment from Worker Ant on 2019-06-21 05:25:29 UTC ---

REVIEW: https://review.gluster.org/22918 (Detach iot_worker to release its
resources) posted (#1) for review on master by Liguang Li

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1718734
[Bug 1718734] Memory leak in glusterfsd process
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.