[Bugs] [Bug 1580352] [GSS] Glusterd memory leaking in gf_gld_mt_linebuf

Mon May 21 10:53:29 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1580352

Sanju <srakonde at redhat.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|bugs at gluster.org            |srakonde at redhat.com

--- Comment #2 from Sanju <srakonde at redhat.com> ---

Description of problem:

This customer has two pairs of gluster nodes running RHGS 3.2:

- Production gluster nodes:

glusterldc1fs1up.owfg.com
glusterldc1fs2up.owfg.com

- QA gluster nodes:

glusteredc1fs1uq.owfg.com
glusteredc1fs2uq.owfg.com

Gluster packages installed:

grep -i gluster installed-rpms

gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64                Fri Sep 22 08:15:35
2017
gluster-nagios-common-0.2.4-1.el7rhgs.noarch                Wed Aug 23 14:16:28
2017
glusterfs-3.8.4-18.6.el7rhgs.x86_64                         Fri Sep 22 08:19:41
2017
glusterfs-api-3.8.4-18.6.el7rhgs.x86_64                     Fri Sep 22 08:19:41
2017
glusterfs-cli-3.8.4-18.6.el7rhgs.x86_64                     Fri Sep 22 08:19:41
2017
glusterfs-client-xlators-3.8.4-18.6.el7rhgs.x86_64          Fri Sep 22 08:19:41
2017
glusterfs-fuse-3.8.4-18.6.el7rhgs.x86_64                    Fri Sep 22 08:19:41
2017
glusterfs-geo-replication-3.8.4-18.6.el7rhgs.x86_64         Fri Sep 22 08:19:42
2017
glusterfs-libs-3.8.4-18.6.el7rhgs.x86_64                    Fri Sep 22 08:19:41
2017
glusterfs-server-3.8.4-18.6.el7rhgs.x86_64                  Fri Sep 22 08:19:41
2017
libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.3.x86_64 Fri Sep 22
08:15:41 2017
python-gluster-3.8.4-18.6.el7rhgs.noarch                    Fri Sep 22 08:19:42
2017
samba-vfs-glusterfs-4.4.5-3.el7rhgs.x86_64                  Wed Aug 23 14:18:11
2017
vdsm-gluster-4.17.33-1.2.el7rhgs.noarch                     Fri Sep 22 08:15:44
2017

In all four hosts, the glusterd process has a memory leak.

Looking at the ps output, the resident set size of the process is 1.6 GB on the
QA nodes and nearly 5 GB on production nodes.

- sosreport glusteredc1fs1uq.owfg.com

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      1573  0.8 13.8 2842496 1679636 ?     Ssl  Apr20 154:28
/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO

==> Here, the process is consuming 1.6 GB and is taking nearly 14% of the
memory:

- sosreport glusteredc1fs2uq.owfg.com

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     20590  0.5  6.2 1681632 753284 ?      Ssl  Apr27  41:38
/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO

- sosreport glusterldc1fs1up.owfg.com

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     18199  1.8  7.5 1999648 1230012 ?     Ssl  May01  20:14
/usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p
/var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S
/var/run/gluster/9920bccf2a4c92d44d9f991404c5765d.socket

- sosreport glusterldc1fs2up.owfg.com

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      9102  3.5 36.7 8990320 5975468 ?     Ssl  Feb14 3927:15
/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO

==> On this gluster node, it's taking 36% of memory and it's consuming nearly 6
GB.

I've asked the customer for two statedumps of the glusterd process, taken two
hours apart. He has provided this information only from the QA nodes.

Node glusteredc1fs1uq.owfg.com

glusterdump.1573.dump.1525458209

Looking at the highest memory size:

[mgmt/glusterd.management - usage-type gf_gld_mt_linebuf memusage]
size=909495296 --> 909 MB leaked here
num_allocs=888179
max_size=909495296
max_num_allocs=888179
total_allocs=888179

[mgmt/glusterd.management - usage-type gf_common_mt_mem_pool memusage]
size=170826728  --> 170 MB leaked here
num_allocs=1607174
max_size=170839680
max_num_allocs=1607398
total_allocs=80039329

Same thing for the second iteration - the same structures keep growing: 

glusterdump.1573.dump.1525466693

[mgmt/glusterd.management - usage-type gf_gld_mt_linebuf memusage]
size=919127040  --> On this second iteration we have 919 MB
num_allocs=897585
max_size=919127040
max_num_allocs=897585
total_allocs=897585

[mgmt/glusterd.management - usage-type gf_common_mt_mem_pool memusage]
size=172089816
num_allocs=1619086
max_size=172099544
max_num_allocs=1619240
total_allocs=80707302

Identical results for node glusteredc1fs2uq.owfg.com

glusterdump.20590.dump.1525458352

mgmt/glusterd.management - usage-type gf_gld_mt_linebuf memusage]
size=476495872
num_allocs=465328
max_size=476495872  --> 476 MB leaked here:
max_num_allocs=465328
total_allocs=465328

[mgmt/glusterd.management - usage-type gf_common_mt_mem_pool memusage]
size=70239188  --> 70 MB here
num_allocs=627665
max_size=70284104
max_num_allocs=628212
total_allocs=86062168

glusterdump.20590.dump.1525466708

[mgmt/glusterd.management - usage-type gf_gld_mt_linebuf memusage]
size=485989376
num_allocs=474599
max_size=485989376  --> On the second iteration, the memory has increased on
485 MB
max_num_allocs=474599
total_allocs=474599

[mgmt/glusterd.management - usage-type gf_common_mt_mem_pool memusage]
size=71332824
num_allocs=637632
max_size=71335796
max_num_allocs=637669
total_allocs=87385904

The only place where I can find such allocation is in geo-replication code:

https://github.com/gluster/glusterfs/blob/master/xlators/mgmt/glusterd/src/glusterd-geo-rep.c

Exactly here:

glusterd_urltransform

...

  for (;;) {
                size_t len;
                line = GF_MALLOC (1024, gf_gld_mt_linebuf);
                if (!line) {
                        error = _gf_true;
                        goto out;
                }

...

I believe this is caused by geo-replication. Further assistance from
engineering is required to understand the source of this memory leak.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.