[Bugs] [Bug 1380249] New: Huge memory usage of FUSE client

Thu Sep 29 05:47:55 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1380249

            Bug ID: 1380249
           Summary: Huge memory usage of FUSE client
           Product: GlusterFS
           Version: mainline
         Component: fuse
          Keywords: Triaged
          Severity: medium
          Priority: medium
          Assignee: bugs at gluster.org
          Reporter: nbalacha at redhat.com
                CC: bugs at gluster.org, glush at jet.msk.su,
                    kjohnson at gnulnx.net, nbalacha at redhat.com,
                    oleksandr at natalenko.name, pkarampu at redhat.com,
                    yannick.perret at liris.cnrs.fr
        Depends On: 1369364

+++ This bug was initially created as a clone of Bug #1369364 +++

Description of problem:

FUSE client consumes lots of memory for volume with lots of small files
(mailboxes storage for dovecot).

Here are volume options:

===
Type: Distributed-Replicate
...
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
...
network.ping-timeout: 10
features.cache-invalidation: on
performance.cache-size: 16777216
performance.cache-max-file-size: 1048576
performance.io-thread-count: 4
performance.write-behind: on
performance.flush-behind: on
performance.read-ahead: on
performance.quick-read: on
performance.stat-prefetch: on
performance.write-behind-window-size: 2097152
storage.linux-aio: on
performance.client-io-threads: off
server.event-threads: 8
network.inode-lru-limit: 4096
client.event-threads: 4
cluster.readdir-optimize: on
cluster.lookup-optimize: on
performance.readdir-ahead: on
nfs.disable: off
cluster.data-self-heal: off
cluster.metadata-self-heal: off
cluster.entry-self-heal: off
===

Here goes memory usage before drop_caches:

===
root      1049  0.0 40.1 1850040 1170912 ?     S<sl сер19   0:28
/usr/sbin/glusterfs --fopen-keep-cache --direct-io-mode=disable
--volfile-server=glusterfs.server.com --volfile-id=mail_boxes
/var/spool/mail/virtual
===

And after drop_caches:

===
root      1049  0.0 40.1 1850040 1170912 ?     S<sl сер19   0:28
/usr/sbin/glusterfs --fopen-keep-cache --direct-io-mode=disable
--volfile-server=glusterfs.server.com --volfile-id=mail_boxes
/var/spool/mail/virtual
===

Nothing changes after drop_caches.

pmap of PID 1049 is attached. I see lots of suspicious entries:

===
00007fee74000000  65536K rw---   [ anon ]
===

Are those related to I/O translator? What else could consume 64M at once?

Also, attaching statedumps before drop_caches and after drop_caches (almost no
difference between them).

Version-Release number of selected component (if applicable):

GlusterFS v3.7.14 + following patches (all of them are already merged for
3.7.15 release):

===
Aravinda VK (1):
      packaging: Remove ".py" extension from symlink target

Atin Mukherjee (1):
      rpc : build_prog_details should iterate program list inside critical
section

Jiffin Tony Thottan (2):
      gfapi : Avoid double freeing of dict in glfs_*_*getxattr
      xlator/trash : append '/' at the end in trash_notify_lookup_cbk

Raghavendra G (2):
      libglusterfs/client_t: Dump the 0th client too
      storage/posix: fix inode leaks

Soumya Koduri (2):
      glfs/upcall: entries should be removed under mutex lock
      gfapi/upcall: Fix a ref leak

Susant Palai (1):
      posix: fix posix_fgetxattr to return the correct error
===

How reproducible:

Always.

Steps to Reproduce:
1. mount volume with lots of small files (mail boxes);
2. use them
3. ...
4. PROFIT!

Actual results:

Client memory leaks.

Expected results:

Client memory does not leak :).

Additional info:

Feel free to ask me for additional info.

--- Additional comment from Oleksandr Natalenko on 2016-08-23 04:30 EDT ---

--- Additional comment from Oleksandr Natalenko on 2016-08-23 04:30 EDT ---

--- Additional comment from Oleksandr Natalenko on 2016-08-23 04:30 EDT ---

--- Additional comment from Oleksandr Natalenko on 2016-08-25 11:03:23 EDT ---

http://termbin.com/66ud — the script I use trying to reproduce the leak.

Observing the following in pmap:

===
00007fe430000000    132K rw---   [ anon ]
00007fe430021000  65404K -----   [ anon ]
00007fe438000000   6564K rw---   [ anon ] ← this value increases over time...
00007fe438669000  58972K -----   [ anon ] ← ...while this value decreases over
time
00007fe43c000000    132K rw---   [ anon ]
00007fe43c021000  65404K -----   [ anon ]
00007fe440000000   6800K rw---   [ anon ]
00007fe4406a4000  58736K -----   [ anon ]
00007fe44552c000      4K -----   [ anon ]
00007fe44552d000   8192K rw---   [ anon ]
00007fe445d2d000      4K -----   [ anon ]
00007fe445d2e000  16908K rw---   [ anon ]
00007fe446db1000      4K -----   [ anon ]
00007fe446db2000  18744K rw---   [ anon ]                                      

                             00007fe448000000   9068K rw---   [ anon ]
00007fe4488db000  56468K -----   [ anon ]
00007fe44c027000  29424K rw---   [ anon ]
00007fe44dce3000     52K r-x-- meta.so
00007fe44dcf0000   2044K ----- meta.so
...snip...
===

Other anonymous allocations change over time as well, I have pointed only to
one pair observing similar pattern.

--- Additional comment from Kyle Johnson on 2016-08-25 15:39:28 EDT ---

I may also be experiencing this bug.

Distributed volume w/ 3.7.14, no patches.  One server is FreeBSD 10.x, the
other is CentOS 6.  Both bricks are ZFS backed.

===
[root at colossus2 ~]# gluster volume info ftp

Volume Name: ftp
Type: Distribute
Volume ID: f3f2b222-575c-4c8d-92f1-e640fd7edfbb
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: 192.168.110.1:/tank/bricks/ftp
Brick2: 192.168.110.2:/ftp/bricks/ftp
Options Reconfigured:
performance.client-io-threads: on
cluster.weighted-rebalance: off
performance.readdir-ahead: on
nfs.disable: on
===

After approximately 24 hours of heaving reading on the client (copying 93728
files totaling 1.4Tb out of the fuse mount point, elsewhere on the system to a
non-gluster space) , I found that the glusterfs process was consuming all
system memory, including swap, approximately 34Gb.  Remounting the fuse volume
freed up the memory.

Let me know if you need anything else.

--- Additional comment from Kyle Johnson on 2016-08-25 15:43:04 EDT ---

Oops.  I am on 3.7.13.  Fat fingers.

--- Additional comment from Oleksandr Natalenko on 2016-08-26 03:33:50 EDT ---

I've run another test: http://termbin.com/cgzj

The idea of it is to split read/write workload across two clients. I'm
attaching statedumps of both reader and writer.

Unfortunately, it seems, memory accounting for FUSE translator suffers from
integer overflow (underflow?), because I get something like this:

===
[mount/fuse.fuse - usage-type gf_fuse_mt_iov_base memusage]
size=4262955128
num_allocs=1
max_size=4294967088
max_num_allocs=3
total_allocs=18389093
===

Another observation: reader consumes more memory than writer. Given cache-size
is 32M (and if even it is used across 5 xlators), total RAM consumption should
not exceed 160-200M). However, reader RAM:

===
root     30865  8.7  3.5 1054244 433676 ?      Ssl  Aug25  61:01
/usr/sbin/glusterfs --volfile-server=gluster.server.com --volfile-id=test
/mnt/net/glusterfs/test2
===

and writer RAM:

===
root     30796  3.3  2.0 722464 249352 ?       Ssl  Aug25  23:11
/usr/sbin/glusterfs --volfile-server=gluster.server.com --volfile-id=test
/mnt/net/glusterfs/test1
===

--- Additional comment from Oleksandr Natalenko on 2016-08-26 03:38 EDT ---

--- Additional comment from Oleksandr Natalenko on 2016-08-26 03:38 EDT ---

--- Additional comment from Yannick Perret on 2016-08-29 12:35:51 EDT ---

I also have the same bug, using versions 3.7.13 and 3.7.14 (compiled from
sources, no patch) and 3.8.1 (using package from gluster download page).

Tested on an up-to-date Debian 8.

Same observations: memory always growing on any operations (recursive listing,
reads, writes), even when always reading the same files. Even after days of
non-activities.
Memory can grow until OOM-killer activates (>6Go of memory the time I let it go
to that point).

--- Additional comment from Yannick Perret on 2016-08-29 13:46:30 EDT ---

Some memory data, tests performed on 3.8.1 gluster version (both servers and
client are 64b Debian 8 up-to-date with default kernel options; test volume is
a replica-2 volume without any tuning; mount is performed using "mount -t
glusterfs" without any option):
What I did:
1. mounting the volume, removing all its content, umounting it
2. mounting the volume
3. performing a cp -Rp /usr/* /root/MNT
4. performing a rm -rf /root/MNT/*
5. re-doing 3, 4

VSZ/RSS of gluster client process are respectively:
- 381896 / 35688 just after mount
- 644040 / 309240 after 1st cp -Rp
- 644040 / 310128 after 1st rm -rf
- 709576 / 310128 after 1st kill -USR1 (it was just for testing)
- 840648 / 421964 after 2nd cp -Rp
- 840648 / 422224 after 2nd rm -rf

I created a small script in order to perform these operations in loop:
while /bin/true
do
  cp -Rp /usr/* /root/MNT/
  + get VSZ/RSS of glusterfs process
  rm -rf /root/MNT/*
  + get VSZ/RSS of glusterfs process
done

It gives:
1037256 645500
1037256 645840
1168328 757348
1168328 757620
1299400 869128
1299400 869328
1364936 980712
1364936 980944
1496008 1092384
1496008 1092404
1627080 1203796
1627080 1203996
1692616 1315572
1692616 1315504
1823688 1426812
1823688 1427340
1954760 1538716
1954760 1538772
2085832 1647676
2085832 1647708
2151368 1750392
2151368 1750708
2282440 1853864
2282440 1853764
2413512 1952668
2413512 1952704
2479048 2056500
2479048 2056712

I let it turning and gluster client finished by being killed by OOM-killer:
[1246854.291996] Out of memory: Kill process 931 (glusterfs) score 742 or
sacrifice child
[1246854.292102] Killed process 931 (glusterfs) total-vm:3527624kB,
anon-rss:3100328kB, file-rss:0kB

So on this machine ~3.5Gb of VSZ.

--- Additional comment from Oleksandr Natalenko on 2016-08-30 04:08 EDT ---

--- Additional comment from Oleksandr Natalenko on 2016-08-30 04:09:17 EDT ---

Attached Valgrind's Memcheck tool output for reading workload.

--- Additional comment from Nithya Balachandran on 2016-08-30 05:00:28 EDT ---

Thank you. I will take a look at this.

--- Additional comment from Oleksandr Natalenko on 2016-08-30 05:03:17 EDT ---

Nithya, please, take a look at my message to ML regarding Massif tool — I'd
like to profile GlusterFS client with that too (but have some troubles) [1]
[2].

[1] https://www.gluster.org/pipermail/gluster-devel/2016-August/050665.html

[2] https://www.gluster.org/pipermail/gluster-devel/2016-August/050666.html

--- Additional comment from Oleksandr Natalenko on 2016-09-02 06:30:35 EDT ---

I've tried different kinds of workload and rechecked it with 3.7.15, and it
seems that leak is reproduced reliably with the following command on volume
with lots of files:

find /mnt/volume -type f -print0 | xargs -0 stat

In half an hour RSS went from 150M to 1G.

Valgrind's Memcheck shows nothing. I still need some help on Massif tool.

--- Additional comment from Oleksandr Natalenko on 2016-09-07 07:20:28 EDT ---

Okay, I've finally managed how to deal with Massif tool, and here are the
results.

Volume layout and options are the same as pointed in this BZ description. After
mount RSS is 90M, after performing this command for some time (there are
millions of files):

===
find /mnt/net/glusterfs/test -print0 | xargs -0 stat >/dev/null
===

RSS grows to 900M (and counting). VSZ also grows from ~400M to ~1200M (and
counting). Performing drop_caches lowers RSS/VSZ usage only by ~100M.

There are 3 Massif logs attached, the main is for PID 23664. Also, I attach
client statedump.

drop_caches was called before taking statedump and unmounting the volume.

--- Additional comment from Oleksandr Natalenko on 2016-09-07 07:20 EDT ---

--- Additional comment from Oleksandr Natalenko on 2016-09-07 07:21 EDT ---

--- Additional comment from Oleksandr Natalenko on 2016-09-07 07:21 EDT ---

--- Additional comment from Oleksandr Natalenko on 2016-09-07 07:22 EDT ---

--- Additional comment from Pranith Kumar K on 2016-09-07 14:40:22 EDT ---

Oleksandr,
     As per the massif tool output, the heal size is only ~113MB:

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 89 500,793,720,910      112,228,416      109,224,305     3,004,111           
0
 90 504,760,946,959      113,760,568      110,649,152     3,111,416           
0

I also went through the major stack-traces which are active during the last
snapshot. Nothing seems new. Most of them are related to last
inode-allocations.
But you were saying the RSS was around 900M, I am wondering if you captured
this at the same time or not.

Pranith

--- Additional comment from Oleksandr Natalenko on 2016-09-08 03:11:06 EDT ---

Re-run profiling with --pages-as-heap=yes as requested by Pranith in
#gluster-dev.

===
root     10696 15.2  7.9 1557544 969640 pts/0  Sl+  вер07  95:47 valgrind
--tool=massif --pages-as-heap=yes --run-libc-freeres=no --trace-children=yes
--sim-hints=fuse-compatible /usr/sbin/glusterfs -N
--volfile-server=glusterfs.example.com --volfile-id=some_volume
/mnt/net/glusterfs/test
===

Main PID is 10696, attaching Massif output and statedump.

--- Additional comment from Oleksandr Natalenko on 2016-09-08 03:11 EDT ---

--- Additional comment from Oleksandr Natalenko on 2016-09-08 03:12 EDT ---

--- Additional comment from Oleksandr Natalenko on 2016-09-08 03:23:18 EDT ---

Pranith, as you may see in #c23 and #c24, now all the memory gets into
statistics:

===
--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 50 1,164,197,215,106    1,481,912,320    1,481,912,320             0          
 0
 51 1,166,294,532,574    1,481,912,320    1,481,912,320             0          
 0
 52 1,168,332,670,161    1,481,912,320    1,481,912,320             0          
 0
 53 1,176,305,259,696    1,549,017,088    1,549,017,088             0          
 0
 54 1,176,353,806,434    1,549,021,184    1,549,021,184             0          
 0
===

--- Additional comment from Nithya Balachandran on 2016-09-09 11:53:51 EDT ---

I can reproduce the memory rise but statedumps don't seem to contain anything
useful. I have started looking through the Massif files and will continue to
run the tests to see what I can find.

--- Additional comment from Nithya Balachandran on 2016-09-12 12:20:38 EDT ---

Status so far:

I don't see anything in the Massif logs to pinpoint a leak. the quick-read
translator was one point to consider but I can see the leak even with it
disabled.

I am trying the following:

Created a single brick distribute volume with default options except for
quick-read which I have disabled.
Copying 100000 zero byte files to the gluster fuse mount. RSS goes from 30m to
69m. 
Run rm -rf *. RSS jumps to 104m but then comes down to 74m once the operation
is complete.

I will take another look at the readdirp code.

--- Additional comment from Nithya Balachandran on 2016-09-20 23:24:40 EDT ---

Just letting you know I am still looking into this but have not had much luck
yet.
Will update when I have something.

--- Additional comment from Nithya Balachandran on 2016-09-28 00:26:39 EDT ---

Valgrind etc doesn't show any mem leaks so far but I do see the mem usage
rising. I am wondering if this could be because of memory fragmentation rather
than memory leaks.

Still looking. All ideas welcome.

--- Additional comment from Oleksandr Natalenko on 2016-09-28 07:51 EDT ---

--- Additional comment from Worker Ant on 2016-09-28 12:12:23 EDT ---

REVIEW: http://review.gluster.org/15593 (protocol/client: reduce memory usage)
posted (#1) for review on master by N Balachandran (nbalacha at redhat.com)

--- Additional comment from Nithya Balachandran on 2016-09-29 01:19:25 EDT ---

Oleksandr,

Do you see a difference with if you turn off performance.readdir-ahead?
Apparently this is a known memory hog.

Still looking into how to improve things.

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1369364
[Bug 1369364] Huge memory usage of FUSE client
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.