[Bugs] [Bug 1695099] New: The number of glusterfs processes keeps increasing, using all available resources

Tue Apr 2 13:21:23 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1695099

            Bug ID: 1695099
           Summary: The number of glusterfs processes keeps increasing,
                    using all available resources
           Product: GlusterFS
           Version: 5
          Hardware: x86_64
                OS: Linux
            Status: NEW
         Component: glusterd
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: christian.ihle at drift.oslo.kommune.no
                CC: bugs at gluster.org
  Target Milestone: ---
    Classification: Community

Description of problem:

During normal operations, CPU and memory usage gradually increase to 100%,
being used by a large number of glusterfs processes. The result is slowness and
resource starvation. Issue startet happening with GlusterFS 5.2, but did not
improve with 5.5. Did not see this issue in 3.12.

Version-Release number of selected component (if applicable): 
GlusterFS 5.2 and 5.5
Heketi 8.0.0
CentOS 7.6

How reproducible:

Users of the cluster hit this issue pretty often by creating and deleting
volumes quickly, from Kubernetes (using Heketi to control GlusterFS). Sometimes
we hit 100% resource usage several times a day.

Steps to Reproduce:
1. Create volume
2. Delete volume
3. Repeat quickly

Actual results:

CPU usage and memory usage increase, and the number of glusterfs processes
increases. I have to login to each node in the cluster and kill old processes
to make nodes responsive again, otherwise the nodes eventually freeze from
resource starvation.

Expected results:

CPU and memory usage should only spike shortly, and not continue to increase,
and there should be only one glusterfs process.

Additional info:

I found some issues that look similar:

* https://github.com/gluster/glusterfs/issues/625
* https://github.com/heketi/heketi/issues/1439

Log output from a time where resource usage increased
(/var/log/glusterfs/glusterd.log):

[2019-04-01 12:07:23.377715] W [MSGID: 101095]
[xlator.c:180:xlator_volopt_dynload] 0-xlator:
/usr/lib64/glusterfs/5.5/xlator/nfs/server.so: cannot open shared object file:
Ingen slik fil eller filkatalog
[2019-04-01 12:07:23.684561] I [run.c:242:runner_log]
(-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6f9a) [0x7fd04cc46f9a]
-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6a65) [0x7fd04cc46a65]
-->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fd058156955] ) 0-management:
Ran script: /var/lib/glusterd/hooks/1/create/post/S10selinux-label-brick.sh
--volname=vol_45653f46dbc8953f876a009b4ea8dd26
[2019-04-01 12:07:26.931683] I [rpc-clnt.c:1000:rpc_clnt_connection_init]
0-snapd: setting frame-timeout to 600
[2019-04-01 12:07:26.932340] I [rpc-clnt.c:1000:rpc_clnt_connection_init]
0-gfproxyd: setting frame-timeout to 600
[2019-04-01 12:07:26.932667] I [MSGID: 106131]
[glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: nfs already stopped
[2019-04-01 12:07:26.932707] I [MSGID: 106568]
[glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: nfs service is
stopped
[2019-04-01 12:07:26.932731] I [MSGID: 106599]
[glusterd-nfs-svc.c:81:glusterd_nfssvc_manager] 0-management: nfs/server.so
xlator is not installed
[2019-04-01 12:07:26.963055] I [MSGID: 106568]
[glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping glustershd
daemon running in pid: 16020
[2019-04-01 12:07:27.963708] I [MSGID: 106568]
[glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: glustershd service is
stopped
[2019-04-01 12:07:27.963951] I [MSGID: 106567]
[glusterd-svc-mgmt.c:220:glusterd_svc_start] 0-management: Starting glustershd
service
[2019-04-01 12:07:28.985311] I [MSGID: 106131]
[glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: bitd already stopped
[2019-04-01 12:07:28.985478] I [MSGID: 106568]
[glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: bitd service is
stopped
[2019-04-01 12:07:28.989024] I [MSGID: 106131]
[glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: scrub already
stopped
[2019-04-01 12:07:28.989098] I [MSGID: 106568]
[glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: scrub service is
stopped
[2019-04-01 12:07:29.299841] I [run.c:242:runner_log]
(-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6f9a) [0x7fd04cc46f9a]
-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6a65) [0x7fd04cc46a65]
-->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fd058156955] ) 0-management:
Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh
--volname=vol_45653f46dbc8953f876a009b4ea8dd26 --first=no --version=1
--volume-op=start --gd-workdir=/var/lib/glusterd
[2019-04-01 12:07:29.338437] E [run.c:242:runner_log]
(-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6f9a) [0x7fd04cc46f9a]
-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe69c3) [0x7fd04cc469c3]
-->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fd058156955] ) 0-management:
Failed to execute script:
/var/lib/glusterd/hooks/1/start/post/S30samba-start.sh
--volname=vol_45653f46dbc8953f876a009b4ea8dd26 --first=no --version=1
--volume-op=start --gd-workdir=/var/lib/glusterd
[2019-04-01 12:07:52.658922] I [run.c:242:runner_log]
(-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0x3b2dd) [0x7fd04cb9b2dd]
-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6a65) [0x7fd04cc46a65]
-->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fd058156955] ) 0-management:
Ran script: /var/lib/glusterd/hooks/1/stop/pre/S29CTDB-teardown.sh
--volname=vol_c5112b1e28a7bbc96640a8572009c6f0 --last=no
[2019-04-01 12:07:52.679220] E [run.c:242:runner_log]
(-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0x3b2dd) [0x7fd04cb9b2dd]
-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe69c3) [0x7fd04cc469c3]
-->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fd058156955] ) 0-management:
Failed to execute script: /var/lib/glusterd/hooks/1/stop/pre/S30samba-stop.sh
--volname=vol_c5112b1e28a7bbc96640a8572009c6f0 --last=no
[2019-04-01 12:07:52.681081] I [MSGID: 106542]
[glusterd-utils.c:8440:glusterd_brick_signal] 0-glusterd: sending signal 15 to
brick with pid 27595
[2019-04-01 12:07:53.732699] I [MSGID: 106599]
[glusterd-nfs-svc.c:81:glusterd_nfssvc_manager] 0-management: nfs/server.so
xlator is not installed
[2019-04-01 12:07:53.791560] I [MSGID: 106568]
[glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping glustershd
daemon running in pid: 18583
[2019-04-01 12:07:53.791857] I [MSGID: 106143]
[glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick
/var/lib/heketi/mounts/vg_799fbf11286fbf497605bbe58c3e9dfa/brick_08bfe132dad6099ab387555298466ca3/brick
on port 49162
[2019-04-01 12:07:53.822032] I [MSGID: 106006]
[glusterd-svc-mgmt.c:356:glusterd_svc_common_rpc_notify] 0-management:
glustershd has disconnected from glusterd.
[2019-04-01 12:07:54.792497] I [MSGID: 106568]
[glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: glustershd service is
stopped
[2019-04-01 12:07:54.792736] I [MSGID: 106567]
[glusterd-svc-mgmt.c:220:glusterd_svc_start] 0-management: Starting glustershd
service
[2019-04-01 12:07:55.812655] I [MSGID: 106131]
[glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: bitd already stopped
[2019-04-01 12:07:55.812837] I [MSGID: 106568]
[glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: bitd service is
stopped
[2019-04-01 12:07:55.816580] I [MSGID: 106131]
[glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: scrub already
stopped
[2019-04-01 12:07:55.816672] I [MSGID: 106568]
[glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: scrub service is
stopped
[2019-04-01 12:07:59.829927] I [run.c:242:runner_log]
(-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0x3b2dd) [0x7fd04cb9b2dd]
-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6a65) [0x7fd04cc46a65]
-->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fd058156955] ) 0-management:
Ran script: /var/lib/glusterd/hooks/1/delete/pre/S10selinux-del-fcontext.sh
--volname=vol_c5112b1e28a7bbc96640a8572009c6f0
[2019-04-01 12:07:59.951300] I [MSGID: 106495]
[glusterd-handler.c:3118:__glusterd_handle_getwd] 0-glusterd: Received getwd
req
[2019-04-01 12:07:59.967584] I [run.c:242:runner_log]
(-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6f9a) [0x7fd04cc46f9a]
-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6a65) [0x7fd04cc46a65]
-->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fd058156955] ) 0-management:
Ran script: /var/lib/glusterd/hooks/1/delete/post/S57glusterfind-delete-post
--volname=vol_c5112b1e28a7bbc96640a8572009c6f0
[2019-04-01 12:07:53.732626] I [MSGID: 106131]
[glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: nfs already stopped
[2019-04-01 12:07:53.732677] I [MSGID: 106568]
[glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: nfs service is
stopped

Examples of errors about deleted volumes from /var/log/glusterfs/glustershd.log
- we get gigabytes of these every day:

[2019-04-02 09:57:08.997572] E [MSGID: 108006]
[afr-common.c:5314:__afr_handle_child_down_event]
10-vol_3be8a34875cc37098593d4bc8740477b-replicate-0: All subvolumes are down.
Going offline until at least one of them comes back up.
[2019-04-02 09:57:09.033441] E [MSGID: 108006]
[afr-common.c:5314:__afr_handle_child_down_event]
26-vol_2399e6ef0347ac569a0b1211f1fd109d-replicate-0: All subvolumes are down.
Going offline until at least one of them comes back up.
[2019-04-02 09:57:09.036003] E [MSGID: 108006]
[afr-common.c:5314:__afr_handle_child_down_event]
40-vol_fafddd8a937a550fbefc6c54830ce44f-replicate-0: All subvolumes are down.
Going offline until at least one of them comes back up.
[2019-04-02 09:57:09.077109] E [MSGID: 108006]
[afr-common.c:5314:__afr_handle_child_down_event]
2-vol_bca47201841f5b50d341eb2bedf5cd46-replicate-0: All subvolumes are down.
Going offline until at least one of them comes back up.
[2019-04-02 09:57:09.103495] E [MSGID: 108006]
[afr-common.c:5314:__afr_handle_child_down_event]
24-vol_fafddd8a937a550fbefc6c54830ce44f-replicate-0: All subvolumes are down.
Going offline until at least one of them comes back up.
[2019-04-02 09:57:09.455818] E [MSGID: 108006]
[afr-common.c:5314:__afr_handle_child_down_event]
30-vol_fafddd8a937a550fbefc6c54830ce44f-replicate-0: All subvolumes are down.
Going offline until at least one of them comes back up.
[2019-04-02 09:57:09.511070] E [MSGID: 108006]
[afr-common.c:5314:__afr_handle_child_down_event]
14-vol_cf3700764dfdce40d60b89fde7e1a643-replicate-0: All subvolumes are down.
Going offline until at least one of them comes back up.
[2019-04-02 09:57:09.490714] E [MSGID: 108006]
[afr-common.c:5314:__afr_handle_child_down_event]
0-vol_c5112b1e28a7bbc96640a8572009c6f0-replicate-0: All subvolumes are down.
Going offline until at least one of them comes back up.

Example of concurrent glusterfs processes on a node:

root      4559 16.8  6.5 14882048 1060288 ?    Ssl  Apr01 206:00
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option
*replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name
glustershd
root      6507 14.7  6.1 14250324 998280 ?     Ssl  Apr01 178:33
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option
*replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name
glustershd
root      6743  0.0  1.2 4780344 201708 ?      Ssl  Apr01   0:35
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option
*replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name
glustershd
root      7660 17.0  6.3 14859244 1027432 ?    Ssl  Apr01 206:32
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option
*replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name
glustershd
root      7789  0.1  1.5 5390364 250200 ?      Ssl  Apr01   1:08
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option
*replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name
glustershd
root      9259 16.4  6.3 14841432 1029512 ?    Ssl  Apr01 198:12
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option
*replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name
glustershd
root     12394 14.0  5.6 13549044 918424 ?     Ssl  Apr01 167:46
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option
*replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name
glustershd
root     14980  9.2  4.7 11657716 778876 ?     Ssl  Apr01 110:10
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option
*replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name
glustershd
root     16032  8.2  4.4 11040436 716020 ?     Ssl  Apr01  97:39
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option
*replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name
glustershd
root     23961  6.3  3.7 9807736 610408 ?      Ssl  Apr01  62:03
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option
*replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name
glustershd
root     25560  2.8  3.0 8474704 503488 ?      Ssl  Apr01  27:33
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option
*replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name
glustershd
root     26293  3.2  1.2 4812208 200896 ?      Ssl  09:26   0:35
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option
*replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name
glustershd
root     28205  1.3  1.8 5992016 300012 ?      Ssl  Apr01  13:31
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option
*replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name
glustershd
root     29186  1.4  2.1 6669800 352440 ?      Ssl  Apr01  13:59
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option
*replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name
glustershd
root     30485  0.9  0.6 3527080 101552 ?      Ssl  09:35   0:05
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option
*replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name
glustershd
root     31171  1.0  0.6 3562360 104908 ?      Ssl  09:35   0:05
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option
*replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name
glustershd
root     32086  0.6  0.3 2925412 54852 ?       Ssl  09:35   0:03
/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
/var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log
-S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option
*replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name
glustershd

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.