<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
Thanks Strahil, <br>
<br>
Would the geo rep process be the gsyncd.py proceses? <br>
<br>
It seems like it's the glusterfsd and auxiliary mounts that are
holding all the memory right now... <br>
<br>
Could this be related to the open-behind bug mentioned here:
<a class="moz-txt-link-freetext" href="https://github.com/gluster/glusterfs/issues/1444">https://github.com/gluster/glusterfs/issues/1444</a> and here:
<a class="moz-txt-link-freetext" href="https://github.com/gluster/glusterfs/issues/1440">https://github.com/gluster/glusterfs/issues/1440</a> ? <br>
<br>
Thanks,<br>
-Matthew<br>
<div class="moz-signature"><br>
Matthew Benstead<br>
System Administrator<br>
<a href="https://pacificclimate.org/">Pacific Climate Impacts
Consortium</a><br>
University of Victoria, UH1<br>
PO Box 1800, STN CSC<br>
Victoria, BC, V8W 2Y2<br>
Phone: 1-250-721-8432<br>
Email: <a href="mailto:matthewb@uvic.ca">matthewb@uvic.ca</a>
</div>
<div class="moz-cite-prefix">On 2020-08-14 10:35 p.m., Strahil
Nikolov wrote:<br>
</div>
<blockquote type="cite"
cite="mid:7A418761-7531-4B2F-9430-20976A22FFE6@yahoo.com">
<pre class="moz-quote-pre" wrap="">Hey Matthew,
Can you check with valgrind the memory leak ?
It will be something like:
Find the geo rep process via ps and note all parameters it was started with .
Next stop geo rep.
Then start it with valgrind :
valgrind --log-file="filename" --tool=memcheck --leak-check=full <georep process binary> <geo rep parameters>
It might help narrowing the problem.
Best Regards,
Strahil Nikolov
На 14 август 2020 г. 20:22:16 GMT+03:00, Matthew Benstead <a class="moz-txt-link-rfc2396E" href="mailto:matthewb@uvic.ca"><matthewb@uvic.ca></a> написа:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Hi,
We are building a new storage system, and after geo-replication has
been
running for a few hours the server runs out of memory and oom-killer
starts killing bricks. It runs fine without geo-replication on, and the
server has 64GB of RAM. I have stopped geo-replication for now.
Any ideas what to tune?
[root@storage01 ~]# gluster --version | head -1
glusterfs 7.7
[root@storage01 ~]# cat /etc/centos-release; uname -r
CentOS Linux release 7.8.2003 (Core)
3.10.0-1127.10.1.el7.x86_64
[root@storage01 ~]# df -h /storage2/
Filesystem Size Used Avail Use% Mounted on
10.0.231.91:/storage 328T 228T 100T 70% /storage2
[root@storage01 ~]# cat /proc/meminfo | grep MemTotal
MemTotal: 65412064 kB
[root@storage01 ~]# free -g
total used free shared buff/cache
available
Mem: 62 18 0 0 43 43
Swap: 3 0 3
[root@storage01 ~]# gluster volume info
Volume Name: storage
Type: Distributed-Replicate
Volume ID: cf94a8f2-324b-40b3-bf72-c3766100ea99
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x (2 + 1) = 9
Transport-type: tcp
Bricks:
Brick1: 10.0.231.91:/data/storage_a/storage
Brick2: 10.0.231.92:/data/storage_b/storage
Brick3: 10.0.231.93:/data/storage_c/storage (arbiter)
Brick4: 10.0.231.92:/data/storage_a/storage
Brick5: 10.0.231.93:/data/storage_b/storage
Brick6: 10.0.231.91:/data/storage_c/storage (arbiter)
Brick7: 10.0.231.93:/data/storage_a/storage
Brick8: 10.0.231.91:/data/storage_b/storage
Brick9: 10.0.231.92:/data/storage_c/storage (arbiter)
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
network.ping-timeout: 10
features.inode-quota: on
features.quota: on
nfs.disable: on
features.quota-deem-statfs: on
storage.fips-mode-rchecksum: on
performance.readdir-ahead: on
performance.parallel-readdir: on
cluster.lookup-optimize: on
client.event-threads: 4
server.event-threads: 4
performance.cache-size: 256MB
You can see the memory spike and reduce as bricks are killed - this
happened twice in the graph below:
You can see two brick processes are down:
[root@storage01 ~]# gluster volume status
Status of volume: storage
Gluster process TCP Port RDMA Port Online
Pid
------------------------------------------------------------------------------
Brick 10.0.231.91:/data/storage_a/storage N/A N/A N
N/A
Brick 10.0.231.92:/data/storage_b/storage 49152 0 Y
1627
Brick 10.0.231.93:/data/storage_c/storage 49152 0 Y
259966
Brick 10.0.231.92:/data/storage_a/storage 49153 0 Y
1642
Brick 10.0.231.93:/data/storage_b/storage 49153 0 Y
259975
Brick 10.0.231.91:/data/storage_c/storage 49153 0 Y
20656
Brick 10.0.231.93:/data/storage_a/storage 49154 0 Y
259983
Brick 10.0.231.91:/data/storage_b/storage N/A N/A N
N/A
Brick 10.0.231.92:/data/storage_c/storage 49154 0 Y
1655
Self-heal Daemon on localhost N/A N/A Y
20690
Quota Daemon on localhost N/A N/A Y
172136
Self-heal Daemon on 10.0.231.93 N/A N/A Y
260010
Quota Daemon on 10.0.231.93 N/A N/A Y
128115
Self-heal Daemon on 10.0.231.92 N/A N/A Y
1702
Quota Daemon on 10.0.231.92 N/A N/A Y
128564
Task Status of Volume storage
------------------------------------------------------------------------------
There are no active volume tasks
Logs:
[2020-08-13 20:58:22.186540] I [MSGID: 106143]
[glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick
(null) on port 49154
[2020-08-13 20:58:22.196110] I [MSGID: 106005]
[glusterd-handler.c:5960:__glusterd_brick_rpc_notify] 0-management:
Brick 10.0.231.91:/data/storage_b/storage has disconnected from
glusterd.
[2020-08-13 20:58:22.196752] I [MSGID: 106143]
[glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick
/data/storage_b/storage on port 49154
[2020-08-13 21:05:23.418966] I [MSGID: 106143]
[glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick
(null) on port 49152
[2020-08-13 21:05:23.420881] I [MSGID: 106005]
[glusterd-handler.c:5960:__glusterd_brick_rpc_notify] 0-management:
Brick 10.0.231.91:/data/storage_a/storage has disconnected from
glusterd.
[2020-08-13 21:05:23.421334] I [MSGID: 106143]
[glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick
/data/storage_a/storage on port 49152
[Thu Aug 13 13:58:17 2020] Out of memory: Kill process 20664
(glusterfsd) score 422 or sacrifice child
[Thu Aug 13 13:58:17 2020] Killed process 20664 (glusterfsd), UID 0,
total-vm:32884384kB, anon-rss:29625096kB, file-rss:0kB, shmem-rss:0kB
[Thu Aug 13 14:05:18 2020] Out of memory: Kill process 20647
(glusterfsd) score 467 or sacrifice child
[Thu Aug 13 14:05:18 2020] Killed process 20647 (glusterfsd), UID 0,
total-vm:36265116kB, anon-rss:32767744kB, file-rss:520kB,
shmem-rss:0kB0
glustershd logs:
[2020-08-13 20:58:22.181368] W [socket.c:775:__socket_rwv]
0-storage-client-7: readv on 10.0.231.91:49154 failed (No data
available)
[2020-08-13 20:58:22.185413] I [MSGID: 114018]
[client.c:2347:client_rpc_notify] 0-storage-client-7: disconnected from
storage-client-7. Client process will keep trying to connect to
glusterd until brick's port is available
[2020-08-13 20:58:25.211872] E [MSGID: 114058]
[client-handshake.c:1455:client_query_portmap_cbk] 0-storage-client-7:
failed to get the port number for remote subvolume. Please run 'gluster
volume status' on server to see if brick process is running.
[2020-08-13 20:58:25.211934] I [MSGID: 114018]
[client.c:2347:client_rpc_notify] 0-storage-client-7: disconnected from
storage-client-7. Client process will keep trying to connect to
glusterd until brick's port is available
[2020-08-13 21:00:28.386633] I [socket.c:865:__socket_shutdown]
0-storage-client-7: intentional socket shutdown(8)
[2020-08-13 21:02:34.565373] I [socket.c:865:__socket_shutdown]
0-storage-client-7: intentional socket shutdown(8)
[2020-08-13 21:02:58.000263] W [MSGID: 114031]
[client-rpc-fops_v2.c:920:client4_0_getxattr_cbk] 0-storage-client-7:
remote operation failed. Path: /
(00000000-0000-0000-0000-000000000001). Key: trusted.glusterfs.pathinfo
[Transport endpoint is not connected]
[2020-08-13 21:02:58.000460] W [MSGID: 114029]
[client-rpc-fops_v2.c:4469:client4_0_getxattr] 0-storage-client-7:
failed to send the fop
[2020-08-13 21:04:40.733823] I [socket.c:865:__socket_shutdown]
0-storage-client-7: intentional socket shutdown(8)
[2020-08-13 21:05:23.418987] W [socket.c:775:__socket_rwv]
0-storage-client-0: readv on 10.0.231.91:49152 failed (No data
available)
[2020-08-13 21:05:23.419365] I [MSGID: 114018]
[client.c:2347:client_rpc_notify] 0-storage-client-0: disconnected from
storage-client-0. Client process will keep trying to connect to
glusterd until brick's port is available
[2020-08-13 21:05:26.423218] E [MSGID: 114058]
[client-handshake.c:1455:client_query_portmap_cbk] 0-storage-client-0:
failed to get the port number for remote subvolume. Please run 'gluster
volume status' on server to see if brick process is running.
[2020-08-13 21:06:46.919942] I [socket.c:865:__socket_shutdown]
0-storage-client-7: intentional socket shutdown(8)
[2020-08-13 21:05:26.423274] I [MSGID: 114018]
[client.c:2347:client_rpc_notify] 0-storage-client-0: disconnected from
storage-client-0. Client process will keep trying to connect to
glusterd until brick's port is available
[2020-08-13 21:07:29.667896] I [socket.c:865:__socket_shutdown]
0-storage-client-0: intentional socket shutdown(8)
[2020-08-13 21:08:05.660858] I [MSGID: 100041]
[glusterfsd-mgmt.c:1111:glusterfs_handle_svc_attach] 0-glusterfs:
received attach request for volfile-id=shd/storage
[2020-08-13 21:08:05.660948] I [MSGID: 100040]
[glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in
volfile, continuing
[2020-08-13 21:08:05.661326] I [rpc-clnt.c:1963:rpc_clnt_reconfig]
0-storage-client-7: changing port to 49154 (from 0)
[2020-08-13 21:08:05.664638] I [MSGID: 114057]
[client-handshake.c:1375:select_server_supported_programs]
0-storage-client-7: Using Program GlusterFS 4.x v1, Num (1298437),
Version (400)
[2020-08-13 21:08:05.665266] I [MSGID: 114046]
[client-handshake.c:1105:client_setvolume_cbk] 0-storage-client-7:
Connected to storage-client-7, attached to remote volume
'/data/storage_b/storage'.
[2020-08-13 21:08:05.713533] I [rpc-clnt.c:1963:rpc_clnt_reconfig]
0-storage-client-0: changing port to 49152 (from 0)
[2020-08-13 21:08:05.716535] I [MSGID: 114057]
[client-handshake.c:1375:select_server_supported_programs]
0-storage-client-0: Using Program GlusterFS 4.x v1, Num (1298437),
Version (400)
[2020-08-13 21:08:05.717224] I [MSGID: 114046]
[client-handshake.c:1105:client_setvolume_cbk] 0-storage-client-0:
Connected to storage-client-0, attached to remote volume
'/data/storage_a/storage'.
Thanks,
-Matthew
</pre>
</blockquote>
</blockquote>
<br>
</body>
</html>