<html>
  <head>

    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
  </head>
  <body>
    Hi, <br>
    <br>
    We are building a new storage system, and after geo-replication has
    been running for a few hours the server runs out of memory and
    oom-killer starts killing bricks. It runs fine without
    geo-replication on, and the server has 64GB of RAM. I have stopped
    geo-replication for now<tt>. </tt><br>
    <br>
    Any ideas what to tune? <br>
    <br>
    <tt>[root@storage01 ~]# gluster --version | head -1</tt><tt><br>
    </tt><tt>glusterfs 7.7</tt><tt><br>
    </tt><tt><br>
    </tt><tt>[root@storage01 ~]# cat /etc/centos-release; uname -r</tt><tt><br>
    </tt><tt>CentOS Linux release 7.8.2003 (Core)</tt><tt><br>
    </tt><tt>3.10.0-1127.10.1.el7.x86_64</tt><tt><br>
    </tt><tt><br>
    </tt><tt>[root@storage01 ~]# df -h /storage2/</tt><tt><br>
    </tt><tt>Filesystem            Size  Used Avail Use% Mounted on</tt><tt><br>
    </tt><tt>10.0.231.91:/storage  328T  228T  100T  70% /storage2</tt><tt><br>
      <br>
      [root@storage01 ~]# cat /proc/meminfo  | grep MemTotal<br>
      MemTotal:       65412064 kB<br>
      <br>
      [root@storage01 ~]# free -g<br>
                    total        used        free      shared 
      buff/cache   available<br>
      Mem:             62          18           0           0         
      43          43<br>
      Swap:             3           0           3<br>
      <br>
    </tt><tt><br>
    </tt><tt>[root@storage01 ~]# gluster volume info</tt><tt><br>
    </tt><tt> </tt><tt><br>
    </tt><tt>Volume Name: storage</tt><tt><br>
    </tt><tt>Type: Distributed-Replicate</tt><tt><br>
    </tt><tt>Volume ID: cf94a8f2-324b-40b3-bf72-c3766100ea99</tt><tt><br>
    </tt><tt>Status: Started</tt><tt><br>
    </tt><tt>Snapshot Count: 0</tt><tt><br>
    </tt><tt>Number of Bricks: 3 x (2 + 1) = 9</tt><tt><br>
    </tt><tt>Transport-type: tcp</tt><tt><br>
    </tt><tt>Bricks:</tt><tt><br>
    </tt><tt>Brick1: 10.0.231.91:/data/storage_a/storage</tt><tt><br>
    </tt><tt>Brick2: 10.0.231.92:/data/storage_b/storage</tt><tt><br>
    </tt><tt>Brick3: 10.0.231.93:/data/storage_c/storage (arbiter)</tt><tt><br>
    </tt><tt>Brick4: 10.0.231.92:/data/storage_a/storage</tt><tt><br>
    </tt><tt>Brick5: 10.0.231.93:/data/storage_b/storage</tt><tt><br>
    </tt><tt>Brick6: 10.0.231.91:/data/storage_c/storage (arbiter)</tt><tt><br>
    </tt><tt>Brick7: 10.0.231.93:/data/storage_a/storage</tt><tt><br>
    </tt><tt>Brick8: 10.0.231.91:/data/storage_b/storage</tt><tt><br>
    </tt><tt>Brick9: 10.0.231.92:/data/storage_c/storage (arbiter)</tt><tt><br>
    </tt><tt>Options Reconfigured:</tt><tt><br>
    </tt><tt>changelog.changelog: on</tt><tt><br>
    </tt><tt>geo-replication.ignore-pid-check: on</tt><tt><br>
    </tt><tt>geo-replication.indexing: on</tt><tt><br>
    </tt><tt>network.ping-timeout: 10</tt><tt><br>
    </tt><tt>features.inode-quota: on</tt><tt><br>
    </tt><tt>features.quota: on</tt><tt><br>
    </tt><tt>nfs.disable: on</tt><tt><br>
    </tt><tt>features.quota-deem-statfs: on</tt><tt><br>
    </tt><tt>storage.fips-mode-rchecksum: on</tt><tt><br>
    </tt><tt>performance.readdir-ahead: on</tt><tt><br>
    </tt><tt>performance.parallel-readdir: on</tt><tt><br>
    </tt><tt>cluster.lookup-optimize: on</tt><tt><br>
    </tt><tt>client.event-threads: 4</tt><tt><br>
    </tt><tt>server.event-threads: 4</tt><tt><br>
    </tt><tt>performance.cache-size: 256MB</tt><br>
    <br>
    You can see the memory spike and reduce as bricks are killed - this
    happened twice in the graph below: <br>
    <br>
    <img src="cid:part1.5B6C5AF0.590BE62A@uvic.ca" alt=""><br>
    <br>
    You can see two brick processes are down: <br>
    <pre>[root@storage01 ~]# gluster volume status
Status of volume: storage
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.0.231.91:/data/storage_a/storage   N/A       N/A        N       N/A  
Brick 10.0.231.92:/data/storage_b/storage   49152     0          Y       1627 
Brick 10.0.231.93:/data/storage_c/storage   49152     0          Y       259966
Brick 10.0.231.92:/data/storage_a/storage   49153     0          Y       1642 
Brick 10.0.231.93:/data/storage_b/storage   49153     0          Y       259975
Brick 10.0.231.91:/data/storage_c/storage   49153     0          Y       20656
Brick 10.0.231.93:/data/storage_a/storage   49154     0          Y       259983
Brick 10.0.231.91:/data/storage_b/storage   N/A       N/A        N       N/A  
Brick 10.0.231.92:/data/storage_c/storage   49154     0          Y       1655 
Self-heal Daemon on localhost               N/A       N/A        Y       20690
Quota Daemon on localhost                   N/A       N/A        Y       172136
Self-heal Daemon on 10.0.231.93             N/A       N/A        Y       260010
Quota Daemon on 10.0.231.93                 N/A       N/A        Y       128115
Self-heal Daemon on 10.0.231.92             N/A       N/A        Y       1702 
Quota Daemon on 10.0.231.92                 N/A       N/A        Y       128564

Task Status of Volume storage
------------------------------------------------------------------------------
There are no active volume tasks</pre>
    Logs: <br>
    <br>
    <pre>[2020-08-13 20:58:22.186540] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick (null) on port 49154 
[2020-08-13 20:58:22.196110] I [MSGID: 106005] [glusterd-handler.c:5960:__glusterd_brick_rpc_notify] 0-management: Brick 10.0.231.91:/data/storage_b/storage has disconnected from glusterd. 
[2020-08-13 20:58:22.196752] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick /data/storage_b/storage on port 49154 

[2020-08-13 21:05:23.418966] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick (null) on port 49152 
[2020-08-13 21:05:23.420881] I [MSGID: 106005] [glusterd-handler.c:5960:__glusterd_brick_rpc_notify] 0-management: Brick 10.0.231.91:/data/storage_a/storage has disconnected from glusterd. 
[2020-08-13 21:05:23.421334] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick /data/storage_a/storage on port 49152 
</pre>
    <br>
    <br>
    <pre>[Thu Aug 13 13:58:17 2020] Out of memory: Kill process 20664 (glusterfsd) score 422 or sacrifice child
[Thu Aug 13 13:58:17 2020] Killed process 20664 (glusterfsd), UID 0, total-vm:32884384kB, anon-rss:29625096kB, file-rss:0kB, shmem-rss:0kB

[Thu Aug 13 14:05:18 2020] Out of memory: Kill process 20647 (glusterfsd) score 467 or sacrifice child
[Thu Aug 13 14:05:18 2020] Killed process 20647 (glusterfsd), UID 0, total-vm:36265116kB, anon-rss:32767744kB, file-rss:520kB, shmem-rss:0kB0</pre>
    <br>
    <br>
    glustershd logs: <br>
    <br>
    <pre>[2020-08-13 20:58:22.181368] W [socket.c:775:__socket_rwv] 0-storage-client-7: readv on 10.0.231.91:49154 failed (No data available)
[2020-08-13 20:58:22.185413] I [MSGID: 114018] [client.c:2347:client_rpc_notify] 0-storage-client-7: disconnected from storage-client-7. Client process will keep trying to connect to glusterd until brick's port is available 
[2020-08-13 20:58:25.211872] E [MSGID: 114058] [client-handshake.c:1455:client_query_portmap_cbk] 0-storage-client-7: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. 
[2020-08-13 20:58:25.211934] I [MSGID: 114018] [client.c:2347:client_rpc_notify] 0-storage-client-7: disconnected from storage-client-7. Client process will keep trying to connect to glusterd until brick's port is available 
[2020-08-13 21:00:28.386633] I [socket.c:865:__socket_shutdown] 0-storage-client-7: intentional socket shutdown(8)
[2020-08-13 21:02:34.565373] I [socket.c:865:__socket_shutdown] 0-storage-client-7: intentional socket shutdown(8)
[2020-08-13 21:02:58.000263] W [MSGID: 114031] [client-rpc-fops_v2.c:920:client4_0_getxattr_cbk] 0-storage-client-7: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001). Key: trusted.glusterfs.pathinfo [Transport endpoint is not connected]
[2020-08-13 21:02:58.000460] W [MSGID: 114029] [client-rpc-fops_v2.c:4469:client4_0_getxattr] 0-storage-client-7: failed to send the fop 
[2020-08-13 21:04:40.733823] I [socket.c:865:__socket_shutdown] 0-storage-client-7: intentional socket shutdown(8)
[2020-08-13 21:05:23.418987] W [socket.c:775:__socket_rwv] 0-storage-client-0: readv on 10.0.231.91:49152 failed (No data available)
[2020-08-13 21:05:23.419365] I [MSGID: 114018] [client.c:2347:client_rpc_notify] 0-storage-client-0: disconnected from storage-client-0. Client process will keep trying to connect to glusterd until brick's port is available 
[2020-08-13 21:05:26.423218] E [MSGID: 114058] [client-handshake.c:1455:client_query_portmap_cbk] 0-storage-client-0: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. 
[2020-08-13 21:06:46.919942] I [socket.c:865:__socket_shutdown] 0-storage-client-7: intentional socket shutdown(8)
[2020-08-13 21:05:26.423274] I [MSGID: 114018] [client.c:2347:client_rpc_notify] 0-storage-client-0: disconnected from storage-client-0. Client process will keep trying to connect to glusterd until brick's port is available 
[2020-08-13 21:07:29.667896] I [socket.c:865:__socket_shutdown] 0-storage-client-0: intentional socket shutdown(8)
[2020-08-13 21:08:05.660858] I [MSGID: 100041] [glusterfsd-mgmt.c:1111:glusterfs_handle_svc_attach] 0-glusterfs: received attach request for volfile-id=shd/storage 
[2020-08-13 21:08:05.660948] I [MSGID: 100040] [glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing 
[2020-08-13 21:08:05.661326] I [rpc-clnt.c:1963:rpc_clnt_reconfig] 0-storage-client-7: changing port to 49154 (from 0)
[2020-08-13 21:08:05.664638] I [MSGID: 114057] [client-handshake.c:1375:select_server_supported_programs] 0-storage-client-7: Using Program GlusterFS 4.x v1, Num (1298437), Version (400) 
[2020-08-13 21:08:05.665266] I [MSGID: 114046] [client-handshake.c:1105:client_setvolume_cbk] 0-storage-client-7: Connected to storage-client-7, attached to remote volume '/data/storage_b/storage'. 
[2020-08-13 21:08:05.713533] I [rpc-clnt.c:1963:rpc_clnt_reconfig] 0-storage-client-0: changing port to 49152 (from 0)
[2020-08-13 21:08:05.716535] I [MSGID: 114057] [client-handshake.c:1375:select_server_supported_programs] 0-storage-client-0: Using Program GlusterFS 4.x v1, Num (1298437), Version (400) 
[2020-08-13 21:08:05.717224] I [MSGID: 114046] [client-handshake.c:1105:client_setvolume_cbk] 0-storage-client-0: Connected to storage-client-0, attached to remote volume '/data/storage_a/storage'. 
</pre>
    <br>
    Thanks,<br>
     -Matthew<br>
    <div class="moz-signature"><br>
    </div>
  </body>
</html>