<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    Thanks Strahil, <br>

    <br>

    Would the geo rep process be the gsyncd.py proceses? <br>

    <br>

    It seems like it's the glusterfsd and auxiliary mounts that are

    holding all the memory right now... <br>

    <br>

    Could this be related to the open-behind bug mentioned here:

    <a class="moz-txt-link-freetext" href="https://github.com/gluster/glusterfs/issues/1444">https://github.com/gluster/glusterfs/issues/1444</a>  and here:

    <a class="moz-txt-link-freetext" href="https://github.com/gluster/glusterfs/issues/1440">https://github.com/gluster/glusterfs/issues/1440</a> ? <br>

    <br>

    Thanks,<br>

     -Matthew<br>

    <div class="moz-signature"><br>

      Matthew Benstead<br>

      System Administrator<br>

      <a href="https://pacificclimate.org/">Pacific Climate Impacts

        Consortium</a><br>

      University of Victoria, UH1<br>

      PO Box 1800, STN CSC<br>

      Victoria, BC, V8W 2Y2<br>

      Phone: 1-250-721-8432<br>

      Email: <a href="mailto:matthewb@uvic.ca">matthewb@uvic.ca</a>

    </div>

    <div class="moz-cite-prefix">On 2020-08-14 10:35 p.m., Strahil

      Nikolov wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:7A418761-7531-4B2F-9430-20976A22FFE6@yahoo.com">

      <pre class="moz-quote-pre" wrap="">Hey Matthew,

Can you check with valgrind the memory leak ?

It will be something like:

Find the geo rep process via ps and note  all parameters it was started with .

Next stop geo rep.

Then start it with valgrind :

valgrind --log-file="filename"  --tool=memcheck --leak-check=full  &lt;georep process binary&gt; &lt;geo rep parameters&gt;

It might help narrowing the problem.

Best Regards,

Strahil Nikolov

На 14 август 2020 г. 20:22:16 GMT+03:00, Matthew Benstead <a class="moz-txt-link-rfc2396E" href="mailto:matthewb@uvic.ca">&lt;matthewb@uvic.ca&gt;</a> написа:

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">Hi,

We are building a new storage system, and after geo-replication has

been 

running for a few hours the server runs out of memory and oom-killer 

starts killing bricks. It runs fine without geo-replication on, and the

server has 64GB of RAM. I have stopped geo-replication for now.

Any ideas what to tune?

[root@storage01 ~]# gluster --version | head -1

glusterfs 7.7

[root@storage01 ~]# cat /etc/centos-release; uname -r

CentOS Linux release 7.8.2003 (Core)

3.10.0-1127.10.1.el7.x86_64

[root@storage01 ~]# df -h /storage2/

Filesystem            Size  Used Avail Use% Mounted on

10.0.231.91:/storage  328T  228T  100T  70% /storage2

[root@storage01 ~]# cat /proc/meminfo  | grep MemTotal

MemTotal:       65412064 kB

[root@storage01 ~]# free -g

              total        used        free      shared buff/cache   

available

Mem:             62          18           0           0 43          43

Swap:             3           0           3

[root@storage01 ~]# gluster volume info

Volume Name: storage

Type: Distributed-Replicate

Volume ID: cf94a8f2-324b-40b3-bf72-c3766100ea99

Status: Started

Snapshot Count: 0

Number of Bricks: 3 x (2 + 1) = 9

Transport-type: tcp

Bricks:

Brick1: 10.0.231.91:/data/storage_a/storage

Brick2: 10.0.231.92:/data/storage_b/storage

Brick3: 10.0.231.93:/data/storage_c/storage (arbiter)

Brick4: 10.0.231.92:/data/storage_a/storage

Brick5: 10.0.231.93:/data/storage_b/storage

Brick6: 10.0.231.91:/data/storage_c/storage (arbiter)

Brick7: 10.0.231.93:/data/storage_a/storage

Brick8: 10.0.231.91:/data/storage_b/storage

Brick9: 10.0.231.92:/data/storage_c/storage (arbiter)

Options Reconfigured:

changelog.changelog: on

geo-replication.ignore-pid-check: on

geo-replication.indexing: on

network.ping-timeout: 10

features.inode-quota: on

features.quota: on

nfs.disable: on

features.quota-deem-statfs: on

storage.fips-mode-rchecksum: on

performance.readdir-ahead: on

performance.parallel-readdir: on

cluster.lookup-optimize: on

client.event-threads: 4

server.event-threads: 4

performance.cache-size: 256MB

You can see the memory spike and reduce as bricks are killed - this 

happened twice in the graph below:

You can see two brick processes are down:

[root@storage01 ~]# gluster volume status

Status of volume: storage

Gluster process                             TCP Port  RDMA Port  Online

Pid

------------------------------------------------------------------------------

Brick 10.0.231.91:/data/storage_a/storage   N/A       N/A        N     

N/A

Brick 10.0.231.92:/data/storage_b/storage   49152     0          Y     

1627

Brick 10.0.231.93:/data/storage_c/storage   49152     0          Y     

259966

Brick 10.0.231.92:/data/storage_a/storage   49153     0          Y     

1642

Brick 10.0.231.93:/data/storage_b/storage   49153     0          Y     

259975

Brick 10.0.231.91:/data/storage_c/storage   49153     0          Y     

20656

Brick 10.0.231.93:/data/storage_a/storage   49154     0          Y     

259983

Brick 10.0.231.91:/data/storage_b/storage   N/A       N/A        N     

N/A

Brick 10.0.231.92:/data/storage_c/storage   49154     0          Y     

1655

Self-heal Daemon on localhost               N/A       N/A        Y     

20690

Quota Daemon on localhost                   N/A       N/A        Y     

172136

Self-heal Daemon on 10.0.231.93             N/A       N/A        Y     

260010

Quota Daemon on 10.0.231.93                 N/A       N/A        Y     

128115

Self-heal Daemon on 10.0.231.92             N/A       N/A        Y     

1702

Quota Daemon on 10.0.231.92                 N/A       N/A        Y     

128564

Task Status of Volume storage

------------------------------------------------------------------------------

There are no active volume tasks

Logs:

[2020-08-13 20:58:22.186540] I [MSGID: 106143]

[glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick

(null) on port 49154

[2020-08-13 20:58:22.196110] I [MSGID: 106005]

[glusterd-handler.c:5960:__glusterd_brick_rpc_notify] 0-management:

Brick 10.0.231.91:/data/storage_b/storage has disconnected from

glusterd.

[2020-08-13 20:58:22.196752] I [MSGID: 106143]

[glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick

/data/storage_b/storage on port 49154

[2020-08-13 21:05:23.418966] I [MSGID: 106143]

[glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick

(null) on port 49152

[2020-08-13 21:05:23.420881] I [MSGID: 106005]

[glusterd-handler.c:5960:__glusterd_brick_rpc_notify] 0-management:

Brick 10.0.231.91:/data/storage_a/storage has disconnected from

glusterd.

[2020-08-13 21:05:23.421334] I [MSGID: 106143]

[glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick

/data/storage_a/storage on port 49152

[Thu Aug 13 13:58:17 2020] Out of memory: Kill process 20664

(glusterfsd) score 422 or sacrifice child

[Thu Aug 13 13:58:17 2020] Killed process 20664 (glusterfsd), UID 0,

total-vm:32884384kB, anon-rss:29625096kB, file-rss:0kB, shmem-rss:0kB

[Thu Aug 13 14:05:18 2020] Out of memory: Kill process 20647

(glusterfsd) score 467 or sacrifice child

[Thu Aug 13 14:05:18 2020] Killed process 20647 (glusterfsd), UID 0,

total-vm:36265116kB, anon-rss:32767744kB, file-rss:520kB,

shmem-rss:0kB0

glustershd logs:

[2020-08-13 20:58:22.181368] W [socket.c:775:__socket_rwv]

0-storage-client-7: readv on 10.0.231.91:49154 failed (No data

available)

[2020-08-13 20:58:22.185413] I [MSGID: 114018]

[client.c:2347:client_rpc_notify] 0-storage-client-7: disconnected from

storage-client-7. Client process will keep trying to connect to

glusterd until brick's port is available

[2020-08-13 20:58:25.211872] E [MSGID: 114058]

[client-handshake.c:1455:client_query_portmap_cbk] 0-storage-client-7:

failed to get the port number for remote subvolume. Please run 'gluster

volume status' on server to see if brick process is running.

[2020-08-13 20:58:25.211934] I [MSGID: 114018]

[client.c:2347:client_rpc_notify] 0-storage-client-7: disconnected from

storage-client-7. Client process will keep trying to connect to

glusterd until brick's port is available

[2020-08-13 21:00:28.386633] I [socket.c:865:__socket_shutdown]

0-storage-client-7: intentional socket shutdown(8)

[2020-08-13 21:02:34.565373] I [socket.c:865:__socket_shutdown]

0-storage-client-7: intentional socket shutdown(8)

[2020-08-13 21:02:58.000263] W [MSGID: 114031]

[client-rpc-fops_v2.c:920:client4_0_getxattr_cbk] 0-storage-client-7:

remote operation failed. Path: /

(00000000-0000-0000-0000-000000000001). Key: trusted.glusterfs.pathinfo

[Transport endpoint is not connected]

[2020-08-13 21:02:58.000460] W [MSGID: 114029]

[client-rpc-fops_v2.c:4469:client4_0_getxattr] 0-storage-client-7:

failed to send the fop

[2020-08-13 21:04:40.733823] I [socket.c:865:__socket_shutdown]

0-storage-client-7: intentional socket shutdown(8)

[2020-08-13 21:05:23.418987] W [socket.c:775:__socket_rwv]

0-storage-client-0: readv on 10.0.231.91:49152 failed (No data

available)

[2020-08-13 21:05:23.419365] I [MSGID: 114018]

[client.c:2347:client_rpc_notify] 0-storage-client-0: disconnected from

storage-client-0. Client process will keep trying to connect to

glusterd until brick's port is available

[2020-08-13 21:05:26.423218] E [MSGID: 114058]

[client-handshake.c:1455:client_query_portmap_cbk] 0-storage-client-0:

failed to get the port number for remote subvolume. Please run 'gluster

volume status' on server to see if brick process is running.

[2020-08-13 21:06:46.919942] I [socket.c:865:__socket_shutdown]

0-storage-client-7: intentional socket shutdown(8)

[2020-08-13 21:05:26.423274] I [MSGID: 114018]

[client.c:2347:client_rpc_notify] 0-storage-client-0: disconnected from

storage-client-0. Client process will keep trying to connect to

glusterd until brick's port is available

[2020-08-13 21:07:29.667896] I [socket.c:865:__socket_shutdown]

0-storage-client-0: intentional socket shutdown(8)

[2020-08-13 21:08:05.660858] I [MSGID: 100041]

[glusterfsd-mgmt.c:1111:glusterfs_handle_svc_attach] 0-glusterfs:

received attach request for volfile-id=shd/storage

[2020-08-13 21:08:05.660948] I [MSGID: 100040]

[glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in

volfile, continuing

[2020-08-13 21:08:05.661326] I [rpc-clnt.c:1963:rpc_clnt_reconfig]

0-storage-client-7: changing port to 49154 (from 0)

[2020-08-13 21:08:05.664638] I [MSGID: 114057]

[client-handshake.c:1375:select_server_supported_programs]

0-storage-client-7: Using Program GlusterFS 4.x v1, Num (1298437),

Version (400)

[2020-08-13 21:08:05.665266] I [MSGID: 114046]

[client-handshake.c:1105:client_setvolume_cbk] 0-storage-client-7:

Connected to storage-client-7, attached to remote volume

'/data/storage_b/storage'.

[2020-08-13 21:08:05.713533] I [rpc-clnt.c:1963:rpc_clnt_reconfig]

0-storage-client-0: changing port to 49152 (from 0)

[2020-08-13 21:08:05.716535] I [MSGID: 114057]

[client-handshake.c:1375:select_server_supported_programs]

0-storage-client-0: Using Program GlusterFS 4.x v1, Num (1298437),

Version (400)

[2020-08-13 21:08:05.717224] I [MSGID: 114046]

[client-handshake.c:1105:client_setvolume_cbk] 0-storage-client-0:

Connected to storage-client-0, attached to remote volume

'/data/storage_a/storage'.

Thanks,

 -Matthew

</pre>

      </blockquote>

    </blockquote>

    <br>

  </body>

</html>