[Gluster-users] Geo-Replication memory leak on slave node

Thu Jun 7 07:19:04 UTC 2018

Hi Kotresh,

Many thanks for your prompt response.

Below are my responses to your questions;

1. Is this trace back consistently hit? I just wanted to confirm whether
it's transient which occurs once in a while and gets back to normal?
It appears not.  As soon as the geo-rep recovered yesterday from the high
memory usage it immediately began rising again until it consumed all of the
available ram.  But this time nothing was committed to the log file.
I would like to add here that this current instance of geo-rep was only
brought online at the start of this week due to the issues with glibc on
CentOS 7.5.  This is the first time I have had geo-rep running with Gluster
ver 3.12.9, both storage clusters at each physical site were only rebuilt
approx. 4 weeks ago, due to the previous version in use going EOL.  Prior
to this I had been running 3.13.2 (3.13.X now EOL) at each of the sites and
it is worth noting that the same behaviour was also seen on this version of
Gluster, unfortunately I do not have any of the log data from then but I do
not recall seeing any instances of the trace back message mentioned.

2. Please upload the complete geo-rep logs from both master and slave.
I have the log files, just checking to make sure there is no confidential
info inside.  The logfiles are too big to send via email, even when
compressed.  Do you have a preferred method to allow me to share this data
with you or would a share from my Google drive be sufficient?

3. Are the gluster versions same across master and slave?
Yes, all gluster versions are the same across the two sites for all storage
nodes.  See below for version info taken from the current geo-rep master.

glusterfs 3.12.9
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.

glusterfs-geo-replication-3.12.9-1.el7.x86_64
glusterfs-gnfs-3.12.9-1.el7.x86_64
glusterfs-libs-3.12.9-1.el7.x86_64
glusterfs-server-3.12.9-1.el7.x86_64
glusterfs-3.12.9-1.el7.x86_64
glusterfs-api-3.12.9-1.el7.x86_64
glusterfs-events-3.12.9-1.el7.x86_64
centos-release-gluster312-1.0-1.el7.centos.noarch
glusterfs-client-xlators-3.12.9-1.el7.x86_64
glusterfs-cli-3.12.9-1.el7.x86_64
python2-gluster-3.12.9-1.el7.x86_64
glusterfs-rdma-3.12.9-1.el7.x86_64
glusterfs-fuse-3.12.9-1.el7.x86_64

I have also attached another screenshot showing the memory usage from the
Gluster slave for the last 48 hours.  This shows memory saturation from
yesterday, which correlates with the trace back sent yesterday, and the
subsequent memory saturation which occurred over the last 24 hours.  For
info, all times are in UTC.

Please advise the preferred method to get the log data across to you and
also if you require any further information.

Many thanks,

Mark Betham

On 7 June 2018 at 04:42, Kotresh Hiremath Ravishankar <khiremat at redhat.com>
wrote:

> Hi Mark,
>
> Few questions.
>
> 1. Is this trace back consistently hit? I just wanted to confirm whether
> it's transient which occurs once in a while and gets back to normal?
> 2. Please upload the complete geo-rep logs from both master and slave.
> 3. Are the gluster versions same across master and slave?
>
> Thanks,
> Kotresh HR
>
> On Wed, Jun 6, 2018 at 7:10 PM, Mark Betham <mark.betham@
> performancehorizon.com> wrote:
>
>> Dear Gluster-Users,
>>
>> I have geo-replication setup and configured between 2 Gluster pools
>> located at different sites.  What I am seeing is an error being reported
>> within the geo-replication slave log as follows;
>>
>> *[2018-06-05 12:05:26.767615] E
>> [syncdutils(slave):331:log_raise_exception] <top>: FAIL: *
>> *Traceback (most recent call last):*
>> *  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line
>> 361, in twrap*
>> *    tf(*aa)*
>> *  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line
>> 1009, in <lambda>*
>> *    t = syncdutils.Thread(target=lambda: (repce.service_loop(),*
>> *  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 90, in
>> service_loop*
>> *    self.q.put(recv(self.inf))*
>> *  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 61, in
>> recv*
>> *    return pickle.load(inf)*
>> *ImportError: No module named
>> h_2013-04-26-04:02:49-2013-04-26_11:02:53.gz.15WBuUh*
>> *[2018-06-05 12:05:26.768085] E [repce(slave):117:worker] <top>: call
>> failed: *
>> *Traceback (most recent call last):*
>> *  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in
>> worker*
>> *    res = getattr(self.obj, rmeth)(*in_data[2:])*
>> *TypeError: getattr(): attribute name must be string*
>>
>> From this point in time the slave server begins to consume all of its
>> available RAM until it becomes non-responsive.  Eventually the gluster
>> service seems to kill off the offending process and the memory is returned
>> to the system.  Once the memory has been returned to the remote slave
>> system the geo-replication often recovers and data transfer resumes.
>>
>> I have attached the full geo-replication slave log containing the error
>> shown above.  I have also attached an image file showing the memory usage
>> of the affected storage server.
>>
>> We are currently running Gluster version 3.12.9 on top of CentOS 7.5
>> x86_64.  The system has been fully patched and is running the latest
>> software, excluding glibc which had to be downgraded to get geo-replication
>> working.
>>
>> The Gluster volume runs on a dedicated partition using the XFS filesystem
>> which in turn is running on a LVM thin volume.  The physical storage is
>> presented as a single drive due to the underlying disks being part of a
>> raid 10 array.
>>
>> The Master volume which is being replicated has a total of 2.2 TB of data
>> to be replicated.  The total size of the volume fluctuates very little as
>> data being removed equals the new data coming in.  This data is made up of
>> many thousands of files across many separated directories.  Data file sizes
>> vary from the very small (>1K) to the large (>1Gb).  The Gluster service
>> itself is running with a single volume in a replicated configuration across
>> 3 bricks at each of the sites.  The delta changes being replicated are on
>> average about 100GB per day, where this includes file creation / deletion /
>> modification.
>>
>> The config for the geo-replication session is as follows, taken from the
>> current source server;
>>
>> *special_sync_mode: partial*
>> *gluster_log_file:
>> /var/log/glusterfs/geo-replication/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1.gluster.log*
>> *ssh_command: ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no
>> -i /var/lib/glusterd/geo-replication/secret.pem*
>> *change_detector: changelog*
>> *session_owner: 40e9e77a-034c-44a2-896e-59eec47e8a84*
>> *state_file:
>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/monitor.status*
>> *gluster_params: aux-gfid-mount acl*
>> *log_rsync_performance: true*
>> *remote_gsyncd: /nonexistent/gsyncd*
>> *working_dir:
>> /var/lib/misc/glusterfsd/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1*
>> *state_detail_file:
>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1-detail.status*
>> *gluster_command_dir: /usr/sbin/*
>> *pid_file:
>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/monitor.pid*
>> *georep_session_working_dir:
>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/*
>> *ssh_command_tar: ssh -oPasswordAuthentication=no
>> -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/tar_ssh.pem*
>> *master.stime_xattr_name:
>> trusted.glusterfs.40e9e77a-034c-44a2-896e-59eec47e8a84.ccfaed9b-ff4b-4a55-acfa-03f092cdf460.stime*
>> *changelog_log_file:
>> /var/log/glusterfs/geo-replication/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1-changes.log*
>> *socketdir: /var/run/gluster*
>> *volume_id: 40e9e77a-034c-44a2-896e-59eec47e8a84*
>> *ignore_deletes: false*
>> *state_socket_unencoded:
>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1.socket*
>> *log_file:
>> /var/log/glusterfs/geo-replication/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1.log*
>>
>> If any further information is required in order to troubleshoot this
>> issue then please let me know.
>>
>> I would be very grateful for any help or guidance received.
>>
>> Many thanks,
>>
>> Mark Betham.
>>
>>
>>
>>
>> This email may contain confidential material; unintended recipients must
>> not disseminate, use, or act upon any information in it. If you received
>> this email in error, please contact the sender and permanently delete the
>> email.
>> Performance Horizon Group Limited | Registered in England & Wales
>> 07188234 | Level 8, West One, Forth Banks, Newcastle upon Tyne, NE1 3PA
>>
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>
>
>
>
> --
> Thanks and Regards,
> Kotresh H R
>

-- 
MARK BETHAM
Senior System Administrator
+44 (0) 191 261 2444
performancehorizon.com
PerformanceHorizon <https://www.facebook.com/PerformanceHorizon>
tweetphg <https://twitter.com/tweetphg>
performance-horizon-group <https://www.linkedin.com/company-beta/1484320/>

-- 

          This
            email may contain confidential material; 
unintended
            recipients must not disseminate, use, or act upon 
any
            information in it. If you received this email in error,

        please contact the sender and permanently delete the email.

     Performance Horizon Group Limited | Registered in England

& Wales 07188234 | Level 8, West One, Forth Banks,
            Newcastle 
upon Tyne, NE1 3PA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180607/e99b054f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Storage-Slave_Last-48hrs.png
Type: image/png
Size: 192981 bytes
Desc: not available
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180607/e99b054f/attachment.png>