[Gluster-users] Geo-Replication memory leak on slave node

Mon Jun 11 04:53:54 UTC 2018

Hi Mark,

Google drive works for me.

Thanks,
Kotresh HR

On Fri, Jun 8, 2018 at 3:00 PM, Mark Betham <
mark.betham at performancehorizon.com> wrote:

> Hi Kotresh,
>
> The memory issue re-occurred again.  This is indicating it will occur
> around once a day.
>
> Again no traceback listed in the log, the only update in the log was as
> follows;
> [2018-06-08 08:26:43.404261] I [resource(slave):1020:service_loop]
> GLUSTER: connection inactive, stopping timeout=120
> [2018-06-08 08:29:19.357615] I [syncdutils(slave):271:finalize] <top>:
> exiting.
> [2018-06-08 08:31:02.432002] I [resource(slave):1502:connect] GLUSTER:
> Mounting gluster volume locally...
> [2018-06-08 08:31:03.716967] I [resource(slave):1515:connect] GLUSTER:
> Mounted gluster volume duration=1.2729
> [2018-06-08 08:31:03.717411] I [resource(slave):1012:service_loop]
> GLUSTER: slave listening
>
> I have attached an image showing the latest memory usage pattern.
>
> Can you please advise how I can pass the log data across to you?  As soon
> as I know this I will get the data uploaded for your review.
>
> Thanks,
>
> Mark Betham
>
>
>
>
> On 7 June 2018 at 08:19, Mark Betham <mark.betham at performancehorizon.com>
> wrote:
>
>> Hi Kotresh,
>>
>> Many thanks for your prompt response.
>>
>> Below are my responses to your questions;
>>
>> 1. Is this trace back consistently hit? I just wanted to confirm whether
>> it's transient which occurs once in a while and gets back to normal?
>> It appears not.  As soon as the geo-rep recovered yesterday from the high
>> memory usage it immediately began rising again until it consumed all of the
>> available ram.  But this time nothing was committed to the log file.
>> I would like to add here that this current instance of geo-rep was only
>> brought online at the start of this week due to the issues with glibc on
>> CentOS 7.5.  This is the first time I have had geo-rep running with Gluster
>> ver 3.12.9, both storage clusters at each physical site were only rebuilt
>> approx. 4 weeks ago, due to the previous version in use going EOL.  Prior
>> to this I had been running 3.13.2 (3.13.X now EOL) at each of the sites and
>> it is worth noting that the same behaviour was also seen on this version of
>> Gluster, unfortunately I do not have any of the log data from then but I do
>> not recall seeing any instances of the trace back message mentioned.
>>
>> 2. Please upload the complete geo-rep logs from both master and slave.
>> I have the log files, just checking to make sure there is no confidential
>> info inside.  The logfiles are too big to send via email, even when
>> compressed.  Do you have a preferred method to allow me to share this data
>> with you or would a share from my Google drive be sufficient?
>>
>> 3. Are the gluster versions same across master and slave?
>> Yes, all gluster versions are the same across the two sites for all
>> storage nodes.  See below for version info taken from the current geo-rep
>> master.
>>
>> glusterfs 3.12.9
>> Repository revision: git://git.gluster.org/glusterfs.git
>> Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
>> GlusterFS comes with ABSOLUTELY NO WARRANTY.
>> It is licensed to you under your choice of the GNU Lesser
>> General Public License, version 3 or any later version (LGPLv3
>> or later), or the GNU General Public License, version 2 (GPLv2),
>> in all cases as published by the Free Software Foundation.
>>
>> glusterfs-geo-replication-3.12.9-1.el7.x86_64
>> glusterfs-gnfs-3.12.9-1.el7.x86_64
>> glusterfs-libs-3.12.9-1.el7.x86_64
>> glusterfs-server-3.12.9-1.el7.x86_64
>> glusterfs-3.12.9-1.el7.x86_64
>> glusterfs-api-3.12.9-1.el7.x86_64
>> glusterfs-events-3.12.9-1.el7.x86_64
>> centos-release-gluster312-1.0-1.el7.centos.noarch
>> glusterfs-client-xlators-3.12.9-1.el7.x86_64
>> glusterfs-cli-3.12.9-1.el7.x86_64
>> python2-gluster-3.12.9-1.el7.x86_64
>> glusterfs-rdma-3.12.9-1.el7.x86_64
>> glusterfs-fuse-3.12.9-1.el7.x86_64
>>
>> I have also attached another screenshot showing the memory usage from the
>> Gluster slave for the last 48 hours.  This shows memory saturation from
>> yesterday, which correlates with the trace back sent yesterday, and the
>> subsequent memory saturation which occurred over the last 24 hours.  For
>> info, all times are in UTC.
>>
>> Please advise the preferred method to get the log data across to you and
>> also if you require any further information.
>>
>> Many thanks,
>>
>> Mark Betham
>>
>>
>> On 7 June 2018 at 04:42, Kotresh Hiremath Ravishankar <
>> khiremat at redhat.com> wrote:
>>
>>> Hi Mark,
>>>
>>> Few questions.
>>>
>>> 1. Is this trace back consistently hit? I just wanted to confirm whether
>>> it's transient which occurs once in a while and gets back to normal?
>>> 2. Please upload the complete geo-rep logs from both master and slave.
>>> 3. Are the gluster versions same across master and slave?
>>>
>>> Thanks,
>>> Kotresh HR
>>>
>>> On Wed, Jun 6, 2018 at 7:10 PM, Mark Betham <
>>> mark.betham at performancehorizon.com> wrote:
>>>
>>>> Dear Gluster-Users,
>>>>
>>>> I have geo-replication setup and configured between 2 Gluster pools
>>>> located at different sites.  What I am seeing is an error being reported
>>>> within the geo-replication slave log as follows;
>>>>
>>>> *[2018-06-05 12:05:26.767615] E
>>>> [syncdutils(slave):331:log_raise_exception] <top>: FAIL: *
>>>> *Traceback (most recent call last):*
>>>> *  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line
>>>> 361, in twrap*
>>>> *    tf(*aa)*
>>>> *  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line
>>>> 1009, in <lambda>*
>>>> *    t = syncdutils.Thread(target=lambda: (repce.service_loop(),*
>>>> *  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 90,
>>>> in service_loop*
>>>> *    self.q.put(recv(self.inf))*
>>>> *  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 61,
>>>> in recv*
>>>> *    return pickle.load(inf)*
>>>> *ImportError: No module named
>>>> h_2013-04-26-04:02:49-2013-04-26_11:02:53.gz.15WBuUh*
>>>> *[2018-06-05 12:05:26.768085] E [repce(slave):117:worker] <top>: call
>>>> failed: *
>>>> *Traceback (most recent call last):*
>>>> *  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113,
>>>> in worker*
>>>> *    res = getattr(self.obj, rmeth)(*in_data[2:])*
>>>> *TypeError: getattr(): attribute name must be string*
>>>>
>>>> From this point in time the slave server begins to consume all of its
>>>> available RAM until it becomes non-responsive.  Eventually the gluster
>>>> service seems to kill off the offending process and the memory is returned
>>>> to the system.  Once the memory has been returned to the remote slave
>>>> system the geo-replication often recovers and data transfer resumes.
>>>>
>>>> I have attached the full geo-replication slave log containing the error
>>>> shown above.  I have also attached an image file showing the memory usage
>>>> of the affected storage server.
>>>>
>>>> We are currently running Gluster version 3.12.9 on top of CentOS 7.5
>>>> x86_64.  The system has been fully patched and is running the latest
>>>> software, excluding glibc which had to be downgraded to get geo-replication
>>>> working.
>>>>
>>>> The Gluster volume runs on a dedicated partition using the XFS
>>>> filesystem which in turn is running on a LVM thin volume.  The physical
>>>> storage is presented as a single drive due to the underlying disks being
>>>> part of a raid 10 array.
>>>>
>>>> The Master volume which is being replicated has a total of 2.2 TB of
>>>> data to be replicated.  The total size of the volume fluctuates very little
>>>> as data being removed equals the new data coming in.  This data is made up
>>>> of many thousands of files across many separated directories.  Data file
>>>> sizes vary from the very small (>1K) to the large (>1Gb).  The Gluster
>>>> service itself is running with a single volume in a replicated
>>>> configuration across 3 bricks at each of the sites.  The delta changes
>>>> being replicated are on average about 100GB per day, where this includes
>>>> file creation / deletion / modification.
>>>>
>>>> The config for the geo-replication session is as follows, taken from
>>>> the current source server;
>>>>
>>>> *special_sync_mode: partial*
>>>> *gluster_log_file:
>>>> /var/log/glusterfs/geo-replication/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1.gluster.log*
>>>> *ssh_command: ssh -oPasswordAuthentication=no
>>>> -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem*
>>>> *change_detector: changelog*
>>>> *session_owner: 40e9e77a-034c-44a2-896e-59eec47e8a84*
>>>> *state_file:
>>>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/monitor.status*
>>>> *gluster_params: aux-gfid-mount acl*
>>>> *log_rsync_performance: true*
>>>> *remote_gsyncd: /nonexistent/gsyncd*
>>>> *working_dir:
>>>> /var/lib/misc/glusterfsd/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1*
>>>> *state_detail_file:
>>>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1-detail.status*
>>>> *gluster_command_dir: /usr/sbin/*
>>>> *pid_file:
>>>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/monitor.pid*
>>>> *georep_session_working_dir:
>>>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/*
>>>> *ssh_command_tar: ssh -oPasswordAuthentication=no
>>>> -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/tar_ssh.pem*
>>>> *master.stime_xattr_name:
>>>> trusted.glusterfs.40e9e77a-034c-44a2-896e-59eec47e8a84.ccfaed9b-ff4b-4a55-acfa-03f092cdf460.stime*
>>>> *changelog_log_file:
>>>> /var/log/glusterfs/geo-replication/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1-changes.log*
>>>> *socketdir: /var/run/gluster*
>>>> *volume_id: 40e9e77a-034c-44a2-896e-59eec47e8a84*
>>>> *ignore_deletes: false*
>>>> *state_socket_unencoded:
>>>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1.socket*
>>>> *log_file:
>>>> /var/log/glusterfs/geo-replication/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1.log*
>>>>
>>>> If any further information is required in order to troubleshoot this
>>>> issue then please let me know.
>>>>
>>>> I would be very grateful for any help or guidance received.
>>>>
>>>> Many thanks,
>>>>
>>>> Mark Betham.
>>>>
>>>>
>>>>
>>>>
>>>> This email may contain confidential material; unintended recipients
>>>> must not disseminate, use, or act upon any information in it. If you
>>>> received this email in error, please contact the sender and permanently
>>>> delete the email.
>>>> Performance Horizon Group Limited | Registered in England & Wales
>>>> 07188234 | Level 8, West One, Forth Banks, Newcastle upon Tyne, NE1 3PA
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>
>>>
>>>
>>> --
>>> Thanks and Regards,
>>> Kotresh H R
>>>
>>
>>
>>
>> --
>> MARK BETHAM
>> Senior System Administrator
>> +44 (0) 191 261 2444
>> performancehorizon.com
>> PerformanceHorizon <https://www.facebook.com/PerformanceHorizon>
>> tweetphg <https://twitter.com/tweetphg>
>> performance-horizon-group
>> <https://www.linkedin.com/company-beta/1484320/>
>>
>
>
>
> --
> MARK BETHAM
> Senior System Administrator
> +44 (0) 191 261 2444
> performancehorizon.com
> PerformanceHorizon <https://www.facebook.com/PerformanceHorizon>
> tweetphg <https://twitter.com/tweetphg>
> performance-horizon-group <https://www.linkedin.com/company-beta/1484320/>
>
>
> This email may contain confidential material; unintended recipients must
> not disseminate, use, or act upon any information in it. If you received
> this email in error, please contact the sender and permanently delete the
> email.
> Performance Horizon Group Limited | Registered in England & Wales 07188234
> | Level 8, West One, Forth Banks, Newcastle upon Tyne, NE1 3PA
>
>
>

-- 
Thanks and Regards,
Kotresh H R
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180611/55ebbacd/attachment.html>