[Gluster-users] Geo-Replication memory leak on slave node

Wed Jun 20 09:55:52 UTC 2018

Hi Mark,

Sorry, I was busy and could not take a serious look at the logs. I can
update you on Monday.

Thanks,
Kotresh HR

On Wed, Jun 20, 2018 at 12:32 PM, Mark Betham <
mark.betham at performancehorizon.com> wrote:

> Hi Kotresh,
>
> I was wondering if you had made any progress with regards to the issue I
> am currently experiencing with geo-replication.
>
> For info the fault remains and effectively requires a restart of the
> geo-replication service on a daily basis to reclaim the used memory on the
> slave node.
>
> If you require any further information then please do not hesitate to ask.
>
> Many thanks,
>
> Mark Betham
>
>
> On Mon, 11 Jun 2018 at 08:24, Mark Betham <mark.betham@
> performancehorizon.com> wrote:
>
>> Hi Kotresh,
>>
>> Many thanks.  I will shortly setup a share on my GDrive and send the link
>> directly to yourself.
>>
>> For Info;
>> The Geo-Rep slave failed again over the weekend but it did not recover
>> this time.  It looks to have become unresponsive at around 14:40 UTC on 9th
>> June.  I have attached an image showing the mem usage and you can see from
>> this when the system failed.  The system was totally unresponsive and
>> required a cold power off and then power on in order to recover the server.
>>
>> Many thanks for your help.
>>
>> Mark Betham.
>>
>> On 11 June 2018 at 05:53, Kotresh Hiremath Ravishankar <
>> khiremat at redhat.com> wrote:
>>
>>> Hi Mark,
>>>
>>> Google drive works for me.
>>>
>>> Thanks,
>>> Kotresh HR
>>>
>>> On Fri, Jun 8, 2018 at 3:00 PM, Mark Betham <mark.betham@
>>> performancehorizon.com> wrote:
>>>
>>>> Hi Kotresh,
>>>>
>>>> The memory issue re-occurred again.  This is indicating it will occur
>>>> around once a day.
>>>>
>>>> Again no traceback listed in the log, the only update in the log was as
>>>> follows;
>>>> [2018-06-08 08:26:43.404261] I [resource(slave):1020:service_loop]
>>>> GLUSTER: connection inactive, stopping timeout=120
>>>> [2018-06-08 08:29:19.357615] I [syncdutils(slave):271:finalize] <top>:
>>>> exiting.
>>>> [2018-06-08 08:31:02.432002] I [resource(slave):1502:connect] GLUSTER:
>>>> Mounting gluster volume locally...
>>>> [2018-06-08 08:31:03.716967] I [resource(slave):1515:connect] GLUSTER:
>>>> Mounted gluster volume duration=1.2729
>>>> [2018-06-08 08:31:03.717411] I [resource(slave):1012:service_loop]
>>>> GLUSTER: slave listening
>>>>
>>>> I have attached an image showing the latest memory usage pattern.
>>>>
>>>> Can you please advise how I can pass the log data across to you?  As
>>>> soon as I know this I will get the data uploaded for your review.
>>>>
>>>> Thanks,
>>>>
>>>> Mark Betham
>>>>
>>>>
>>>>
>>>>
>>>> On 7 June 2018 at 08:19, Mark Betham <mark.betham@
>>>> performancehorizon.com> wrote:
>>>>
>>>>> Hi Kotresh,
>>>>>
>>>>> Many thanks for your prompt response.
>>>>>
>>>>> Below are my responses to your questions;
>>>>>
>>>>> 1. Is this trace back consistently hit? I just wanted to confirm
>>>>> whether it's transient which occurs once in a while and gets back to normal?
>>>>> It appears not.  As soon as the geo-rep recovered yesterday from the
>>>>> high memory usage it immediately began rising again until it consumed all
>>>>> of the available ram.  But this time nothing was committed to the log file.
>>>>> I would like to add here that this current instance of geo-rep was
>>>>> only brought online at the start of this week due to the issues with glibc
>>>>> on CentOS 7.5.  This is the first time I have had geo-rep running with
>>>>> Gluster ver 3.12.9, both storage clusters at each physical site were only
>>>>> rebuilt approx. 4 weeks ago, due to the previous version in use going EOL.
>>>>> Prior to this I had been running 3.13.2 (3.13.X now EOL) at each of the
>>>>> sites and it is worth noting that the same behaviour was also seen on this
>>>>> version of Gluster, unfortunately I do not have any of the log data from
>>>>> then but I do not recall seeing any instances of the trace back message
>>>>> mentioned.
>>>>>
>>>>> 2. Please upload the complete geo-rep logs from both master and slave.
>>>>> I have the log files, just checking to make sure there is no
>>>>> confidential info inside.  The logfiles are too big to send via email, even
>>>>> when compressed.  Do you have a preferred method to allow me to share this
>>>>> data with you or would a share from my Google drive be sufficient?
>>>>>
>>>>> 3. Are the gluster versions same across master and slave?
>>>>> Yes, all gluster versions are the same across the two sites for all
>>>>> storage nodes.  See below for version info taken from the current geo-rep
>>>>> master.
>>>>>
>>>>> glusterfs 3.12.9
>>>>> Repository revision: git://git.gluster.org/glusterfs.git
>>>>> Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
>>>>> GlusterFS comes with ABSOLUTELY NO WARRANTY.
>>>>> It is licensed to you under your choice of the GNU Lesser
>>>>> General Public License, version 3 or any later version (LGPLv3
>>>>> or later), or the GNU General Public License, version 2 (GPLv2),
>>>>> in all cases as published by the Free Software Foundation.
>>>>>
>>>>> glusterfs-geo-replication-3.12.9-1.el7.x86_64
>>>>> glusterfs-gnfs-3.12.9-1.el7.x86_64
>>>>> glusterfs-libs-3.12.9-1.el7.x86_64
>>>>> glusterfs-server-3.12.9-1.el7.x86_64
>>>>> glusterfs-3.12.9-1.el7.x86_64
>>>>> glusterfs-api-3.12.9-1.el7.x86_64
>>>>> glusterfs-events-3.12.9-1.el7.x86_64
>>>>> centos-release-gluster312-1.0-1.el7.centos.noarch
>>>>> glusterfs-client-xlators-3.12.9-1.el7.x86_64
>>>>> glusterfs-cli-3.12.9-1.el7.x86_64
>>>>> python2-gluster-3.12.9-1.el7.x86_64
>>>>> glusterfs-rdma-3.12.9-1.el7.x86_64
>>>>> glusterfs-fuse-3.12.9-1.el7.x86_64
>>>>>
>>>>> I have also attached another screenshot showing the memory usage from
>>>>> the Gluster slave for the last 48 hours.  This shows memory saturation from
>>>>> yesterday, which correlates with the trace back sent yesterday, and the
>>>>> subsequent memory saturation which occurred over the last 24 hours.  For
>>>>> info, all times are in UTC.
>>>>>
>>>>> Please advise the preferred method to get the log data across to you
>>>>> and also if you require any further information.
>>>>>
>>>>> Many thanks,
>>>>>
>>>>> Mark Betham
>>>>>
>>>>>
>>>>> On 7 June 2018 at 04:42, Kotresh Hiremath Ravishankar <
>>>>> khiremat at redhat.com> wrote:
>>>>>
>>>>>> Hi Mark,
>>>>>>
>>>>>> Few questions.
>>>>>>
>>>>>> 1. Is this trace back consistently hit? I just wanted to confirm
>>>>>> whether it's transient which occurs once in a while and gets back to normal?
>>>>>> 2. Please upload the complete geo-rep logs from both master and slave.
>>>>>> 3. Are the gluster versions same across master and slave?
>>>>>>
>>>>>> Thanks,
>>>>>> Kotresh HR
>>>>>>
>>>>>> On Wed, Jun 6, 2018 at 7:10 PM, Mark Betham <mark.betham@
>>>>>> performancehorizon.com> wrote:
>>>>>>
>>>>>>> Dear Gluster-Users,
>>>>>>>
>>>>>>> I have geo-replication setup and configured between 2 Gluster pools
>>>>>>> located at different sites.  What I am seeing is an error being reported
>>>>>>> within the geo-replication slave log as follows;
>>>>>>>
>>>>>>> *[2018-06-05 12:05:26.767615] E
>>>>>>> [syncdutils(slave):331:log_raise_exception] <top>: FAIL: *
>>>>>>> *Traceback (most recent call last):*
>>>>>>> *  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py",
>>>>>>> line 361, in twrap*
>>>>>>> *    tf(*aa)*
>>>>>>> *  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line
>>>>>>> 1009, in <lambda>*
>>>>>>> *    t = syncdutils.Thread(target=lambda: (repce.service_loop(),*
>>>>>>> *  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line
>>>>>>> 90, in service_loop*
>>>>>>> *    self.q.put(recv(self.inf))*
>>>>>>> *  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line
>>>>>>> 61, in recv*
>>>>>>> *    return pickle.load(inf)*
>>>>>>> *ImportError: No module named
>>>>>>> h_2013-04-26-04:02:49-2013-04-26_11:02:53.gz.15WBuUh*
>>>>>>> *[2018-06-05 12:05:26.768085] E [repce(slave):117:worker] <top>:
>>>>>>> call failed: *
>>>>>>> *Traceback (most recent call last):*
>>>>>>> *  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line
>>>>>>> 113, in worker*
>>>>>>> *    res = getattr(self.obj, rmeth)(*in_data[2:])*
>>>>>>> *TypeError: getattr(): attribute name must be string*
>>>>>>>
>>>>>>> From this point in time the slave server begins to consume all of
>>>>>>> its available RAM until it becomes non-responsive.  Eventually the gluster
>>>>>>> service seems to kill off the offending process and the memory is returned
>>>>>>> to the system.  Once the memory has been returned to the remote slave
>>>>>>> system the geo-replication often recovers and data transfer resumes.
>>>>>>>
>>>>>>> I have attached the full geo-replication slave log containing the
>>>>>>> error shown above.  I have also attached an image file showing the memory
>>>>>>> usage of the affected storage server.
>>>>>>>
>>>>>>> We are currently running Gluster version 3.12.9 on top of CentOS 7.5
>>>>>>> x86_64.  The system has been fully patched and is running the latest
>>>>>>> software, excluding glibc which had to be downgraded to get geo-replication
>>>>>>> working.
>>>>>>>
>>>>>>> The Gluster volume runs on a dedicated partition using the XFS
>>>>>>> filesystem which in turn is running on a LVM thin volume.  The physical
>>>>>>> storage is presented as a single drive due to the underlying disks being
>>>>>>> part of a raid 10 array.
>>>>>>>
>>>>>>> The Master volume which is being replicated has a total of 2.2 TB of
>>>>>>> data to be replicated.  The total size of the volume fluctuates very little
>>>>>>> as data being removed equals the new data coming in.  This data is made up
>>>>>>> of many thousands of files across many separated directories.  Data file
>>>>>>> sizes vary from the very small (>1K) to the large (>1Gb).  The Gluster
>>>>>>> service itself is running with a single volume in a replicated
>>>>>>> configuration across 3 bricks at each of the sites.  The delta changes
>>>>>>> being replicated are on average about 100GB per day, where this includes
>>>>>>> file creation / deletion / modification.
>>>>>>>
>>>>>>> The config for the geo-replication session is as follows, taken from
>>>>>>> the current source server;
>>>>>>>
>>>>>>> *special_sync_mode: partial*
>>>>>>> *gluster_log_file:
>>>>>>> /var/log/glusterfs/geo-replication/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1.gluster.log*
>>>>>>> *ssh_command: ssh -oPasswordAuthentication=no
>>>>>>> -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem*
>>>>>>> *change_detector: changelog*
>>>>>>> *session_owner: 40e9e77a-034c-44a2-896e-59eec47e8a84*
>>>>>>> *state_file:
>>>>>>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/monitor.status*
>>>>>>> *gluster_params: aux-gfid-mount acl*
>>>>>>> *log_rsync_performance: true*
>>>>>>> *remote_gsyncd: /nonexistent/gsyncd*
>>>>>>> *working_dir:
>>>>>>> /var/lib/misc/glusterfsd/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1*
>>>>>>> *state_detail_file:
>>>>>>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1-detail.status*
>>>>>>> *gluster_command_dir: /usr/sbin/*
>>>>>>> *pid_file:
>>>>>>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/monitor.pid*
>>>>>>> *georep_session_working_dir:
>>>>>>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/*
>>>>>>> *ssh_command_tar: ssh -oPasswordAuthentication=no
>>>>>>> -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/tar_ssh.pem*
>>>>>>> *master.stime_xattr_name:
>>>>>>> trusted.glusterfs.40e9e77a-034c-44a2-896e-59eec47e8a84.ccfaed9b-ff4b-4a55-acfa-03f092cdf460.stime*
>>>>>>> *changelog_log_file:
>>>>>>> /var/log/glusterfs/geo-replication/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1-changes.log*
>>>>>>> *socketdir: /var/run/gluster*
>>>>>>> *volume_id: 40e9e77a-034c-44a2-896e-59eec47e8a84*
>>>>>>> *ignore_deletes: false*
>>>>>>> *state_socket_unencoded:
>>>>>>> /var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1.socket*
>>>>>>> *log_file:
>>>>>>> /var/log/glusterfs/geo-replication/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1.log*
>>>>>>>
>>>>>>> If any further information is required in order to troubleshoot this
>>>>>>> issue then please let me know.
>>>>>>>
>>>>>>> I would be very grateful for any help or guidance received.
>>>>>>>
>>>>>>> Many thanks,
>>>>>>>
>>>>>>> Mark Betham.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This email may contain confidential material; unintended recipients
>>>>>>> must not disseminate, use, or act upon any information in it. If you
>>>>>>> received this email in error, please contact the sender and permanently
>>>>>>> delete the email.
>>>>>>> Performance Horizon Group Limited | Registered in England & Wales
>>>>>>> 07188234 | Level 8, West One, Forth Banks, Newcastle upon Tyne, NE1 3PA
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Gluster-users mailing list
>>>>>>> Gluster-users at gluster.org
>>>>>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks and Regards,
>>>>>> Kotresh H R
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> MARK BETHAM
>>>>> Senior System Administrator
>>>>> +44 (0) 191 261 2444
>>>>> performancehorizon.com
>>>>> PerformanceHorizon <https://www.facebook.com/PerformanceHorizon>
>>>>> tweetphg <https://twitter.com/tweetphg>
>>>>> performance-horizon-group
>>>>> <https://www.linkedin.com/company-beta/1484320/>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> MARK BETHAM
>>>> Senior System Administrator
>>>> +44 (0) 191 261 2444
>>>> performancehorizon.com
>>>> PerformanceHorizon <https://www.facebook.com/PerformanceHorizon>
>>>> tweetphg <https://twitter.com/tweetphg>
>>>> performance-horizon-group
>>>> <https://www.linkedin.com/company-beta/1484320/>
>>>>
>>>>
>>>> This email may contain confidential material; unintended recipients
>>>> must not disseminate, use, or act upon any information in it. If you
>>>> received this email in error, please contact the sender and permanently
>>>> delete the email.
>>>> Performance Horizon Group Limited | Registered in England & Wales
>>>> 07188234 | Level 8, West One, Forth Banks, Newcastle upon Tyne, NE1 3PA
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks and Regards,
>>> Kotresh H R
>>>
>>
>>
>>
>> --
>> MARK BETHAM
>> Senior System Administrator
>> +44 (0) 191 261 2444
>> performancehorizon.com
>> PerformanceHorizon <https://www.facebook.com/PerformanceHorizon>
>> tweetphg <https://twitter.com/tweetphg>
>> performance-horizon-group
>> <https://www.linkedin.com/company-beta/1484320/>
>>
>
>
> --
> MARK BETHAM
> Senior System Administrator
> +44 (0) 191 261 2444
> performancehorizon.com
> PerformanceHorizon <https://www.facebook.com/PerformanceHorizon>
> tweetphg <https://twitter.com/tweetphg>
> performance-horizon-group <https://www.linkedin.com/company-beta/1484320/>
>
>
> This email may contain confidential material; unintended recipients must
> not disseminate, use, or act upon any information in it. If you received
> this email in error, please contact the sender and permanently delete the
> email.
> Performance Horizon Group Limited | Registered in England & Wales 07188234
> | Level 8, West One, Forth Banks, Newcastle upon Tyne, NE1 3PA
>
>
>

-- 
Thanks and Regards,
Kotresh H R
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180620/25bf27b5/attachment.html>