<div dir="ltr"><div dir="ltr">Hello Richard,<div><br></div><div>Thank you for the logs.</div><div><br></div><div>I am wondering if this could be a different memory leak than the one addressed in the bug. Would it be possible for you to obtain a statedump of the client so that we can understand the memory allocation pattern better? Details about gathering a statedump can be found at [1]. Please ensure that /var/run/gluster is present before triggering a statedump.</div><div><br></div><div>Regards,</div><div>Vijay</div><div><br></div><div>[1] <a href="https://docs.gluster.org/en/v3/Troubleshooting/statedump/">https://docs.gluster.org/en/v3/Troubleshooting/statedump/</a></div><div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr">On Fri, Sep 21, 2018 at 12:14 AM Richard Neuboeck <<a href="mailto:hawk@tbi.univie.ac.at">hawk@tbi.univie.ac.at</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi again,<br>
<br>
in my limited - non full time programmer - understanding it's a memory<br>
leak in the gluster fuse client.<br>
<br>
Should I reopen the mentioned bugreport or open a new one? Or would the<br>
community prefer an entirely different approach?<br>
<br>
Thanks<br>
Richard<br>
<br>
On 13.09.18 10:07, Richard Neuboeck wrote:<br>
> Hi,<br>
> <br>
> I've created excerpts from the brick and client logs +/- 1 minute to<br>
> the kill event. Still the logs are ~400-500MB so will put them<br>
> somewhere to download since I have no idea what I should be looking<br>
> for and skimming them didn't reveal obvious problems to me.<br>
> <br>
> <a href="http://www.tbi.univie.ac.at/~hawk/gluster/brick_3min_excerpt.log" rel="noreferrer" target="_blank">http://www.tbi.univie.ac.at/~hawk/gluster/brick_3min_excerpt.log</a><br>
> <a href="http://www.tbi.univie.ac.at/~hawk/gluster/mnt_3min_excerpt.log" rel="noreferrer" target="_blank">http://www.tbi.univie.ac.at/~hawk/gluster/mnt_3min_excerpt.log</a><br>
> <br>
> I was pointed in the direction of the following Bugreport<br>
> <a href="https://bugzilla.redhat.com/show_bug.cgi?id=1613512" rel="noreferrer" target="_blank">https://bugzilla.redhat.com/show_bug.cgi?id=1613512</a><br>
> It sounds right but seems to have been addressed already.<br>
> <br>
> If there is anything I can do to help solve this problem please let<br>
> me know. Thanks for your help!<br>
> <br>
> Cheers<br>
> Richard<br>
> <br>
> <br>
> On 9/11/18 10:10 AM, Richard Neuboeck wrote:<br>
>> Hi,<br>
>><br>
>> since I feared that the logs would fill up the partition (again) I<br>
>> checked the systems daily and finally found the reason. The glusterfs<br>
>> process on the client runs out of memory and get's killed by OOM after<br>
>> about four days. Since rsync runs for a couple of days longer till it<br>
>> ends I never checked the whole time frame in the system logs and never<br>
>> stumbled upon the OOM message.<br>
>><br>
>> Running out of memory on a 128GB RAM system even with a DB occupying<br>
>> ~40% of that is kind of strange though. Might there be a leak?<br>
>><br>
>> But this would explain the erratic behavior I've experienced over the<br>
>> last 1.5 years while trying to work with our homes on glusterfs.<br>
>><br>
>> Here is the kernel log message for the killed glusterfs process.<br>
>> <a href="https://gist.github.com/bleuchien/3d2b87985ecb944c60347d5e8660e36a" rel="noreferrer" target="_blank">https://gist.github.com/bleuchien/3d2b87985ecb944c60347d5e8660e36a</a><br>
>><br>
>> I'm checking the brick and client trace logs. But those are respectively<br>
>> 1TB and 2TB in size so searching in them takes a while. I'll be creating<br>
>> gists for both logs about the time when the process died.<br>
>><br>
>> As soon as I have more details I'll post them.<br>
>><br>
>> Here you can see a graphical representation of the memory usage of this<br>
>> system: <a href="https://imgur.com/a/4BINtfr" rel="noreferrer" target="_blank">https://imgur.com/a/4BINtfr</a><br>
>><br>
>> Cheers<br>
>> Richard<br>
>><br>
>><br>
>><br>
>> On 31.08.18 08:13, Raghavendra Gowdappa wrote:<br>
>>><br>
>>><br>
>>> On Fri, Aug 31, 2018 at 11:11 AM, Richard Neuboeck<br>
>>> <<a href="mailto:hawk@tbi.univie.ac.at" target="_blank">hawk@tbi.univie.ac.at</a> <mailto:<a href="mailto:hawk@tbi.univie.ac.at" target="_blank">hawk@tbi.univie.ac.at</a>>> wrote:<br>
>>><br>
>>> On 08/31/2018 03:50 AM, Raghavendra Gowdappa wrote:<br>
>>> > +Mohit. +Milind<br>
>>> > <br>
>>> > @Mohit/Milind,<br>
>>> > <br>
>>> > Can you check logs and see whether you can find anything relevant?<br>
>>><br>
>>> From glances at the system logs nothing out of the ordinary<br>
>>> occurred. However I'll start another rsync and take a closer look.<br>
>>> It will take a few days.<br>
>>><br>
>>> > <br>
>>> > On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck<br>
>>> > <<a href="mailto:hawk@tbi.univie.ac.at" target="_blank">hawk@tbi.univie.ac.at</a> <mailto:<a href="mailto:hawk@tbi.univie.ac.at" target="_blank">hawk@tbi.univie.ac.at</a>><br>
>>> <mailto:<a href="mailto:hawk@tbi.univie.ac.at" target="_blank">hawk@tbi.univie.ac.at</a> <mailto:<a href="mailto:hawk@tbi.univie.ac.at" target="_blank">hawk@tbi.univie.ac.at</a>>>> wrote:<br>
>>> > <br>
>>> > Hi,<br>
>>> > <br>
>>> > I'm attaching a shortened version since the whole is about 5.8GB of<br>
>>> > the client mount log. It includes the initial mount messages and the<br>
>>> > last two minutes of log entries.<br>
>>> > <br>
>>> > It ends very anticlimactic without an obvious error. Is there<br>
>>> > anything specific I should be looking for?<br>
>>> > <br>
>>> > <br>
>>> > Normally I look logs around disconnect msgs to find out the reason.<br>
>>> > But as you said, sometimes one can see just disconnect msgs without<br>
>>> > any reason. That normally points to reason for disconnect in the<br>
>>> > network rather than a Glusterfs initiated disconnect.<br>
>>><br>
>>> The rsync source is serving our homes currently so there are NFS<br>
>>> connections 24/7. There don't seem to be any network related<br>
>>> interruptions <br>
>>><br>
>>><br>
>>> Can you set diagnostics.client-log-level and diagnostics.brick-log-level<br>
>>> to TRACE and check logs of both ends of connections - client and brick?<br>
>>> To reduce the logsize, I would suggest to logrotate existing logs and<br>
>>> start with fresh logs when you are about to start so that only relevant<br>
>>> logs are captured. Also, can you take strace of client and brick process<br>
>>> using:<br>
>>><br>
>>> strace -o <outputfile> -ff -v -p <pid><br>
>>><br>
>>> attach both logs and strace. Let's trace through what syscalls on socket<br>
>>> return and then decide whether to inspect tcpdump or not. If you don't<br>
>>> want to repeat tests again, please capture tcpdump too (on both ends of<br>
>>> connection) and send them to us.<br>
>>><br>
>>><br>
>>> - a co-worker would be here faster than I could check<br>
>>> the logs if the connection to home would be broken ;-)<br>
>>> The three gluster machines are due to this problem reduced to only<br>
>>> testing so there is nothing else running.<br>
>>><br>
>>><br>
>>> > <br>
>>> > Cheers<br>
>>> > Richard<br>
>>> > <br>
>>> > On 08/30/2018 02:40 PM, Raghavendra Gowdappa wrote:<br>
>>> > > Normally client logs will give a clue on why the disconnections are<br>
>>> > > happening (ping-timeout, wrong port etc). Can you look into client<br>
>>> > > logs to figure out what's happening? If you can't find anything, can<br>
>>> > > you send across client logs?<br>
>>> > > <br>
>>> > > On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck<br>
>>> > > <<a href="mailto:hawk@tbi.univie.ac.at" target="_blank">hawk@tbi.univie.ac.at</a> <mailto:<a href="mailto:hawk@tbi.univie.ac.at" target="_blank">hawk@tbi.univie.ac.at</a>><br>
>>> <mailto:<a href="mailto:hawk@tbi.univie.ac.at" target="_blank">hawk@tbi.univie.ac.at</a> <mailto:<a href="mailto:hawk@tbi.univie.ac.at" target="_blank">hawk@tbi.univie.ac.at</a>>><br>
>>> > <mailto:<a href="mailto:hawk@tbi.univie.ac.at" target="_blank">hawk@tbi.univie.ac.at</a> <mailto:<a href="mailto:hawk@tbi.univie.ac.at" target="_blank">hawk@tbi.univie.ac.at</a>><br>
>>> <mailto:<a href="mailto:hawk@tbi.univie.ac.at" target="_blank">hawk@tbi.univie.ac.at</a> <mailto:<a href="mailto:hawk@tbi.univie.ac.at" target="_blank">hawk@tbi.univie.ac.at</a>>>>><br>
>>> > wrote:<br>
>>> > ><br>
>>> > > Hi Gluster Community,<br>
>>> > ><br>
>>> > > I have problems with a glusterfs 'Transport endpoint not<br>
>>> > connected'<br>
>>> > > connection abort during file transfers that I can<br>
>>> > replicate (all the<br>
>>> > > time now) but not pinpoint as to why this is happening.<br>
>>> > ><br>
>>> > > The volume is set up in replica 3 mode and accessed with<br>
>>> > the fuse<br>
>>> > > gluster client. Both client and server are running CentOS<br>
>>> > and the<br>
>>> > > supplied 3.12.11 version of gluster.<br>
>>> > ><br>
>>> > > The connection abort happens at different times during<br>
>>> > rsync but<br>
>>> > > occurs every time I try to sync all our files (1.1TB) to<br>
>>> > the empty<br>
>>> > > volume.<br>
>>> > ><br>
>>> > > Client and server side I don't find errors in the gluster<br>
>>> > log files.<br>
>>> > > rsync logs the obvious transfer problem. The only log that<br>
>>> > shows<br>
>>> > > anything related is the server brick log which states<br>
>>> that the<br>
>>> > > connection is shutting down:<br>
>>> > ><br>
>>> > > [2018-08-18 22:40:35.502510] I [MSGID: 115036]<br>
>>> > > [server.c:527:server_rpc_notify] 0-home-server:<br>
>>> disconnecting<br>
>>> > > connection from<br>
>>> > > brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0<br>
>>> > > [2018-08-18 22:40:35.502620] W<br>
>>> > > [inodelk.c:499:pl_inodelk_log_cleanup] 0-home-server:<br>
>>> > releasing lock<br>
>>> > > on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by<br>
>>> > > {client=0x7f83ec0b3ce0, pid=110423<br>
>>> lk-owner=d0fd5ffb427f0000}<br>
>>> > > [2018-08-18 22:40:35.502692] W<br>
>>> > > [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server:<br>
>>> > releasing lock<br>
>>> > > on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by<br>
>>> > > {client=0x7f83ec0b3ce0, pid=110423<br>
>>> lk-owner=703dd4cc407f0000}<br>
>>> > > [2018-08-18 22:40:35.502719] W<br>
>>> > > [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server:<br>
>>> > releasing lock<br>
>>> > > on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by<br>
>>> > > {client=0x7f83ec0b3ce0, pid=110423<br>
>>> lk-owner=703dd4cc407f0000}<br>
>>> > > [2018-08-18 22:40:35.505950] I [MSGID: 101055]<br>
>>> > > [client_t.c:443:gf_client_unref] 0-home-server: Shutting<br>
>>> down<br>
>>> > > connection<br>
>>> > brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0<br>
>>> > ><br>
>>> > > Since I'm running another replica 3 setup for oVirt for a<br>
>>> > long time<br>
>>> > > now which is completely stable I thought I made a mistake<br>
>>> > setting<br>
>>> > > different options at first. However even when I reset<br>
>>> > those options<br>
>>> > > I'm able to reproduce the connection problem.<br>
>>> > ><br>
>>> > > The unoptimized volume setup looks like this:<br>
>>> > ><br>
>>> > > Volume Name: home<br>
>>> > > Type: Replicate<br>
>>> > > Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8<br>
>>> > > Status: Started<br>
>>> > > Snapshot Count: 0<br>
>>> > > Number of Bricks: 1 x 3 = 3<br>
>>> > > Transport-type: tcp<br>
>>> > > Bricks:<br>
>>> > > Brick1: sphere-four:/srv/gluster_home/brick<br>
>>> > > Brick2: sphere-five:/srv/gluster_home/brick<br>
>>> > > Brick3: sphere-six:/srv/gluster_home/brick<br>
>>> > > Options Reconfigured:<br>
>>> > > nfs.disable: on<br>
>>> > > transport.address-family: inet<br>
>>> > > cluster.quorum-type: auto<br>
>>> > > cluster.server-quorum-type: server<br>
>>> > > cluster.server-quorum-ratio: 50%<br>
>>> > ><br>
>>> > ><br>
>>> > > The following additional options were used before:<br>
>>> > ><br>
>>> > > performance.cache-size: 5GB<br>
>>> > > client.event-threads: 4<br>
>>> > > server.event-threads: 4<br>
>>> > > cluster.lookup-optimize: on<br>
>>> > > features.cache-invalidation: on<br>
>>> > > performance.stat-prefetch: on<br>
>>> > > performance.cache-invalidation: on<br>
>>> > > network.inode-lru-limit: 50000<br>
>>> > > features.cache-invalidation-timeout: 600<br>
>>> > > performance.md-cache-timeout: 600<br>
>>> > > performance.parallel-readdir: on<br>
>>> > ><br>
>>> > ><br>
>>> > > In this case the gluster servers and also the client is<br>
>>> > using a<br>
>>> > > bonded network device running in adaptive load balancing<br>
>>> mode.<br>
>>> > ><br>
>>> > > I've tried using the debug option for the client mount.<br>
>>> > But except<br>
>>> > > for a ~0.5TB log file I didn't get information that seems<br>
>>> > > helpful to me.<br>
>>> > ><br>
>>> > > Transferring just a couple of GB works without problems.<br>
>>> > ><br>
>>> > > It may very well be that I'm already blind to the obvious<br>
>>> > but after<br>
>>> > > many long running tests I can't find the crux in the setup.<br>
>>> > ><br>
>>> > > Does anyone have an idea as how to approach this problem<br>
>>> > in a way<br>
>>> > > that sheds some useful information?<br>
>>> > ><br>
>>> > > Any help is highly appreciated!<br>
>>> > > Cheers<br>
>>> > > Richard<br>
>>> > ><br>
>>> > > --<br>
>>> > > /dev/null<br>
>>> > ><br>
>>> > ><br>
>>> > ><br>
>>> > ><br>
>>> > > _______________________________________________<br>
>>> > > Gluster-users mailing list<br>
>>> > > <a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a> <mailto:<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a>><br>
>>> > <mailto:<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>
>>> <mailto:<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a>>><br>
>>> > <mailto:<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>
>>> <mailto:<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a>><br>
>>> > <mailto:<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>
>>> <mailto:<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a>>>><br>
>>> > > <a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>
>>> <<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a>><br>
>>> > <<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>
>>> <<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a>>><br>
>>> > > <br>
>>> <<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>
>>> <<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a>><br>
>>> > <<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>
>>> <<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a>>>><br>
>>> > ><br>
>>> > ><br>
>>> ><br>
>>> ><br>
>>> > --<br>
>>> > /dev/null<br>
>>> ><br>
>>> ><br>
>>><br>
>>><br>
>>> -- <br>
>>> /dev/null<br>
>>><br>
>>><br>
>><br>
>><br>
>><br>
>> _______________________________________________<br>
>> Gluster-users mailing list<br>
>> <a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>
>> <a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>
>><br>
> <br>
> <br>
> <br>
> <br>
> _______________________________________________<br>
> Gluster-users mailing list<br>
> <a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>
> <a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>
> <br>
<br>
_______________________________________________<br>
Gluster-users mailing list<br>
<a href="mailto:Gluster-users@gluster.org" target="_blank">Gluster-users@gluster.org</a><br>
<a href="https://lists.gluster.org/mailman/listinfo/gluster-users" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-users</a></blockquote></div>