[Gluster-users] gluster connection interrupted during transfer
Richard Neuboeck
hawk at tbi.univie.ac.at
Fri Sep 21 07:14:11 UTC 2018
Hi again,
in my limited - non full time programmer - understanding it's a memory
leak in the gluster fuse client.
Should I reopen the mentioned bugreport or open a new one? Or would the
community prefer an entirely different approach?
Thanks
Richard
On 13.09.18 10:07, Richard Neuboeck wrote:
> Hi,
>
> I've created excerpts from the brick and client logs +/- 1 minute to
> the kill event. Still the logs are ~400-500MB so will put them
> somewhere to download since I have no idea what I should be looking
> for and skimming them didn't reveal obvious problems to me.
>
> http://www.tbi.univie.ac.at/~hawk/gluster/brick_3min_excerpt.log
> http://www.tbi.univie.ac.at/~hawk/gluster/mnt_3min_excerpt.log
>
> I was pointed in the direction of the following Bugreport
> https://bugzilla.redhat.com/show_bug.cgi?id=1613512
> It sounds right but seems to have been addressed already.
>
> If there is anything I can do to help solve this problem please let
> me know. Thanks for your help!
>
> Cheers
> Richard
>
>
> On 9/11/18 10:10 AM, Richard Neuboeck wrote:
>> Hi,
>>
>> since I feared that the logs would fill up the partition (again) I
>> checked the systems daily and finally found the reason. The glusterfs
>> process on the client runs out of memory and get's killed by OOM after
>> about four days. Since rsync runs for a couple of days longer till it
>> ends I never checked the whole time frame in the system logs and never
>> stumbled upon the OOM message.
>>
>> Running out of memory on a 128GB RAM system even with a DB occupying
>> ~40% of that is kind of strange though. Might there be a leak?
>>
>> But this would explain the erratic behavior I've experienced over the
>> last 1.5 years while trying to work with our homes on glusterfs.
>>
>> Here is the kernel log message for the killed glusterfs process.
>> https://gist.github.com/bleuchien/3d2b87985ecb944c60347d5e8660e36a
>>
>> I'm checking the brick and client trace logs. But those are respectively
>> 1TB and 2TB in size so searching in them takes a while. I'll be creating
>> gists for both logs about the time when the process died.
>>
>> As soon as I have more details I'll post them.
>>
>> Here you can see a graphical representation of the memory usage of this
>> system: https://imgur.com/a/4BINtfr
>>
>> Cheers
>> Richard
>>
>>
>>
>> On 31.08.18 08:13, Raghavendra Gowdappa wrote:
>>>
>>>
>>> On Fri, Aug 31, 2018 at 11:11 AM, Richard Neuboeck
>>> <hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>> wrote:
>>>
>>> On 08/31/2018 03:50 AM, Raghavendra Gowdappa wrote:
>>> > +Mohit. +Milind
>>> >
>>> > @Mohit/Milind,
>>> >
>>> > Can you check logs and see whether you can find anything relevant?
>>>
>>> From glances at the system logs nothing out of the ordinary
>>> occurred. However I'll start another rsync and take a closer look.
>>> It will take a few days.
>>>
>>> >
>>> > On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck
>>> > <hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>
>>> <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>>> wrote:
>>> >
>>> > Hi,
>>> >
>>> > I'm attaching a shortened version since the whole is about 5.8GB of
>>> > the client mount log. It includes the initial mount messages and the
>>> > last two minutes of log entries.
>>> >
>>> > It ends very anticlimactic without an obvious error. Is there
>>> > anything specific I should be looking for?
>>> >
>>> >
>>> > Normally I look logs around disconnect msgs to find out the reason.
>>> > But as you said, sometimes one can see just disconnect msgs without
>>> > any reason. That normally points to reason for disconnect in the
>>> > network rather than a Glusterfs initiated disconnect.
>>>
>>> The rsync source is serving our homes currently so there are NFS
>>> connections 24/7. There don't seem to be any network related
>>> interruptions
>>>
>>>
>>> Can you set diagnostics.client-log-level and diagnostics.brick-log-level
>>> to TRACE and check logs of both ends of connections - client and brick?
>>> To reduce the logsize, I would suggest to logrotate existing logs and
>>> start with fresh logs when you are about to start so that only relevant
>>> logs are captured. Also, can you take strace of client and brick process
>>> using:
>>>
>>> strace -o <outputfile> -ff -v -p <pid>
>>>
>>> attach both logs and strace. Let's trace through what syscalls on socket
>>> return and then decide whether to inspect tcpdump or not. If you don't
>>> want to repeat tests again, please capture tcpdump too (on both ends of
>>> connection) and send them to us.
>>>
>>>
>>> - a co-worker would be here faster than I could check
>>> the logs if the connection to home would be broken ;-)
>>> The three gluster machines are due to this problem reduced to only
>>> testing so there is nothing else running.
>>>
>>>
>>> >
>>> > Cheers
>>> > Richard
>>> >
>>> > On 08/30/2018 02:40 PM, Raghavendra Gowdappa wrote:
>>> > > Normally client logs will give a clue on why the disconnections are
>>> > > happening (ping-timeout, wrong port etc). Can you look into client
>>> > > logs to figure out what's happening? If you can't find anything, can
>>> > > you send across client logs?
>>> > >
>>> > > On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck
>>> > > <hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>
>>> <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>>
>>> > <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>
>>> <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>>>>
>>> > wrote:
>>> > >
>>> > > Hi Gluster Community,
>>> > >
>>> > > I have problems with a glusterfs 'Transport endpoint not
>>> > connected'
>>> > > connection abort during file transfers that I can
>>> > replicate (all the
>>> > > time now) but not pinpoint as to why this is happening.
>>> > >
>>> > > The volume is set up in replica 3 mode and accessed with
>>> > the fuse
>>> > > gluster client. Both client and server are running CentOS
>>> > and the
>>> > > supplied 3.12.11 version of gluster.
>>> > >
>>> > > The connection abort happens at different times during
>>> > rsync but
>>> > > occurs every time I try to sync all our files (1.1TB) to
>>> > the empty
>>> > > volume.
>>> > >
>>> > > Client and server side I don't find errors in the gluster
>>> > log files.
>>> > > rsync logs the obvious transfer problem. The only log that
>>> > shows
>>> > > anything related is the server brick log which states
>>> that the
>>> > > connection is shutting down:
>>> > >
>>> > > [2018-08-18 22:40:35.502510] I [MSGID: 115036]
>>> > > [server.c:527:server_rpc_notify] 0-home-server:
>>> disconnecting
>>> > > connection from
>>> > > brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
>>> > > [2018-08-18 22:40:35.502620] W
>>> > > [inodelk.c:499:pl_inodelk_log_cleanup] 0-home-server:
>>> > releasing lock
>>> > > on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
>>> > > {client=0x7f83ec0b3ce0, pid=110423
>>> lk-owner=d0fd5ffb427f0000}
>>> > > [2018-08-18 22:40:35.502692] W
>>> > > [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server:
>>> > releasing lock
>>> > > on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
>>> > > {client=0x7f83ec0b3ce0, pid=110423
>>> lk-owner=703dd4cc407f0000}
>>> > > [2018-08-18 22:40:35.502719] W
>>> > > [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server:
>>> > releasing lock
>>> > > on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
>>> > > {client=0x7f83ec0b3ce0, pid=110423
>>> lk-owner=703dd4cc407f0000}
>>> > > [2018-08-18 22:40:35.505950] I [MSGID: 101055]
>>> > > [client_t.c:443:gf_client_unref] 0-home-server: Shutting
>>> down
>>> > > connection
>>> > brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
>>> > >
>>> > > Since I'm running another replica 3 setup for oVirt for a
>>> > long time
>>> > > now which is completely stable I thought I made a mistake
>>> > setting
>>> > > different options at first. However even when I reset
>>> > those options
>>> > > I'm able to reproduce the connection problem.
>>> > >
>>> > > The unoptimized volume setup looks like this:
>>> > >
>>> > > Volume Name: home
>>> > > Type: Replicate
>>> > > Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
>>> > > Status: Started
>>> > > Snapshot Count: 0
>>> > > Number of Bricks: 1 x 3 = 3
>>> > > Transport-type: tcp
>>> > > Bricks:
>>> > > Brick1: sphere-four:/srv/gluster_home/brick
>>> > > Brick2: sphere-five:/srv/gluster_home/brick
>>> > > Brick3: sphere-six:/srv/gluster_home/brick
>>> > > Options Reconfigured:
>>> > > nfs.disable: on
>>> > > transport.address-family: inet
>>> > > cluster.quorum-type: auto
>>> > > cluster.server-quorum-type: server
>>> > > cluster.server-quorum-ratio: 50%
>>> > >
>>> > >
>>> > > The following additional options were used before:
>>> > >
>>> > > performance.cache-size: 5GB
>>> > > client.event-threads: 4
>>> > > server.event-threads: 4
>>> > > cluster.lookup-optimize: on
>>> > > features.cache-invalidation: on
>>> > > performance.stat-prefetch: on
>>> > > performance.cache-invalidation: on
>>> > > network.inode-lru-limit: 50000
>>> > > features.cache-invalidation-timeout: 600
>>> > > performance.md-cache-timeout: 600
>>> > > performance.parallel-readdir: on
>>> > >
>>> > >
>>> > > In this case the gluster servers and also the client is
>>> > using a
>>> > > bonded network device running in adaptive load balancing
>>> mode.
>>> > >
>>> > > I've tried using the debug option for the client mount.
>>> > But except
>>> > > for a ~0.5TB log file I didn't get information that seems
>>> > > helpful to me.
>>> > >
>>> > > Transferring just a couple of GB works without problems.
>>> > >
>>> > > It may very well be that I'm already blind to the obvious
>>> > but after
>>> > > many long running tests I can't find the crux in the setup.
>>> > >
>>> > > Does anyone have an idea as how to approach this problem
>>> > in a way
>>> > > that sheds some useful information?
>>> > >
>>> > > Any help is highly appreciated!
>>> > > Cheers
>>> > > Richard
>>> > >
>>> > > --
>>> > > /dev/null
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > _______________________________________________
>>> > > Gluster-users mailing list
>>> > > Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>>> > <mailto:Gluster-users at gluster.org
>>> <mailto:Gluster-users at gluster.org>>
>>> > <mailto:Gluster-users at gluster.org
>>> <mailto:Gluster-users at gluster.org>
>>> > <mailto:Gluster-users at gluster.org
>>> <mailto:Gluster-users at gluster.org>>>
>>> > > https://lists.gluster.org/mailman/listinfo/gluster-users
>>> <https://lists.gluster.org/mailman/listinfo/gluster-users>
>>> > <https://lists.gluster.org/mailman/listinfo/gluster-users
>>> <https://lists.gluster.org/mailman/listinfo/gluster-users>>
>>> > >
>>> <https://lists.gluster.org/mailman/listinfo/gluster-users
>>> <https://lists.gluster.org/mailman/listinfo/gluster-users>
>>> > <https://lists.gluster.org/mailman/listinfo/gluster-users
>>> <https://lists.gluster.org/mailman/listinfo/gluster-users>>>
>>> > >
>>> > >
>>> >
>>> >
>>> > --
>>> > /dev/null
>>> >
>>> >
>>>
>>>
>>> --
>>> /dev/null
>>>
>>>
>>
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180921/4f96ab50/attachment.sig>
More information about the Gluster-users
mailing list