[Gluster-users] gluster connection interrupted during transfer
Vijay Bellur
vbellur at redhat.com
Mon Sep 24 22:39:40 UTC 2018
Hello Richard,
Thank you for the logs.
I am wondering if this could be a different memory leak than the one
addressed in the bug. Would it be possible for you to obtain a statedump of
the client so that we can understand the memory allocation pattern better?
Details about gathering a statedump can be found at [1]. Please ensure that
/var/run/gluster is present before triggering a statedump.
Regards,
Vijay
[1] https://docs.gluster.org/en/v3/Troubleshooting/statedump/
On Fri, Sep 21, 2018 at 12:14 AM Richard Neuboeck <hawk at tbi.univie.ac.at>
wrote:
> Hi again,
>
> in my limited - non full time programmer - understanding it's a memory
> leak in the gluster fuse client.
>
> Should I reopen the mentioned bugreport or open a new one? Or would the
> community prefer an entirely different approach?
>
> Thanks
> Richard
>
> On 13.09.18 10:07, Richard Neuboeck wrote:
> > Hi,
> >
> > I've created excerpts from the brick and client logs +/- 1 minute to
> > the kill event. Still the logs are ~400-500MB so will put them
> > somewhere to download since I have no idea what I should be looking
> > for and skimming them didn't reveal obvious problems to me.
> >
> > http://www.tbi.univie.ac.at/~hawk/gluster/brick_3min_excerpt.log
> > http://www.tbi.univie.ac.at/~hawk/gluster/mnt_3min_excerpt.log
> >
> > I was pointed in the direction of the following Bugreport
> > https://bugzilla.redhat.com/show_bug.cgi?id=1613512
> > It sounds right but seems to have been addressed already.
> >
> > If there is anything I can do to help solve this problem please let
> > me know. Thanks for your help!
> >
> > Cheers
> > Richard
> >
> >
> > On 9/11/18 10:10 AM, Richard Neuboeck wrote:
> >> Hi,
> >>
> >> since I feared that the logs would fill up the partition (again) I
> >> checked the systems daily and finally found the reason. The glusterfs
> >> process on the client runs out of memory and get's killed by OOM after
> >> about four days. Since rsync runs for a couple of days longer till it
> >> ends I never checked the whole time frame in the system logs and never
> >> stumbled upon the OOM message.
> >>
> >> Running out of memory on a 128GB RAM system even with a DB occupying
> >> ~40% of that is kind of strange though. Might there be a leak?
> >>
> >> But this would explain the erratic behavior I've experienced over the
> >> last 1.5 years while trying to work with our homes on glusterfs.
> >>
> >> Here is the kernel log message for the killed glusterfs process.
> >> https://gist.github.com/bleuchien/3d2b87985ecb944c60347d5e8660e36a
> >>
> >> I'm checking the brick and client trace logs. But those are respectively
> >> 1TB and 2TB in size so searching in them takes a while. I'll be creating
> >> gists for both logs about the time when the process died.
> >>
> >> As soon as I have more details I'll post them.
> >>
> >> Here you can see a graphical representation of the memory usage of this
> >> system: https://imgur.com/a/4BINtfr
> >>
> >> Cheers
> >> Richard
> >>
> >>
> >>
> >> On 31.08.18 08:13, Raghavendra Gowdappa wrote:
> >>>
> >>>
> >>> On Fri, Aug 31, 2018 at 11:11 AM, Richard Neuboeck
> >>> <hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>> wrote:
> >>>
> >>> On 08/31/2018 03:50 AM, Raghavendra Gowdappa wrote:
> >>> > +Mohit. +Milind
> >>> >
> >>> > @Mohit/Milind,
> >>> >
> >>> > Can you check logs and see whether you can find anything
> relevant?
> >>>
> >>> From glances at the system logs nothing out of the ordinary
> >>> occurred. However I'll start another rsync and take a closer look.
> >>> It will take a few days.
> >>>
> >>> >
> >>> > On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck
> >>> > <hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>
> >>> <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>>>
> wrote:
> >>> >
> >>> > Hi,
> >>> >
> >>> > I'm attaching a shortened version since the whole is about
> 5.8GB of
> >>> > the client mount log. It includes the initial mount messages
> and the
> >>> > last two minutes of log entries.
> >>> >
> >>> > It ends very anticlimactic without an obvious error. Is there
> >>> > anything specific I should be looking for?
> >>> >
> >>> >
> >>> > Normally I look logs around disconnect msgs to find out the
> reason.
> >>> > But as you said, sometimes one can see just disconnect msgs
> without
> >>> > any reason. That normally points to reason for disconnect in the
> >>> > network rather than a Glusterfs initiated disconnect.
> >>>
> >>> The rsync source is serving our homes currently so there are NFS
> >>> connections 24/7. There don't seem to be any network related
> >>> interruptions
> >>>
> >>>
> >>> Can you set diagnostics.client-log-level and
> diagnostics.brick-log-level
> >>> to TRACE and check logs of both ends of connections - client and brick?
> >>> To reduce the logsize, I would suggest to logrotate existing logs and
> >>> start with fresh logs when you are about to start so that only relevant
> >>> logs are captured. Also, can you take strace of client and brick
> process
> >>> using:
> >>>
> >>> strace -o <outputfile> -ff -v -p <pid>
> >>>
> >>> attach both logs and strace. Let's trace through what syscalls on
> socket
> >>> return and then decide whether to inspect tcpdump or not. If you don't
> >>> want to repeat tests again, please capture tcpdump too (on both ends of
> >>> connection) and send them to us.
> >>>
> >>>
> >>> - a co-worker would be here faster than I could check
> >>> the logs if the connection to home would be broken ;-)
> >>> The three gluster machines are due to this problem reduced to only
> >>> testing so there is nothing else running.
> >>>
> >>>
> >>> >
> >>> > Cheers
> >>> > Richard
> >>> >
> >>> > On 08/30/2018 02:40 PM, Raghavendra Gowdappa wrote:
> >>> > > Normally client logs will give a clue on why the
> disconnections are
> >>> > > happening (ping-timeout, wrong port etc). Can you look
> into client
> >>> > > logs to figure out what's happening? If you can't find
> anything, can
> >>> > > you send across client logs?
> >>> > >
> >>> > > On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck
> >>> > > <hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>
> >>> <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>>
> >>> > <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>
> >>> <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>>>>
> >>> > wrote:
> >>> > >
> >>> > > Hi Gluster Community,
> >>> > >
> >>> > > I have problems with a glusterfs 'Transport endpoint
> not
> >>> > connected'
> >>> > > connection abort during file transfers that I can
> >>> > replicate (all the
> >>> > > time now) but not pinpoint as to why this is happening.
> >>> > >
> >>> > > The volume is set up in replica 3 mode and accessed
> with
> >>> > the fuse
> >>> > > gluster client. Both client and server are running
> CentOS
> >>> > and the
> >>> > > supplied 3.12.11 version of gluster.
> >>> > >
> >>> > > The connection abort happens at different times during
> >>> > rsync but
> >>> > > occurs every time I try to sync all our files (1.1TB)
> to
> >>> > the empty
> >>> > > volume.
> >>> > >
> >>> > > Client and server side I don't find errors in the
> gluster
> >>> > log files.
> >>> > > rsync logs the obvious transfer problem. The only log
> that
> >>> > shows
> >>> > > anything related is the server brick log which states
> >>> that the
> >>> > > connection is shutting down:
> >>> > >
> >>> > > [2018-08-18 22:40:35.502510] I [MSGID: 115036]
> >>> > > [server.c:527:server_rpc_notify] 0-home-server:
> >>> disconnecting
> >>> > > connection from
> >>> > >
> brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
> >>> > > [2018-08-18 22:40:35.502620] W
> >>> > > [inodelk.c:499:pl_inodelk_log_cleanup] 0-home-server:
> >>> > releasing lock
> >>> > > on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
> >>> > > {client=0x7f83ec0b3ce0, pid=110423
> >>> lk-owner=d0fd5ffb427f0000}
> >>> > > [2018-08-18 22:40:35.502692] W
> >>> > > [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server:
> >>> > releasing lock
> >>> > > on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
> >>> > > {client=0x7f83ec0b3ce0, pid=110423
> >>> lk-owner=703dd4cc407f0000}
> >>> > > [2018-08-18 22:40:35.502719] W
> >>> > > [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server:
> >>> > releasing lock
> >>> > > on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
> >>> > > {client=0x7f83ec0b3ce0, pid=110423
> >>> lk-owner=703dd4cc407f0000}
> >>> > > [2018-08-18 22:40:35.505950] I [MSGID: 101055]
> >>> > > [client_t.c:443:gf_client_unref] 0-home-server:
> Shutting
> >>> down
> >>> > > connection
> >>> > brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
> >>> > >
> >>> > > Since I'm running another replica 3 setup for oVirt
> for a
> >>> > long time
> >>> > > now which is completely stable I thought I made a
> mistake
> >>> > setting
> >>> > > different options at first. However even when I reset
> >>> > those options
> >>> > > I'm able to reproduce the connection problem.
> >>> > >
> >>> > > The unoptimized volume setup looks like this:
> >>> > >
> >>> > > Volume Name: home
> >>> > > Type: Replicate
> >>> > > Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
> >>> > > Status: Started
> >>> > > Snapshot Count: 0
> >>> > > Number of Bricks: 1 x 3 = 3
> >>> > > Transport-type: tcp
> >>> > > Bricks:
> >>> > > Brick1: sphere-four:/srv/gluster_home/brick
> >>> > > Brick2: sphere-five:/srv/gluster_home/brick
> >>> > > Brick3: sphere-six:/srv/gluster_home/brick
> >>> > > Options Reconfigured:
> >>> > > nfs.disable: on
> >>> > > transport.address-family: inet
> >>> > > cluster.quorum-type: auto
> >>> > > cluster.server-quorum-type: server
> >>> > > cluster.server-quorum-ratio: 50%
> >>> > >
> >>> > >
> >>> > > The following additional options were used before:
> >>> > >
> >>> > > performance.cache-size: 5GB
> >>> > > client.event-threads: 4
> >>> > > server.event-threads: 4
> >>> > > cluster.lookup-optimize: on
> >>> > > features.cache-invalidation: on
> >>> > > performance.stat-prefetch: on
> >>> > > performance.cache-invalidation: on
> >>> > > network.inode-lru-limit: 50000
> >>> > > features.cache-invalidation-timeout: 600
> >>> > > performance.md-cache-timeout: 600
> >>> > > performance.parallel-readdir: on
> >>> > >
> >>> > >
> >>> > > In this case the gluster servers and also the client is
> >>> > using a
> >>> > > bonded network device running in adaptive load
> balancing
> >>> mode.
> >>> > >
> >>> > > I've tried using the debug option for the client mount.
> >>> > But except
> >>> > > for a ~0.5TB log file I didn't get information that
> seems
> >>> > > helpful to me.
> >>> > >
> >>> > > Transferring just a couple of GB works without
> problems.
> >>> > >
> >>> > > It may very well be that I'm already blind to the
> obvious
> >>> > but after
> >>> > > many long running tests I can't find the crux in the
> setup.
> >>> > >
> >>> > > Does anyone have an idea as how to approach this
> problem
> >>> > in a way
> >>> > > that sheds some useful information?
> >>> > >
> >>> > > Any help is highly appreciated!
> >>> > > Cheers
> >>> > > Richard
> >>> > >
> >>> > > --
> >>> > > /dev/null
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > _______________________________________________
> >>> > > Gluster-users mailing list
> >>> > > Gluster-users at gluster.org <mailto:
> Gluster-users at gluster.org>
> >>> > <mailto:Gluster-users at gluster.org
> >>> <mailto:Gluster-users at gluster.org>>
> >>> > <mailto:Gluster-users at gluster.org
> >>> <mailto:Gluster-users at gluster.org>
> >>> > <mailto:Gluster-users at gluster.org
> >>> <mailto:Gluster-users at gluster.org>>>
> >>> > >
> https://lists.gluster.org/mailman/listinfo/gluster-users
> >>> <https://lists.gluster.org/mailman/listinfo/gluster-users>
> >>> > <https://lists.gluster.org/mailman/listinfo/gluster-users
> >>> <https://lists.gluster.org/mailman/listinfo/gluster-users>>
> >>> > >
> >>> <https://lists.gluster.org/mailman/listinfo/gluster-users
> >>> <https://lists.gluster.org/mailman/listinfo/gluster-users>
> >>> > <https://lists.gluster.org/mailman/listinfo/gluster-users
> >>> <https://lists.gluster.org/mailman/listinfo/gluster-users>>>
> >>> > >
> >>> > >
> >>> >
> >>> >
> >>> > --
> >>> > /dev/null
> >>> >
> >>> >
> >>>
> >>>
> >>> --
> >>> /dev/null
> >>>
> >>>
> >>
> >>
> >>
> >> _______________________________________________
> >> Gluster-users mailing list
> >> Gluster-users at gluster.org
> >> https://lists.gluster.org/mailman/listinfo/gluster-users
> >>
> >
> >
> >
> >
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-users
> >
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180924/fa1262b1/attachment.html>
More information about the Gluster-users
mailing list