[Gluster-users] gluster connection interrupted during transfer
Richard Neuboeck
hawk at tbi.univie.ac.at
Wed Nov 21 09:22:17 UTC 2018
Hi Vijay,
this is an update to the 8 tests I've run so far. In short all is well.
I followed your advice and created state dumps every 3 hours. 4 tests
ran with the default volume options. The last 4 tests ran with all
performance optimizations I could find to increase small file performance.
During the run time the dump file size varied from the beginning of the
mount ~100KB to ~1GB reflecting the memory footprint of the gluster process.
Since every test ran without interruption the memory leak seems to be
fixed in 3.12.14-1.el7.x86_64 on CentOS 7.
Thanks again for you help.
Cheers
Richard
On 15.10.18 10:48, Richard Neuboeck wrote:
> Hi Vijay,
>
> sorry it took so long. I've upgraded the gluster server and client to
> the latest packages 3.12.14-1.el7.x86_64 available in CentOS.
>
> Incredibly my first test after the update worked perfectly! I'll do
> another couple of rsyncs, maybe apply the performance improvements again
> and do statedumps all the way.
>
> I'll report back if there are any more problems or if they are resolved.
>
> Thanks for the help so far!
> Cheers
> Richard
>
>
> On 25.09.18 00:39, Vijay Bellur wrote:
>> Hello Richard,
>>
>> Thank you for the logs.
>>
>> I am wondering if this could be a different memory leak than the one
>> addressed in the bug. Would it be possible for you to obtain a
>> statedump of the client so that we can understand the memory allocation
>> pattern better? Details about gathering a statedump can be found at [1].
>> Please ensure that /var/run/gluster is present before triggering a
>> statedump.
>>
>> Regards,
>> Vijay
>>
>> [1] https://docs.gluster.org/en/v3/Troubleshooting/statedump/
>>
>>
>> On Fri, Sep 21, 2018 at 12:14 AM Richard Neuboeck <hawk at tbi.univie.ac.at
>> <mailto:hawk at tbi.univie.ac.at>> wrote:
>>
>> Hi again,
>>
>> in my limited - non full time programmer - understanding it's a memory
>> leak in the gluster fuse client.
>>
>> Should I reopen the mentioned bugreport or open a new one? Or would the
>> community prefer an entirely different approach?
>>
>> Thanks
>> Richard
>>
>> On 13.09.18 10:07, Richard Neuboeck wrote:
>> > Hi,
>> >
>> > I've created excerpts from the brick and client logs +/- 1 minute to
>> > the kill event. Still the logs are ~400-500MB so will put them
>> > somewhere to download since I have no idea what I should be looking
>> > for and skimming them didn't reveal obvious problems to me.
>> >
>> > http://www.tbi.univie.ac.at/~hawk/gluster/brick_3min_excerpt.log
>> <http://www.tbi.univie.ac.at/%7Ehawk/gluster/brick_3min_excerpt.log>
>> > http://www.tbi.univie.ac.at/~hawk/gluster/mnt_3min_excerpt.log
>> <http://www.tbi.univie.ac.at/%7Ehawk/gluster/mnt_3min_excerpt.log>
>> >
>> > I was pointed in the direction of the following Bugreport
>> > https://bugzilla.redhat.com/show_bug.cgi?id=1613512
>> > It sounds right but seems to have been addressed already.
>> >
>> > If there is anything I can do to help solve this problem please let
>> > me know. Thanks for your help!
>> >
>> > Cheers
>> > Richard
>> >
>> >
>> > On 9/11/18 10:10 AM, Richard Neuboeck wrote:
>> >> Hi,
>> >>
>> >> since I feared that the logs would fill up the partition (again) I
>> >> checked the systems daily and finally found the reason. The glusterfs
>> >> process on the client runs out of memory and get's killed by OOM
>> after
>> >> about four days. Since rsync runs for a couple of days longer till it
>> >> ends I never checked the whole time frame in the system logs and
>> never
>> >> stumbled upon the OOM message.
>> >>
>> >> Running out of memory on a 128GB RAM system even with a DB occupying
>> >> ~40% of that is kind of strange though. Might there be a leak?
>> >>
>> >> But this would explain the erratic behavior I've experienced over the
>> >> last 1.5 years while trying to work with our homes on glusterfs.
>> >>
>> >> Here is the kernel log message for the killed glusterfs process.
>> >> https://gist.github.com/bleuchien/3d2b87985ecb944c60347d5e8660e36a
>> >>
>> >> I'm checking the brick and client trace logs. But those are
>> respectively
>> >> 1TB and 2TB in size so searching in them takes a while. I'll be
>> creating
>> >> gists for both logs about the time when the process died.
>> >>
>> >> As soon as I have more details I'll post them.
>> >>
>> >> Here you can see a graphical representation of the memory usage
>> of this
>> >> system: https://imgur.com/a/4BINtfr
>> >>
>> >> Cheers
>> >> Richard
>> >>
>> >>
>> >>
>> >> On 31.08.18 08:13, Raghavendra Gowdappa wrote:
>> >>>
>> >>>
>> >>> On Fri, Aug 31, 2018 at 11:11 AM, Richard Neuboeck
>> >>> <hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>
>> <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>>> wrote:
>> >>>
>> >>> On 08/31/2018 03:50 AM, Raghavendra Gowdappa wrote:
>> >>> > +Mohit. +Milind
>> >>> >
>> >>> > @Mohit/Milind,
>> >>> >
>> >>> > Can you check logs and see whether you can find anything
>> relevant?
>> >>>
>> >>> From glances at the system logs nothing out of the ordinary
>> >>> occurred. However I'll start another rsync and take a closer
>> look.
>> >>> It will take a few days.
>> >>>
>> >>> >
>> >>> > On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck
>> >>> > <hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>
>> <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>>
>> >>> <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>
>> <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>>>> wrote:
>> >>> >
>> >>> > Hi,
>> >>> >
>> >>> > I'm attaching a shortened version since the whole is
>> about 5.8GB of
>> >>> > the client mount log. It includes the initial mount
>> messages and the
>> >>> > last two minutes of log entries.
>> >>> >
>> >>> > It ends very anticlimactic without an obvious error.
>> Is there
>> >>> > anything specific I should be looking for?
>> >>> >
>> >>> >
>> >>> > Normally I look logs around disconnect msgs to find out
>> the reason.
>> >>> > But as you said, sometimes one can see just disconnect
>> msgs without
>> >>> > any reason. That normally points to reason for disconnect
>> in the
>> >>> > network rather than a Glusterfs initiated disconnect.
>> >>>
>> >>> The rsync source is serving our homes currently so there are NFS
>> >>> connections 24/7. There don't seem to be any network related
>> >>> interruptions
>> >>>
>> >>>
>> >>> Can you set diagnostics.client-log-level and
>> diagnostics.brick-log-level
>> >>> to TRACE and check logs of both ends of connections - client and
>> brick?
>> >>> To reduce the logsize, I would suggest to logrotate existing
>> logs and
>> >>> start with fresh logs when you are about to start so that only
>> relevant
>> >>> logs are captured. Also, can you take strace of client and brick
>> process
>> >>> using:
>> >>>
>> >>> strace -o <outputfile> -ff -v -p <pid>
>> >>>
>> >>> attach both logs and strace. Let's trace through what syscalls
>> on socket
>> >>> return and then decide whether to inspect tcpdump or not. If you
>> don't
>> >>> want to repeat tests again, please capture tcpdump too (on both
>> ends of
>> >>> connection) and send them to us.
>> >>>
>> >>>
>> >>> - a co-worker would be here faster than I could check
>> >>> the logs if the connection to home would be broken ;-)
>> >>> The three gluster machines are due to this problem reduced
>> to only
>> >>> testing so there is nothing else running.
>> >>>
>> >>>
>> >>> >
>> >>> > Cheers
>> >>> > Richard
>> >>> >
>> >>> > On 08/30/2018 02:40 PM, Raghavendra Gowdappa wrote:
>> >>> > > Normally client logs will give a clue on why the
>> disconnections are
>> >>> > > happening (ping-timeout, wrong port etc). Can you
>> look into client
>> >>> > > logs to figure out what's happening? If you can't
>> find anything, can
>> >>> > > you send across client logs?
>> >>> > >
>> >>> > > On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck
>> >>> > > <hawk at tbi.univie.ac.at
>> <mailto:hawk at tbi.univie.ac.at> <mailto:hawk at tbi.univie.ac.at
>> <mailto:hawk at tbi.univie.ac.at>>
>> >>> <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>
>> <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>>>
>> >>> > <mailto:hawk at tbi.univie.ac.at
>> <mailto:hawk at tbi.univie.ac.at> <mailto:hawk at tbi.univie.ac.at
>> <mailto:hawk at tbi.univie.ac.at>>
>> >>> <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>
>> <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>>>>>
>> >>> > wrote:
>> >>> > >
>> >>> > > Hi Gluster Community,
>> >>> > >
>> >>> > > I have problems with a glusterfs 'Transport
>> endpoint not
>> >>> > connected'
>> >>> > > connection abort during file transfers that I can
>> >>> > replicate (all the
>> >>> > > time now) but not pinpoint as to why this is
>> happening.
>> >>> > >
>> >>> > > The volume is set up in replica 3 mode and
>> accessed with
>> >>> > the fuse
>> >>> > > gluster client. Both client and server are
>> running CentOS
>> >>> > and the
>> >>> > > supplied 3.12.11 version of gluster.
>> >>> > >
>> >>> > > The connection abort happens at different times
>> during
>> >>> > rsync but
>> >>> > > occurs every time I try to sync all our files
>> (1.1TB) to
>> >>> > the empty
>> >>> > > volume.
>> >>> > >
>> >>> > > Client and server side I don't find errors in
>> the gluster
>> >>> > log files.
>> >>> > > rsync logs the obvious transfer problem. The
>> only log that
>> >>> > shows
>> >>> > > anything related is the server brick log which
>> states
>> >>> that the
>> >>> > > connection is shutting down:
>> >>> > >
>> >>> > > [2018-08-18 22:40:35.502510] I [MSGID: 115036]
>> >>> > > [server.c:527:server_rpc_notify] 0-home-server:
>> >>> disconnecting
>> >>> > > connection from
>> >>> > >
>> brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
>> >>> > > [2018-08-18 22:40:35.502620] W
>> >>> > > [inodelk.c:499:pl_inodelk_log_cleanup]
>> 0-home-server:
>> >>> > releasing lock
>> >>> > > on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
>> >>> > > {client=0x7f83ec0b3ce0, pid=110423
>> >>> lk-owner=d0fd5ffb427f0000}
>> >>> > > [2018-08-18 22:40:35.502692] W
>> >>> > > [entrylk.c:864:pl_entrylk_log_cleanup]
>> 0-home-server:
>> >>> > releasing lock
>> >>> > > on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
>> >>> > > {client=0x7f83ec0b3ce0, pid=110423
>> >>> lk-owner=703dd4cc407f0000}
>> >>> > > [2018-08-18 22:40:35.502719] W
>> >>> > > [entrylk.c:864:pl_entrylk_log_cleanup]
>> 0-home-server:
>> >>> > releasing lock
>> >>> > > on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
>> >>> > > {client=0x7f83ec0b3ce0, pid=110423
>> >>> lk-owner=703dd4cc407f0000}
>> >>> > > [2018-08-18 22:40:35.505950] I [MSGID: 101055]
>> >>> > > [client_t.c:443:gf_client_unref] 0-home-server:
>> Shutting
>> >>> down
>> >>> > > connection
>> >>> > brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
>> >>> > >
>> >>> > > Since I'm running another replica 3 setup for
>> oVirt for a
>> >>> > long time
>> >>> > > now which is completely stable I thought I made
>> a mistake
>> >>> > setting
>> >>> > > different options at first. However even when I
>> reset
>> >>> > those options
>> >>> > > I'm able to reproduce the connection problem.
>> >>> > >
>> >>> > > The unoptimized volume setup looks like this:
>> >>> > >
>> >>> > > Volume Name: home
>> >>> > > Type: Replicate
>> >>> > > Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
>> >>> > > Status: Started
>> >>> > > Snapshot Count: 0
>> >>> > > Number of Bricks: 1 x 3 = 3
>> >>> > > Transport-type: tcp
>> >>> > > Bricks:
>> >>> > > Brick1: sphere-four:/srv/gluster_home/brick
>> >>> > > Brick2: sphere-five:/srv/gluster_home/brick
>> >>> > > Brick3: sphere-six:/srv/gluster_home/brick
>> >>> > > Options Reconfigured:
>> >>> > > nfs.disable: on
>> >>> > > transport.address-family: inet
>> >>> > > cluster.quorum-type: auto
>> >>> > > cluster.server-quorum-type: server
>> >>> > > cluster.server-quorum-ratio: 50%
>> >>> > >
>> >>> > >
>> >>> > > The following additional options were used before:
>> >>> > >
>> >>> > > performance.cache-size: 5GB
>> >>> > > client.event-threads: 4
>> >>> > > server.event-threads: 4
>> >>> > > cluster.lookup-optimize: on
>> >>> > > features.cache-invalidation: on
>> >>> > > performance.stat-prefetch: on
>> >>> > > performance.cache-invalidation: on
>> >>> > > network.inode-lru-limit: 50000
>> >>> > > features.cache-invalidation-timeout: 600
>> >>> > > performance.md-cache-timeout: 600
>> >>> > > performance.parallel-readdir: on
>> >>> > >
>> >>> > >
>> >>> > > In this case the gluster servers and also the
>> client is
>> >>> > using a
>> >>> > > bonded network device running in adaptive load
>> balancing
>> >>> mode.
>> >>> > >
>> >>> > > I've tried using the debug option for the client
>> mount.
>> >>> > But except
>> >>> > > for a ~0.5TB log file I didn't get information
>> that seems
>> >>> > > helpful to me.
>> >>> > >
>> >>> > > Transferring just a couple of GB works without
>> problems.
>> >>> > >
>> >>> > > It may very well be that I'm already blind to
>> the obvious
>> >>> > but after
>> >>> > > many long running tests I can't find the crux in
>> the setup.
>> >>> > >
>> >>> > > Does anyone have an idea as how to approach this
>> problem
>> >>> > in a way
>> >>> > > that sheds some useful information?
>> >>> > >
>> >>> > > Any help is highly appreciated!
>> >>> > > Cheers
>> >>> > > Richard
>> >>> > >
>> >>> > > --
>> >>> > > /dev/null
>> >>> > >
>> >>> > >
>> >>> > >
>> >>> > >
>> >>> > > _______________________________________________
>> >>> > > Gluster-users mailing list
>> >>> > > Gluster-users at gluster.org
>> <mailto:Gluster-users at gluster.org> <mailto:Gluster-users at gluster.org
>> <mailto:Gluster-users at gluster.org>>
>> >>> > <mailto:Gluster-users at gluster.org
>> <mailto:Gluster-users at gluster.org>
>> >>> <mailto:Gluster-users at gluster.org
>> <mailto:Gluster-users at gluster.org>>>
>> >>> > <mailto:Gluster-users at gluster.org
>> <mailto:Gluster-users at gluster.org>
>> >>> <mailto:Gluster-users at gluster.org
>> <mailto:Gluster-users at gluster.org>>
>> >>> > <mailto:Gluster-users at gluster.org
>> <mailto:Gluster-users at gluster.org>
>> >>> <mailto:Gluster-users at gluster.org
>> <mailto:Gluster-users at gluster.org>>>>
>> >>> > >
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>> >>> <https://lists.gluster.org/mailman/listinfo/gluster-users>
>> >>> > <https://lists.gluster.org/mailman/listinfo/gluster-users
>> >>> <https://lists.gluster.org/mailman/listinfo/gluster-users>>
>> >>> > >
>> >>> <https://lists.gluster.org/mailman/listinfo/gluster-users
>> >>> <https://lists.gluster.org/mailman/listinfo/gluster-users>
>> >>> > <https://lists.gluster.org/mailman/listinfo/gluster-users
>> >>> <https://lists.gluster.org/mailman/listinfo/gluster-users>>>
>> >>> > >
>> >>> > >
>> >>> >
>> >>> >
>> >>> > --
>> >>> > /dev/null
>> >>> >
>> >>> >
>> >>>
>> >>>
>> >>> --
>> >>> /dev/null
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> Gluster-users mailing list
>> >> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>> >> https://lists.gluster.org/mailman/listinfo/gluster-users
>> >>
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > Gluster-users mailing list
>> > Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>> > https://lists.gluster.org/mailman/listinfo/gluster-users
>> >
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20181121/86c2e02f/attachment.sig>
More information about the Gluster-users
mailing list