[Gluster-users] gluster connection interrupted during transfer

Wed Nov 21 09:22:17 UTC 2018

Hi Vijay,

this is an update to the 8 tests I've run so far. In short all is well.

I followed your advice and created state dumps every 3 hours. 4 tests
ran with the default volume options. The last 4 tests ran with all
performance optimizations I could find to increase small file performance.

During the run time the dump file size varied from the beginning of the
mount ~100KB to ~1GB reflecting the memory footprint of the gluster process.

Since every test ran without interruption the memory leak seems to be
fixed in 3.12.14-1.el7.x86_64 on CentOS 7.

Thanks again for you help.
Cheers
Richard

On 15.10.18 10:48, Richard Neuboeck wrote:
> Hi Vijay,
> 
> sorry it took so long. I've upgraded the gluster server and client to
> the latest packages 3.12.14-1.el7.x86_64 available in CentOS.
> 
> Incredibly my first test after the update worked perfectly! I'll do
> another couple of rsyncs, maybe apply the performance improvements again
> and do statedumps all the way.
> 
> I'll report back if there are any more problems or if they are resolved.
> 
> Thanks for the help so far!
> Cheers
> Richard
> 
> 
> On 25.09.18 00:39, Vijay Bellur wrote:
>> Hello Richard,
>>
>> Thank you for the logs.
>>
>> I am wondering if this could be a different memory leak than the one
>> addressed in the bug. Would it be possible for you to obtain a
>> statedump of the client so that we can understand the memory allocation
>> pattern better? Details about gathering a statedump can be found at [1].
>> Please ensure that /var/run/gluster is present before triggering a
>> statedump.
>>
>> Regards,
>> Vijay
>>
>> [1] https://docs.gluster.org/en/v3/Troubleshooting/statedump/
>>
>>
>> On Fri, Sep 21, 2018 at 12:14 AM Richard Neuboeck <hawk at tbi.univie.ac.at
>> <mailto:hawk at tbi.univie.ac.at>> wrote:
>>
>>     Hi again,
>>
>>     in my limited - non full time programmer - understanding it's a memory
>>     leak in the gluster fuse client.
>>
>>     Should I reopen the mentioned bugreport or open a new one? Or would the
>>     community prefer an entirely different approach?
>>
>>     Thanks
>>     Richard
>>
>>     On 13.09.18 10:07, Richard Neuboeck wrote:
>>     > Hi,
>>     >
>>     > I've created excerpts from the brick and client logs +/- 1 minute to
>>     > the kill event. Still the logs are ~400-500MB so will put them
>>     > somewhere to download since I have no idea what I should be looking
>>     > for and skimming them didn't reveal obvious problems to me.
>>     >
>>     > http://www.tbi.univie.ac.at/~hawk/gluster/brick_3min_excerpt.log
>>     <http://www.tbi.univie.ac.at/%7Ehawk/gluster/brick_3min_excerpt.log>
>>     > http://www.tbi.univie.ac.at/~hawk/gluster/mnt_3min_excerpt.log
>>     <http://www.tbi.univie.ac.at/%7Ehawk/gluster/mnt_3min_excerpt.log>
>>     >
>>     > I was pointed in the direction of the following Bugreport
>>     > https://bugzilla.redhat.com/show_bug.cgi?id=1613512
>>     > It sounds right but seems to have been addressed already.
>>     >
>>     > If there is anything I can do to help solve this problem please let
>>     > me know. Thanks for your help!
>>     >
>>     > Cheers
>>     > Richard
>>     >
>>     >
>>     > On 9/11/18 10:10 AM, Richard Neuboeck wrote:
>>     >> Hi,
>>     >>
>>     >> since I feared that the logs would fill up the partition (again) I
>>     >> checked the systems daily and finally found the reason. The glusterfs
>>     >> process on the client runs out of memory and get's killed by OOM
>>     after
>>     >> about four days. Since rsync runs for a couple of days longer till it
>>     >> ends I never checked the whole time frame in the system logs and
>>     never
>>     >> stumbled upon the OOM message.
>>     >>
>>     >> Running out of memory on a 128GB RAM system even with a DB occupying
>>     >> ~40% of that is kind of strange though. Might there be a leak?
>>     >>
>>     >> But this would explain the erratic behavior I've experienced over the
>>     >> last 1.5 years while trying to work with our homes on glusterfs.
>>     >>
>>     >> Here is the kernel log message for the killed glusterfs process.
>>     >> https://gist.github.com/bleuchien/3d2b87985ecb944c60347d5e8660e36a
>>     >>
>>     >> I'm checking the brick and client trace logs. But those are
>>     respectively
>>     >> 1TB and 2TB in size so searching in them takes a while. I'll be
>>     creating
>>     >> gists for both logs about the time when the process died.
>>     >>
>>     >> As soon as I have more details I'll post them.
>>     >>
>>     >> Here you can see a graphical representation of the memory usage
>>     of this
>>     >> system: https://imgur.com/a/4BINtfr
>>     >>
>>     >> Cheers
>>     >> Richard
>>     >>
>>     >>
>>     >>
>>     >> On 31.08.18 08:13, Raghavendra Gowdappa wrote:
>>     >>>
>>     >>>
>>     >>> On Fri, Aug 31, 2018 at 11:11 AM, Richard Neuboeck
>>     >>> <hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>
>>     <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>>> wrote:
>>     >>>
>>     >>>     On 08/31/2018 03:50 AM, Raghavendra Gowdappa wrote:
>>     >>>     > +Mohit. +Milind
>>     >>>     >
>>     >>>     > @Mohit/Milind,
>>     >>>     >
>>     >>>     > Can you check logs and see whether you can find anything
>>     relevant?
>>     >>>
>>     >>>     From glances at the system logs nothing out of the ordinary
>>     >>>     occurred. However I'll start another rsync and take a closer
>>     look.
>>     >>>     It will take a few days.
>>     >>>
>>     >>>     >
>>     >>>     > On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck
>>     >>>     > <hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>
>>     <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>>
>>     >>>     <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>
>>     <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>>>> wrote:
>>     >>>     >
>>     >>>     >     Hi,
>>     >>>     >
>>     >>>     >     I'm attaching a shortened version since the whole is
>>     about 5.8GB of
>>     >>>     >     the client mount log. It includes the initial mount
>>     messages and the
>>     >>>     >     last two minutes of log entries.
>>     >>>     >
>>     >>>     >     It ends very anticlimactic without an obvious error.
>>     Is there
>>     >>>     >     anything specific I should be looking for?
>>     >>>     >
>>     >>>     >
>>     >>>     > Normally I look logs around disconnect msgs to find out
>>     the reason.
>>     >>>     > But as you said, sometimes one can see just disconnect
>>     msgs without
>>     >>>     > any reason. That normally points to reason for disconnect
>>     in the
>>     >>>     > network rather than a Glusterfs initiated disconnect.
>>     >>>
>>     >>>     The rsync source is serving our homes currently so there are NFS
>>     >>>     connections 24/7. There don't seem to be any network related
>>     >>>     interruptions
>>     >>>
>>     >>>
>>     >>> Can you set diagnostics.client-log-level and
>>     diagnostics.brick-log-level
>>     >>> to TRACE and check logs of both ends of connections - client and
>>     brick?
>>     >>> To reduce the logsize, I would suggest to logrotate existing
>>     logs and
>>     >>> start with fresh logs when you are about to start so that only
>>     relevant
>>     >>> logs are captured. Also, can you take strace of client and brick
>>     process
>>     >>> using:
>>     >>>
>>     >>> strace -o <outputfile> -ff -v -p <pid>
>>     >>>
>>     >>> attach both logs and strace. Let's trace through what syscalls
>>     on socket
>>     >>> return and then decide whether to inspect tcpdump or not. If you
>>     don't
>>     >>> want to repeat tests again, please capture tcpdump too (on both
>>     ends of
>>     >>> connection) and send them to us.
>>     >>>
>>     >>>
>>     >>>     - a co-worker would be here faster than I could check
>>     >>>     the logs if the connection to home would be broken ;-)
>>     >>>     The three gluster machines are due to this problem reduced
>>     to only
>>     >>>     testing so there is nothing else running.
>>     >>>
>>     >>>
>>     >>>     >
>>     >>>     >     Cheers
>>     >>>     >     Richard
>>     >>>     >
>>     >>>     >     On 08/30/2018 02:40 PM, Raghavendra Gowdappa wrote:
>>     >>>     >     > Normally client logs will give a clue on why the
>>     disconnections are
>>     >>>     >     > happening (ping-timeout, wrong port etc). Can you
>>     look into client
>>     >>>     >     > logs to figure out what's happening? If you can't
>>     find anything, can
>>     >>>     >     > you send across client logs?
>>     >>>     >     >
>>     >>>     >     > On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck
>>     >>>     >     > <hawk at tbi.univie.ac.at
>>     <mailto:hawk at tbi.univie.ac.at> <mailto:hawk at tbi.univie.ac.at
>>     <mailto:hawk at tbi.univie.ac.at>>
>>     >>>     <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>
>>     <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>>>
>>     >>>     >     <mailto:hawk at tbi.univie.ac.at
>>     <mailto:hawk at tbi.univie.ac.at> <mailto:hawk at tbi.univie.ac.at
>>     <mailto:hawk at tbi.univie.ac.at>>
>>     >>>     <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>
>>     <mailto:hawk at tbi.univie.ac.at <mailto:hawk at tbi.univie.ac.at>>>>>
>>     >>>     >     wrote:
>>     >>>     >     >
>>     >>>     >     >     Hi Gluster Community,
>>     >>>     >     >
>>     >>>     >     >     I have problems with a glusterfs 'Transport
>>     endpoint not
>>     >>>     >     connected'
>>     >>>     >     >     connection abort during file transfers that I can
>>     >>>     >     replicate (all the
>>     >>>     >     >     time now) but not pinpoint as to why this is
>>     happening.
>>     >>>     >     >
>>     >>>     >     >     The volume is set up in replica 3 mode and
>>     accessed with
>>     >>>     >     the fuse
>>     >>>     >     >     gluster client. Both client and server are
>>     running CentOS
>>     >>>     >     and the
>>     >>>     >     >     supplied 3.12.11 version of gluster.
>>     >>>     >     >
>>     >>>     >     >     The connection abort happens at different times
>>     during
>>     >>>     >     rsync but
>>     >>>     >     >     occurs every time I try to sync all our files
>>     (1.1TB) to
>>     >>>     >     the empty
>>     >>>     >     >     volume.
>>     >>>     >     >
>>     >>>     >     >     Client and server side I don't find errors in
>>     the gluster
>>     >>>     >     log files.
>>     >>>     >     >     rsync logs the obvious transfer problem. The
>>     only log that
>>     >>>     >     shows
>>     >>>     >     >     anything related is the server brick log which
>>     states
>>     >>>     that the
>>     >>>     >     >     connection is shutting down:
>>     >>>     >     >
>>     >>>     >     >     [2018-08-18 22:40:35.502510] I [MSGID: 115036]
>>     >>>     >     >     [server.c:527:server_rpc_notify] 0-home-server:
>>     >>>     disconnecting
>>     >>>     >     >     connection from
>>     >>>     >     >   
>>      brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
>>     >>>     >     >     [2018-08-18 22:40:35.502620] W
>>     >>>     >     >     [inodelk.c:499:pl_inodelk_log_cleanup]
>>     0-home-server:
>>     >>>     >     releasing lock
>>     >>>     >     >     on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by
>>     >>>     >     >     {client=0x7f83ec0b3ce0, pid=110423
>>     >>>     lk-owner=d0fd5ffb427f0000}
>>     >>>     >     >     [2018-08-18 22:40:35.502692] W
>>     >>>     >     >     [entrylk.c:864:pl_entrylk_log_cleanup]
>>     0-home-server:
>>     >>>     >     releasing lock
>>     >>>     >     >     on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
>>     >>>     >     >     {client=0x7f83ec0b3ce0, pid=110423
>>     >>>     lk-owner=703dd4cc407f0000}
>>     >>>     >     >     [2018-08-18 22:40:35.502719] W
>>     >>>     >     >     [entrylk.c:864:pl_entrylk_log_cleanup]
>>     0-home-server:
>>     >>>     >     releasing lock
>>     >>>     >     >     on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by
>>     >>>     >     >     {client=0x7f83ec0b3ce0, pid=110423
>>     >>>     lk-owner=703dd4cc407f0000}
>>     >>>     >     >     [2018-08-18 22:40:35.505950] I [MSGID: 101055]
>>     >>>     >     >     [client_t.c:443:gf_client_unref] 0-home-server:
>>     Shutting
>>     >>>     down
>>     >>>     >     >     connection
>>     >>>     >     brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0
>>     >>>     >     >
>>     >>>     >     >     Since I'm running another replica 3 setup for
>>     oVirt for a
>>     >>>     >     long time
>>     >>>     >     >     now which is completely stable I thought I made
>>     a mistake
>>     >>>     >     setting
>>     >>>     >     >     different options at first. However even when I
>>     reset
>>     >>>     >     those options
>>     >>>     >     >     I'm able to reproduce the connection problem.
>>     >>>     >     >
>>     >>>     >     >     The unoptimized volume setup looks like this:
>>     >>>     >     >
>>     >>>     >     >     Volume Name: home
>>     >>>     >     >     Type: Replicate
>>     >>>     >     >     Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8
>>     >>>     >     >     Status: Started
>>     >>>     >     >     Snapshot Count: 0
>>     >>>     >     >     Number of Bricks: 1 x 3 = 3
>>     >>>     >     >     Transport-type: tcp
>>     >>>     >     >     Bricks:
>>     >>>     >     >     Brick1: sphere-four:/srv/gluster_home/brick
>>     >>>     >     >     Brick2: sphere-five:/srv/gluster_home/brick
>>     >>>     >     >     Brick3: sphere-six:/srv/gluster_home/brick
>>     >>>     >     >     Options Reconfigured:
>>     >>>     >     >     nfs.disable: on
>>     >>>     >     >     transport.address-family: inet
>>     >>>     >     >     cluster.quorum-type: auto
>>     >>>     >     >     cluster.server-quorum-type: server
>>     >>>     >     >     cluster.server-quorum-ratio: 50%
>>     >>>     >     >
>>     >>>     >     >
>>     >>>     >     >     The following additional options were used before:
>>     >>>     >     >
>>     >>>     >     >     performance.cache-size: 5GB
>>     >>>     >     >     client.event-threads: 4
>>     >>>     >     >     server.event-threads: 4
>>     >>>     >     >     cluster.lookup-optimize: on
>>     >>>     >     >     features.cache-invalidation: on
>>     >>>     >     >     performance.stat-prefetch: on
>>     >>>     >     >     performance.cache-invalidation: on
>>     >>>     >     >     network.inode-lru-limit: 50000
>>     >>>     >     >     features.cache-invalidation-timeout: 600
>>     >>>     >     >     performance.md-cache-timeout: 600
>>     >>>     >     >     performance.parallel-readdir: on
>>     >>>     >     >
>>     >>>     >     >
>>     >>>     >     >     In this case the gluster servers and also the
>>     client is
>>     >>>     >     using a
>>     >>>     >     >     bonded network device running in adaptive load
>>     balancing
>>     >>>     mode.
>>     >>>     >     >
>>     >>>     >     >     I've tried using the debug option for the client
>>     mount.
>>     >>>     >     But except
>>     >>>     >     >     for a ~0.5TB log file I didn't get information
>>     that seems
>>     >>>     >     >     helpful to me.
>>     >>>     >     >
>>     >>>     >     >     Transferring just a couple of GB works without
>>     problems.
>>     >>>     >     >
>>     >>>     >     >     It may very well be that I'm already blind to
>>     the obvious
>>     >>>     >     but after
>>     >>>     >     >     many long running tests I can't find the crux in
>>     the setup.
>>     >>>     >     >
>>     >>>     >     >     Does anyone have an idea as how to approach this
>>     problem
>>     >>>     >     in a way
>>     >>>     >     >     that sheds some useful information?
>>     >>>     >     >
>>     >>>     >     >     Any help is highly appreciated!
>>     >>>     >     >     Cheers
>>     >>>     >     >     Richard
>>     >>>     >     >
>>     >>>     >     >     --
>>     >>>     >     >     /dev/null
>>     >>>     >     >
>>     >>>     >     >
>>     >>>     >     >
>>     >>>     >     >
>>     >>>     >     >     _______________________________________________
>>     >>>     >     >     Gluster-users mailing list
>>     >>>     >     >     Gluster-users at gluster.org
>>     <mailto:Gluster-users at gluster.org> <mailto:Gluster-users at gluster.org
>>     <mailto:Gluster-users at gluster.org>>
>>     >>>     >     <mailto:Gluster-users at gluster.org
>>     <mailto:Gluster-users at gluster.org>
>>     >>>     <mailto:Gluster-users at gluster.org
>>     <mailto:Gluster-users at gluster.org>>>
>>     >>>     >     <mailto:Gluster-users at gluster.org
>>     <mailto:Gluster-users at gluster.org>
>>     >>>     <mailto:Gluster-users at gluster.org
>>     <mailto:Gluster-users at gluster.org>>
>>     >>>     >     <mailto:Gluster-users at gluster.org
>>     <mailto:Gluster-users at gluster.org>
>>     >>>     <mailto:Gluster-users at gluster.org
>>     <mailto:Gluster-users at gluster.org>>>>
>>     >>>     >     >   
>>      https://lists.gluster.org/mailman/listinfo/gluster-users
>>     >>>     <https://lists.gluster.org/mailman/listinfo/gluster-users>
>>     >>>     >     <https://lists.gluster.org/mailman/listinfo/gluster-users
>>     >>>     <https://lists.gluster.org/mailman/listinfo/gluster-users>>
>>     >>>     >     >   
>>     >>>      <https://lists.gluster.org/mailman/listinfo/gluster-users
>>     >>>     <https://lists.gluster.org/mailman/listinfo/gluster-users>
>>     >>>     >     <https://lists.gluster.org/mailman/listinfo/gluster-users
>>     >>>     <https://lists.gluster.org/mailman/listinfo/gluster-users>>>
>>     >>>     >     >
>>     >>>     >     >
>>     >>>     >
>>     >>>     >
>>     >>>     >     --
>>     >>>     >     /dev/null
>>     >>>     >
>>     >>>     >
>>     >>>
>>     >>>
>>     >>>     --
>>     >>>     /dev/null
>>     >>>
>>     >>>
>>     >>
>>     >>
>>     >>
>>     >> _______________________________________________
>>     >> Gluster-users mailing list
>>     >> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>>     >> https://lists.gluster.org/mailman/listinfo/gluster-users
>>     >>
>>     >
>>     >
>>     >
>>     >
>>     > _______________________________________________
>>     > Gluster-users mailing list
>>     > Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>>     > https://lists.gluster.org/mailman/listinfo/gluster-users
>>     >
>>
>>     _______________________________________________
>>     Gluster-users mailing list
>>     Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>>     https://lists.gluster.org/mailman/listinfo/gluster-users
>>
> 
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20181121/86c2e02f/attachment.sig>