[Gluster-users] Transport endpoint is not connected failures in 5.3 under high I/O load
brandon at thinkhuge.net
brandon at thinkhuge.net
Mon Mar 18 16:46:09 UTC 2019
Hello list,
We are having critical failures under load of CentOS7 glusterfs 5.3 with our
servers losing their local mount point with the issue - "Transport endpoint
is not connected"
Not sure if it is related but the logs are full of the following message.
[2019-03-18 14:00:02.656876] E [MSGID: 101191]
[event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch
handler
We operate multiple separate glusterfs distributed clusters of about 6-8
nodes. Our 2 biggest, separate, and most I/O active glusterfs clusters are
both having the issues.
We are trying to use glusterfs as a unified file system for pureftpd backup
services for a VPS service. We have a relatively small backup window of the
weekend where all our servers backup at the same time. When backups start
early on Saturday it causes a sustained massive amount of FTP file upload
I/O for around 48 hours with all the compressed backup files being uploaded.
For our london 8 node cluster for example there is about 45 TB of uploads in
~48 hours currently.
We do have some other smaller issues with directory listing under this load
too but, it has been working for a couple years since 3.x until we've
updated recently and randomly now all servers are losing their glusterfs
mount with the "Transport endpoint is not connected" issue.
Our glusterfs servers are all mostly the same with small variations. Mostly
they are supermicro E3 cpu, 16 gb ram, LSI raid10 hdd (with and without
bbu). Drive arrays vary between 4-16 sata3 hdd drives each node depending
on if the servers are older or newer. Firmware is kept up-to-date as well as
running the latest LSI compiled driver. the newer 16 drive backup servers
have 2 x 1Gbit LACP teamed interfaces also.
[root at lonbaknode3 ~]# uname -r
3.10.0-957.5.1.el7.x86_64
[root at lonbaknode3 ~]# rpm -qa |grep gluster
centos-release-gluster5-1.0-1.el7.centos.noarch
glusterfs-libs-5.3-2.el7.x86_64
glusterfs-api-5.3-2.el7.x86_64
glusterfs-5.3-2.el7.x86_64
glusterfs-cli-5.3-2.el7.x86_64
glusterfs-client-xlators-5.3-2.el7.x86_64
glusterfs-server-5.3-2.el7.x86_64
glusterfs-fuse-5.3-2.el7.x86_64
[root at lonbaknode3 ~]#
[root at lonbaknode3 ~]# gluster volume info all
Volume Name: volbackups
Type: Distribute
Volume ID: 32bf4fe9-5450-49f8-b6aa-05471d3bdffa
Status: Started
Snapshot Count: 0
Number of Bricks: 8
Transport-type: tcp
Bricks:
Brick1: lonbaknode3.domain.net:/lvbackups/brick
Brick2: lonbaknode4.domain.net:/lvbackups/brick
Brick3: lonbaknode5.domain.net:/lvbackups/brick
Brick4: lonbaknode6.domain.net:/lvbackups/brick
Brick5: lonbaknode7.domain.net:/lvbackups/brick
Brick6: lonbaknode8.domain.net:/lvbackups/brick
Brick7: lonbaknode9.domain.net:/lvbackups/brick
Brick8: lonbaknode10.domain.net:/lvbackups/brick
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
cluster.min-free-disk: 1%
performance.cache-size: 8GB
performance.cache-max-file-size: 128MB
diagnostics.brick-log-level: WARNING
diagnostics.brick-sys-log-level: WARNING
client.event-threads: 3
performance.client-io-threads: on
performance.io-thread-count: 24
network.inode-lru-limit: 1048576
performance.parallel-readdir: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
[root at lonbaknode3 ~]#
Mount output shows the following:
lonbaknode3.domain.net:/volbackups on /home/volbackups type fuse.glusterfs
(rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=1
31072)
If you notice anything in our volume or mount settings above missing or
otherwise bad feel free to let us know. Still learning this glusterfs. I
tried searching for any recommended performance settings but, it's not
always clear which setting is most applicable or beneficial to our workload.
I have just found this post that looks like it is the same issues.
https://lists.gluster.org/pipermail/gluster-users/2019-March/035958.html
We have not yet tried the suggestion of "performance.write-behind: off" but,
we will do so if that is recommended.
Could someone knowledgeable advise anything for these issues?
If any more information is needed do let us know.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190318/e0c57d1d/attachment.html>
More information about the Gluster-users
mailing list