[Gluster-users] gluster fails under heavy array job load load

Fri Dec 13 17:50:27 UTC 2013

Bug 1043009 Submitted


On Thursday, December 12, 2013 11:46:03 PM Anand Avati wrote:
> Please provide the full client and server logs (in a bug report). The
> snippets give some hints, but are not very meaningful without the full
> context/history since mount time (they have after-the-fact symptoms, but
> not the part which show the reason why disconnects happened).
> 
> Even before looking into the full logs here are some quick observations:
> 
> - write-behind-window-size = 1024MB seems *excessively* high. Please set
> this to 1MB (default) and check if the stability improves.
> 
> - I see RDMA is enabled on the volume. Are you mounting clients through
> RDMA? If so, for the purpose of diagnostics can you mount through TCP and
> check the stability improves? If you are using RDMA with such a high
> write-behind-window-size, spurious ping-timeouts are an almost certainty
> during heavy writes. The RDMA driver has limited flow control, and setting
> such a high window-size can easily congest all the RDMA buffers resulting
> in spurious ping-timeouts and disconnections.
> 
> Avati
> 
> On Thu, Dec 12, 2013 at 5:03 PM, harry mangalam 
<harry.mangalam at uci.edu>wrote:
> >  Hi All,
> > 
> > (Gluster Volume Details at bottom)
> > 
> > 
> > 
> > I've posted some of this previously, but even after various upgrades,
> > attempted fixes, etc, it remains a problem.
> > 
> > 
> > 
> > 
> > 
> > Short version: Our gluster fs (~340TB) provides scratch space for a
> > ~5000core academic compute cluster.
> > 
> > Much of our load is streaming IO, doing a lot of genomics work, and that
> > is the load under which we saw this latest failure.
> > 
> > Under heavy batch load, especially array jobs, where there might be
> > several 64core nodes doing I/O on the 4servers/8bricks, we often get job
> > failures that have the following profile:
> > 
> > 
> > 
> > Client POV:
> > 
> > Here is a sampling of the client logs (/var/log/glusterfs/gl.log) for all
> > compute nodes that indicated interaction with the user's files
> > 
> > <http://pastie.org/8548781>
> > 
> > 
> > 
> > Here are some client Info logs that seem fairly serious:
> > 
> > <http://pastie.org/8548785>
> > 
> > 
> > 
> > The errors that referenced this user were gathered from all the nodes that
> > were running his code (in compute*) and agglomerated with:
> > 
> > 
> > 
> > cut -f2,3 -d']' compute* |cut -f1 -dP | sort | uniq -c | sort -gr
> > 
> > 
> > 
> > and placed here to show the profile of errors that his run generated.
> > 
> > <http://pastie.org/8548796>
> > 
> > 
> > 
> > so 71 of them were:
> > 
> > W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-7: remote
> > operation failed: Transport endpoint is not connected.
> > 
> > etc
> > 
> > 
> > 
> > We've seen this before and previously discounted it bc it seems to have
> > been related to the problem of spurious NFS-related bugs, but now I'm
> > wondering whether it's a real problem.
> > 
> > Also the 'remote operation failed: Stale file handle. ' warnings.
> > 
> > 
> > 
> > There were no Errors logged per se, tho some of the W's looked fairly
> > nasty, like the 'dht_layout_dir_mismatch'
> > 
> > 
> > 
> > From the server side, however, during the same period, there were:
> > 
> > 0 Warnings about this user's files
> > 
> > 0 Errors
> > 
> > 458 Info lines
> > 
> > of which only 1 line was not a 'cleanup' line like this:
> > 
> > ---
> > 
> > 10.2.7.11:[2013-12-12 21:22:01.064289] I
> > [server-helpers.c:460:do_fd_cleanup] 0-gl-server: fd cleanup on
> > /path/to/file
> > 
> > ---
> > 
> > it was:
> > 
> > ---
> > 
> > 10.2.7.14:[2013-12-12 21:00:35.209015] I
> > [server-rpc-fops.c:898:_gf_server_log_setxattr_failure] 0-gl-server:
> > 113697332: SETXATTR /bio/tdlong/RNAseqIII/ckpt.1084030
> > (c9488341-c063-4175-8492-75e2e282f690) ==> trusted.glusterfs.dht
> > 
> > ---
> > 
> > 
> > 
> > We're losing about 10% of these kinds of array jobs bc of this, which is
> > just not supportable.
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Gluster details
> > 
> > 
> > 
> > servers and clients running gluster 3.4.0-8.el6 over QDR IB, IPoIB, thru 2
> > Mellanox, 1 Voltaire switches, Mellanox cards, CentOS 6.4
> > 
> > 
> > 
> > $ gluster volume info
> > 
> >  Volume Name: gl
> > 
> > Type: Distribute
> > 
> > Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
> > 
> > Status: Started
> > 
> > Number of Bricks: 8
> > 
> > Transport-type: tcp,rdma
> > 
> > Bricks:
> > 
> > Brick1: bs2:/raid1
> > 
> > Brick2: bs2:/raid2
> > 
> > Brick3: bs3:/raid1
> > 
> > Brick4: bs3:/raid2
> > 
> > Brick5: bs4:/raid1
> > 
> > Brick6: bs4:/raid2
> > 
> > Brick7: bs1:/raid1
> > 
> > Brick8: bs1:/raid2
> > 
> > Options Reconfigured:
> > 
> > performance.write-behind-window-size: 1024MB
> > 
> > performance.flush-behind: on
> > 
> > performance.cache-size: 268435456
> > 
> > nfs.disable: on
> > 
> > performance.io-cache: on
> > 
> > performance.quick-read: on
> > 
> > performance.io-thread-count: 64
> > 
> > auth.allow: 10.2.*.*,10.1.*.*
> > 
> > 
> > 
> > 
> > 
> > 'gluster volume status gl detail':
> > 
> > <http://pastie.org/8548826>
> > 
> > 
> > 
> > ---
> > 
> > Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
> > 
> > [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
> > 
> > 415 South Circle View Dr, Irvine, CA, 92697 [shipping]
> > 
> > MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
> > 
> > ---
> > 
> > 
> > 
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://supercolony.gluster.org/mailman/listinfo/gluster-users

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20131213/a2a48a51/attachment.html>