[Gluster-users] Glusterfs 3.3 rapidly generating write errors under heavy load.
harry mangalam
harry.mangalam at uci.edu
Fri Apr 12 21:51:56 UTC 2013
As I've posted previously <http://goo.gl/DLplt> with increasing frequency, our
academic cluster glusterfs (340TG over 4 nodes, 2 bricks each, details at
bottom) is generating unacceptable errors under heavy load (which is the norm
for the cluster). We use the SGE scheduler and it looks like gluster cannot
keep up under heavy write load (as is the case with array jbs), or at least
the kind of load that we're putting it under. Comments welcome.
The user who has mostly been affected writes this:
[[..it is the same issue that I've been seeing for a few days. I've been able
to get access to up to 800 cores in the last week, which enables a high write
load. These programs are also attempting to buffer the output by storing to
large internal string streams before writing.
A different script, which was based only on command-line manipulations of
files (gzip, zcat, cut, and paste) had similar issues. I re-wrote those
operations do be done in one fell swoop in C++, and it ran through just
fine.]]
For example, in the last 5 days, it has generated errors (' E ') in these
numbers:
biostor1 - 58 Errors (biostorX is the node; raid[12] are the bricks)
raid1 - 2 Errors
raid2 - 56 Errors
biostor2 - 13532 Errors
raid1 - 10384 Errors
raid2 - 3148 Errors
biostor3 - 35 Errors
raid1 - 6 Errors
raid2 - 29 Errors
biostor4 - 98 Errors
raid1 - 27 Errors
raid2 - 71 Errors
================================================================
on bistor1, the errors were distributed like this (stripping the particulars):
# errs file and position
44 [posix.c:358:posix_setattr]
8 [posix.c:823:posix_mknod]
2 [posix.c:1730:posix_create]
1 [server.c:176:server_submit_reply]
1 [rpcsvc.c:1080:rpcsvc_submit_generic]
1 [posix.c:857:posix_mknod]
1 [posix-helpers.c:685:posix_handle_pair]
Examples:
44 x [2013-04-11 20:43:28.811049] E [posix.c:358:posix_setattr] 0-gl-posix:
setattr (lstat) on /raid2/.glusterfs/9b/03/9b036627-864b-403a-8681-
e4b1ad1a0da6 failed: No such file or directory
(occurring in clumps - all 44 happened in one minute.)
8 x [2013-04-11 21:36:34.665924] E [posix.c:823:posix_mknod] 0-gl-posix: mknod
on
/raid2/bio/krthornt/WTCCC/explore_Jan2013/control_vs_control/esm/more_perms_collected/esm.500000.14
failed: File exists
(7 within 2m)
================================================================
on bistor2, the errors were distributed like this (stripping the particulars):
# errs file and position
7558 [posix.c:1852:posix_open]
3136 [posix.c:823:posix_mknod]
2819 [posix.c:223:posix_stat]
8 [posix.c:183:posix_lookup]
4 [posix.c:1730:posix_create]
2 [posix.c:857:posix_mknod]
Examples:
7558 x [2013-04-11 20:30:12.080860] E [posix.c:1852:posix_open] 0-gl-posix:
open on /raid1/.glusterfs/ba/03/ba035b25-ac26-451e-a1ec-9fd9262ce9a3: No such
file or directory
(all in ~13m)
3136 x [2013-04-11 14:44:49.185916] E [posix.c:823:posix_mknod] 0-gl-posix:
mknod on
/raid2/bio/krthornt/WTCCC/explore_Jan2013/control_vs_control/esm/more_perms_collected/esm.500000.3
failed: File exists
(all in the same 13m as above - of these, all but 17 were referencing the
SAME SGE array file:
/raid2/bio/krthornt/WTCCC/explore_Jan2013/control_vs_control/esm/more_perms_collected/esm.500000.22)
2819 x [2013-04-11 20:30:16.469462] E [posix.c:223:posix_stat] 0-gl-posix:
lstat on /raid1/.glusterfs/2c/54/2c545e08-a523-4502-bc1a-817e0368a04c failed:
No such file or directory
(all in the same 13m as above)
================================================================
on bistor3, the errors were distributed like this (stripping the particulars):
# errs file and position
17 [server-helpers.c:763:server_alloc_frame]
15 [posix.c:823:posix_mknod]
3 [posix.c:1730:posix_create]
Examples:
17 x [2013-04-08 14:22:28.835606] E [server-helpers.c:763:server_alloc_frame]
(-->/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x93) [0x327220a5b3] (--
>/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x293) [0x327220a443] (--
>/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_lookup+0xb8)
[0x7fc6a9836558]))) 0-server: invalid argument: conn
(in 2 batches within 1s each)
15 x [2013-04-10 11:30:44.453916] E [posix.c:823:posix_mknod] 0-gl-posix:
mknod on
/raid1/bio/krthornt/WTCCC/explore_Jan2013/control_vs_control/esm/more_perms/esm.500000.18
failed: File exists
(9 in ~5m, 6 in ~3m; see also above; these are SGE array jobs so
they're being generated quite fast.)
================================================================
on bistor4, the errors were distributed like this (stripping the particulars):
# errs file and position
50 [server-helpers.c:763:server_alloc_frame]
26 [posix.c:823:posix_mknod]
8 [posix.c:857:posix_mknod]
8 [posix-helpers.c:685:posix_handle_pair]
2 [server.c:176:server_submit_reply]
2 [rpcsvc.c:1080:rpcsvc_submit_generic]
1 [posix.c:865:posix_mknod]
1 [posix.c:183:posix_lookup]
Examples:
50 x [2013-04-08 13:36:42.286009] E [server-helpers.c:763:server_alloc_frame]
(-->/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x93) [0x39b200a5b3] (--
>/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x293) [0x39b200a443] (--
>/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_lookup+0xb8)
[0x7f42e695e558]))) 0-server: invalid argument: conn
(in 2 groups of 3 and 47, each group ocurring within 1s)
26 x [2013-04-11 10:00:47.609499] E [posix.c:823:posix_mknod] 0-gl-posix:
mknod on /raid1/bio/tdlong/yeast2/data/bam/YEE_0000_00_00_00__.bam failed:
File exists
(2 groups of 6 and 15 each ocurred in 1s)
================================================================
Gluster configuration info:
$ gluster volume info gl
Volume Name: gl
Type: Distribute
Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
Status: Started
Number of Bricks: 8
Transport-type: tcp,rdma
Bricks:
Brick1: bs2:/raid1
Brick2: bs2:/raid2
Brick3: bs3:/raid1
Brick4: bs3:/raid2
Brick5: bs4:/raid1
Brick6: bs4:/raid2
Brick7: bs1:/raid1
Brick8: bs1:/raid2
Options Reconfigured:
performance.write-behind-window-size: 1024MB
performance.flush-behind: on
performance.cache-size: 268435456
nfs.disable: on
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64
auth.allow: 10.2.*.*,10.1.*.*
================================================================
$ gluster volume status gl detail
Status of volume: gl
----------------------------------------------------------------
Brick : Brick bs2:/raid1
Port : 24009
Online : Y
Pid : 2904
File System : xfs
Device : /dev/sdc
Mount Options : rw,noatime,sunit=512,swidth=8192,allocsize=32m
Inode Size : 256
Disk Space Free : 28.2TB
Total Disk Space : 43.7TB
Inode Count : 9374964096
Free Inodes : 9372045017
------------------------------------------------------------------------------
Brick : Brick bs2:/raid2
Port : 24011
Online : Y
Pid : 2910
File System : xfs
Device : /dev/sdd
Mount Options : rw,noatime,sunit=512,swidth=7680,allocsize=32m
Inode Size : 256
Disk Space Free : 27.2TB
Total Disk Space : 40.9TB
Inode Count : 8789028864
Free Inodes : 8786101538
------------------------------------------------------------------------------
Brick : Brick bs3:/raid1
Port : 24009
Online : Y
Pid : 2876
File System : xfs
Device : /dev/sdc
Mount Options : rw,noatime,sunit=512,swidth=8192,allocsize=32m
Inode Size : 256
Disk Space Free : 28.5TB
Total Disk Space : 43.7TB
Inode Count : 9374964096
Free Inodes : 9372035932
------------------------------------------------------------------------------
Brick : Brick bs3:/raid2
Port : 24011
Online : Y
Pid : 2881
File System : xfs
Device : /dev/sdd
Mount Options : rw,noatime,sunit=512,swidth=7680,allocsize=32m
Inode Size : 256
Disk Space Free : 25.0TB
Total Disk Space : 40.9TB
Inode Count : 8789028864
Free Inodes : 8786099214
------------------------------------------------------------------------------
Brick : Brick bs4:/raid1
Port : 24009
Online : Y
Pid : 2955
File System : xfs
Device : /dev/sdc
Mount Options : rw,noatime,sunit=512,swidth=8192,allocsize=32m
Inode Size : 256
Disk Space Free : 28.0TB
Total Disk Space : 43.7TB
Inode Count : 9374964096
Free Inodes : 9372034051
------------------------------------------------------------------------------
Brick : Brick bs4:/raid2
Port : 24011
Online : Y
Pid : 2961
File System : xfs
Device : /dev/sdd
Mount Options : rw,noatime,sunit=512,swidth=7680,allocsize=32m
Inode Size : 256
Disk Space Free : 24.1TB
Total Disk Space : 40.9TB
Inode Count : 8789028864
Free Inodes : 8786101010
------------------------------------------------------------------------------
Brick : Brick bs1:/raid1
Port : 24013
Online : Y
Pid : 3043
File System : xfs
Device : /dev/sdc
Mount Options : rw,noatime,sunit=512,swidth=8192,allocsize=32m
Inode Size : 256
Disk Space Free : 29.1TB
Total Disk Space : 43.7TB
Inode Count : 9374964096
Free Inodes : 9372036362
------------------------------------------------------------------------------
Brick : Brick bs1:/raid2
Port : 24015
Online : Y
Pid : 3049
File System : xfs
Device : /dev/sdd
Mount Options : rw,noatime,sunit=512,swidth=7680,allocsize=32m
Inode Size : 256
Disk Space Free : 25.9TB
Total Disk Space : 40.9TB
Inode Count : 8789028864
Free Inodes : 8786101382
---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
"A Message From a Dying Veteran" <http://goo.gl/tTHdo>
More information about the Gluster-users
mailing list