[Gluster-users] Glusterfs 3.3 rapidly generating write errors under heavy load.

harry mangalam harry.mangalam at uci.edu
Fri Apr 12 21:51:56 UTC 2013


As I've posted previously <http://goo.gl/DLplt> with increasing frequency, our 
academic cluster glusterfs (340TG over 4 nodes, 2 bricks each, details at 
bottom) is generating unacceptable errors under heavy load (which is the norm 
for the cluster). We use the SGE scheduler and it looks like gluster cannot 
keep up under heavy write load (as is the case with array jbs), or at least 
the kind of load that we're putting it under.  Comments welcome.

The user who has mostly been affected writes this:
[[..it is the same issue that I've been seeing for a few days.  I've been able 
to get access to up to 800 cores in the last week, which enables a high write 
load.  These programs are also attempting to buffer the output by storing to 
large internal string streams before writing.

A different script, which was based only on command-line manipulations of 
files (gzip, zcat, cut, and paste) had similar issues.  I re-wrote those 
operations do be done in one fell swoop in C++, and it ran through just 
fine.]]

For example, in the last 5 days, it has generated errors (' E ') in these 
numbers:

biostor1 -    58 Errors  (biostorX is the node; raid[12] are the bricks)
   raid1 -     2 Errors
   raid2 -    56 Errors

biostor2 - 13532 Errors
   raid1 - 10384 Errors 
   raid2 -  3148 Errors

biostor3 -    35 Errors
   raid1 -     6 Errors
   raid2 -    29 Errors

biostor4 -    98 Errors
   raid1 -    27 Errors
   raid2 -    71 Errors
   
   
================================================================
on bistor1, the errors were distributed like this (stripping the particulars):

 # errs  file and position
     44 [posix.c:358:posix_setattr]
      8 [posix.c:823:posix_mknod]
      2 [posix.c:1730:posix_create]
      1 [server.c:176:server_submit_reply]
      1 [rpcsvc.c:1080:rpcsvc_submit_generic]
      1 [posix.c:857:posix_mknod]
      1 [posix-helpers.c:685:posix_handle_pair]

Examples:

44 x [2013-04-11 20:43:28.811049] E [posix.c:358:posix_setattr] 0-gl-posix: 
setattr (lstat) on /raid2/.glusterfs/9b/03/9b036627-864b-403a-8681-
e4b1ad1a0da6 failed: No such file or directory

             (occurring in clumps - all 44 happened in one minute.)

8 x [2013-04-11 21:36:34.665924] E [posix.c:823:posix_mknod] 0-gl-posix: mknod 
on 
/raid2/bio/krthornt/WTCCC/explore_Jan2013/control_vs_control/esm/more_perms_collected/esm.500000.14 
failed: File exists
             (7 within 2m)
             
================================================================
on bistor2, the errors were distributed like this (stripping the particulars):

 # errs  file and position
   7558 [posix.c:1852:posix_open]
   3136 [posix.c:823:posix_mknod]
   2819 [posix.c:223:posix_stat]
      8 [posix.c:183:posix_lookup]
      4 [posix.c:1730:posix_create]
      2 [posix.c:857:posix_mknod]
      
Examples:

7558 x [2013-04-11 20:30:12.080860] E [posix.c:1852:posix_open] 0-gl-posix: 
open on /raid1/.glusterfs/ba/03/ba035b25-ac26-451e-a1ec-9fd9262ce9a3: No such 
file or directory
  (all in  ~13m)

3136 x [2013-04-11 14:44:49.185916] E [posix.c:823:posix_mknod] 0-gl-posix: 
mknod on 
/raid2/bio/krthornt/WTCCC/explore_Jan2013/control_vs_control/esm/more_perms_collected/esm.500000.3 
failed: File exists
  (all in the same 13m as above - of these, all but 17 were referencing the 
SAME SGE array file:
  /raid2/bio/krthornt/WTCCC/explore_Jan2013/control_vs_control/esm/more_perms_collected/esm.500000.22)


2819 x [2013-04-11 20:30:16.469462] E [posix.c:223:posix_stat] 0-gl-posix: 
lstat on /raid1/.glusterfs/2c/54/2c545e08-a523-4502-bc1a-817e0368a04c failed: 
No such file or directory
  (all in the same 13m as above)


================================================================
on bistor3, the errors were distributed like this (stripping the particulars):

 # errs  file and position
     17 [server-helpers.c:763:server_alloc_frame]                                         
     15 [posix.c:823:posix_mknod]
      3 [posix.c:1730:posix_create]

Examples:

17 x [2013-04-08 14:22:28.835606] E [server-helpers.c:763:server_alloc_frame] 
(-->/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x93) [0x327220a5b3] (--
>/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x293) [0x327220a443] (--
>/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_lookup+0xb8) 
[0x7fc6a9836558]))) 0-server: invalid argument: conn
       (in 2 batches within 1s each)

15 x [2013-04-10 11:30:44.453916] E [posix.c:823:posix_mknod] 0-gl-posix: 
mknod on 
/raid1/bio/krthornt/WTCCC/explore_Jan2013/control_vs_control/esm/more_perms/esm.500000.18 
failed: File exists
       (9 in ~5m, 6 in ~3m; see also above; these are SGE array jobs so 
they're being generated quite fast.)
       
================================================================

on bistor4, the errors were distributed like this (stripping the particulars):

 # errs  file and position
     50 [server-helpers.c:763:server_alloc_frame]                                          
     26 [posix.c:823:posix_mknod]
      8 [posix.c:857:posix_mknod]
      8 [posix-helpers.c:685:posix_handle_pair]
      2 [server.c:176:server_submit_reply]
      2 [rpcsvc.c:1080:rpcsvc_submit_generic]
      1 [posix.c:865:posix_mknod]
      1 [posix.c:183:posix_lookup]


Examples:
50 x [2013-04-08 13:36:42.286009] E [server-helpers.c:763:server_alloc_frame] 
(-->/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x93) [0x39b200a5b3] (--
>/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x293) [0x39b200a443] (--
>/usr/lib64/glusterfs/3.3.0/xlator/protocol/server.so(server_lookup+0xb8) 
[0x7f42e695e558]))) 0-server: invalid argument: conn
   (in 2 groups of 3 and 47, each group ocurring within 1s)

26 x [2013-04-11 10:00:47.609499] E [posix.c:823:posix_mknod] 0-gl-posix: 
mknod on /raid1/bio/tdlong/yeast2/data/bam/YEE_0000_00_00_00__.bam failed: 
File exists 
    (2 groups of 6 and 15 each ocurred in 1s)
    
================================================================
   
   
Gluster configuration info:
   
$ gluster volume info gl 
   
Volume Name: gl
Type: Distribute
Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
Status: Started
Number of Bricks: 8
Transport-type: tcp,rdma
Bricks:
Brick1: bs2:/raid1
Brick2: bs2:/raid2
Brick3: bs3:/raid1
Brick4: bs3:/raid2
Brick5: bs4:/raid1
Brick6: bs4:/raid2
Brick7: bs1:/raid1
Brick8: bs1:/raid2
Options Reconfigured:
performance.write-behind-window-size: 1024MB
performance.flush-behind: on
performance.cache-size: 268435456
nfs.disable: on
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64
auth.allow: 10.2.*.*,10.1.*.*

================================================================

$ gluster volume status gl detail
Status of volume: gl
----------------------------------------------------------------
Brick                : Brick bs2:/raid1    
Port                 : 24009               
Online               : Y                   
Pid                  : 2904                
File System          : xfs                 
Device               : /dev/sdc            
Mount Options        : rw,noatime,sunit=512,swidth=8192,allocsize=32m
Inode Size           : 256                 
Disk Space Free      : 28.2TB              
Total Disk Space     : 43.7TB              
Inode Count          : 9374964096          
Free Inodes          : 9372045017          
------------------------------------------------------------------------------
Brick                : Brick bs2:/raid2    
Port                 : 24011               
Online               : Y                   
Pid                  : 2910                
File System          : xfs                 
Device               : /dev/sdd            
Mount Options        : rw,noatime,sunit=512,swidth=7680,allocsize=32m
Inode Size           : 256                 
Disk Space Free      : 27.2TB              
Total Disk Space     : 40.9TB              
Inode Count          : 8789028864          
Free Inodes          : 8786101538          
------------------------------------------------------------------------------
Brick                : Brick bs3:/raid1    
Port                 : 24009               
Online               : Y                   
Pid                  : 2876                
File System          : xfs                 
Device               : /dev/sdc            
Mount Options        : rw,noatime,sunit=512,swidth=8192,allocsize=32m
Inode Size           : 256                 
Disk Space Free      : 28.5TB              
Total Disk Space     : 43.7TB              
Inode Count          : 9374964096          
Free Inodes          : 9372035932          
------------------------------------------------------------------------------
Brick                : Brick bs3:/raid2    
Port                 : 24011               
Online               : Y                   
Pid                  : 2881                
File System          : xfs                 
Device               : /dev/sdd            
Mount Options        : rw,noatime,sunit=512,swidth=7680,allocsize=32m
Inode Size           : 256                 
Disk Space Free      : 25.0TB              
Total Disk Space     : 40.9TB              
Inode Count          : 8789028864          
Free Inodes          : 8786099214          
------------------------------------------------------------------------------
Brick                : Brick bs4:/raid1    
Port                 : 24009               
Online               : Y                   
Pid                  : 2955                
File System          : xfs                 
Device               : /dev/sdc            
Mount Options        : rw,noatime,sunit=512,swidth=8192,allocsize=32m
Inode Size           : 256                 
Disk Space Free      : 28.0TB              
Total Disk Space     : 43.7TB              
Inode Count          : 9374964096          
Free Inodes          : 9372034051          
------------------------------------------------------------------------------
Brick                : Brick bs4:/raid2    
Port                 : 24011               
Online               : Y                   
Pid                  : 2961                
File System          : xfs                 
Device               : /dev/sdd            
Mount Options        : rw,noatime,sunit=512,swidth=7680,allocsize=32m
Inode Size           : 256                 
Disk Space Free      : 24.1TB              
Total Disk Space     : 40.9TB              
Inode Count          : 8789028864          
Free Inodes          : 8786101010          
------------------------------------------------------------------------------
Brick                : Brick bs1:/raid1    
Port                 : 24013               
Online               : Y                   
Pid                  : 3043                
File System          : xfs                 
Device               : /dev/sdc            
Mount Options        : rw,noatime,sunit=512,swidth=8192,allocsize=32m
Inode Size           : 256                 
Disk Space Free      : 29.1TB              
Total Disk Space     : 43.7TB              
Inode Count          : 9374964096          
Free Inodes          : 9372036362          
------------------------------------------------------------------------------
Brick                : Brick bs1:/raid2    
Port                 : 24015               
Online               : Y                   
Pid                  : 3049                
File System          : xfs                 
Device               : /dev/sdd            
Mount Options        : rw,noatime,sunit=512,swidth=7680,allocsize=32m
Inode Size           : 256                 
Disk Space Free      : 25.9TB              
Total Disk Space     : 40.9TB              
Inode Count          : 8789028864          
Free Inodes          : 8786101382          
 

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
"A Message From a Dying Veteran" <http://goo.gl/tTHdo>




More information about the Gluster-users mailing list