[Gluster-users] gluster 3.3 sporadic 'Permission denied' failure under load in cluster env

harry mangalam harry.mangalam at uci.edu
Fri Mar 22 17:51:54 UTC 2013


We have a ~2500core academic cluster with saturating amounts of use. 
The main data store is running on a 4 node/8brick/340TB/QDR IB gluster 3.3 
filesystem.  All are 8xOpteron/32GB systems with 3ware 9750 SAS controllers
The servers are all running SL6.2 and are stable, with load running stably at 
about 2 continuously.

gluster is config'ed as:

Volume Name: gl
Type: Distribute
Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
Status: Started
Number of Bricks: 8
Transport-type: tcp,rdma
Bricks:
Brick1: bs2:/raid1
Brick2: bs2:/raid2
Brick3: bs3:/raid1
Brick4: bs3:/raid2
Brick5: bs4:/raid1
Brick6: bs4:/raid2
Brick7: bs1:/raid1
Brick8: bs1:/raid2
Options Reconfigured:
performance.write-behind-window-size: 1024MB
performance.flush-behind: on
performance.cache-size: 268435456
nfs.disable: on
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64
auth.allow: 10.2.*.*,10.1.*.*

 
Many of our users run large array jobs under SGE and especially during those 
runs where there is LOTS of IO, we will VERY occasionally (20 times since last 
June, according to brick logs) see these kinds of errors, resulting in the 
failure of that particular element of the array job.  

Sometimes these are acceptable, but often the next job depends on all elements 
of the array job to complete correctly. At any rate, from the fs POV they 
should all complete.

The rarity of this error and the type of error, and where it is located 
suggest that it might be a hash collision..?  According to gluster bugzilla 
this doesn't seem to be a registered bug, so here I am asking if this has been 
seen by others and how this might be addressed.

=========================================================================
> The error below being reported by Grid Engine says:
> 
> user "root" 03/21/2013 15:29:23 [507:26777]: error: can't open output
> file
> "/gl/bio/krthornt/WTCCC/autosomal_analysis_Jan2013/1958BC/COMPUTE_1958BC.o2
> 54058.103": Permission denied 03/21/2013 15:29:23 [400:25458]: wait3
=========================================================================

Looking thru all the server logs (/var/log/glusterfs/etc-glusterfs-
glusterd.vol.log), reveals nothing about this error, but the brick logs yeild 
this set of lines referencing that file at the correct time:

/var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:18.667171] W [posix-
handle.c:461:posix_handle_hard] 0-gl-posix: link 
/raid1/bio/krthornt/WTCCC/autosomal_
analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 -> 
/raid1/.glusterfs/5a/0e/5a0e87a6-e35d-4368-841e-b45802fecc4e failed (File 
exists)

/var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:18.667249] E 
[posix.c:1730:posix_create] 0-gl-posix: setting gfid on 
/raid1/bio/krthornt/WTCCC/autosomal_
analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 failed

/var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:19.241602] I [server3_1-
fops.c:1538:server_open_cbk] 0-gl-server: 644765: OPEN 
/bio/krthornt/WTCCC/autoso
mal_analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 (5a0e87a6-
e35d-4368-841e-b45802fecc4e) ==> -1 (Permission denied)

/var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:19.520455] I [server3_1-
fops.c:1538:server_open_cbk] 0-gl-server: 644970: OPEN 
/bio/krthornt/WTCCC/autoso
mal_analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 (5a0e87a6-
e35d-4368-841e-b45802fecc4e) ==> -1 (Permission denied)

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---




More information about the Gluster-users mailing list