[Gluster-users] Redux: Stale NFS file handle

Fri Jan 4 01:56:10 UTC 2013

>From ~ an hour's googling and reading, it looks like this (not uncommon) 
bug/warning/error has not necessarily been associated with data loss, but we 
are finding that our gluster fs is interrupting our cluster jobs with the 
'Stale NFS handle' Warnings like this (on the client):

[2013-01-03 12:30:59.149230] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-
gl-client-0: remote operation failed: Stale NFS file handle. Path: 
/bio/krthornt/build_div/yak/line06_CY08A/prinses (3b0aa7b2-bf7f-4b27-
b515-32e94b1206e3)

(and 7 more, differing by the timestamp of <<1s).

The dir mentioned existed before the job was asked to read from it and shortly 
after the SGE failed, I checked that the glusterfs (/bio) was still mounted 
and that dir was still r/w.

We are getting these errors infrequently, but fairly regularly (a couple times 
a week, usually during a big array job that heavily reads from a particular 
dir) and I haven't seen any resolutions of the fault besides the vocabulary 
being corrected.  I know it's not nec an NFS problem, but I haven't seen a fix 
from the gluster folks.

Our glusterfs on this system is set up like this (over QDR/tcpoib)

$ gluster volume info gl

Volume Name: gl
Type: Distribute
Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
Status: Started
Number of Bricks: 8
Transport-type: tcp,rdma
Bricks:
Brick1: bs2:/raid1
Brick2: bs2:/raid2
Brick3: bs3:/raid1
Brick4: bs3:/raid2
Brick5: bs4:/raid1
Brick6: bs4:/raid2
Brick7: bs1:/raid1
Brick8: bs1:/raid2
Options Reconfigured:
auth.allow: 10.2.*.*,10.1.*.*
performance.io-thread-count: 64
performance.quick-read: on
performance.io-cache: on
nfs.disable: on
performance.cache-size: 268435456
performance.flush-behind: on
performance.write-behind-window-size: 1024MB

and otherwise appears to be happy.  

We were having a low-level problem with the RAID servers, where this LSI/3ware 
error was temporally close (~2m) to the gluster error:

LSI 3DM2 alert -- host: biostor4.oit.uci.edu
Jan 03, 2013 03:32:09PM - Controller 6
ERROR - Drive timeout detected: encl=1, slot=3

This error seemed to be related to construction around our data center and 
dust related with it.  We have had 10s of these LSI/3ware errors with no 
related gluster errors or apparent problems with the RAIDs.  No drives were 
ejected from the RAIDs and the errors did not repeat.  3ware explains:
<http://cholla.mmto.org/computers/3ware/3dm2/en/3DM_2_OLH-8-6.html>
==============================
009h Drive timeout detected

The 3ware RAID controller has a sophisticated recovery mechanism to handle 
various types of failures of a disk drive. One such possible failure of a disk 
drive is a failure of a command that is pending from the 3ware RAID controller 
to complete within a reasonable amount of time. If the 3ware RAID controller 
detects this condition, it notifies the user, prior to entering the recovery 
phase, by displaying this AEN.

Possible causes of APORT time-outs include a bad or intermittent disk drive, 
power cable or interface cable.
================================
We've checked into this and it doesn't seem to be related, but I thought I'd 
bring it up.

hjm

On Thursday, August 23, 2012 09:54:13 PM Joe Julian wrote:
> *Bug 832694* <https://bugzilla.redhat.com/show_bug.cgi?id=832694>
> -ESTALE error text should be reworded
> 
> On 08/23/2012 09:50 PM, Kaushal M wrote:
> > The "Stale NFS file handle" message is the default string given by
> > strerror() for errno ESTALE.
> > Gluster uses ESTALE as errno to indicate that the file being referred
> > to no longer exists, ie. the reference is stale.
> > 
> > - Kaushal
> > 
> > On Fri, Aug 24, 2012 at 7:03 AM, Jules Wang <lancelotds at 163.com
> > 
> > <mailto:lancelotds at 163.com>> wrote:
> >     Hi, Jon ,
> >     
> >         I also met the same issue, and reported a
> >     
> >     bug(https://bugzilla.redhat.com/show_bug.cgi?id=851381)
> >     <https://bugzilla.redhat.com/show_bug.cgi?id=851381>
> >     
> >         In the bug report, I share a simple way to reproduce the bug.
> >         Have fun.
> >     
> >     Best Regards.
> >     Jules Wang.
> >     
> >     At 2012-08-23 23:02:34,"Bùi Hùng Vie^.t" <buihungviet at gmail.com
> >     
> >     <mailto:buihungviet at gmail.com>> wrote:
> >         Hi Jon,
> >         I have no answer for you. Just want to share with you guys
> >         that I met the same issue with this message. In my gluster
> >         system, Gluster client log files have a lot of these messages.
> >         I tried to ask and found nothing on the Web. Amazingly,
> >         Gluster have been running for long time :)
> >         
> >         On Thu, Aug 23, 2012 at 8:43 PM, Jon Tegner <tegner at renget.se
> >         
> >         <mailto:tegner at renget.se>> wrote:
> >             Hi, I'm a bit curious of error messages of the type
> >             "remote operation failed: Stale NFS file handle". All
> >             clients using the file system use Gluster Native Client,
> >             so why should stale nfs file handle be reported?
> >             
> >             Regards,
> >             
> >             /jon
> >             
> >             
> >             
> >             _______________________________________________
> >             Gluster-users mailing list
> >             Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
> >             http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
> >     
> >     _______________________________________________
> >     Gluster-users mailing list
> >     Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
> >     http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
> > 
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)