[Gluster-users] Files present on the backend but have become invisible from clients

Burnash, James jburnash at knight.com
Thu Jun 23 16:38:52 UTC 2011


Hi Jeff.

Well, it took me 3 reads (1 out loud) to process what was going in your explanation - but then it all made sense :-)

As to your question about xattributes on those 4 directories at the end of your message:

fs18/g01/pfs-ro1-client-1
getfattr -d -e hex -m - /export/read-only/g01
getfattr: Removing leading '/' from absolute path names
# file: export/read-only/g01
trusted.afr.pfs-ro1-client-0=0x000000000600000800000000
trusted.afr.pfs-ro1-client-1=0x000000000000000000000000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000333333303ffffffb
trusted.glusterfs.test=0x776f726b696e6700

fs15/g01/pfs-ro1-client-21
getfattr -d -e hex -m - /export/read-only/g01
getfattr: Removing leading '/' from absolute path names
# file: export/read-only/g01
trusted.afr.pfs-ro1-client-20=0x000000000200000000000000
trusted.afr.pfs-ro1-client-21=0x000000000000000000000000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000004cccccc859999993
trusted.glusterfs.test=0x776f726b696e6700

fs18/g02/pfs-ro1-client-3
getfattr -d -e hex -m - /export/read-only/g02
getfattr: Removing leading '/' from absolute path names
# file: export/read-only/g02
trusted.afr.pfs-ro1-client-2=0x000000004500000400000000
trusted.afr.pfs-ro1-client-3=0x000000000000000000000000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000003ffffffc4cccccc7
trusted.glusterfs.test=0x776f726b696e6700

fs15/g02/pfs-ro1-client-23
getfattr -d -e hex -m - /export/read-only/g02
getfattr: Removing leading '/' from absolute path names
# file: export/read-only/g02
trusted.afr.pfs-ro1-client-22=0x000000000200000000000000
trusted.afr.pfs-ro1-client-23=0x000000000000000000000000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000599999946666665f
trusted.glusterfs.test=0x776f726b696e6700

Does that help at all?

James Burnash
Unix Engineer
Knight Capital Group


-----Original Message-----
From: Jeff Darcy [mailto:jdarcy at redhat.com]
Sent: Wednesday, June 22, 2011 4:41 PM
To: Burnash, James
Cc: gluster-users at gluster.org
Subject: Re: [Gluster-users] Files present on the backend but have become invisible from clients

On 06/22/2011 02:44 PM, Burnash, James wrote:
> g01/pfs-ro1-client-0=0x000000000000000000000000 jc1letgfs17
> g01/pfs-ro1-client-0=0x000000000600000800000000 jc1letgfs18
> g01/pfs-ro1-client-20=0x000000000000000000000000 jc1letgfs14
> g01/pfs-ro1-client-20=0x000000000200000000000000 jc1letgfs15
> g02/pfs-ro1-client-2=0x000000000000000000000000 jc1letgfs17
> g02/pfs-ro1-client-2=0x000000004500000400000000 jc1letgfs18
> g02/pfs-ro1-client-22=0x000000000000000000000000 jc1letgfs14
> g02/pfs-ro1-client-22=0x000000000200000000000000 jc1letgfs15
>
> Would anybody have any insights as to what is going on here? I'm
> seeing attributes in my sleep these days ... that cannot be good!

When I look at this, a few things occur to me.  First, those are some pretty big metadata-change numbers.  For g02 on fs18, 0x45000004 is actually 0x04000045 - about 67M - after byte swapping.  The other thing that seems strange is that it always seems to be the second member of a replica pair "indicting" the first. Lastly, if you don't see any non-zero xattrs besides those above then this is not a normal split-brain situation.  It might be a more exotic kind, based on
*missing* xattrs.  Here's the sequence:

* Lookup for '/' on client-0 returns a zero opcounts for both client-0
  and client-1

* Lookup for '/' on client-1 returns a non-zero opcount for client-0 and
  *no xattr at all* (i.e. missing) for client-1

* Client-1 is declared "ignorant" in afr_sh_build_pending_matrix

* Client-0's value for client-1 is incremented, again in
  afr_sh_build_pending_matrix

* Client-0 and client-1 are both marked "wise" in afr_sh_mark_sources

* Client-0 and client-1 both receive wisdom=0 in afr_sh_compute_wisdom

* No wisdom=1 nodes are found in afr_sh_wise_nodes_conflict

* The "Unable to self-heal" messages come from afr_sh_metadata_fix

So, what I'd do is check whether the following xattrs are missing:

        fs18/g01/pfs-ro1-client-1
        fs15/g01/pfs-ro1-client-21
        fs18/g02/pfs-ro1-client-3
        fs15/g02/pfs-ro1-client-23

There might be other "false split-brain" scenarios lurking in this code.
 I don't fully understand it, but it seems to go through a lot of paths that might not have been fully tested.


DISCLAIMER:
This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this in error, please immediately notify me and permanently delete the original and any copy of any e-mail and any printout thereof. E-mail transmission cannot be guaranteed to be secure or error-free. The sender therefore does not accept liability for any errors or omissions in the contents of this message which arise as a result of e-mail transmission.
NOTICE REGARDING PRIVACY AND CONFIDENTIALITY Knight Capital Group may, at its discretion, monitor and review the content of all e-mail communications. http://www.knight.com



More information about the Gluster-users mailing list