[Gluster-users] [SPAM?] Re: strange error hangs any access to gluster mount

Tue Apr 5 20:38:29 UTC 2011

On 04/05/2011 03:08 PM, Burnash, James wrote:
> Hi Jeff.
> 
> Thanks again for the help - it's much appreciated.
> 
> I tried suggestion (1), and as you suspected - no joy.
> 
> For suggestions (2) and (3) - I haven't done (2) yet, because I don't
> have any available empty bricks to migrate the data from g04. I have
> plenty of space on existing bricks, but I don't think (from the docs)
> that is the intention of this activity, and I'd hate to lose data, as
> I'm already somewhat in the dog house.

You mean you haven't done (3) yet, right?  (2) shouldn't require any
extra space.

> Can you think of any work around to this problem? And what would be
> the effect of just implementing step (2)?

The effect should be to create a hash space that has no gaps or
overlaps, which will keep DHT in its normal state instead of trying (and
apparently failing) to self-heal. In the normal state, if DHT fails to
find a file where it expects based on hashing, it will find it elsewhere
and create a linkfile on a file-by-file basis. This might be slightly
inefficient until a rebalance is done, but with the top-level xattrs
manually repaired the rebalance might actually succeed. Alternatively,
the setfattr might fail, in which case we have something much simpler to
diagnose on those local filesystems.

I can't say there's no risk of data loss with any of these approaches,
I'm afraid, since we're in such a weird state and don't know how we got
there. What I can say is that, from my knowledge of the code, setting
that xattr manually shouldn't carry any greater risk of data loss than
removing them did. The worst outcome that seems likely is that it
doesn't help and we end up exactly where we are already.

> -----Original Message----- From: Jeff Darcy
> Now it looks like g04 on gfs17/gfs18 has no DHT xattrs at all,
> leaving a hole from d999998c to e6666657. From the log, the
> "background meta-data self-heal" messages are probably related to
> that, though the failure messages about non-blocking inodelks (line
> 713) and possible split brain (e.g. line 777) still seem a bit odd.
> There are also some messages about timeouts (e.g. line 851) that are
> probably unrelated but might be worth investigating. I can suggest a
> few possible courses of action:
> 
> (1) Do a "getfattr -n trusted.distribute.fix.layout" on the root
> (from the client side) to force the layouts to be recalculated. This
> is the same hook that's used by the first part of the rebalance code,
> but only does this one part on one directory. OTOH, it's also the
> same thing the self-heal should have done, so I kind of expect it
> will fail (harmlessly) in the same way.
> 
> (2) Manually set the xattr on gfs{17,18}:/.../g04 to the "correct" 
> value, like so:
> 
> setfattr -n trusted.glusterfs.dht -v \ 
> 0x0000000100000000d999998ce6666657 g04
> 
> (3) Migrate the data off that volume to others, remove/nuke/rebuild
> it, then add it back in a pristine state and rebalance.