[Gluster-users] strange error hangs any access to gluster mount

Burnash, James jburnash at knight.com
Mon Apr 18 13:23:11 UTC 2011

Good Morning guys.

After taking the actions suggested by Jeff and confirmed by Amar, it looks like all of my maps are clean (thought I am still having problems, which may or may not be related).

Here are the maps after all the corrections were done:

loop_check 'cd /export/read-only; for brick_root in $(ls -d g*); do echo -n $HOSTNAME; getfattr -d -e hex -n trusted.glusterfs.dht $brick_root |  xargs echo | sed -e "s/trusted.glusterfs.dht=0x0000000100000000/ /" -e "s/# file:/ /" | awk "{ print \$3,\$2,\$1 }"; done' jc1letgfs{14,15,17,18} | tee ro-brick-attrs.out.041811

sort -k 2,3 ro-brick-attrs.out.041811
jc1letgfs17 000000000ccccccb g07
jc1letgfs18 000000000ccccccb g07
jc1letgfs17 0ccccccc19999997 g08
jc1letgfs18 0ccccccc19999997 g08
jc1letgfs17 1999999826666663 g09
jc1letgfs18 1999999826666663 g09
jc1letgfs17 266666643333332f g10
jc1letgfs18 266666643333332f g10
jc1letgfs17 333333303ffffffb g01
jc1letgfs18 333333303ffffffb g01
jc1letgfs17 3ffffffc4cccccc7 g02
jc1letgfs18 3ffffffc4cccccc7 g02
jc1letgfs14 4cccccc859999993 g01
jc1letgfs15 4cccccc859999993 g01
jc1letgfs14 599999946666665f g02
jc1letgfs15 599999946666665f g02
jc1letgfs14 666666607333332b g03
jc1letgfs15 666666607333332b g03
jc1letgfs14 7333332c7ffffff7 g04
jc1letgfs15 7333332c7ffffff7 g04
jc1letgfs14 7ffffff88cccccc3 g05
jc1letgfs15 7ffffff88cccccc3 g05
jc1letgfs14 8cccccc49999998f g06
jc1letgfs15 8cccccc49999998f g06
jc1letgfs14 99999990a666665b g07
jc1letgfs15 99999990a666665b g07
jc1letgfs14 a666665cb3333327 g08
jc1letgfs15 a666665cb3333327 g08
jc1letgfs14 b3333328bffffff3 g09
jc1letgfs15 b3333328bffffff3 g09
jc1letgfs14 bffffff4ccccccbf g10
jc1letgfs15 bffffff4ccccccbf g10
jc1letgfs17 ccccccc0d999998b g03
jc1letgfs18 ccccccc0d999998b g03
jc1letgfs17 d999998ce6666657 g04
jc1letgfs18 d999998ce6666657 g04
jc1letgfs17 e6666658f3333323 g05
jc1letgfs18 e6666658f3333323 g05
jc1letgfs17 f3333324ffffffff g06
jc1letgfs18 f3333324ffffffff g06

How do these look to you? I don’t see a gap/hole, and since mirrored pairs are claiming (only) the same range for each mirrored brick, all looks ok to me.  Can  one / both of you just confirm that my layout now looks clean?


Thanks Jeff. That at least gives me shot at figuring out some similar problems.

It's possible that in the course of bringing up the mirrors initially I futzed something up. I'll have to check the read-write servers as well.

OK, that definitely shows a problem.  Here's the whole map of which
nodes are claiming which ranges:

00000000 0ccccccb: g07 on gfs17/gfs18
0ccccccc 19999997: g08 on gfs17/gfs18
19999998 26666663: g09 on gfs17/gfs18
26666664 3333332f: g10 on gfs17/gfs18
33333330 3ffffffb: g01 on gfs17/gfs18
3ffffffc 4cccccc7: g02 on gfs17/gfs18
4cccccc8 59999993: g01 on gfs14/gfs14
59999994 6666665f: g02 on gfs14/gfs14
66666660 7333332b: g03 on gfs14/gfs14
7333332c 7ffffff7: g04 on gfs14/gfs14
7ffffff8 8cccccc3: g05 on gfs14/gfs14
8cccccc4 9999998f: g06 on gfs14/gfs14
99999990 a666665b: g07 on gfs14/gfs14
a666665c b3333327: g08 on gfs14/gfs14
b3333328 b333332e: g09 on gfs14/gfs14
b333332f bffffff3: g09 on gfs14/gfs14
                  *** AND g04 on gfs17/18
bffffff4 ccccccbf: g10 on gfs14/gfs14
                  *** AND g04 on gfs17/18
ccccccc0 ccccccc7: g03 on gfs17/gfs18
                  *** AND g04 on gfs17/18
ccccccc8 d999998b: g03 on gfs17/gfs18
d999998c e6666657: *** GAP ***
e6666658 f3333323: g05 on gfs17/gfs18
f3333324 ffffffff: g06 on gfs17/gfs18

I know this all seems like numerology, but bear with me.  Note that all
of the problems seem to involve g04 on gfs17/gfs18 claiming the wrong
range, and that the range it's claiming is almost exactly twice the size
of all the other ranges.  In fact, it's the range it would have been
assigned if there had been ten nodes instead of twenty.  For example, if
that filesystem had been restored to an earlier state on gfs17/gfs18,
and then self-healed in the wrong direction (self-mangled?) you would
get exactly this set of symptoms.  I'm not saying that's what happened;
it's just a way to illustrate what these values mean and the
consequences of their being out of sync with each other.

So, why only one client?  Since you're reporting values on the servers,
I'd guess it's because only that client has remounted.  The others are
probably still operating from cached (and apparently correct) layout
information.  This is a very precarious state, I'd have to say.  You
*might* be able to fix this by fixing the xattr values on that one
filesystem, but I really can't recommend trying that without some input
from Gluster themselves.

