[Gluster-devel] self heal makes client loop indefinately

Vincent Régnard vregnard at tbs-internet.com
Wed Oct 17 10:07:50 UTC 2007


Hi all,

I have a HA configuration, 3 nodes with replication *:3 on server side. 
3 servers 1 client on a host also doing gluster server. AFR and UNIFY 
are on server side exactly as described in the wiki's tutorials 
(Kritical's). I launch my 3 servers, they comunicate and establish all 
connexions well one with the others. But when I execute the client, 
mounting the volume report inconsistencies on the server the client 
connects to, self heal reports a mismatch, seams that self healing is 
started or at leas requested for these files on all server, but client 
enters an infinite loop and load average increases over 60, this 
cient/server host becomes unavailable.

GFS rel 1.3.5 (gcc 3.3, glibc 2.3.6; linux kern 2.6.16.latest)
Fuse 2.7.0-glfs4

Server log (where the client connects to):

2007-10-17 09:18:21 W [unify-self-heal.c:261:unify_sh_readdir_cbk] 
tbs-clust-unify-or2: found mismatch in mode/uid/gid for .qmail-srv376cc
2007-10-17 09:18:21 W [unify-self-heal.c:261:unify_sh_readdir_cbk] 
tbs-clust-unify-or2: found mismatch in mode/uid/gid for 
.qmail-srv376cc-accept-default

Client log:

2007-10-17 09:17:54 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse: 
LOOKUP 1/home (/home)
2007-10-17 09:17:54 D [fuse-bridge.c:346:fuse_entry_cbk] glusterfs-fuse: 
ENTRY => 3424340
2007-10-17 09:17:54 D [inode.c:559:__create_inode] fuse/inode: create 
inode(3424340)
2007-10-17 09:17:54 D [inode.c:351:__active_inode] fuse/inode: 
activating inode(3424340), lru=0/1024
2007-10-17 09:17:54 D [inode.c:381:__passive_inode] fuse/inode: 
passivating inode(3424340), lru=1/1024
2007-10-17 09:17:54 D [inode.c:308:__destroy_inode] fuse/inode: destroy 
inode(0) [@0x8092c78]
2007-10-17 09:17:54 D [inode.c:351:__active_inode] fuse/inode: 
activating inode(3424340), lru=0/1024
2007-10-17 09:17:54 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse: 
LOOKUP 3424340/http2 (/home/http2)
2007-10-17 09:18:19 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse: 
LOOKUP 1/home (/home)
2007-10-17 09:18:21 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse: 
LOOKUP 1/home (/home)
2007-10-17 09:18:44 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse: 
LOOKUP 1/home (/home)

[...] loops for ages

2007-10-17 09:41:16 W [client-protocol.c:210:call_bail] 
tbs-clust-server-or2: activating bail-out. pending frames = 53. last 
sent = 2007-10-17 09:38:20. last received = 2007-10-17 09:17:54 
transport-timeout = 108
2007-10-17 09:41:16 C [client-protocol.c:218:call_bail] 
tbs-clust-server-or2: bailing transport
2007-10-17 09:41:16 W [client-protocol.c:4324:client_protocol_cleanup] 
tbs-clust-server-or2: cleaning up state in transport
object 0x8090e78
2007-10-17 09:41:16 D [tcp.c:131:cont_hand] tcp: forcing poll/read/write 
to break on blocked socket (if any)
2007-10-17 09:41:16 W [client-protocol.c:4376:client_protocol_cleanup] 
tbs-clust-server-or2: forced unwinding frame type(0)
op(34) reply=@0x80986b8
2007-10-17 09:41:16 D [fuse-bridge.c:378:fuse_entry_cbk] glusterfs-fuse: 
ERR => -1 (107)
2007-10-17 09:41:16 D [inode.c:308:__destroy_inode] fuse/inode: destroy 
inode(0) [@0x8093088]
[...]
2007-10-17 09:45:01 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse: 
LOOKUP 1/home (/home)
2007-10-17 09:45:01 D [fuse-bridge.c:346:fuse_entry_cbk] glusterfs-fuse: 
ENTRY => 3424340
2007-10-17 09:45:01 D [inode.c:559:__create_inode] fuse/inode: create 
inode(3424340)
2007-10-17 09:45:01 D [inode.c:351:__active_inode] fuse/inode: 
activating inode(3424340), lru=0/1024
2007-10-17 09:45:01 D [inode.c:381:__passive_inode] fuse/inode: 
passivating inode(3424340), lru=1/1024
2007-10-17 09:45:01 D [inode.c:308:__destroy_inode] fuse/inode: destroy 
inode(0) [@0x80930f0]
2007-10-17 09:45:01 D [inode.c:351:__active_inode] fuse/inode: 
activating inode(3424340), lru=0/1024
2007-10-17 09:45:01 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse: 
LOOKUP 3424340/http2 (/home/http2)
2007-10-17 09:45:08 D [client-protocol.c:4301:client_protocol_reconnect] 
tbs-clust-server-or2: breaking reconnect chain
2007-10-17 09:45:26 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse: 
LOOKUP 1/home (/home)
2007-10-17 09:45:27 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse: 
LOOKUP 1/home (/home)
2007-10-17 09:45:51 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse: 
LOOKUP 1/home (/home)
2007-10-17 09:46:16 D [fuse-bridge.c:423:fuse_lookup] glusterfs-fuse: 
LOOKUP 1/home (/home)

-> and loops again as before


Other 2 server logs:

2007-10-17 09:17:54 D [inode.c:351:__active_inode] tbs-clust-dor3/inode: 
activating inode(9568336), lru=2/1024
2007-10-17 09:17:54 D [inode.c:381:__passive_inode] 
tbs-clust-dor3/inode: passivating inode(9568336), lru=3/1024
2007-10-17 09:17:54 D [inode.c:308:__destroy_inode] 
tbs-clust-dor3/inode: destroy inode(0) [@0xb7c00df0]
2007-10-17 09:17:54 D [inode.c:351:__active_inode] 
tbs-clust-nsor3/inode: activating inode(9666831), lru=2/1024
2007-10-17 09:17:54 D [inode.c:381:__passive_inode] 
tbs-clust-nsor3/inode: passivating inode(9666831), lru=3/1024
2007-10-17 09:17:54 D [inode.c:308:__destroy_inode] 
tbs-clust-nsor3/inode: destroy inode(0) [@0xb7a01150]


NB: the directory where mismatch are found has a large number of files 
(a few tens of thouthands). When I ls in that directory (on glusterfs 
mount point); it takes ages to return the answer. There are also many 
symlinks in that directory (about 20%).
I saw a stat-prefetch translator that I dont have presently, can this 
help in that situation ? Is it a post 1.3.5 functionality ? (1.3.6?)

-- 
Vincent Régnard
vregnard-at-tbs-internet-dot-com
TBS-internet.com
027 630 5902





More information about the Gluster-devel mailing list