[Gluster-devel] GlusterFS Volume Failure

Craig Carl craig at gluster.com
Tue Jun 1 08:51:54 UTC 2010


The engineering team will need some details - 

Gluster version? 
OS details for the clients and servers. 
Hardware details for the clients and servers 
Client volume file. 
Why was 10.1.1.2 already down, how was it brought down? 

Also, this type of question will probably get a better response on the gluster-users list, could you subscribe there and repost your email with the details I've asked for? You can subscribe to Gluster-users here - http://gluster.org/cgi-bin/mailman/listinfo/gluster-users . 



Thanks, 

Craig 

-- 
Craig Carl 













Sales Engineer; Gluster, Inc. 

From: "Philippe Muller" <philippe.muller at gmail.com> 
To: gluster-devel at nongnu.org 
Sent: Tuesday, June 1, 2010 1:39:09 AM 
Subject: [Gluster-devel] GlusterFS Volume Failure 

Hi, 


Last night, we got some troubles with a GlusterFS mount. It's a replicate volume, and the 10.1.1.2 host was already down. The volume files weren't readable until I manually restarted the GlusterFS instance. 
We'd like to understand what happened on this volume. Especially the "Server 10.1.1.1:6996 has not responded in the last 42 seconds, disconnecting." message. I can't figure out why the GlusterFS instance couldn't talk to itself. 
Please help us. 


This log is from 10.1.1.1 itself : 



[2010-06-01 00:01:54] E [client-protocol.c:415:client_ping_timer_expired] brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42 seconds, disconnecting. 
[2010-06-01 00:04:28] E [client-protocol.c:415:client_ping_timer_expired] brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42 seconds, disconnecting. 
[2010-06-01 00:06:57] E [client-protocol.c:415:client_ping_timer_expired] brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42 seconds, disconnecting. 
[2010-06-01 00:09:32] E [client-protocol.c:415:client_ping_timer_expired] brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42 seconds, disconnecting. 
[2010-06-01 00:11:55] E [client-protocol.c:415:client_ping_timer_expired] brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42 seconds, disconnecting. 
[2010-06-01 00:14:29] E [client-protocol.c:415:client_ping_timer_expired] brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42 seconds, disconnecting. 
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame STAT(0) frame sent = 2010-05-31 23:45:43. frame-timeout = 1800 
[2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 7731899: STAT() /masterspool => -1 (Transport endpoint is not connected) 
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame LOOKUP(27) frame sent = 2010-05-31 23:45:39. frame-timeout = 1800 
[2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 7731898: LOOKUP() / => -1 (Transport endpoint is not connected) 
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame STATFS(13) frame sent = 2010-05-31 23:45:39. frame-timeout = 1800 
[2010-06-01 00:15:44] W [fuse-bridge.c:2352:fuse_statfs_cbk] glusterfs-fuse: 7731897: ERR => -1 (Transport endpoint is not connected) 
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame LOOKUP(27) frame sent = 2010-05-31 23:45:37. frame-timeout = 1800 
[2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 7731896: LOOKUP() / => -1 (Transport endpoint is not connected) 
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame OPEN(10) frame sent = 2010-05-31 23:45:34. frame-timeout = 1800 
[2010-06-01 00:15:44] W [fuse-bridge.c:858:fuse_fd_cbk] glusterfs-fuse: 7731894: OPEN() /cell/common/bootstrap => -1 (Transport endpoint is not connected) 
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame FSTAT(25) frame sent = 2010-05-31 23:45:35. frame-timeout = 1800 
[2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 7731895: FSTAT() /masterspool/messages => -1 (File descriptor in bad state) 
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame FSTAT(25) frame sent = 2010-05-31 23:45:34. frame-timeout = 1800 
[2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 7731893: FSTAT() /cell/common/bootstrap => -1 (File descriptor in bad state) 
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame PING(5) frame sent = 2010-05-31 23:45:35. frame-timeout = 1800 
[2010-06-01 00:15:54] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame PING(5) frame sent = 2010-05-31 23:45:51. frame-timeout = 1800 
[2010-06-01 00:16:05] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame LOOKUP(27) frame sent = 2010-05-31 23:45:56. frame-timeout = 1800 
[2010-06-01 00:16:05] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 7731901: LOOKUP() / => -1 (Transport endpoint is not connected) 
[2010-06-01 00:16:25] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame STATFS(13) frame sent = 2010-05-31 23:46:19. frame-timeout = 1800 
[2010-06-01 00:16:25] W [fuse-bridge.c:2352:fuse_statfs_cbk] glusterfs-fuse: 7731902: ERR => -1 (Transport endpoint is not connected) 
[..] 


Here is our configuration : 



volume posix 
type storage/posix 
option directory /data/sge 
end-volume 


volume locks 
type features/locks 
subvolumes posix 
end-volume 


volume brick 
type performance/io-threads 
option thread-count 8 
subvolumes locks 
end-volume 


volume server 
type protocol/server 
option transport-type tcp 
option auth.addr.brick.allow 10.*.*.* 
subvolumes brick 
end-volume 


volume brick-qmaster 
type protocol/client 
option transport-type tcp 
option remote-host 10.1.1.1 
option remote-subvolume brick 
end-volume 


volume brick-shadow 
type protocol/client 
option transport-type tcp 
option remote-host 10.1.1.2 
option remote-subvolume brick 
end-volume 


volume sge-replicate 
type cluster/replicate 
subvolumes brick-qmaster brick-shadow 
end-volume 





Philippe Muller 

_______________________________________________ 
Gluster-devel mailing list 
Gluster-devel at nongnu.org 
http://lists.nongnu.org/mailman/listinfo/gluster-devel 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20100601/35f233c5/attachment-0003.html>


More information about the Gluster-devel mailing list