[Gluster-devel] GlusterFS Volume Failure

Philippe Muller philippe.muller at gmail.com
Tue Jun 1 08:39:09 UTC 2010


Hi,

Last night, we got some troubles with a GlusterFS mount. It's a replicate
volume, and the 10.1.1.2 host was already down. The volume files weren't
readable until I manually restarted the GlusterFS instance.
We'd like to understand what happened on this volume. Especially the
"Server 10.1.1.1:6996 has not responded in the last 42 seconds,
disconnecting." message. I can't figure out why the GlusterFS instance
couldn't talk to itself.
Please help us.

This log is from 10.1.1.1 itself :

[2010-06-01 00:01:54] E [client-protocol.c:415:client_ping_timer_expired]
brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42
seconds, disconnecting.
[2010-06-01 00:04:28] E [client-protocol.c:415:client_ping_timer_expired]
brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42
seconds, disconnecting.
[2010-06-01 00:06:57] E [client-protocol.c:415:client_ping_timer_expired]
brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42
seconds, disconnecting.
[2010-06-01 00:09:32] E [client-protocol.c:415:client_ping_timer_expired]
brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42
seconds, disconnecting.
[2010-06-01 00:11:55] E [client-protocol.c:415:client_ping_timer_expired]
brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42
seconds, disconnecting.
[2010-06-01 00:14:29] E [client-protocol.c:415:client_ping_timer_expired]
brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42
seconds, disconnecting.
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster:
bailing out frame STAT(0) frame sent = 2010-05-31 23:45:43. frame-timeout =
1800
[2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse:
7731899: STAT() /masterspool => -1 (Transport endpoint is not connected)
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster:
bailing out frame LOOKUP(27) frame sent = 2010-05-31 23:45:39. frame-timeout
= 1800
[2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse:
7731898: LOOKUP() / => -1 (Transport endpoint is not connected)
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster:
bailing out frame STATFS(13) frame sent = 2010-05-31 23:45:39. frame-timeout
= 1800
[2010-06-01 00:15:44] W [fuse-bridge.c:2352:fuse_statfs_cbk] glusterfs-fuse:
7731897: ERR => -1 (Transport endpoint is not connected)
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster:
bailing out frame LOOKUP(27) frame sent = 2010-05-31 23:45:37. frame-timeout
= 1800
[2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse:
7731896: LOOKUP() / => -1 (Transport endpoint is not connected)
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster:
bailing out frame OPEN(10) frame sent = 2010-05-31 23:45:34. frame-timeout =
1800
[2010-06-01 00:15:44] W [fuse-bridge.c:858:fuse_fd_cbk] glusterfs-fuse:
7731894: OPEN() /cell/common/bootstrap => -1 (Transport endpoint is not
connected)
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster:
bailing out frame FSTAT(25) frame sent = 2010-05-31 23:45:35. frame-timeout
= 1800
[2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse:
7731895: FSTAT() /masterspool/messages => -1 (File descriptor in bad state)
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster:
bailing out frame FSTAT(25) frame sent = 2010-05-31 23:45:34. frame-timeout
= 1800
[2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse:
7731893: FSTAT() /cell/common/bootstrap => -1 (File descriptor in bad state)
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster:
bailing out frame PING(5) frame sent = 2010-05-31 23:45:35. frame-timeout =
1800
[2010-06-01 00:15:54] E [client-protocol.c:313:call_bail] brick-qmaster:
bailing out frame PING(5) frame sent = 2010-05-31 23:45:51. frame-timeout =
1800
[2010-06-01 00:16:05] E [client-protocol.c:313:call_bail] brick-qmaster:
bailing out frame LOOKUP(27) frame sent = 2010-05-31 23:45:56. frame-timeout
= 1800
[2010-06-01 00:16:05] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse:
7731901: LOOKUP() / => -1 (Transport endpoint is not connected)
[2010-06-01 00:16:25] E [client-protocol.c:313:call_bail] brick-qmaster:
bailing out frame STATFS(13) frame sent = 2010-05-31 23:46:19. frame-timeout
= 1800
[2010-06-01 00:16:25] W [fuse-bridge.c:2352:fuse_statfs_cbk] glusterfs-fuse:
7731902: ERR => -1 (Transport endpoint is not connected)
[..]

Here is our configuration :

volume posix
    type storage/posix
    option directory /data/sge
end-volume

volume locks
    type features/locks
    subvolumes posix
end-volume

volume brick
    type performance/io-threads
    option thread-count 8
    subvolumes locks
end-volume

volume server
    type protocol/server
    option transport-type tcp
    option auth.addr.brick.allow 10.*.*.*
    subvolumes brick
end-volume

volume brick-qmaster
    type protocol/client
    option transport-type tcp
    option remote-host 10.1.1.1
    option remote-subvolume brick
end-volume

volume brick-shadow
    type protocol/client
    option transport-type tcp
    option remote-host 10.1.1.2
    option remote-subvolume brick
end-volume

volume sge-replicate
    type cluster/replicate
    subvolumes brick-qmaster brick-shadow
end-volume



Philippe Muller
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20100601/d6382b4c/attachment-0003.html>


More information about the Gluster-devel mailing list