[Gluster-devel] GlusterFS Volume Failure

Tue Jun 1 09:14:46 UTC 2010

Sorry, I forgot to include these information :
- The configuration I gave you contains the server and the client part, as
used on the 10.1.1.1 host (and 10.1.1.2)
  - The instance mounts itself ; we run it this way
: /usr/local/sbin/glusterfs --log-file=/var/log/glusterfs/sge.log
--volfile=/usr/local/etc/glusterfs/sge.vol
--pid-file=/var/run/glusterfs-sge.pid /mnt/sge
- The clients use the client part of this configuration
- Linux kernel : 2.6.32.7
- GlusterFS version : 3.0.3 (built from sources)
- Volume role : share configuration and master spool data between GridEngine
masters and clients (master writes data / clients read data)

Originally, we had two hosts replicating the volume data (10.1.1.1 and
10.1.1.2). Last week, we had to change the second host IP address.
When we updated clients configurations to use the new address, we got a lot
of "I/O error" when reading files. Since this volume is critical for us, we
chose to lose the redundancy and quickly get the service back to normal (=
10.1.1.2 shutdown).
Then, we got the incident described in the last e-mail.
At this time we didn't lost any data. However, regarding these events, I'm
somewhat afraid of losing some data.

Does someone already used GlusterFS to store GridEngine configuration/master
spool data ?

For reference, the client volume :

volume brick-qmaster
    type protocol/client
    option transport-type tcp
    option remote-host 10.1.1.1
    option remote-subvolume brick
end-volume
volume brick-shadow
    type protocol/client
    option transport-type tcp
    option remote-host 10.1.1.1
    option remote-subvolume brick
end-volume
volume sge-replicate
    type cluster/replicate
    subvolumes brick-qmaster brick-shadow
end-volume

Regards,

Philippe Muller

On Tue, Jun 1, 2010 at 10:51 AM, Craig Carl <craig at gluster.com> wrote:

> The engineering team will need some details -
>
> Gluster version?
> OS details for the clients and servers.
> Hardware details for the clients and servers
> Client volume file.
> Why was 10.1.1.2 already down, how was it brought down?
>
> Also, this type of question will probably get a better response on the
> gluster-users list, could you subscribe there and repost your email with the
> details I've asked for? You can subscribe to Gluster-users here -
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users.
>
> Thanks,
>
> Craig
>
> --
> Craig Carl
> Sales Engineer; Gluster, Inc.
>
> ------------------------------
> *From: *"Philippe Muller" <philippe.muller at gmail.com>
> *To: *gluster-devel at nongnu.org
> *Sent: *Tuesday, June 1, 2010 1:39:09 AM
> *Subject: *[Gluster-devel] GlusterFS Volume Failure
>
>
> Hi,
>
> Last night, we got some troubles with a GlusterFS mount. It's a replicate
> volume, and the 10.1.1.2 host was already down. The volume files weren't
> readable until I manually restarted the GlusterFS instance.
> We'd like to understand what happened on this volume. Especially the
> "Server 10.1.1.1:6996 has not responded in the last 42 seconds,
> disconnecting." message. I can't figure out why the GlusterFS instance
> couldn't talk to itself.
> Please help us.
>
> This log is from 10.1.1.1 itself :
>
> [2010-06-01 00:01:54] E [client-protocol.c:415:client_ping_timer_expired]
> brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42
> seconds, disconnecting.
> [2010-06-01 00:04:28] E [client-protocol.c:415:client_ping_timer_expired]
> brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42
> seconds, disconnecting.
> [2010-06-01 00:06:57] E [client-protocol.c:415:client_ping_timer_expired]
> brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42
> seconds, disconnecting.
> [2010-06-01 00:09:32] E [client-protocol.c:415:client_ping_timer_expired]
> brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42
> seconds, disconnecting.
> [2010-06-01 00:11:55] E [client-protocol.c:415:client_ping_timer_expired]
> brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42
> seconds, disconnecting.
> [2010-06-01 00:14:29] E [client-protocol.c:415:client_ping_timer_expired]
> brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42
> seconds, disconnecting.
> [2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster:
> bailing out frame STAT(0) frame sent = 2010-05-31 23:45:43. frame-timeout =
> 1800
> [2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse:
> 7731899: STAT() /masterspool => -1 (Transport endpoint is not connected)
> [2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster:
> bailing out frame LOOKUP(27) frame sent = 2010-05-31 23:45:39. frame-timeout
> = 1800
> [2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse:
> 7731898: LOOKUP() / => -1 (Transport endpoint is not connected)
> [2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster:
> bailing out frame STATFS(13) frame sent = 2010-05-31 23:45:39. frame-timeout
> = 1800
> [2010-06-01 00:15:44] W [fuse-bridge.c:2352:fuse_statfs_cbk]
> glusterfs-fuse: 7731897: ERR => -1 (Transport endpoint is not connected)
> [2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster:
> bailing out frame LOOKUP(27) frame sent = 2010-05-31 23:45:37. frame-timeout
> = 1800
> [2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse:
> 7731896: LOOKUP() / => -1 (Transport endpoint is not connected)
> [2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster:
> bailing out frame OPEN(10) frame sent = 2010-05-31 23:45:34. frame-timeout =
> 1800
> [2010-06-01 00:15:44] W [fuse-bridge.c:858:fuse_fd_cbk] glusterfs-fuse:
> 7731894: OPEN() /cell/common/bootstrap => -1 (Transport endpoint is not
> connected)
> [2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster:
> bailing out frame FSTAT(25) frame sent = 2010-05-31 23:45:35. frame-timeout
> = 1800
> [2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse:
> 7731895: FSTAT() /masterspool/messages => -1 (File descriptor in bad state)
> [2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster:
> bailing out frame FSTAT(25) frame sent = 2010-05-31 23:45:34. frame-timeout
> = 1800
> [2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse:
> 7731893: FSTAT() /cell/common/bootstrap => -1 (File descriptor in bad state)
> [2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster:
> bailing out frame PING(5) frame sent = 2010-05-31 23:45:35. frame-timeout =
> 1800
> [2010-06-01 00:15:54] E [client-protocol.c:313:call_bail] brick-qmaster:
> bailing out frame PING(5) frame sent = 2010-05-31 23:45:51. frame-timeout =
> 1800
> [2010-06-01 00:16:05] E [client-protocol.c:313:call_bail] brick-qmaster:
> bailing out frame LOOKUP(27) frame sent = 2010-05-31 23:45:56. frame-timeout
> = 1800
> [2010-06-01 00:16:05] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse:
> 7731901: LOOKUP() / => -1 (Transport endpoint is not connected)
> [2010-06-01 00:16:25] E [client-protocol.c:313:call_bail] brick-qmaster:
> bailing out frame STATFS(13) frame sent = 2010-05-31 23:46:19. frame-timeout
> = 1800
> [2010-06-01 00:16:25] W [fuse-bridge.c:2352:fuse_statfs_cbk]
> glusterfs-fuse: 7731902: ERR => -1 (Transport endpoint is not connected)
> [..]
>
> Here is our configuration :
>
> volume posix
>     type storage/posix
>     option directory /data/sge
> end-volume
>
> volume locks
>     type features/locks
>     subvolumes posix
> end-volume
>
> volume brick
>     type performance/io-threads
>     option thread-count 8
>     subvolumes locks
> end-volume
>
> volume server
>     type protocol/server
>     option transport-type tcp
>     option auth.addr.brick.allow 10.*.*.*
>     subvolumes brick
> end-volume
>
> volume brick-qmaster
>     type protocol/client
>     option transport-type tcp
>     option remote-host 10.1.1.1
>     option remote-subvolume brick
> end-volume
>
> volume brick-shadow
>     type protocol/client
>     option transport-type tcp
>     option remote-host 10.1.1.2
>     option remote-subvolume brick
> end-volume
>
> volume sge-replicate
>     type cluster/replicate
>     subvolumes brick-qmaster brick-shadow
> end-volume
>
>
>
> Philippe Muller
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20100601/90dcae80/attachment-0003.html>