[Gluster-devel] Server Side AFR gets transport endpoint is not connected

Thu Aug 28 05:03:11 UTC 2008

On Thu, Aug 28, 2008 at 12:45 AM, James E Warner <jwarner6 at csc.com> wrote:
>
> Hi,
>
> I'm currently testing gluster to see if I can make it work for our HA
> filesystem needs.  And in initial testing things seem to be very good
> especially with client side AFR performing replication to our server nodes.
> However, we would like to keep our client network free of replication
> traffic so I set up server side afr with three storage bricks replicating
> data between themselves and round robin DNS for the node failover.  The
> round robin dns is working and the failover between the nodes is kind of
> working, but if I pull the network cable on the currently active server
> (the host that the glusterfs client is connected to) the next filesystem
> operation (such as ls /mnt/glusterfs) fails with a "transport endpoint is
> not connected" error.  Similarly, if I have a large copy operation in
> progress the copy will exit with a failure. All of the operations after
> that work fine and netstat shows that the node has failed over to the next
> server in the list, but by that point I the current file system operation
> has failed.  Anyway, this leads me to a few questions:
>
> 0.  Do my config files look OK or does it look like I've configured this
> thing incorrectly? :)
> 1.  Is this the expected behavior or is this a bug?  From reading the
> mailing list I had the impression that on failure the operation would be
> tried on the remaining ip's that were cached in the clients list, so I was
> surprised that the operation failed and I think that it is probably a bug,
> but I could see an argument for how this might be considered normal
> operation.

That is the expected behavior.

>
> 2.  If this is expected behavior is there any plan to change the behavior
> in the future or is server side AFR always expected to work this way?  I've
> seen references to round robin dns being an interim measure on the mailing
> list, so I'm not sure if there is another translator in the works or not.
> If there is something in the works is that available in the current
> glusterfs 1.4 snapshot releases or is that planned for a much later
> version?

Yes we plan to bring in a HA translator which will make this work fine.

>
> 3.  Can you think of any option that I might have missed that would correct
> the problem and allow the currently running file operation to succeed
> during a failover?
>
> 4.  Once again if this is as designed can you explain the reason that it
> works this way?  As I said I really expected it to transparently failover
> in much the same way that client side afr seems to, so I was surprised that
> it didn't.

If AFR is on client side, it will maintain connections to its
subvolumes separately.
So if one node fails, it will still have connection to other subvols.
However if AFR
is on server side and the server goes down, it can not do anything about it.
Now if we bring HA xlator into picture, it sits on the client and it
can take care
of seamless failure transition when the connection fails.

>
> Since I hope that this is a bug, the configuration files and the relevant
> sections of the client log are below.  I have used this configuration on
> the gluster 1.3.11 version and the latest snapshot from August 27, 2008.
>
> Client Log Snippet:
> ================
>
> 2008-08-27 12:53:34 D [fuse-bridge.c:839:fuse_err_cbk] glusterfs-fuse: 62: