[Gluster-devel] High abailability question...

Mon May 26 11:50:59 UTC 2008

Hello. My name is Víctor and I would like to ask about some test I have
been doing with glusterfs.
We are a bio-search company, and we are thinking on using gluserfs to
develop one of our projects.

I am doing the test with two servers and a client, using an AFR cluster
mode parsed on the client side.
The fact is I have been reading on the glusterfs documentation that this
sort of implementation would included high availability in case one
server goes down.

My files are:

*Fichero de configuración del cliente (CENTRAL). CLIENT*

*volume sargasv0*

type protocol/client

option transport-type tcp/client

option remote-host 192.168.1.60

option remote-port 6996

option remote-subvolume v0

*end-volume*

*volume shedirv4*

type protocol/client

option transport-type tcp/client

option remote-host 192.168.1.61

option remote-port 6996

option remote-subvolume v4

*end-volume*

*volume mirror0*

type cluster/afr

subvolumes sargasv0 shedirv4

*end-volume*

*Fichero de configuración del servidor (SARGAS) SERVER1*

*volume v0*

type storage/posix

option directory /tmp/export0

*end-volume*

*volume server*

type protocol/server

option transport-type tcp/server

option listen-port 6996

option auth.ip.v0.allow *

subvolumes v0

*end-volume*

*Fichero de configuración del servidor (SHEDIR* ) *SERVER2*

*volume v4*

type storage/posix

option directory /tmp/export4

*end-volume*

*volume server*

type protocol/server

option transport-type tcp/server

option listen-port 6996

option auth.ip.v4.allow *

subvolumes v4

*end-volume*

Well, I have done some tests playing a movie (*.avi) file installed over
my glusterfs mounted directory with Totem Movie Player. Farther tests
were done on VLC media player with identical results.
I am running ubuntu 7.10.

Once I have run the glusterfs infrastructure with the two servers and
the client, I made a copy of the avi file from my home directory to the
mounted glusterfs on the client. The file was copied correctly and the
replication to servers was ok.
I began the test on debug mode. When I plug off one of the servers I
could keep on watching the video after a period of load balance to the
remaining active server of about 20/30 seconds.
Well this is high abailability, but when I plug again the server that
previously I had desattached and plug off the other one, I obtained the
following error: "could not read from resource", and the following lines
on the debug's log.

*2008-05-23 12:45:31 D
[client-protocol.c:4750:client_protocol_reconnect] sargasv0: attempting
reconnect *

*2008-05-23 12:45:31 D [tcp-client.c:77:tcp_connect] sargasv0: socket fd
= 6 *

*2008-05-23 12:45:31 D [tcp-client.c:107:tcp_connect] sargasv0:
finalized on port `1023' *

*2008-05-23 12:45:31 D [common-utils.c:179:gf_resolve_ip] resolver: DNS
cache not present, freshly probing hostname: 192.168.1.60 *

*2008-05-23 12:45:31 D [common-utils.c:204:gf_resolve_ip] resolver:
returning IP:192.168.1.60[0] for hostname: 192.168.1.60 *

*2008-05-23 12:45:31 D [common-utils.c:212:gf_resolve_ip] resolver:
flushing DNS cache *

*2008-05-23 12:45:31 D [tcp-client.c:161:tcp_connect] sargasv0: connect
on 6 in progress (non-blocking) *

*2008-05-23 12:45:31 D [tcp-client.c:198:tcp_connect] sargasv0:
connection on 6 still in progress - try later *

*2008-05-23 12:45:35 W [client-protocol.c:205:call_bail] shedirv4:
activating bail-out. pending frames = 1. last sent = 2008-05-23
12:44:52. last received = 2008-05-23 12:44:52 transport-timeout = 42 *

*2008-05-23 12:45:35 C [client-protocol.c:212:call_bail] shedirv4:
bailing transport *

*2008-05-23 12:45:35 D [tcp.c:137:cont_hand] tcp: forcing
poll/read/write to break on blocked socket (if any) *

*2008-05-23 12:45:35 W [client-protocol.c:4777:client_protocol_cleanup]
shedirv4: cleaning up state in transport object 0x808bd90 *

*2008-05-23 12:45:35 E [client-protocol.c:4827:client_protocol_cleanup]
shedirv4: forced unwinding frame type(1) op(13) reply=@0xb6a00468 *

*2008-05-23 12:45:35 E [client-protocol.c:3193:client_readv_cbk]
shedirv4: no proper reply from server, returning ENOTCONN *

*2008-05-23 12:45:35 D [afr.c:2248:afr_readv_cbk] mirror0: reading from
child 2 *

*2008-05-23 12:45:35 E [afr.c:2262:afr_readv_cbk] mirror0:
(path=/dc4.avi child=shedirv4) op_ret=-1 op_errno=107 *

*2008-05-23 12:45:35 E [fuse-bridge.c:1551:fuse_readv_cbk]
glusterfs-fuse: 182438: READ => -1 (107) *

*2008-05-23 12:45:35 D [tcp.c:87:tcp_disconnect] shedirv4: connection
disconnected *

*2008-05-23 12:45:35 D [afr.c:5939:notify] mirror0: GF_EVENT_CHILD_DOWN
from shedirv4 *

*2008-05-23 12:45:35 D [fuse-bridge.c:1577:fuse_readv] glusterfs-fuse:
182439: READ (0xb6c01420, size=4096, offset=172892160) *

*2008-05-23 12:45:35 E [fuse-bridge.c:1551:fuse_readv_cbk]
glusterfs-fuse: 182439: READ => -1 (107) *

*2008-05-23 12:45:35 D [fuse-bridge.c:1577:fuse_readv] glusterfs-fuse:
182440: READ (0xb6c01420, size=4096, offset=172892160) *

*2008-05-23 12:45:35 E [fuse-bridge.c:1551:fuse_readv_cbk]
glusterfs-fuse: 182440: READ => -1 (107) *

*...
*

In this case, I had to close the file and play it again. Then glusterfs
looked for the file on the active server and run it without problems.
But, If you do the test again, pluging the server that was previously
unplugged and plugging off the one that was active the same error comes
out and the film is stopped again.

Therefor, the very first time one server is down, is possible to
maintain the file open and continue watching the video, but second and
following attemps would became on read error and it is necessary to
re-open the file again...

Is there a way of avoiding this read-error in order to maintain my file
opened and continue watching the movie after the load balance to the
active server has happened from a second time?

Thank you for your help.