[Gluster-devel] Re: afr :2 HA setup question

Matthias Albert gluster at linux4experts.de
Tue Sep 11 06:15:04 UTC 2007


Hi August,

I can confirm your problem with your setup. I' ve a 4 Server glusterfsd 
setup also with  1.3.1 running and some glusterfs clients with fuse glfs3.

One of these 4 servers had a hardware failure and was no longer  
reachable -> so the side effect was, that all of my glusterfs Clients 
couldn't write anything in the mounted glusterfs share. I've build a new 
test Machine changed the old one with this new machine. Probably this 
week, I have more time for playing and testing with glusterfs (also with 
some performance translators).

I will test the "option transport-timeout X" and will see what happen if 
I take one of them of the net.

Regards,

   Matthias

August R. Wohlt schrieb:
> Hi all -
>
> After combing through the archives, I found the transport-timeout
> option mentioned by avati. Is this described in the wiki docs
> anywhere? I thought I had read through every page, but don't recall
> seeing it. The e-mail from avati mentioned that it was described in
> "doc/translator-options.txt" but this file does not appear in my
> glusterfs-1.3.1 tarball.
>
> In any case, for those who have similar issues, making transport
> timeout much smaller is your friend :-)
>
> Many Thanks!!
> :august
>
> On 9/10/07, August R. Wohlt <glusterfs at isidore.net> wrote:
>   
>> Hi devs et al,
>>
>> After many hours of sublimation, I was able to condense my previous hanging
>> issue down to this simplest case.
>>
>> To summarize: I have two physical machines, each afr'ing a directory to the
>> other. both are glusterfs(d) 1.3.1 with glfs3 fuse. iptables is suspended
>> during these tests. Spec files are below.
>>
>> The four situations:
>>
>> 1) If I start up both machines and start up glusterfsd on both machines, I
>> can mount either one from the other and view its files as expected.
>>
>> 2) If I start up only one machine and glusterfsd, I can mount that
>> glusterfsd brick from the same machine and use it (ie edit the files) while
>> it tries to connect to the 2nd machine in the background. When I bring up
>> the 2nd machine, it connects and afrs as expected. Compare this to #4).
>>
>> 3) If I start up both machines and glusterfsd on both, mount each others'
>> bricks, verify I can see the files and then kill glusterfsd on one of them,
>> I can still use and view files on the other one while it tries to reconnect
>> in the background to the glusterfsd that was killed. When it comes back up
>> everything continues as expected.
>>
>> 4) But, if I startup both machines with glusterfsd on both, mount either
>> brick and view the files and then bring down the other machine (ie not kill
>> glusterfsd, but bring down the whole machine suddenly, or pull the ethernet
>> cable) , I can no longer see any files on the remaining machine. It just
>> hangs until the machine that is down comes back up and then it continues on
>> its merry way.
>>
>> This is presumably not the expected behavior since it is not the behavior in
>> 2) and 3). It is only after the machines have both started up and then one
>> of them goes away that I see this problem. Obviously, however this is the
>> very situation that calls for an HA setup in the real world. When one server
>> goes offline suddenly, you want to be able to keep on using the first.
>>
>> Here is the simplest spec file configuration that exhibits this problem:
>>
>> Simple server configuration:
>>
>> volume brick-ds
>>     type storage/posix
>>     option directory /.brick-ds
>> end-volume
>>
>>  volume brick-ds-afr
>>     type storage/posix
>>     option directory /.brick-ds-afr
>> end-volume
>>
>> volume server
>>     type protocol/server
>>     option transport-type tcp/server
>>     option bind-address 192.168.16.128 # 192.168.16.1 on the other server
>>     subvolumes brick-ds brick-ds-afr
>>     option auth.ip.brick-ds.allow 192.168.16.*
>>     option auth.ip.brick-ds-afr.allow 192.168.16.*
>> end-volume
>>
>>
>> Client Configuration :
>>
>>    volume brick-ds-local
>>      type protocol/client
>>      option transport-type tcp/client
>>      option remote-host 192.168.16.128 # 192.168.16.1 on the other machine
>>      option remote-subvolume brick-ds
>>    end-volume
>>
>>    volume brick-ds-remote
>>       type protocol/client
>>       option transport-type tcp/client
>>       option remote-host 192.168.16.1 # 192.168.16.128 on the other machine
>>       option remote-subvolume brick-ds-afr
>>     end-volume
>>
>>      volume brick-ds-afr
>>       type cluster/afr
>>       subvolumes brick-ds-local brick-ds-remote
>>       option replicate *:2
>>     end-volume
>>
>> These are both stock CentOS/RHEL 5 machines. You can demonstrate the
>> behavior by rebooting one machine, pulling out the ethernet cable, or
>> sending the route out into space (ie route add -host 192.168.16.1
>> some_disconnected_device). Everything will be frozen until the connection
>> returns and then when it comes back up, things keep working again after
>> that.
>>
>> Because of this problem, any kind of  HA / unify setup will not work for me
>> when one of the nodes fails.
>>
>> Can someone else verify this behavior? If there is some part of the logs /
>> strace / gdb output you'd like to see , just let me know. I'd really like to
>> use glusterfs in an HA setup, but don't see how with this behavior.
>>
>> Thanks in advance!!
>> :august
>>
>>
>> On 9/7/07, August R. Wohlt < glusterfs at isidore.net> wrote:
>>     
>>> Hi all -
>>>
>>> I have a setup based on this :
>>>
>>>       
>>  http://www.gluster.org/docs/index.php/GlusterFS_High_Availability_Storage_with_GlusterFS
>>     
>>> but with only 2 machines. Effectively just a mirror (glusterfsd
>>>       
>> configuration below). 1.3.1 client and server.
>>     
>>>       
>>     
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>   






More information about the Gluster-devel mailing list