[Gluster-users] Gluster replication after node failure

Tue Nov 17 11:35:57 UTC 2009

Hi guys (and girls)!

I've been testing Gluster a bit during the last days and ran into a
problem. I'm planning on using two storage nodes running replication on
the server side and a nr of clients (in my proof of concept just one)
that uses the Gluster client to create a mount. On that client I'll run
backup software.

When everything is up and running it's fine, all works as I want. Both
with client side replication and server side replication. But when I
kill a server during creation of a large file and then bring the server
up again the problems start.

As I found in the docs, the resync of a storage server is only triggered
after an ls or any other action that does a stat. So far so good.

In my client setup (running replication on the server) I have:

#volume ha
#    type testing/cluster/ha
#    subvolumes server1 server2
#end-volume

where server1 and server2 are the volumes exported by the servers (their
replicated volumes). I was expecting that if I do an ls on that client,
hitting a server which doesn't have the "right" data, it would go on to
the next server untill it finds a current one. This is not happening and
the ls is just standing there and waiting.

Even THAT would be OK for me (though it becomes a bit of a timeout
problem when a back server is waiting for it's data this long), but the
ls never becomes fast again. Like something's hanging, untill I take
that file away.

Is this by design or am I missing something?

My scenario would be:

backup are written to disk, replicated and all at night. But during one
night we have a network failure and site2 goes offline. When it comes
back online, it's so-so that it's not automatically replicated from a
journal (this is not possible as far as I understand), but when I then
would try to restore certain files from a client in a third site, no
matter what server I use or even with the ha-module, it hangs at every
stat. This will break the backup system I'm afraid.

Also, because site2 won't get in sync by itself, when we would loose
network again and try to restore from site2, what would happen then?

I'll test a bit further, see how it works. But it would be good to get
some more info on this hanging. :) Gluster sure looks very interesting,
it's easy to setup and archives much of what more complex solutions
would bring without rocketship engineering. :)

Thanks!

-- 
Robin Vleij
robin at swip.net