[Gluster-users] Recovering out of sync nodes from input/output error

Wed Apr 11 11:00:47 UTC 2012

Hello,

We use gluster in our production environment and recently encountered an
unrecoverable error which was solved only by deleting the existing volume
and local stored files and recreating from scratch.

I am now playing with a test environment which almost mirrors the prod and
I can always reproduce the problem.
We have websites on two servers which use gluster for common usage files.
We also use DNS round robin for request balancing (this is a main element
in the scenario).

Setup: Two servers running Gentoo 2.0.3 kernel 3.0.6, glusterfs 3.2.5
Gluster commands:
gluster volume create vol-replication replica 2 transport tcp 10.0.2.14:/local
10.0.2.15:/local
gluster volume start vol-replication
gluster volume set vol-replication network.ping-timeout 1
node1 (10.0.2.14): mount -t glusterfs 10.0.2.14:/vol-replication /a
node2 (10.0.2.15): mount -t glusterfs 10.0.2.15:/vol-replication /a

Now assume that connectivity between the two nodes has failed, but they can
still be accessed from the outside world and files can be written on them
through Apache.
Request 1 -> 10.0.2.14 -> creates file howareyou
Request 2 -> 10.0.2.15 -> creates file hello
At some point, connectivity between the two nodes recovers and disaster
strikes:
ls /a
ls: cannot access /a: Input/output error

Simulation follows:
step 1
node1:
iptables -I INPUT 1 -s 10.0.2.15 -j DROP (connectivity loss simulation)
touch /a/howareyou

node2:
touch /a/hello

step 2
node1:
iptables -D INPUT 1 (connectivity recovery)
ls /a
ls: cannot access /a: Input/output error

node2:
ls /a
ls: cannot access /a: Input/output error

The only way to recover this was to delete the offending files. This was
easy to do on the test environment because there were two files involved,
but on the prod environment we had many more and I managed to recover only
after deleting the gluster volume and the local content including the local
storage directory itself! Nothing else of what I tried (stopping volume,
recreating volume, emptying the local storage directory, remounting,
restarting gluster) worked.

Any hint on how one could recover from this sort of situation?
Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120411/9c47e4be/attachment.html>