[Gluster-users] Manual rsync before self-heal to prevent repaired server hanging

Fri Oct 7 11:15:56 UTC 2011

Hello All,
I have replicated-distributed volumes (created with the CLI) spread over 
several servers.  One of the servers in the cluster has been down for 
two weeks due to hardware problems and I am now ready to put it back 
into service.  The problem is that the files on it are now very 
different to the files on its GlusterFS replica; a lot of data has been 
added to the GlusterFS volumes in the past two weeks, and several users 
have deleted or modified a lot of files as well.  Therefore, I am 
wondering if it would be better to manually synchronise the files on the 
off-line server with the files on the live server before attempting a 
GlusterFS self heal on the volumes.  I know how to synchronise xattrs 
using rsync, but I would like to find out if this procedure is safe 
before going ahead.  My main worry is that GlusterFS replication might 
rely on there being differences between the xattrs on replicated pairs 
in normal operation, and that making the xattrs the same would break 
replication.  Can anyone tell me if it is safe to manually rsync a pair 
of replicated servers while one of them is off line?

There is another side to this story that may or may not be relevant.  
The hardware vendor doesn't think there is anything wrong with the 
server that keeps hanging.  Instead, they think that GlusterFS causes 
the server to hang when a lot of file synchronisation by GlusterFS self 
healing is going on.  I'm not sure whether or not to believe this, but 
the suspicion has come about because the server hangs every time it 
comes back into service following the replacement of a piece of hardware 
(and there is not much left of the original server inside now).  The 
live (and supposedly non-faulty) server has also hung on a few occasions 
during a large GlusterFS self heal operation (ie. one involving a lot of 
files), and the vendor is understandably unhappy about the prospect of 
taking that one apart as well.  Both servers produce ext4 related kernel 
errors just before they hang.  They have both been upgraded to CentOS 
5.7 since the trouble began, and GlusterFS on all servers has been 
upgraded from 3.2.3 to 3.2.4.  The vendor suggested manually 
synchronising the two servers with rsync before starting glusterd on the 
server that has been repaired.  I have been trying to break the server 
with rsync and various stress testing utilities without success for the 
past couple of days, so the vendor's view is that rsync is safe, but a 
large amount of continuous GlusterFS file synchronisation is not.  I 
would be happy to use the rsync approach if it keeps the servers 
running, as long as it doesn't ruin my xattrs.

Any comments or suggestions would be much appreciated.
Regards
Dan Bretherton.