[Gluster-users] Re. Replication Issue

Fri May 18 19:23:53 UTC 2012

Hi Dan,

Thank you a lot for your comprehensive explantion of using rsync to sync
glusterfs servers. I have not a opportunity to check that solution because
my customer decided to give up of Glusters. I will test it at my lab.
Thanks,
Jimmy,

On 16 May 2012 16:45, Dan Bretherton <d.a.bretherton at reading.ac.uk> wrote:

> Hi Glusterfs Users!
>>
>> I have got one replicated volume with two bricks:
>>
>> s1 ~ # gluster volume info
>>
>> Volume Name: data-ns
>> Type: Replicate
>> Status: Started
>> Number of Bricks: 2
>> Transport-type: tcp
>> Bricks:
>> Brick1: s1:/mnt/gluster/data-ns
>> Brick2: s2:/mnt/gluster/data-ns
>> Options Reconfigured:
>> performance.cache-refresh-**timeout: 1
>> performance.io-thread-count: 32
>> auth.allow: 10.*
>> performance.cache-size: 1073741824
>>
>>
>> There are 5 clients which have got mounted volume from s1 server.
>>
>> We've face a hardware failure on s2  box for about one week. During that
>> time the s2 box was down.
>> All read writes operations went to s1.
>> Now I would like to synchronize all files on s2 which is operable. I have
>> started Glusterfs Server and
>> executed self healing process("find  with stat"on the glusterfs mount from
>> s2 box).
>> During the replication process I have faced very strange behaviour of
>> Glusterfs.
>> Some of clients have tried to get lots of files from s2 server, but those
>> files did not exist or have got 0 bytes size.
>>
>> It caused lots of "disk wait" on the web servers (clients which have got
>> mounted volume from s1) and finally 503 http response had been sent.
>>
>> My question is, how to avoid serving files from s2 box until all files
>> would be replicated correctly from s1 server?
>>
>> I have installed Glusters 3.2.6-1 from Debian repository.
>>
>> Thank you a lot in advance,
>> Jimmy,
>>
>
> Dear Jimmy,
> I have had problems re-synchronising out of date servers myself.  I posted
> the following query last year.
>
> http://gluster.org/pipermail/**gluster-users/2011-October/**008933.html<http://gluster.org/pipermail/gluster-users/2011-October/008933.html>
>
> In my case I was mainly worried about the self-heal process causing
> excessive load, which I suspected of causing my fairly low specification
> servers to hang.  Following that posting I received some advice off line
> concerning the use of rsync to re-synchronise out of date servers that have
> been off line for repairs for a long period of time.  I was advised that it
> is safe to use rsync, provided that the -X or --xattrs option is used to
> preserve extended attributes, and it is also necessary to use the --delete
> option in order to delete files that were deleted from the live server.
>  When I do this I disable the glusterd service while the rsync is taking
> place, although I have not been advised that this is essential. It is
> possible that  files on the live server may be modified while the rsync is
> in process, so I always follow up with a targeted self-heal in order to
> bring the repaired server fully up to date.  The targeted self-heal
> procedure is described in the following Gluster Community article.
>
> http://community.gluster.org/**a/howto-targeted-self-heal-**
> repairing-less-than-the-whole-**volume/<http://community.gluster.org/a/howto-targeted-self-heal-repairing-less-than-the-whole-volume/>
>
> When the resynchronisation process is complete I have noticed that the
> volume of data in replicated bricks can differ by up to 100MB.  I find this
> a bit worrying, but I haven't had time to find out exactly which files are
> on these bricks and why the volume of data reported by df differs on the
> two servers.
>
> The problem with the rsync approach is that it can take a very long time
> if there are a large number of files to synchronise, probably because rsync
> is single threaded.  I recently had one rsync going for two weeks and it
> still didn't finish, and I discovered that the bricks in question had more
> than 2.5 million files.  I couldn't wait any longer to bring my repaired
> server back into service so I killed the rsync and started glusterd, and I
> then ran a targeted self-heal on the unsynchronised bricks to continue the
> resynchronisation.  That is still going on now, but I am not seeing
> excessive load and haven't noticed any replication errors (but I haven't
> got the time to check thoroughly). This might be because most of the file
> transfer has already taken place or because most of the files in these
> particular bricks are small.
>
> My conclusion from this experience is that if a server goes down for a
> long time and becomes significantly out of date, it is best to use rsync
> (with glusterd disabled) to do as much of the file transfer as possible.
>  Once that has been done, the GlusterFS self heal mechanism can finish off
> the resynchronisation without any problematic side effects.  I will follow
> that procedure next time and report any other problems or observations.
>
> -Dan.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120518/306fe973/attachment.html>