[Gluster-users] Disperse volume recovery and healing

Thu Mar 15 07:09:05 UTC 2018

Hi Victor,

On Wed, Mar 14, 2018 at 12:30 AM, Victor T <hero_of_nothing_1 at hotmail.com>
wrote:

> I have a question about how disperse volumes handle brick failure. I'm
> running version 3.10.10 on all systems. If I have a disperse volume in a
> 4+2 configuration with 6 servers each serving 1 brick, and maintenance
> needs to be performed on all systems, are there any general steps that need
> to be taken to ensure data is not lost or service interrupted? For example,
> can I just reboot each system sequentially after making sure sure the
> service is running on all servers before rebooting the next system? Or is
> there a need to force/wait for a heal after each brick comes back online?
> If I have two bricks down for multiple days and then bring them back in, is
> there a need to issue a heal or something like a rebalance before rebooting
> the other servers? There's lots of documentation about other volume types,
> but it seems information specific to dispersed volumes is a bit hard to
> find. Thanks a bunch.
>

On a 4+2 configuration you could bring down up to 2 bricks simultaneously
for maintenance. However if something happens to one of the remaining 4
bricks, the volume would stop working. So in this case I would recommend to
not have more than one server down for maintenance at the same time unless
the down time is very very small.

Once the stopped servers come back up again, you need to wait until all
files are healed before proceeding with the next server. Failing to do so
means that some files could have more than 2 non-healthy versions, what
will make the file inaccessible until enough healthy versions are available
again.

Self-heal should be automatically triggered once the bricks come online,
however there was a bug (https://bugzilla.redhat.com/show_bug.cgi?id=1547662)
that could cause delays in the self-heal process. This bug should be fixed
in the next version. Meantime you can force self-heal to progress by
issuing "gluster volume heal <volname>" commands each time it seems to have
stopped.

Once the output of "gluster volume heal <volname> info" reports 0 pending
files on all bricks, you can proceed with the maintenance of the next
server.

No need to do any rebalance for down bricks. Rebalance is basically needed
when volume is expanded with more bricks.

Xavi

> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180315/cb106f12/attachment.html>