[Gluster-users] Disperse volume recovery and healing

Fri Mar 16 06:46:52 UTC 2018

On Fri, Mar 16, 2018 at 4:57 AM, Victor T <hero_of_nothing_1 at hotmail.com>
wrote:

> Xavi, does that mean that even if every node was rebooted one at a time
> even without issuing a heal that the volume would have no issues after
> running gluster volume heal [volname] when all bricks are back online?
>

No. After bringing up one brick and before stopping the next one, you need
to be sure that there are no damaged files. You shouldn't reboot a node if
"gluster volume heal <volname> info" shows damaged files.

The command "gluster volume heal <volname>" is only a tool to force heal to
progress (until the bug is fixed).

Xavi

>
> ------------------------------
> *From:* Xavi Hernandez <jahernan at redhat.com>
> *Sent:* Thursday, March 15, 2018 12:09:05 AM
> *To:* Victor T
> *Cc:* gluster-users at gluster.org
> *Subject:* Re: [Gluster-users] Disperse volume recovery and healing
>
> Hi Victor,
>
> On Wed, Mar 14, 2018 at 12:30 AM, Victor T <hero_of_nothing_1 at hotmail.com>
> wrote:
>
> I have a question about how disperse volumes handle brick failure. I'm
> running version 3.10.10 on all systems. If I have a disperse volume in a
> 4+2 configuration with 6 servers each serving 1 brick, and maintenance
> needs to be performed on all systems, are there any general steps that need
> to be taken to ensure data is not lost or service interrupted? For example,
> can I just reboot each system sequentially after making sure sure the
> service is running on all servers before rebooting the next system? Or is
> there a need to force/wait for a heal after each brick comes back online?
> If I have two bricks down for multiple days and then bring them back in, is
> there a need to issue a heal or something like a rebalance before rebooting
> the other servers? There's lots of documentation about other volume types,
> but it seems information specific to dispersed volumes is a bit hard to
> find. Thanks a bunch.
>
>
> On a 4+2 configuration you could bring down up to 2 bricks simultaneously
> for maintenance. However if something happens to one of the remaining 4
> bricks, the volume would stop working. So in this case I would recommend to
> not have more than one server down for maintenance at the same time unless
> the down time is very very small.
>
> Once the stopped servers come back up again, you need to wait until all
> files are healed before proceeding with the next server. Failing to do so
> means that some files could have more than 2 non-healthy versions, what
> will make the file inaccessible until enough healthy versions are available
> again.
>
> Self-heal should be automatically triggered once the bricks come online,
> however there was a bug (https://bugzilla.redhat.com/
> show_bug.cgi?id=1547662) that could cause delays in the self-heal
> process. This bug should be fixed in the next version. Meantime you can
> force self-heal to progress by issuing "gluster volume heal <volname>"
> commands each time it seems to have stopped.
>
> Once the output of "gluster volume heal <volname> info" reports 0 pending
> files on all bricks, you can proceed with the maintenance of the next
> server.
>
> No need to do any rebalance for down bricks. Rebalance is basically needed
> when volume is expanded with more bricks.
>
> Xavi
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180316/619901da/attachment.html>