[Gluster-users] recovery from reboot time?

Tue Mar 26 12:40:08 UTC 2019

After almost a week of doing nothing the brick failed and we were able 
to stop and restart glusterd and then could start a manual heal.

It was interesting when the heal started the time to completion was just 
about 21 days but as it worked through the 300000 some entries it got 
faster to the point where it completed in 2 days.

Now I have 2 gfids that refuse to heal.

We have also been looking at converting these systems to RHEL and buying 
support from RH but it seems that the sales arm is not interested in 
calling people back.

On 3/20/19 1:39 AM, Amar Tumballi Suryanarayan wrote:
> There are 2 things happen after a reboot.
>
> 1. glusterd (management layer) does a sanity check of its volumes, and 
> sees if there are anything different while it went down, and tries to 
> correct its state.
>   - This is fine as long as number of volumes are less, or numbers of 
> nodes are less. (less is referred as < 100).
>
> 2. If it is a replicate or disperse volume, then self-heal daemon does 
> check if there are any self-heal pending.
>   - This does a 'index' crawl to check which files actually changed 
> when one of the brick/node was down.
>   - If this list is big, it can sometimes does take some time.
>
> But 'Days/weeks/month' is not a expected/observed behavior. Is there 
> any logs in the log file? If not, can you do a 'strace -f' to the pid 
> which is consuming major CPU?? (strace for 1 mins sample is good enough).
>
> -Amar
>
>
> On Wed, Mar 20, 2019 at 2:05 AM Alvin Starr <alvin at netvel.net 
> <mailto:alvin at netvel.net>> wrote:
>
>     We have a simple replicated volume  with 1 brick on each node of 17TB.
>
>     There is something like 35M files and directories on the volume.
>
>     One of the servers rebooted and is now "doing something".
>
>     It kind of looks like its doing some kind of sality check with the
>     node
>     that did not reboot but its hard to say and it looks like it may
>     run for
>     hours/days/months....
>
>     Will Gluster take a long time with Lots of little files to resync?
>
>
>     -- 
>     Alvin Starr                   ||   land:  (905)513-7688
>     Netvel Inc.                   ||   Cell:  (416)806-0133
>     alvin at netvel.net <mailto:alvin at netvel.net>              ||
>
>     _______________________________________________
>     Gluster-users mailing list
>     Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>     https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
>
> -- 
> Amar Tumballi (amarts)

-- 
Alvin Starr                   ||   land:  (905)513-7688
Netvel Inc.                   ||   Cell:  (416)806-0133
alvin at netvel.net              ||

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190326/2ad6f6cf/attachment.html>