[Gluster-users] Is rebalance in progress or not?

Sun Mar 15 16:17:57 UTC 2020

On March 15, 2020 12:16:51 PM GMT+02:00, Alexander Iliev <ailiev+gluster at mamul.org> wrote:
>On 3/15/20 11:07 AM, Strahil Nikolov wrote:
>> On March 15, 2020 11:50:32 AM GMT+02:00, Alexander Iliev
><ailiev+gluster at mamul.org> wrote:
>>> Hi list,
>>>
>>> I was having some issues with one of my Gluster nodes so I ended up
>>> re-installing it. Now I want to re-add the bricks for my main volume
>>> and
>>> I'm having the following issue - when I try to add the bricks I get:
>>>
>>>> # gluster volume add-brick store1 replica 3 <bricks ...>
>>>> volume add-brick: failed: Pre Validation failed on 172.31.35.132.
>>> Volume name store1 rebalance is in progress. Please retry after
>>> completion
>>>
>>> But then if I get the rebalance status I get:
>>>
>>>> # gluster volume rebalance store1 status
>>>> volume rebalance: store1: failed: Rebalance not started for volume
>>> store1.
>>>
>>> And if I try to start the rebalancing I get:
>>>
>>>> # gluster volume rebalance store1 start
>>>> volume rebalance: store1: failed: Rebalance on store1 is already
>>> started
>>>
>>> Looking at the logs of the first node, when I try to start the
>>> rebalance
>>> operation I see this:
>>>
>>>> [2020-03-15 09:41:31.883651] E [MSGID: 106276]
>>> [glusterd-rpc-ops.c:1200:__glusterd_stage_op_cbk] 0-management:
>>> Received
>>> stage RJT from uuid: 9476b8bb-d7ee-489a-b083-875805343e67
>>>
>>> On the second node the logs are showing stuff that indicates that a
>>> rebalance operation is indeed in progress:
>>>
>>>> [2020-03-15 09:47:34.190042] I [MSGID: 109081]
>>> [dht-common.c:5868:dht_setxattr] 0-store1-dht: fixing the layout of
>>> /redacted
>>>> [2020-03-15 09:47:34.775691] I
>>> [dht-rebalance.c:3285:gf_defrag_process_dir] 0-store1-dht: migrate
>data
>>>
>>> called on /redacted
>>>> [2020-03-15 09:47:36.019403] I
>>> [dht-rebalance.c:3480:gf_defrag_process_dir] 0-store1-dht: Migration
>>> operation on dir /redacted took 1.24 secs
>>>
>>>
>>> Some background on what led to this situation:
>>>
>>> The volume was originally a replica 3 distributed replicated volume
>on
>>> three nodes. In order to detach the faulty node I lowered the
>replica
>>> count to 2 and removed the bricks from that node from the volume. I
>>> cleaned up the storage (formatted the bricks and cleaned the
>>> trusted.gfid and trusted.glusterfs.volume-id extended attributes)
>and
>>> purged the gluster packages from the system, then I re-installed the
>>> gluster packages and did a `gluster peer probe` from another node.
>>>
>>> I'm running Gluster 6.6 on CentOS 7.7 on all nodes.
>>>
>>> I feel stuck at this point, so any guidance will be greatly
>>> appreciated.
>>>
>>> Thanks!
>>>
>>> Best regards,
>> 
>> Hey  Alex,
>> 
>> Did you try to  go the second node  (the  one tgat  thinks  balance 
>is running)  and stop tge balance ?
>> 
>> gluster volume rebalance VOLNAME stop
>> 
>> Then add the new brick (and  increase  the  replica  count) and after
> the  heal is over - rebalance again.
>
>Hey Strahil,
>
>Thanks for the suggestion, I just tried it, but unfortunately the
>result 
>is pretty much the same - when I try to stop the rebalance on the
>second 
>node it reports that no rebalance is in progress:
>
> > # gluster volume rebalance store1 stop
> > volume rebalance: store1: failed: Rebalance not started for volume 
>store1.
>
>> 
>> Best Regards,
>> Strahil Nikolov
>> 
>
>Best regards,
>--
>alexander iliev

Hey Alex,

I'm not sure  if the  command  has  a  'force' flag, but of it does - it is worth trying.

gluster volume rebalance store1 stop force

Sadly, as  the second  node  thinks balance  is running - I'm not sure if a 'start force' (to convince both nodes  that balance  is runking )and then 'stop'  will have the expected  effect.
Sadly, this situation is hard to reproduce.

In any way , a bug report  should be opened .

Keep  in mind  that I do not have  a  distributed volume ,  so everything above is pure speculation.

Based  on my experience - a gluster  upgrade can fix odd situations like that, but also it could make things worse . So for now avoid any upgrades,  until a dev confirms  it is safe to do.

Best Regards,
Strahil Nikolov