[Bugs] [Bug 1464110] [Scale] : Rebalance ETA (towards the end) may be inaccurate, even on a moderately large data set.

Thu Jun 22 12:52:00 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1464110

Nithya Balachandran <nbalacha at redhat.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
             Blocks|1417151                     |

--- Comment #1 from Nithya Balachandran <nbalacha at redhat.com> ---
+++ This bug was initially created as a clone of Bug #1457731 +++

Description:
------------

Added bricks to a dist rep volume,ran rebalance.

These are the rebalance ETAs at different intervals :

[T4 > T3 > T2 > T1]

**At time T1**

[root at gqas014 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size      
scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------  
-----------   -----------   -----------         ------------     --------------
                               localhost            63949         9.8GB       
295287             0             0          in progress        0:34:57
      server2            64644         9.9GB        300745             0       
     0          in progress        0:34:57
Estimated time left for rebalance to complete :        0:00:38
volume rebalance: butcher: success

**At time T2**

[root at server1 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size      
scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------  
-----------   -----------   -----------         ------------     --------------
                               localhost            64010         9.8GB       
295597             0             0          in progress        0:34:58
      server2            64705         9.9GB        300918             0       
     0          in progress        0:34:58
Estimated time left for rebalance to complete :        0:01:09

**At Time T3** :

[root at server1 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size      
scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------  
-----------   -----------   -----------         ------------     --------------
                               localhost            68057        10.0GB       
313569             0             0          in progress        0:36:46
      server2            68904        10.2GB        319823             0       
     0          in progress        0:36:46
Estimated time left for rebalance to complete :        0:00:09
volume rebalance: butcher: success
[root at server1 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size      
scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------  
-----------   -----------   -----------         ------------     --------------
                               localhost            68110        10.0GB       
313882             0             0          in progress        0:36:48
      server2            68958        10.2GB        319948             0       
     0          in progress        0:36:48
Estimated time left for rebalance to complete :        0:01:10
volume rebalance: butcher: success

**At time T4** // When it finally completed :

[root at server1 ~]# gluster v rebalance butcher status
                                    Node Rebalanced-files          size      
scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------  
-----------   -----------   -----------         ------------     --------------
                               localhost            74885       104.4GB       
345001             0             0            completed        1:12:32
      server2            74658        10.5GB        345747             0       
     0            completed        0:39:54
volume rebalance: butcher: success
[root at server1 ~]#
[root at server1 ~]#

So at interval T1,it says ETA for completion is 38 seconds.

At T2 it suddenly increased to slightly more than a minute.

You can see the same thing happening at T3 interval.

So,basically it keeps looping for a while at 1:10 minutes,counts down to 0 and
starts with 1:10 again.

This continued for another half an hour ,after which it finally completed( You
can see the time diff in run time column accross the intervals).

##NUM_FILES##
[root at gqac011 gluster-mount]# find . -mindepth 1 -type f | wc -l

352120

--- Additional comment from Nithya Balachandran on 2017-06-22 06:38:54 EDT ---

RCA:

The rebalance process calculates the file count once at the beginning and then
uses the value throughout.

If files are created during the rebalance , the number of files scanned may end
up being less than the initially estimated number of files. In that case,
rebalance used to just increment the number by 10K and continue. Based on the
scan rate in the setup on which the bug was filed that works out to 1 min 10 s.

Now the rebalance process will periodically update the file count. However,
this need not make the estimates more accurate as the newly added files may not
be processed if the parent dirs have already been processed.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.