[Bugs] [Bug 1464110] [Scale] : Rebalance ETA (towards the end) may be inaccurate, even on a moderately large data set.
bugzilla at redhat.com
bugzilla at redhat.com
Thu Jun 22 12:52:00 UTC 2017
https://bugzilla.redhat.com/show_bug.cgi?id=1464110
Nithya Balachandran <nbalacha at redhat.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |ASSIGNED
Blocks|1417151 |
--- Comment #1 from Nithya Balachandran <nbalacha at redhat.com> ---
+++ This bug was initially created as a clone of Bug #1457731 +++
Description:
------------
Added bricks to a dist rep volume,ran rebalance.
These are the rebalance ETAs at different intervals :
[T4 > T3 > T2 > T1]
**At time T1**
[root at gqas014 ~]# gluster v rebalance butcher status
Node Rebalanced-files size
scanned failures skipped status run time in h:m:s
--------- ----------- -----------
----------- ----------- ----------- ------------ --------------
localhost 63949 9.8GB
295287 0 0 in progress 0:34:57
server2 64644 9.9GB 300745 0
0 in progress 0:34:57
Estimated time left for rebalance to complete : 0:00:38
volume rebalance: butcher: success
**At time T2**
[root at server1 ~]# gluster v rebalance butcher status
Node Rebalanced-files size
scanned failures skipped status run time in h:m:s
--------- ----------- -----------
----------- ----------- ----------- ------------ --------------
localhost 64010 9.8GB
295597 0 0 in progress 0:34:58
server2 64705 9.9GB 300918 0
0 in progress 0:34:58
Estimated time left for rebalance to complete : 0:01:09
**At Time T3** :
[root at server1 ~]# gluster v rebalance butcher status
Node Rebalanced-files size
scanned failures skipped status run time in h:m:s
--------- ----------- -----------
----------- ----------- ----------- ------------ --------------
localhost 68057 10.0GB
313569 0 0 in progress 0:36:46
server2 68904 10.2GB 319823 0
0 in progress 0:36:46
Estimated time left for rebalance to complete : 0:00:09
volume rebalance: butcher: success
[root at server1 ~]# gluster v rebalance butcher status
Node Rebalanced-files size
scanned failures skipped status run time in h:m:s
--------- ----------- -----------
----------- ----------- ----------- ------------ --------------
localhost 68110 10.0GB
313882 0 0 in progress 0:36:48
server2 68958 10.2GB 319948 0
0 in progress 0:36:48
Estimated time left for rebalance to complete : 0:01:10
volume rebalance: butcher: success
**At time T4** // When it finally completed :
[root at server1 ~]# gluster v rebalance butcher status
Node Rebalanced-files size
scanned failures skipped status run time in h:m:s
--------- ----------- -----------
----------- ----------- ----------- ------------ --------------
localhost 74885 104.4GB
345001 0 0 completed 1:12:32
server2 74658 10.5GB 345747 0
0 completed 0:39:54
volume rebalance: butcher: success
[root at server1 ~]#
[root at server1 ~]#
So at interval T1,it says ETA for completion is 38 seconds.
At T2 it suddenly increased to slightly more than a minute.
You can see the same thing happening at T3 interval.
So,basically it keeps looping for a while at 1:10 minutes,counts down to 0 and
starts with 1:10 again.
This continued for another half an hour ,after which it finally completed( You
can see the time diff in run time column accross the intervals).
##NUM_FILES##
[root at gqac011 gluster-mount]# find . -mindepth 1 -type f | wc -l
352120
--- Additional comment from Nithya Balachandran on 2017-06-22 06:38:54 EDT ---
RCA:
The rebalance process calculates the file count once at the beginning and then
uses the value throughout.
If files are created during the rebalance , the number of files scanned may end
up being less than the initially estimated number of files. In that case,
rebalance used to just increment the number by 10K and continue. Based on the
scan rate in the setup on which the bug was filed that works out to 1 min 10 s.
Now the rebalance process will periodically update the file count. However,
this need not make the estimates more accurate as the newly added files may not
be processed if the parent dirs have already been processed.
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list