[Gluster-users] Rebalance state stuck or corrupted

Anh Vo vtqanh at gmail.com
Wed May 23 19:33:41 UTC 2018


We have had a rebalance operation going on for a few days. After a couple
days the rebalance status said "failed". We stopped the rebalance operation
by doing gluster volume rebalance gv0 stop. Rebalance log indicated gluster
did try to stop the rebalance. However, when we try now to stop the volume
or try to restart rebalance it says there's a rebalance operation going on
and volume can't be stopped. I tried restarting all the glusterfs-server
service (we're using Gluster 3.8.15 on Ubuntu) but that did not help

user at gfs-vm000:~$ sudo gluster volume stop gv0
Stopping volume will make its data inaccessible. Do you want to continue?
(y/n) y
volume stop: gv0: failed: Staging failed on gfs-vm001. Error: rebalance
session is in progress for the volume 'gv0'
Staging failed on gfs-vm017. Error: rebalance session is in progress for
the volume 'gv0'
Staging failed on gfs-vm011. Error: rebalance session is in progress for
the volume 'gv0'
Staging failed on gfs-vm006. Error: rebalance session is in progress for
the volume 'gv0'
Staging failed on gfs-vm003. Error: rebalance session is in progress for
the volume 'gv0'
Staging failed on gfs-vm004. Error: rebalance session is in progress for
the volume 'gv0'
Staging failed on 10.0.13.9. Error: rebalance session is in progress for
the volume 'gv0'
Staging failed on gfs-vm014. Error: rebalance session is in progress for
the volume 'gv0'
Staging failed on gfs-vm013. Error: rebalance session is in progress for
the volume 'gv0'
Staging failed on gfs-vm002. Error: rebalance session is in progress for
the volume 'gv0'
Staging failed on gfs-vm016. Error: rebalance session is in progress for
the volume 'gv0'
Staging failed on gfs-vm007. Error: rebalance session is in progress for
the volume 'gv0'
Staging failed on gfs-vm010. Error: rebalance session is in progress for
the volume 'gv0'
user at gfs-vm000:~$ sudo gluster volume rebalance gv0 stop
volume rebalance: gv0: failed: Rebalance not started.

tail log from gv0-rebalance.log

[2018-05-23 17:32:55.262168] I [MSGID: 109029]
[dht-rebalance.c:4260:gf_defrag_stop] 0-: Received stop command on rebalance
[2018-05-23 17:32:55.262221] I [MSGID: 109028]
[dht-rebalance.c:4079:gf_defrag_status_get] 0-glusterfs: Rebalance is
stopped. Time taken is 749380.00 secs
[2018-05-23 17:32:55.262234] I [MSGID: 109028]
[dht-rebalance.c:4083:gf_defrag_status_get] 0-glusterfs: Files migrated:
821417, size: 25797609415002, lookups: 1162021, failures: 0, skipped: 1814
[2018-05-23 17:32:55.777149] I [MSGID: 109022]
[dht-rebalance.c:1703:dht_migrate_file] 0-gv0-dht: completed migration of
/pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill/transformer_nat-transformer_nat_base_v1-id016_lr0.1_4000_reg5.0_neighbor_hinge0.5_exp_distill_2.0_no_average_kl/model.ckpt-50724.data-00002-of-00003
from subvolume gv0-replicate-0 to gv0-replicate-3
[2018-05-23 17:32:55.782048] W [dht-rebalance.c:2826:gf_defrag_process_dir]
0-gv0-dht: Found error from gf_defrag_get_entry
[2018-05-23 17:32:55.782358] E [MSGID: 109111]
[dht-rebalance.c:3123:gf_defrag_fix_layout] 0-gv0-dht:
gf_defrag_process_dir failed for directory:
/pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill/transformer_nat-transformer_nat_base_v1-id016_lr0.1_4000_reg5.0_neighbor_hinge0.5_exp_distill_2.0_no_average_kl
[2018-05-23 17:32:56.115106] E [MSGID: 109016]
[dht-rebalance.c:3334:gf_defrag_fix_layout] 0-gv0-dht: Fix layout failed
for
/pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill/transformer_nat-transformer_nat_base_v1-id016_lr0.1_4000_reg5.0_neighbor_hinge0.5_exp_distill_2.0_no_average_kl
[2018-05-23 17:32:56.115586] E [MSGID: 109016]
[dht-rebalance.c:3334:gf_defrag_fix_layout] 0-gv0-dht: Fix layout failed
for
/pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill
[2018-05-23 17:32:56.115849] E [MSGID: 109016]
[dht-rebalance.c:3334:gf_defrag_fix_layout] 0-gv0-dht: Fix layout failed
for /pnrsy/v-zhli2/generated/ende_with_teacher/model
[2018-05-23 17:32:56.116141] E [MSGID: 109016]
[dht-rebalance.c:3334:gf_defrag_fix_layout] 0-gv0-dht: Fix layout failed
for /pnrsy/v-zhli2/generated/ende_with_teacher
[2018-05-23 17:32:56.116237] E [MSGID: 109016]
[dht-rebalance.c:3334:gf_defrag_fix_layout] 0-gv0-dht: Fix layout failed
for /pnrsy/v-zhli2/generated
[2018-05-23 17:32:56.116393] E [MSGID: 109016]
[dht-rebalance.c:3334:gf_defrag_fix_layout] 0-gv0-dht: Fix layout failed
for /pnrsy/v-zhli2
[2018-05-23 17:32:56.116625] E [MSGID: 109016]
[dht-rebalance.c:3334:gf_defrag_fix_layout] 0-gv0-dht: Fix layout failed
for /pnrsy
[2018-05-23 17:32:56.129836] I [dht-rebalance.c:2246:gf_defrag_task] 0-DHT:
Thread wokeup. defrag->current_thread_count: 7
[2018-05-23 17:32:56.130072] I [dht-rebalance.c:2246:gf_defrag_task] 0-DHT:
Thread wokeup. defrag->current_thread_count: 8
[2018-05-23 17:32:56.130567] I [dht-rebalance.c:2246:gf_defrag_task] 0-DHT:
Thread wokeup. defrag->current_thread_count: 9
[2018-05-23 17:32:56.131273] I [dht-rebalance.c:2246:gf_defrag_task] 0-DHT:
Thread wokeup. defrag->current_thread_count: 10
[2018-05-23 17:32:56.131492] I [dht-rebalance.c:2246:gf_defrag_task] 0-DHT:
Thread wokeup. defrag->current_thread_count: 11
[2018-05-23 17:32:56.131578] I [dht-rebalance.c:2246:gf_defrag_task] 0-DHT:
Thread wokeup. defrag->current_thread_count: 12
[2018-05-23 17:33:09.164419] I [MSGID: 109022]
[dht-rebalance.c:1703:dht_migrate_file] 0-gv0-dht: completed migration of
/pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill/transformer_nat-transformer_nat_base_v1-id016_lr0.1_4000_reg5.0_neighbor_hinge0.5_exp_distill_2.0_no_average_kl/model.ckpt-142510.data-00002-of-00003
from subvolume gv0-replicate-0 to gv0-replicate-5
[2018-05-23 17:33:09.386106] I [MSGID: 109022]
[dht-rebalance.c:1703:dht_migrate_file] 0-gv0-dht: completed migration of
/pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill/transformer_nat-transformer_nat_base_v1-id016_lr0.1_4000_reg5.0_neighbor_hinge0.5_exp_distill_2.0_no_average_kl/model.ckpt-344803.data-00002-of-00003
from subvolume gv0-replicate-0 to gv0-replicate-2
[2018-05-23 17:33:12.463711] I [MSGID: 109022]
[dht-rebalance.c:1703:dht_migrate_file] 0-gv0-dht: completed migration of
/pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill/transformer_nat-transformer_nat_base_v1-id016_lr0.1_4000_reg5.0_neighbor_hinge0.5_exp_distill_2.0_no_average_kl/model.ckpt-217794.data-00002-of-00003
from subvolume gv0-replicate-0 to gv0-replicate-1
[2018-05-23 17:33:21.525221] I [MSGID: 109022]
[dht-rebalance.c:1703:dht_migrate_file] 0-gv0-dht: completed migration of
/pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill/transformer_nat-transformer_nat_base_v1-id016_lr0.1_4000_reg5.0_neighbor_hinge0.5_exp_distill_2.0_no_average_kl/model.ckpt-198211.data-00002-of-00003
from subvolume gv0-replicate-0 to gv0-replicate-3
[2018-05-23 17:33:28.644220] I [MSGID: 109022]
[dht-rebalance.c:1703:dht_migrate_file] 0-gv0-dht: completed migration of
/pnrsy/v-zhli2/generated/ende_with_teacher/model/translate_ende_wmt32k_distill/transformer_nat-transformer_nat_base_v1-id016_lr0.1_4000_reg5.0_neighbor_hinge0.5_exp_distill_2.0_no_average_kl/model.ckpt-44350.data-00002-of-00003
from subvolume gv0-replicate-0 to gv0-replicate-3
[2018-05-23 17:33:28.647136] I [MSGID: 109028]
[dht-rebalance.c:4079:gf_defrag_status_get] 0-gv0-dht: Rebalance is failed.
Time taken is 749413.00 secs
[2018-05-23 17:33:28.647162] I [MSGID: 109028]
[dht-rebalance.c:4083:gf_defrag_status_get] 0-gv0-dht: Files migrated:
821423, size: 25803971060106, lookups: 1162021, failures: 9, skipped: 1814
[2018-05-23 17:33:28.660680] W [glusterfsd.c:1327:cleanup_and_exit]
(-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7ff2df1e46ba]
-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x55c8f9c89545]
-->/usr/sbin/glusterfs(cleanup_and_exit+0x54) [0x55c8f9c893b4] ) 0-:
received signum (15), shutting down
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180523/e9cd2a23/attachment.html>


More information about the Gluster-users mailing list