[Gluster-users] Rebalance failed on Distributed Disperse volume based on 3.12.14 version
Ashish Pandey
aspandey at redhat.com
Wed Oct 10 09:02:36 UTC 2018
----- Original Message -----
From: "Mauro Tridici" <mauro.tridici at cmcc.it>
To: "Ashish Pandey" <aspandey at redhat.com>
Cc: "gluster-users" <gluster-users at gluster.org>
Sent: Wednesday, October 10, 2018 1:48:28 PM
Subject: Re: [Gluster-users] Rebalance failed on Distributed Disperse volume based on 3.12.14 version
Hi Ashish,
sorry for the late.
You can find below the outputs you need.
REBALANCE STATUS:
[root at s01 ~]# gluster volume rebalance tier2 status
Node Rebalanced-files size scanned failures skipped status run time in h:m:s
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 653403 36.4TB 2774605 0 63604 in progress 103:58:49
s02-stg 558292 20.4TB 1856726 0 34295 in progress 103:58:50
s03-stg 560233 20.2TB 1873208 0 34182 in progress 103:58:50
s04-stg 0 0Bytes 0 0 0 failed 0:00:37
s05-stg 0 0Bytes 0 0 0 completed 48:33:03
s06-stg 0 0Bytes 0 0 0 completed 48:33:02
Estimated time left for rebalance to complete : 1330:02:50
volume rebalance: tier2: success
GLUSTER VOLUME STATUS:
[root at s01 ~]# gluster volume status
Status of volume: tier2
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick s01-stg:/gluster/mnt1/brick 49152 0 Y 3498
Brick s02-stg:/gluster/mnt1/brick 49152 0 Y 3520
Brick s03-stg:/gluster/mnt1/brick 49152 0 Y 3517
Brick s01-stg:/gluster/mnt2/brick 49153 0 Y 3506
Brick s02-stg:/gluster/mnt2/brick 49153 0 Y 3527
Brick s03-stg:/gluster/mnt2/brick 49153 0 Y 3524
Brick s01-stg:/gluster/mnt3/brick 49154 0 Y 3514
Brick s02-stg:/gluster/mnt3/brick 49154 0 Y 3535
Brick s03-stg:/gluster/mnt3/brick 49154 0 Y 3532
Brick s01-stg:/gluster/mnt4/brick 49155 0 Y 3523
Brick s02-stg:/gluster/mnt4/brick 49155 0 Y 3543
Brick s03-stg:/gluster/mnt4/brick 49155 0 Y 3541
Brick s01-stg:/gluster/mnt5/brick 49156 0 Y 3532
Brick s02-stg:/gluster/mnt5/brick 49156 0 Y 3553
Brick s03-stg:/gluster/mnt5/brick 49156 0 Y 3551
Brick s01-stg:/gluster/mnt6/brick 49157 0 Y 3542
Brick s02-stg:/gluster/mnt6/brick 49157 0 Y 3562
Brick s03-stg:/gluster/mnt6/brick 49157 0 Y 3559
Brick s01-stg:/gluster/mnt7/brick 49158 0 Y 3551
Brick s02-stg:/gluster/mnt7/brick 49158 0 Y 3571
Brick s03-stg:/gluster/mnt7/brick 49158 0 Y 3568
Brick s01-stg:/gluster/mnt8/brick 49159 0 Y 3560
Brick s02-stg:/gluster/mnt8/brick 49159 0 Y 3580
Brick s03-stg:/gluster/mnt8/brick 49159 0 Y 3577
Brick s01-stg:/gluster/mnt9/brick 49160 0 Y 3569
Brick s02-stg:/gluster/mnt9/brick 49160 0 Y 3589
Brick s03-stg:/gluster/mnt9/brick 49160 0 Y 3586
Brick s01-stg:/gluster/mnt10/brick 49161 0 Y 3578
Brick s02-stg:/gluster/mnt10/brick 49161 0 Y 3597
Brick s03-stg:/gluster/mnt10/brick 49161 0 Y 3595
Brick s01-stg:/gluster/mnt11/brick 49162 0 Y 3587
Brick s02-stg:/gluster/mnt11/brick 49162 0 Y 3607
Brick s03-stg:/gluster/mnt11/brick 49162 0 Y 3604
Brick s01-stg:/gluster/mnt12/brick 49163 0 Y 3595
Brick s02-stg:/gluster/mnt12/brick 49163 0 Y 3616
Brick s03-stg:/gluster/mnt12/brick 49163 0 Y 3612
Brick s04-stg:/gluster/mnt1/brick 49152 0 Y 3408
Brick s05-stg:/gluster/mnt1/brick 49152 0 Y 3447
Brick s06-stg:/gluster/mnt1/brick 49152 0 Y 3393
Brick s04-stg:/gluster/mnt2/brick 49153 0 Y 3416
Brick s05-stg:/gluster/mnt2/brick 49153 0 Y 3454
Brick s06-stg:/gluster/mnt2/brick 49153 0 Y 3402
Brick s04-stg:/gluster/mnt3/brick 49154 0 Y 3424
Brick s05-stg:/gluster/mnt3/brick 49154 0 Y 3462
Brick s06-stg:/gluster/mnt3/brick 49154 0 Y 3410
Brick s04-stg:/gluster/mnt4/brick 49155 0 Y 3431
Brick s05-stg:/gluster/mnt4/brick 49155 0 Y 3470
Brick s06-stg:/gluster/mnt4/brick 49155 0 Y 3418
Brick s04-stg:/gluster/mnt5/brick 49156 0 Y 3440
Brick s05-stg:/gluster/mnt5/brick 49156 0 Y 3479
Brick s06-stg:/gluster/mnt5/brick 49156 0 Y 3427
Brick s04-stg:/gluster/mnt6/brick 49157 0 Y 3449
Brick s05-stg:/gluster/mnt6/brick 49157 0 Y 3489
Brick s06-stg:/gluster/mnt6/brick 49157 0 Y 3437
Brick s04-stg:/gluster/mnt7/brick 49158 0 Y 3459
Brick s05-stg:/gluster/mnt7/brick 49158 0 Y 3497
Brick s06-stg:/gluster/mnt7/brick 49158 0 Y 3445
Brick s04-stg:/gluster/mnt8/brick 49159 0 Y 3469
Brick s05-stg:/gluster/mnt8/brick 49159 0 Y 3507
Brick s06-stg:/gluster/mnt8/brick 49159 0 Y 3454
Brick s04-stg:/gluster/mnt9/brick 49160 0 Y 3480
Brick s05-stg:/gluster/mnt9/brick 49160 0 Y 3516
Brick s06-stg:/gluster/mnt9/brick 49160 0 Y 3463
Brick s04-stg:/gluster/mnt10/brick 49161 0 Y 3491
Brick s05-stg:/gluster/mnt10/brick 49161 0 Y 3525
Brick s06-stg:/gluster/mnt10/brick 49161 0 Y 3473
Brick s04-stg:/gluster/mnt11/brick 49162 0 Y 3500
Brick s05-stg:/gluster/mnt11/brick 49162 0 Y 3535
Brick s06-stg:/gluster/mnt11/brick 49162 0 Y 3483
Brick s04-stg:/gluster/mnt12/brick 49163 0 Y 3509
Brick s05-stg:/gluster/mnt12/brick 49163 0 Y 3544
Brick s06-stg:/gluster/mnt12/brick 49163 0 Y 3489
Self-heal Daemon on localhost N/A N/A Y 3124
Quota Daemon on localhost N/A N/A Y 3307
Bitrot Daemon on localhost N/A N/A Y 3370
Scrubber Daemon on localhost N/A N/A Y 3464
Self-heal Daemon on s02-stg N/A N/A Y 3275
Quota Daemon on s02-stg N/A N/A Y 3353
Bitrot Daemon on s02-stg N/A N/A Y 3393
Scrubber Daemon on s02-stg N/A N/A Y 3507
Self-heal Daemon on s03-stg N/A N/A Y 3099
Quota Daemon on s03-stg N/A N/A Y 3303
Bitrot Daemon on s03-stg N/A N/A Y 3378
Scrubber Daemon on s03-stg N/A N/A Y 3448
Self-heal Daemon on s04-stg N/A N/A Y 3078
Quota Daemon on s04-stg N/A N/A Y 3244
Bitrot Daemon on s04-stg N/A N/A Y 3288
Scrubber Daemon on s04-stg N/A N/A Y 3360
Self-heal Daemon on s06-stg N/A N/A Y 3047
Quota Daemon on s06-stg N/A N/A Y 3231
Bitrot Daemon on s06-stg N/A N/A Y 3287
Scrubber Daemon on s06-stg N/A N/A Y 3343
Self-heal Daemon on s05-stg N/A N/A Y 3185
Quota Daemon on s05-stg N/A N/A Y 3267
Bitrot Daemon on s05-stg N/A N/A Y 3325
Scrubber Daemon on s05-stg N/A N/A Y 3377
Task Status of Volume tier2
------------------------------------------------------------------------------
Task : Rebalance
ID : c43e3600-79bc-4d06-801b-0d756ec980e8
Status : in progress
GLUSTER V HEAL TIER2 INFO:
[root at s03 ~]# gluster v heal tier2 info
Brick s01-stg:/gluster/mnt1/brick
Status: Connected
Number of entries: 0
Brick s02-stg:/gluster/mnt1/brick
Status: Connected
Number of entries: 0
Brick s03-stg:/gluster/mnt1/brick
Status: Connected
Number of entries: 0
Brick s01-stg:/gluster/mnt2/brick
Status: Connected
Number of entries: 0
Brick s02-stg:/gluster/mnt2/brick
Status: Connected
Number of entries: 0
Brick s03-stg:/gluster/mnt2/brick
Status: Connected
Number of entries: 0
Brick s01-stg:/gluster/mnt3/brick
Status: Connected
Number of entries: 0
Brick s02-stg:/gluster/mnt3/brick
Status: Connected
Number of entries: 0
Brick s03-stg:/gluster/mnt3/brick
Status: Connected
Number of entries: 0
Brick s01-stg:/gluster/mnt4/brick
Status: Connected
Number of entries: 0
Brick s02-stg:/gluster/mnt4/brick
Status: Connected
Number of entries: 0
Brick s03-stg:/gluster/mnt4/brick
Status: Connected
Number of entries: 0
Brick s01-stg:/gluster/mnt5/brick
Status: Connected
Number of entries: 0
Brick s02-stg:/gluster/mnt5/brick
Status: Connected
Number of entries: 0
Brick s03-stg:/gluster/mnt5/brick
Status: Connected
Number of entries: 0
Brick s01-stg:/gluster/mnt6/brick
Status: Connected
Number of entries: 0
Brick s02-stg:/gluster/mnt6/brick
Status: Connected
Number of entries: 0
Brick s03-stg:/gluster/mnt6/brick
Status: Connected
Number of entries: 0
Brick s01-stg:/gluster/mnt7/brick
Status: Connected
Number of entries: 0
Brick s02-stg:/gluster/mnt7/brick
Status: Connected
Number of entries: 0
Brick s03-stg:/gluster/mnt7/brick
Status: Connected
Number of entries: 0
Brick s01-stg:/gluster/mnt8/brick
Status: Connected
Number of entries: 0
Brick s02-stg:/gluster/mnt8/brick
Status: Connected
Number of entries: 0
Brick s03-stg:/gluster/mnt8/brick
Status: Connected
Number of entries: 0
Brick s01-stg:/gluster/mnt9/brick
Status: Connected
Number of entries: 0
Brick s02-stg:/gluster/mnt9/brick
Status: Connected
Number of entries: 0
Brick s03-stg:/gluster/mnt9/brick
Status: Connected
Number of entries: 0
Brick s01-stg:/gluster/mnt10/brick
Status: Connected
Number of entries: 0
Brick s02-stg:/gluster/mnt10/brick
Status: Connected
Number of entries: 0
Brick s03-stg:/gluster/mnt10/brick
Status: Connected
Number of entries: 0
Brick s01-stg:/gluster/mnt11/brick
Status: Connected
Number of entries: 0
Brick s02-stg:/gluster/mnt11/brick
Status: Connected
Number of entries: 0
Brick s03-stg:/gluster/mnt11/brick
Status: Connected
Number of entries: 0
Brick s01-stg:/gluster/mnt12/brick
Status: Connected
Number of entries: 0
Brick s02-stg:/gluster/mnt12/brick
Status: Connected
Number of entries: 0
Brick s03-stg:/gluster/mnt12/brick
Status: Connected
Number of entries: 0
Brick s04-stg:/gluster/mnt1/brick
Status: Connected
Number of entries: 0
Brick s05-stg:/gluster/mnt1/brick
Status: Connected
Number of entries: 0
Brick s06-stg:/gluster/mnt1/brick
Status: Connected
Number of entries: 0
Brick s04-stg:/gluster/mnt2/brick
Status: Connected
Number of entries: 0
Brick s05-stg:/gluster/mnt2/brick
Status: Connected
Number of entries: 0
Brick s06-stg:/gluster/mnt2/brick
Status: Connected
Number of entries: 0
Brick s04-stg:/gluster/mnt3/brick
Status: Connected
Number of entries: 0
Brick s05-stg:/gluster/mnt3/brick
Status: Connected
Number of entries: 0
Brick s06-stg:/gluster/mnt3/brick
Status: Connected
Number of entries: 0
Brick s04-stg:/gluster/mnt4/brick
Status: Connected
Number of entries: 0
Brick s05-stg:/gluster/mnt4/brick
Status: Connected
Number of entries: 0
Brick s06-stg:/gluster/mnt4/brick
Status: Connected
Number of entries: 0
Brick s04-stg:/gluster/mnt5/brick
Status: Connected
Number of entries: 0
Brick s05-stg:/gluster/mnt5/brick
Status: Connected
Number of entries: 0
Brick s06-stg:/gluster/mnt5/brick
Status: Connected
Number of entries: 0
Brick s04-stg:/gluster/mnt6/brick
Status: Connected
Number of entries: 0
Brick s05-stg:/gluster/mnt6/brick
Status: Connected
Number of entries: 0
Brick s06-stg:/gluster/mnt6/brick
Status: Connected
Number of entries: 0
Brick s04-stg:/gluster/mnt7/brick
Status: Connected
Number of entries: 0
Brick s05-stg:/gluster/mnt7/brick
Status: Connected
Number of entries: 0
Brick s06-stg:/gluster/mnt7/brick
Status: Connected
Number of entries: 0
Brick s04-stg:/gluster/mnt8/brick
Status: Connected
Number of entries: 0
Brick s05-stg:/gluster/mnt8/brick
Status: Connected
Number of entries: 0
Brick s06-stg:/gluster/mnt8/brick
Status: Connected
Number of entries: 0
Brick s04-stg:/gluster/mnt9/brick
Status: Connected
Number of entries: 0
Brick s05-stg:/gluster/mnt9/brick
Status: Connected
Number of entries: 0
Brick s06-stg:/gluster/mnt9/brick
Status: Connected
Number of entries: 0
Brick s04-stg:/gluster/mnt10/brick
Status: Connected
Number of entries: 0
Brick s05-stg:/gluster/mnt10/brick
Status: Connected
Number of entries: 0
Brick s06-stg:/gluster/mnt10/brick
Status: Connected
Number of entries: 0
Brick s04-stg:/gluster/mnt11/brick
Status: Connected
Number of entries: 0
Brick s05-stg:/gluster/mnt11/brick
Status: Connected
Number of entries: 0
Brick s06-stg:/gluster/mnt11/brick
Status: Connected
Number of entries: 0
Brick s04-stg:/gluster/mnt12/brick
Status: Connected
Number of entries: 0
Brick s05-stg:/gluster/mnt12/brick
Status: Connected
Number of entries: 0
Brick s06-stg:/gluster/mnt12/brick
Status: Connected
Number of entries: 0
At this moment, the file system is not mounted.
If it is not dangerous, I can mount it and make some tests.
Question: some user need to access the gluster storage; in your opinion, can I offer the access to the storage during rebalance is running?
If it is not possible, no problem. It is only a simple question.
>>>> From EC point of view, heal info shows that volume is in healthy state.
You can mount and use the volume without any issue even if re-balance is going on.
If the IO's will be heavy on mount point, it might slow down the re-balance process but it will work
without any issue (this is what expected).
So go ahead and start using volume.
Thanks,
Mauro
Il giorno 09 ott 2018, alle ore 18:39, Ashish Pandey < aspandey at redhat.com > ha scritto:
Hi Mauro,
I looked into the getxattr output provided by you and found that everything is fine with versions on the root of the bricks.
I would recommend to wait for re-balance to complete.
Keep posting output of following -
1 - re-balance status
2 - gluster volume status
3 - gluster v heal <volname> info
Are you able to access the files/dirs from mount point?
Let's try to find out the issue one by one.
---
Ashish
----- Original Message -----
From: "Mauro Tridici" < mauro.tridici at cmcc.it >
To: "Ashish Pandey" < aspandey at redhat.com >
Cc: "gluster-users" < gluster-users at gluster.org >
Sent: Monday, October 8, 2018 3:33:21 PM
Subject: Re: [Gluster-users] Rebalance failed on Distributed Disperse volume based on 3.12.14 version
Hi Ashish,
the rebalance is still running. It moved about 49.3 TB of 78 TB (estimated).
The initial amount of data saved on the s01, s02 and s03 servers was about 156TB, so I think that half of 156TB (78TB) should be moved to the new 3 servers (s04, s05 and s06).
[root at s01 status]# gluster volume rebalance tier2 status
Node Rebalanced-files size scanned failures skipped status run time in h:m:s
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 553955 20.3TB 2356786 0 61942 in progress 57:42:14
s02-stg 293175 14.8TB 976960 0 30489 in progress 57:42:15
s03-stg 293758 14.2TB 990518 0 30464 in progress 57:42:15
s04-stg 0 0Bytes 0 0 0 failed 0:00:37
s05-stg 0 0Bytes 0 0 0 completed 48:33:03
s06-stg 0 0Bytes 0 0 0 completed 48:33:02
Estimated time left for rebalance to complete : 981:23:02
volume rebalance: tier2: success
In attachment you will find the outputs required.
Thank you,
Mauro
<blockquote>
Il giorno 08 ott 2018, alle ore 11:44, Ashish Pandey < aspandey at redhat.com > ha scritto:
Hi Mauro,
What is the status of rebalace now?
Could you please give output of following for all the bricks -
getfattr -m. -d -e hex <root path of athe brick>
You have to go to all the nodes and for all the bricks on that node you have to run above command.
Example: on s01
getfattr -m. -d -e hex /gluster/mnt1/brick
Keep output from one node in one file si that it will be easy to analyze.
---
Ashish
----- Original Message -----
From: "Mauro Tridici" < mauro.tridici at cmcc.it >
To: "Nithya Balachandran" < nbalacha at redhat.com >
Cc: "gluster-users" < gluster-users at gluster.org >
Sent: Monday, October 8, 2018 2:27:35 PM
Subject: Re: [Gluster-users] Rebalance failed on Distributed Disperse volume based on 3.12.14 version
Hi Nithya,
thank you, my answers are in lines.
<blockquote>
Il giorno 08 ott 2018, alle ore 10:43, Nithya Balachandran < nbalacha at redhat.com > ha scritto:
Hi Mauro,
Yes, a rebalance consists of 2 operations for every directory:
1. Fix the layout for the new volume config (newly added or removed bricks)
2. Migrate files to their new hashed subvols based on the new layout
Are you running a rebalance because you added new bricks to the volume ? As per an earlier email you have already run a fix-layout.
Yes, we added new bricks to the volume and we already executed fix-layout before.
<blockquote>
On s04, please check the rebalance log file to see why the rebalance failed.
</blockquote>
On s04, rebalance failed after the following errors (before these lines no errors were found):
[2018-10-06 00:13:37.359634] I [MSGID: 109063] [dht-layout.c:716:dht_layout_normalize] 0-tier2-dht: Found anomalies in / (gfid = 00000000-0000-0000-0000-000000000001). Holes=2 overlaps=0
[2018-10-06 00:13:37.362424] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-7: Failed to get size and version [Input/output error]
[2018-10-06 00:13:37.362504] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-7: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:13:37.362525] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-7: Failed to update version and size [Input/output error]
[2018-10-06 00:13:37.363105] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-8: Failed to get size and version [Input/output error]
[2018-10-06 00:13:37.363163] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-8: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:13:37.363180] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-8: Failed to update version and size [Input/output error]
[2018-10-06 00:13:37.364920] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-11: Failed to get size and version [Input/output error]
[2018-10-06 00:13:37.364969] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-11: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:13:37.364985] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-11: Failed to update version and size [Input/output error]
[2018-10-06 00:13:37.366864] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-6: Failed to get size and version [Input/output error]
[2018-10-06 00:13:37.366912] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-6: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:13:37.366926] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-6: Failed to update version and size [Input/output error]
[2018-10-06 00:13:37.374818] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-9: Failed to get size and version [Input/output error]
[2018-10-06 00:13:37.374866] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-9: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:13:37.374879] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-9: Failed to update version and size [Input/output error]
[2018-10-06 00:13:37.406076] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-10: Failed to get size and version [Input/output error]
[2018-10-06 00:13:37.406145] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-10: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:13:37.406183] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-10: Failed to update version and size [Input/output error]
[2018-10-06 00:13:51.039835] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-11: Failed to get size and version [Input/output error]
[2018-10-06 00:13:51.039911] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-11, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error]
[2018-10-06 00:13:51.039944] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-11: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:13:51.039958] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-11: Failed to update version and size [Input/output error]
[2018-10-06 00:13:51.040441] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-7: Failed to get size and version [Input/output error]
[2018-10-06 00:13:51.040480] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-7, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error]
[2018-10-06 00:13:51.040518] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-7: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:13:51.040534] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-7: Failed to update version and size [Input/output error]
[2018-10-06 00:13:51.061789] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-9: Failed to get size and version [Input/output error]
[2018-10-06 00:13:51.061830] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-9, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error]
[2018-10-06 00:13:51.061859] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-9: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:13:51.061873] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-9: Failed to update version and size [Input/output error]
[2018-10-06 00:13:51.062283] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-8: Failed to get size and version [Input/output error]
[2018-10-06 00:13:51.062323] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-8, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error]
[2018-10-06 00:13:51.062353] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-8: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:13:51.062367] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-8: Failed to update version and size [Input/output error]
[2018-10-06 00:13:51.064613] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-6: Failed to get size and version [Input/output error]
[2018-10-06 00:13:51.064655] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-6, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error]
[2018-10-06 00:13:51.064685] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-6: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:13:51.064700] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-6: Failed to update version and size [Input/output error]
[2018-10-06 00:13:51.064727] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-10: Failed to get size and version [Input/output error]
[2018-10-06 00:13:51.064766] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-10, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error]
[2018-10-06 00:13:51.064794] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-10: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:13:51.064815] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-10: Failed to update version and size [Input/output error]
[2018-10-06 00:13:53.695948] I [dht-rebalance.c:4512:gf_defrag_start_crawl] 0-tier2-dht: gf_defrag_start_crawl using commit hash 3720343841
[2018-10-06 00:13:53.696837] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-11: Failed to get size and version [Input/output error]
[2018-10-06 00:13:53.696906] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-11: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:13:53.696924] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-11: Failed to update version and size [Input/output error]
[2018-10-06 00:13:53.697549] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-7: Failed to get size and version [Input/output error]
[2018-10-06 00:13:53.697599] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-7: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:13:53.697620] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-7: Failed to update version and size [Input/output error]
[2018-10-06 00:13:53.704120] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-8: Failed to get size and version [Input/output error]
[2018-10-06 00:13:53.704262] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-8: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:13:53.704342] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-8: Failed to update version and size [Input/output error]
[2018-10-06 00:13:53.707260] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-10: Failed to get size and version [Input/output error]
[2018-10-06 00:13:53.707312] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-10: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:13:53.707329] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-10: Failed to update version and size [Input/output error]
[2018-10-06 00:13:53.718301] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-6: Failed to get size and version [Input/output error]
[2018-10-06 00:13:53.718350] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-6: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:13:53.718367] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-6: Failed to update version and size [Input/output error]
[2018-10-06 00:13:55.626130] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-9: Failed to get size and version [Input/output error]
[2018-10-06 00:13:55.626207] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-9: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:13:55.626228] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-9: Failed to update version and size [Input/output error]
[2018-10-06 00:13:55.626231] I [MSGID: 109081] [dht-common.c:4379:dht_setxattr] 0-tier2-dht: fixing the layout of /
[2018-10-06 00:13:55.862374] I [dht-rebalance.c:5063:gf_defrag_status_get] 0-glusterfs: TIME: Estimated total time to complete (size)= 0 seconds, seconds left = 0
[2018-10-06 00:13:55.862440] I [MSGID: 109028] [dht-rebalance.c:5143:gf_defrag_status_get] 0-glusterfs: Rebalance is in progress. Time taken is 20.00 secs
[2018-10-06 00:13:55.862460] I [MSGID: 109028] [dht-rebalance.c:5147:gf_defrag_status_get] 0-glusterfs: Files migrated: 0, size: 0, lookups: 0, failures: 0, skipped: 0
[2018-10-06 00:14:12.476927] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-11: Failed to get size and version [Input/output error]
[2018-10-06 00:14:12.477020] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-11, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error]
[2018-10-06 00:14:12.477077] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-11: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:14:12.477094] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-11: Failed to update version and size [Input/output error]
[2018-10-06 00:14:12.477644] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-7: Failed to get size and version [Input/output error]
[2018-10-06 00:14:12.477695] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-7, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error]
[2018-10-06 00:14:12.477726] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-7: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:14:12.477740] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-7: Failed to update version and size [Input/output error]
[2018-10-06 00:14:12.477853] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-8: Failed to get size and version [Input/output error]
[2018-10-06 00:14:12.477894] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-8, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error]
[2018-10-06 00:14:12.477923] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-8: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:14:12.477937] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-8: Failed to update version and size [Input/output error]
[2018-10-06 00:14:12.486862] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-6: Failed to get size and version [Input/output error]
[2018-10-06 00:14:12.486902] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-6, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error]
[2018-10-06 00:14:12.486929] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-6: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:14:12.486944] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-6: Failed to update version and size [Input/output error]
[2018-10-06 00:14:12.493872] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-10: Failed to get size and version [Input/output error]
[2018-10-06 00:14:12.493912] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-10, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error]
[2018-10-06 00:14:12.493939] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-10: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:14:12.493954] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-10: Failed to update version and size [Input/output error]
[2018-10-06 00:14:12.494560] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-9: Failed to get size and version [Input/output error]
[2018-10-06 00:14:12.494598] E [MSGID: 109006] [dht-selfheal.c:673:dht_selfheal_dir_xattr_cbk] 0-tier2-dht: layout setxattr failed on tier2-disperse-9, path:/ gfid:00000000-0000-0000-0000-000000000001 [Input/output error]
[2018-10-06 00:14:12.494624] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-9: Insufficient available children for this request (have 0, need 4)
[2018-10-06 00:14:12.494640] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-9: Failed to update version and size [Input/output error]
[2018-10-06 00:14:12.795320] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-7: Failed to get size and version [Input/output error]
[2018-10-06 00:14:12.795366] E [MSGID: 109039] [dht-common.c:3113:dht_find_local_subvol_cbk] 0-tier2-dht: getxattr err for dir [Input/output error]
[2018-10-06 00:14:12.795796] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-11: Failed to get size and version [Input/output error]
[2018-10-06 00:14:12.795834] E [MSGID: 109039] [dht-common.c:3113:dht_find_local_subvol_cbk] 0-tier2-dht: getxattr err for dir [Input/output error]
[2018-10-06 00:14:12.804770] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-6: Failed to get size and version [Input/output error]
[2018-10-06 00:14:12.804803] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-10: Failed to get size and version [Input/output error]
[2018-10-06 00:14:12.804811] E [MSGID: 109039] [dht-common.c:3113:dht_find_local_subvol_cbk] 0-tier2-dht: getxattr err for dir [Input/output error]
[2018-10-06 00:14:12.804850] E [MSGID: 109039] [dht-common.c:3113:dht_find_local_subvol_cbk] 0-tier2-dht: getxattr err for dir [Input/output error]
[2018-10-06 00:14:12.808500] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-8: Failed to get size and version [Input/output error]
[2018-10-06 00:14:12.808563] E [MSGID: 109039] [dht-common.c:3113:dht_find_local_subvol_cbk] 0-tier2-dht: getxattr err for dir [Input/output error]
[2018-10-06 00:14:12.812431] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-9: Failed to get size and version [Input/output error]
[2018-10-06 00:14:12.812468] E [MSGID: 109039] [dht-common.c:3113:dht_find_local_subvol_cbk] 0-tier2-dht: getxattr err for dir [Input/output error]
[2018-10-06 00:14:12.812497] E [MSGID: 0] [dht-rebalance.c:4336:dht_get_local_subvols_and_nodeuuids] 0-tier2-dht: local subvolume determination failed with error: 5 [Input/output error]
[2018-10-06 00:14:12.812700] I [MSGID: 109028] [dht-rebalance.c:5143:gf_defrag_status_get] 0-tier2-dht: Rebalance is failed. Time taken is 37.00 secs
[2018-10-06 00:14:12.812720] I [MSGID: 109028] [dht-rebalance.c:5147:gf_defrag_status_get] 0-tier2-dht: Files migrated: 0, size: 0, lookups: 0, failures: 0, skipped: 0
[2018-10-06 00:14:12.812870] W [glusterfsd.c:1375:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7e25) [0x7efe75d18e25] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x5623973d64b5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x5623973d632b] ) 0-: received signum (15), shutting down
Regards,
Mauro
<blockquote>
Regards,
Nithya
On 8 October 2018 at 13:22, Mauro Tridici < mauro.tridici at cmcc.it > wrote:
<blockquote>
Hi All,
for your information, this is the current rebalance status:
[root at s01 ~]# gluster volume rebalance tier2 status
Node Rebalanced-files size scanned failures skipped status run time in h:m:s
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 551922 20.3TB 2349397 0 61849 in progress 55:25:38
s02-stg 287631 13.2TB 959954 0 30262 in progress 55:25:39
s03-stg 288523 12.7TB 973111 0 30220 in progress 55:25:39
s04-stg 0 0Bytes 0 0 0 failed 0:00:37
s05-stg 0 0Bytes 0 0 0 completed 48:33:03
s06-stg 0 0Bytes 0 0 0 completed 48:33:02
Estimated time left for rebalance to complete : 1023:49:56
volume rebalance: tier2: success
Rebalance is migrating files on s05, s06 servers and on s04 too (although it is marked as failed).
s05 and s06 tasks are completed.
Questions:
1) it seems that rebalance is moving files, but it is fixing the layout also, is it normal?
2) when rebalance will be completed, what we need to do before return the gluster storage to the users? We have to launch rebalance again in order to involve s04 server too or a fix-layout to eventually fix some error on s04?
Thank you very much,
Mauro
<blockquote>
Il giorno 07 ott 2018, alle ore 10:29, Mauro Tridici < mauro.tridici at cmcc.it > ha scritto:
Hi All,
some important updates about the issue mentioned below.
After rebalance failed on all the servers, I decided to:
- stop gluster volume
- reboot the servers
- start gluster volume
- change some gluster volume options
- start the rebalance again
The options that I changed are listed below after reading some threads on gluster users mailing list:
BEFORE CHANGE:
gluster volume set tier2 network.ping-timeout 02
gluster volume set all cluster.brick-multiplex on
gluster volume set tier2 cluster.server-quorum-ratio 51%
gluster volume set tier2 cluster.server-quorum-type server
gluster volume set tier2 cluster.quorum-type auto
AFTER CHANGE:
gluster volume set tier2 network.ping-timeout 42
gluster volume set all cluster.brick-multiplex off
gluster volume set tier2 cluster.server-quorum-ratio none
gluster volume set tier2 cluster.server-quorum-type none
gluster volume set tier2 cluster.quorum-type none
The result was that rebalance starts moving data from s01, s02 ed s03 servers to s05 and s06 servers (the new added ones), but it failed on s04 server after 37 seconds.
The rebalance is still running and moving data as you can see from the output:
[root at s01 ~]# gluster volume rebalance tier2 status
Node Rebalanced-files size scanned failures skipped status run time in h:m:s
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 286680 12.6TB 1217960 0 43343 in progress 32:10:24
s02-stg 126291 12.4TB 413077 0 21932 in progress 32:10:25
s03-stg 126516 11.9TB 433014 0 21870 in progress 32:10:25
s04-stg 0 0Bytes 0 0 0 failed 0:00:37
s05-stg 0 0Bytes 0 0 0 in progress 32:10:25
s06-stg 0 0Bytes 0 0 0 in progress 32:10:25
Estimated time left for rebalance to complete : 624:47:48
volume rebalance: tier2: success
When rebalance will be completed, we are planning to re-launch it to try to involve s04 server also.
Do you have some idea about what happened in my previous message and why, now, rebalance it’s running although it’s not involve s04 server?
In attachment the complete tier2-rebalance.log file related to s04 server.
Thank you very much for your help,
Mauro
<tier2-rebalance.log.gz>
<blockquote>
Il giorno 06 ott 2018, alle ore 02:01, Mauro Tridici < mauro.tridici at cmcc.it > ha scritto:
Hi All,
since we need to restore gluster storage as soon as possible, we decided to ignore the few files that could be lost and to go ahead.
So we cleaned all bricks content of servers s04, s05 and s06.
As planned some days ago, we executed the following commands:
gluster peer detach s04
gluster peer detach s05
gluster peer detach s06
gluster peer probe s04
gluster peer probe s05
gluster peer probe s06
gluster volume add-brick tier2 s04-stg:/gluster/mnt1/brick s05-stg:/gluster/mnt1/brick s06-stg:/gluster/mnt1/brick s04-stg:/gluster/mnt2/brick s05-stg:/gluster/mnt2/brick s06-stg:/gluster/mnt2/brick s04-stg:/gluster/mnt3/brick s05-stg:/gluster/mnt3/brick s06-stg:/gluster/mnt3/brick s04-stg:/gluster/mnt4/brick s05-stg:/gluster/mnt4/brick s06-stg:/gluster/mnt4/brick s04-stg:/gluster/mnt5/brick s05-stg:/gluster/mnt5/brick s06-stg:/gluster/mnt5/brick s04-stg:/gluster/mnt6/brick s05-stg:/gluster/mnt6/brick s06-stg:/gluster/mnt6/brick s04-stg:/gluster/mnt7/brick s05-stg:/gluster/mnt7/brick s06-stg:/gluster/mnt7/brick s04-stg:/gluster/mnt8/brick s05-stg:/gluster/mnt8/brick s06-stg:/gluster/mnt8/brick s04-stg:/gluster/mnt9/brick s05-stg:/gluster/mnt9/brick s06-stg:/gluster/mnt9/brick s04-stg:/gluster/mnt10/brick s05-stg:/gluster/mnt10/brick s06-stg:/gluster/mnt10/brick s04-stg:/gluster/mnt11/brick s05-stg:/gluster/mnt11/brick s06-stg:/gluster/mnt11/brick s04-stg:/gluster/mnt12/brick s05-stg:/gluster/mnt12/brick s06-stg:/gluster/mnt12/brick force
gluster volume rebalance tier2 fix-layout start
Everything seem to be fine and fix-layout ended.
[root at s01 ~]# gluster volume rebalance tier2 status
Node status run time in h:m:s
--------- ----------- ------------
localhost fix-layout completed 12:11:6
s02-stg fix-layout completed 12:11:18
s03-stg fix-layout completed 12:11:12
s04-stg fix-layout completed 12:11:20
s05-stg fix-layout completed 12:11:14
s06-stg fix-layout completed 12:10:47
volume rebalance: tier2: success
[root at s01 ~]# gluster volume info
Volume Name: tier2
Type: Distributed-Disperse
Volume ID: a28d88c5-3295-4e35-98d4-210b3af9358c
Status: Started
Snapshot Count: 0
Number of Bricks: 12 x (4 + 2) = 72
Transport-type: tcp
Bricks:
Brick1: s01-stg:/gluster/mnt1/brick
Brick2: s02-stg:/gluster/mnt1/brick
Brick3: s03-stg:/gluster/mnt1/brick
Brick4: s01-stg:/gluster/mnt2/brick
Brick5: s02-stg:/gluster/mnt2/brick
Brick6: s03-stg:/gluster/mnt2/brick
Brick7: s01-stg:/gluster/mnt3/brick
Brick8: s02-stg:/gluster/mnt3/brick
Brick9: s03-stg:/gluster/mnt3/brick
Brick10: s01-stg:/gluster/mnt4/brick
Brick11: s02-stg:/gluster/mnt4/brick
Brick12: s03-stg:/gluster/mnt4/brick
Brick13: s01-stg:/gluster/mnt5/brick
Brick14: s02-stg:/gluster/mnt5/brick
Brick15: s03-stg:/gluster/mnt5/brick
Brick16: s01-stg:/gluster/mnt6/brick
Brick17: s02-stg:/gluster/mnt6/brick
Brick18: s03-stg:/gluster/mnt6/brick
Brick19: s01-stg:/gluster/mnt7/brick
Brick20: s02-stg:/gluster/mnt7/brick
Brick21: s03-stg:/gluster/mnt7/brick
Brick22: s01-stg:/gluster/mnt8/brick
Brick23: s02-stg:/gluster/mnt8/brick
Brick24: s03-stg:/gluster/mnt8/brick
Brick25: s01-stg:/gluster/mnt9/brick
Brick26: s02-stg:/gluster/mnt9/brick
Brick27: s03-stg:/gluster/mnt9/brick
Brick28: s01-stg:/gluster/mnt10/brick
Brick29: s02-stg:/gluster/mnt10/brick
Brick30: s03-stg:/gluster/mnt10/brick
Brick31: s01-stg:/gluster/mnt11/brick
Brick32: s02-stg:/gluster/mnt11/brick
Brick33: s03-stg:/gluster/mnt11/brick
Brick34: s01-stg:/gluster/mnt12/brick
Brick35: s02-stg:/gluster/mnt12/brick
Brick36: s03-stg:/gluster/mnt12/brick
Brick37: s04-stg:/gluster/mnt1/brick
Brick38: s05-stg:/gluster/mnt1/brick
Brick39: s06-stg:/gluster/mnt1/brick
Brick40: s04-stg:/gluster/mnt2/brick
Brick41: s05-stg:/gluster/mnt2/brick
Brick42: s06-stg:/gluster/mnt2/brick
Brick43: s04-stg:/gluster/mnt3/brick
Brick44: s05-stg:/gluster/mnt3/brick
Brick45: s06-stg:/gluster/mnt3/brick
Brick46: s04-stg:/gluster/mnt4/brick
Brick47: s05-stg:/gluster/mnt4/brick
Brick48: s06-stg:/gluster/mnt4/brick
Brick49: s04-stg:/gluster/mnt5/brick
Brick50: s05-stg:/gluster/mnt5/brick
Brick51: s06-stg:/gluster/mnt5/brick
Brick52: s04-stg:/gluster/mnt6/brick
Brick53: s05-stg:/gluster/mnt6/brick
Brick54: s06-stg:/gluster/mnt6/brick
Brick55: s04-stg:/gluster/mnt7/brick
Brick56: s05-stg:/gluster/mnt7/brick
Brick57: s06-stg:/gluster/mnt7/brick
Brick58: s04-stg:/gluster/mnt8/brick
Brick59: s05-stg:/gluster/mnt8/brick
Brick60: s06-stg:/gluster/mnt8/brick
Brick61: s04-stg:/gluster/mnt9/brick
Brick62: s05-stg:/gluster/mnt9/brick
Brick63: s06-stg:/gluster/mnt9/brick
Brick64: s04-stg:/gluster/mnt10/brick
Brick65: s05-stg:/gluster/mnt10/brick
Brick66: s06-stg:/gluster/mnt10/brick
Brick67: s04-stg:/gluster/mnt11/brick
Brick68: s05-stg:/gluster/mnt11/brick
Brick69: s06-stg:/gluster/mnt11/brick
Brick70: s04-stg:/gluster/mnt12/brick
Brick71: s05-stg:/gluster/mnt12/brick
Brick72: s06-stg:/gluster/mnt12/brick
Options Reconfigured:
network.ping-timeout: 42
features.scrub: Active
features.bitrot: on
features.inode-quota: on
features.quota: on
performance.client-io-threads: on
cluster.min-free-disk: 10
cluster.quorum-type: none
transport.address-family: inet
nfs.disable: on
server.event-threads: 4
client.event-threads: 4
cluster.lookup-optimize: on
performance.readdir-ahead: on
performance.parallel-readdir: off
cluster.readdir-optimize: on
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 50000
performance.io -cache: off
disperse.cpu-extensions: auto
performance.io -thread-count: 16
features.quota-deem-statfs: on
features.default-soft-limit: 90
cluster.server-quorum-type: none
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
cluster.brick-multiplex: off
cluster.server-quorum-ratio: 51%
The last step should be the data rebalance between the servers, but rebalance failed soon with a lot of errors like the following ones:
[2018-10-05 23:48:38.644978] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-tier2-client-70: Server lk version = 1
[2018-10-05 23:48:44.735323] I [dht-rebalance.c:4512:gf_defrag_start_crawl] 0-tier2-dht: gf_defrag_start_crawl using commit hash 3720331860
[2018-10-05 23:48:44.736205] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-7: Failed to get size and version [Input/output error]
[2018-10-05 23:48:44.736266] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-7: Insufficient available children for this request (have 0, need 4)
[2018-10-05 23:48:44.736282] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-7: Failed to update version and size [Input/output error]
[2018-10-05 23:48:44.736377] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-8: Failed to get size and version [Input/output error]
[2018-10-05 23:48:44.736436] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-8: Insufficient available children for this request (have 0, need 4)
[2018-10-05 23:48:44.736459] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-8: Failed to update version and size [Input/output error]
[2018-10-05 23:48:44.736460] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-10: Failed to get size and version [Input/output error]
[2018-10-05 23:48:44.736537] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-9: Failed to get size and version [Input/output error]
[2018-10-05 23:48:44.736571] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-10: Insufficient available children for this request (have 0, need 4)
[2018-10-05 23:48:44.736574] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-9: Insufficient available children for this request (have 0, need 4)
[2018-10-05 23:48:44.736604] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-9: Failed to update version and size [Input/output error]
[2018-10-05 23:48:44.736604] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-10: Failed to update version and size [Input/output error]
[2018-10-05 23:48:44.736827] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-11: Failed to get size and version [Input/output error]
[2018-10-05 23:48:44.736887] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-11: Insufficient available children for this request (have 0, need 4)
[2018-10-05 23:48:44.736904] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-11: Failed to update version and size [Input/output error]
[2018-10-05 23:48:44.740337] W [MSGID: 122040] [ec-common.c:1097:ec_prepare_update_cbk] 0-tier2-disperse-6: Failed to get size and version [Input/output error]
[2018-10-05 23:48:44.740381] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-6: Insufficient available children for this request (have 0, need 4)
[2018-10-05 23:48:44.740394] E [MSGID: 122037] [ec-common.c:2040:ec_update_size_version_done] 0-tier2-disperse-6: Failed to update version and size [Input/output error]
[2018-10-05 23:48:50.066103] I [MSGID: 109081] [dht-common.c:4379:dht_setxattr] 0-tier2-dht: fixing the layout of /
In attachment you can find the first logs captured during the rebalance execution.
In your opinion, is there a way to restore the gluster storage or all the data have been lost?
Thank you in advance,
Mauro
<rebalance_log.txt>
<blockquote>
Il giorno 04 ott 2018, alle ore 15:31, Mauro Tridici < mauro.tridici at cmcc.it > ha scritto:
Hi Nithya,
thank you very much.
This is the current “gluster volume info” output after removing bricks (and after peer detach command).
[root at s01 ~]# gluster volume info
Volume Name: tier2
Type: Distributed-Disperse
Volume ID: a28d88c5-3295-4e35-98d4-210b3af9358c
Status: Started
Snapshot Count: 0
Number of Bricks: 6 x (4 + 2) = 36
Transport-type: tcp
Bricks:
Brick1: s01-stg:/gluster/mnt1/brick
Brick2: s02-stg:/gluster/mnt1/brick
Brick3: s03-stg:/gluster/mnt1/brick
Brick4: s01-stg:/gluster/mnt2/brick
Brick5: s02-stg:/gluster/mnt2/brick
Brick6: s03-stg:/gluster/mnt2/brick
Brick7: s01-stg:/gluster/mnt3/brick
Brick8: s02-stg:/gluster/mnt3/brick
Brick9: s03-stg:/gluster/mnt3/brick
Brick10: s01-stg:/gluster/mnt4/brick
Brick11: s02-stg:/gluster/mnt4/brick
Brick12: s03-stg:/gluster/mnt4/brick
Brick13: s01-stg:/gluster/mnt5/brick
Brick14: s02-stg:/gluster/mnt5/brick
Brick15: s03-stg:/gluster/mnt5/brick
Brick16: s01-stg:/gluster/mnt6/brick
Brick17: s02-stg:/gluster/mnt6/brick
Brick18: s03-stg:/gluster/mnt6/brick
Brick19: s01-stg:/gluster/mnt7/brick
Brick20: s02-stg:/gluster/mnt7/brick
Brick21: s03-stg:/gluster/mnt7/brick
Brick22: s01-stg:/gluster/mnt8/brick
Brick23: s02-stg:/gluster/mnt8/brick
Brick24: s03-stg:/gluster/mnt8/brick
Brick25: s01-stg:/gluster/mnt9/brick
Brick26: s02-stg:/gluster/mnt9/brick
Brick27: s03-stg:/gluster/mnt9/brick
Brick28: s01-stg:/gluster/mnt10/brick
Brick29: s02-stg:/gluster/mnt10/brick
Brick30: s03-stg:/gluster/mnt10/brick
Brick31: s01-stg:/gluster/mnt11/brick
Brick32: s02-stg:/gluster/mnt11/brick
Brick33: s03-stg:/gluster/mnt11/brick
Brick34: s01-stg:/gluster/mnt12/brick
Brick35: s02-stg:/gluster/mnt12/brick
Brick36: s03-stg:/gluster/mnt12/brick
Options Reconfigured:
network.ping-timeout: 0
features.scrub: Active
features.bitrot: on
features.inode-quota: on
features.quota: on
performance.client-io-threads: on
cluster.min-free-disk: 10
cluster.quorum-type: auto
transport.address-family: inet
nfs.disable: on
server.event-threads: 4
client.event-threads: 4
cluster.lookup-optimize: on
performance.readdir-ahead: on
performance.parallel-readdir: off
cluster.readdir-optimize: on
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 50000
performance.io -cache: off
disperse.cpu-extensions: auto
performance.io -thread-count: 16
features.quota-deem-statfs: on
features.default-soft-limit: 90
cluster.server-quorum-type: server
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
cluster.brick-multiplex: on
cluster.server-quorum-ratio: 51%
Regards,
Mauro
<blockquote>
Il giorno 04 ott 2018, alle ore 15:22, Nithya Balachandran < nbalacha at redhat.com > ha scritto:
Hi Mauro,
The files on s04 and s05 can be deleted safely as long as those bricks have been removed from the volume and their brick processes are not running.
.glusterfs/indices/xattrop/xattrop-* are links to files that need to be healed.
.glusterfs/quarantine/stub-00000000-0000-0000-0000-000000000008 links to files that bitrot (if enabled)says are corrupted. (none in this case)
I will get back to you on s06. Can you please provide the output of gluster volume info again?
Regards,
Nithya
On 4 October 2018 at 13:47, Mauro Tridici < mauro.tridici at cmcc.it > wrote:
<blockquote>
Dear Ashish, Dear Nithya,
I’m writing this message only to summarize and simplify the information about the "not migrated” files left on removed bricks on server s04, s05 and s06.
In attachment, you can find 3 files (1 file for each server) containing the “not migrated” files lists and related brick number.
In particular:
* s04 and s05 bricks contain only not migrated files in hidden directories “/gluster/mnt#/brick/.glusterfs" (I could delete them, doesn’t it?)
* s06 bricks contain
* not migrated files in hidden directories “/gluster/mnt#/brick/.glusterfs”;
* not migrated files with size equal to 0;
* not migrated files with size greater than 0.
I think it was necessary to collect and summarize information to simplify your analysis.
Thank you very much,
Mauro
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users
</blockquote>
_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users
</blockquote>
_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20181010/7dd96047/attachment-0001.html>
More information about the Gluster-users
mailing list