[Bugs] [Bug 1264520] volume rebalance start is successfull but status returns failed status

Fri Aug 19 05:18:15 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1264520

Nithya Balachandran <nbalacha at redhat.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
                 CC|                            |nbalacha at redhat.com
           Assignee|bugs at gluster.org            |nbalacha at redhat.com
              Flags|                            |needinfo?(shelsucker at hotmai
                   |                            |l.com)

--- Comment #10 from Nithya Balachandran <nbalacha at redhat.com> ---
My apologies for the extremely delayed response.

I went through the code and the glusterd process generates the volfiles based
on the info stored in /var/lib/glusterd/.  It looks like something might be
wrong there.

glusterd uses the information in the /var/lib/glusterd/<volname>/bricks
directory to generate the client info portion for the client vol files (this
includes any fuse client, rebalance etc).

For example, I have a volume called loop with 3 bricks.

Volume Name: loop
Type: Distribute
Volume ID: 68b941df-b656-4950-bcfa-bdd940b774a7
Status: Started
Number of Bricks: 3
Transport-type: tcp
Bricks:
Brick1: 192.168.122.9:/bricks/brick2/b2
Brick2: 192.168.122.9:/bricks/brick1/b2
Brick3: 192.168.122.8:/bricks/brick2/b2
Options Reconfigured:
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
diagnostics.client-log-level: INFO

If I check the brick info stored in /var/lib/glusterd/vols/loop/bricks, I see 

-rw------- 1 root root 179 Aug 17 13:39 192.168.122.8:-bricks-brick2-b2
-rw------- 1 root root 175 Aug 17 13:39 192.168.122.9:-bricks-brick1-b2
-rw------- 1 root root 175 Aug 17 13:39 192.168.122.9:-bricks-brick2-b2

These files contain the information which is used to generate the volfiles.

[root at nb-rhs3-srv1 bricks]# cat 192.168.122.9:-bricks-brick2-b2
hostname=192.168.122.9
path=/bricks/brick2/b2
real_path=/bricks/brick2/b2
listen-port=0
rdma.listen-port=0
decommissioned=0
brick-id=loop-client-0   <--- client 0
mount_dir=/b2
snap-status=0

[root at nb-rhs3-srv1 bricks]# cat 192.168.122.9:-bricks-brick1-b2
hostname=192.168.122.9
path=/bricks/brick1/b2
real_path=/bricks/brick1/b2
listen-port=0
rdma.listen-port=0
decommissioned=0
brick-id=loop-client-1    <--- client 1
mount_dir=/b2
snap-status=0

[root at nb-rhs3-srv1 bricks]# cat 192.168.122.8:-bricks-brick2-b2
hostname=192.168.122.8
path=/bricks/brick2/b2
real_path=/bricks/brick2/b2
listen-port=49152
rdma.listen-port=0
decommissioned=0
brick-id=loop-client-2   <--- client 2
mount_dir=/b2
snap-status=0

It sounds like the files in /var/lib/glusterd/data/bricks for the original 6
bricks have for some reason got the same brick-id. 

We do not know why this could have happened. If you have any steps to reproduce
the issue, please let us know.

Can you please send across the contents of the /var/lib/glusterd/data on the
server so we can confirm this theory?

If this is the case, this problem will show up everytime the volfiles are
generated (if you were to change an option or add/remove bricks for example).
You will need to edit the files and correct the brick ids in the same order as
listed in the gluster volume info. 

Brick1: gls-safran1:/gluster/bricks/brick1/data   <-- data-client-0
Brick2: gls-safran1:/gluster/bricks/brick2/data   <-- data-client-1
Brick3: gls-safran1:/gluster/bricks/brick3/data   <-- data-client-2
Brick4: gls-safran1:/gluster/bricks/brick4/data   <-- data-client-3
Brick5: gls-safran1:/gluster/bricks/brick5/data   <-- data-client-4
Brick6: gls-safran1:/gluster/bricks/brick6/data   <-- data-client-5
Brick7: gls-safran1:/gluster/bricks/brick7/data   <-- data-client-6
Brick8: gls-safran1:/gluster/bricks/brick8/data   <-- data-client-7

Please let me know if you have any questions.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.