[Gluster-users] lost one replica after upgrading glusterfs from 3.7 to 3.10, please help

Fri Apr 28 09:04:40 UTC 2017

Dear Community,

I call for your wisdom, as it appears that googling for keywords doesn't help much.

I have a glusterfs volume with replica count 2, and I tried to perform the online upgrade procedure described in the docs (http://gluster.readthedocs.io/en/latest/Upgrade-Guide/upgrade_to_3.10/). It all went almost fine when I'd done with the first replica, the only problem was the self-heal procedure that refused to complete until I commented out all IPv6 entries in the /etc/hosts.

So far, being sure that it all should work on the 2nd replica pretty the same as it was on the 1st one, I had proceeded with the upgrade on the replica 2. All of a sudden, it told me that it doesn't see the first replica at all. The state before upgrade was:

sst2# gluster volume status
Status of volume: gv0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick sst0:/var/glusterfs                   49152     0          Y       3482 
Brick sst2:/var/glusterfs                   49152     0          Y       29863
NFS Server on localhost                   2049      0          Y       25175
Self-heal Daemon on localhost        N/A       N/A        Y       25283
NFS Server on sst0                          N/A       N/A        N       N/A  
Self-heal Daemon on sst0                N/A       N/A        Y       4827 
NFS Server on sst1                          N/A       N/A        N       N/A  
Self-heal Daemon on sst1                N/A       N/A        Y       15009

Task Status of Volume gv0
------------------------------------------------------------------------------
There are no active volume tasks

sst2# gluster peer status
Number of Peers: 2

Hostname: sst0
Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
State: Peer in Cluster (Connected)

Hostname: sst1
Uuid: 5a2198de-f536-4328-a278-7f746f276e35
State: Sent and Received peer request (Connected)

sst2# gluster volume heal gv0 info
Brick sst0:/var/glusterfs
Number of entries: 0

Brick sst2:/var/glusterfs
Number of entries: 0

After upgrade, it looked like this:

sst2# gluster volume status
Status of volume: gv0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick sst2:/var/glusterfs                   N/A       N/A        N       N/A  
NFS Server on localhost                     N/A       N/A        N       N/A  
NFS Server on localhost                     N/A       N/A        N       N/A  

Task Status of Volume gv0
------------------------------------------------------------------------------
There are no active volume tasks

sst2# gluster peer status
Number of Peers: 2

Hostname: sst1
Uuid: 5a2198de-f536-4328-a278-7f746f276e35
State: Sent and Received peer request (Connected)

Hostname: sst0
Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
State: Peer Rejected (Connected)

My biggest fault probably, at that point I googled and found this article https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/ -- and followed its advice, removing at sst2 all the /var/lib/glusterd contents except the glusterd.info file. As the result, the node, predictably, lost all information about the volume.

sst2# gluster volume status
No volumes present

sst2# gluster peer status
Number of Peers: 2

Hostname: sst0
Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
State: Accepted peer request (Connected)

Hostname: sst1
Uuid: 5a2198de-f536-4328-a278-7f746f276e35
State: Accepted peer request (Connected)

Okay, I thought, this is might be a high time to re-add the brick. Not that easy, Jack:

sst0# gluster volume add-brick gv0 replica 2 'sst2:/var/glusterfs'
volume add-brick: failed: Operation failed

The reason appeared to be natural: sst0 still knows that there was the replica on sst2. What should I do then? At this point, I tried to recover the volume information on sst2 by putting it offline and copying all the volume info from the sst0. Of course it wasn't enough to just copy as is, I modified /var/lib/glusterd/vols/gv0/sst*\:-var-glusterfs, setting listen-port=0 for the remote brick (sst0) and listen-port=49152 for the local brick (sst2). It didn't help much, unfortunately. The final state I've reached is as follows:

sst2# gluster peer status
Number of Peers: 2

Hostname: sst1
Uuid: 5a2198de-f536-4328-a278-7f746f276e35
State: Sent and Received peer request (Connected)

Hostname: sst0
Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
State: Sent and Received peer request (Connected)

sst2# gluster volume info

Volume Name: gv0
Type: Replicate
Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: sst0:/var/glusterfs
Brick2: sst2:/var/glusterfs
Options Reconfigured:
cluster.self-heal-daemon: enable
performance.readdir-ahead: on
storage.owner-uid: 1000
storage.owner-gid: 1000

sst2# gluster volume status
Status of volume: gv0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick sst2:/var/glusterfs                   N/A       N/A        N       N/A  
NFS Server on localhost                     N/A       N/A        N       N/A  
NFS Server on localhost                     N/A       N/A        N       N/A  

Task Status of Volume gv0
------------------------------------------------------------------------------
There are no active volume tasks

Meanwhile, on sst0:

sst0# gluster volume info

Volume Name: gv0
Type: Replicate
Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: sst0:/var/glusterfs
Brick2: sst2:/var/glusterfs
Options Reconfigured:
storage.owner-gid: 1000
storage.owner-uid: 1000
performance.readdir-ahead: on
cluster.self-heal-daemon: enable

sst0 ~ # gluster volume status
Status of volume: gv0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick sst0:/var/glusterfs                   49152     0          Y       31263
NFS Server on localhost                     N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       31254

Task Status of Volume gv0
------------------------------------------------------------------------------
There are no active volume tasks

Any ideas how to bring the sst2 back to normal are appreciated. As a last resort solution, I can schedule the downtime, backup data, kill the volume and start all over, but I would like to know if there is a shorter path. Thank you very much in advance.

-- 
Best Regards,

Seva Gluschenko