[Bugs] [Bug 1687051] gluster volume heal failed when online upgrading from 3.12 to 5.x and when rolling back online upgrade from 4.1.4 to 3.12.15

Sun Mar 24 03:55:36 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1687051

--- Comment #52 from Amgad <amgad.saleh at nokia.com> ---

Hi Sanju:

I did more testing to take a closer look and here's a more-fine description of
the behavior:

0) Stating with 3-replica: gsf-1, gfs-2, and gfs-3new all on 3.12.15

1) Always had successful replication and success of the "gluster volume heal
<vol>" command during 
the online-upgrade from 3.12.15 to 5.5 on all three nodes in all steps.

2) During rolling back one node (gfs-1) to 3.12.15, I added files (128 files)
to one volume, the files were replicated between gfs-2 and gfs-3new servers.

3) When rollback was complete on gfs-1 to 3.12.15 (while gfs-2 and gfs-3new are
still on 5.5), files didn't replicate to gfs-1
   and "gluster volume heal <vol>" command failed (NO bricks were offline). 

   "gluster volume heal <vol> info showed "Number of entries:129" (128 files
and a directory) on the bricks on gfs-2 and gfs-3new.

   ** Heal never succeeded even when rebooted gfs-1.

        [root at gfs-1 ~]#  gluster volume heal glustervol3 info
        Brick 10.76.153.206:/mnt/data3/3                ==> gfs-1
        Status: Connected
        Number of entries: 0

        Brick 10.76.153.213:/mnt/data3/3                ==> gfs-2
        /test_file.0 
        / 
        /test_file.1 
        /test_file.2 
        .......
        /test_file.124 
        /test_file.125 
        /test_file.126 
        /test_file.127 
        Status: Connected
        Number of entries: 129

        Brick 10.76.153.207:/mnt/data3/3                ==> gfs-3new
        /test_file.0 
        / 
        /test_file.1 
        /test_file.2 
        /test_file.3 
        /test_file.4 
        .....
        /test_file.125 
        /test_file.126 
        /test_file.127 
        Status: Connected
        Number of entries: 129

        [root at gfs-1 ~]# 

4) When rolled-back gfs-2 to 3.12.15 (now gfs-1 is on 3.12.15 and gfs-3new is
on 5.5), the moment "glusterd" started on gfs-2,
   replication and heal started and the "Number of entries:" started to go down
till "0" within "8" seconds.

        Brick 10.76.153.206:/mnt/data3/3
        Status: Connected
        Number of entries: 0

        Brick 10.76.153.213:/mnt/data3/3
        /test_file.0 
        / - Possibly undergoing heal

        /test_file.1 
        /test_file.2 
        /test_file.3 
        ..
        /test_file.124 
        /test_file.125 
        /test_file.126 
        /test_file.127 
        Status: Connected
        Number of entries: 129

        Brick 10.76.153.207:/mnt/data3/3
        /test_file.0 
        /test_file.4 
        /test_file.5 
        /test_file.6 
        /test_file.7 
        /test_file.8 
        ..
        /test_file.124 
        /test_file.125 
        /test_file.126 
        /test_file.127 
        Status: Connected
        Number of entries: 125
        ==============
        Brick 10.76.153.206:/mnt/data3/3
        Status: Connected
        Number of entries: 0

        Brick 10.76.153.213:/mnt/data3/3
        /test_file.0 
        /test_file.68 
        /test_file.69 
        ..
        /test_file.124 
        /test_file.125 
        /test_file.126 
        /test_file.127 
        Status: Connected
        Number of entries: 61

        Brick 10.76.153.207:/mnt/data3/3
        /test_file.0 
        /test_file.76 
        /test_file.77 
        /test_file.78 
        ..
        /test_file.122 
        /test_file.123 
        /test_file.124 
        /test_file.125 
        /test_file.126 
        /test_file.127 
        Status: Connected
        Number of entries: 53
        ==============
        Brick 10.76.153.206:/mnt/data3/3
        Status: Connected
        Number of entries: 0

        Brick 10.76.153.213:/mnt/data3/3
        /test_file.0 
        Status: Connected
        Number of entries: 1

        Brick 10.76.153.207:/mnt/data3/3
        /test_file.0 
        Status: Connected
        Number of entries: 1
        ================
        Brick 10.76.153.206:/mnt/data3/3
        Status: Connected
        Number of entries: 0

        Brick 10.76.153.213:/mnt/data3/3
        Status: Connected
        Number of entries: 0

        Brick 10.76.153.207:/mnt/data3/3
        Status: Connected
        Number of entries: 0

5) Despite heal started when gfs-2 was rolled-back to 3.12.15 (2-nodes now are
on 3.12.15),
   the command "gluster volume heal <vol>" was continuously unsuccessful. No
bricks were offline.

        [root at gfs-1 ~]# for i in glustervol1 glustervol2 glustervol3; do
gluster volume heal $i; done
        Launching heal operation to perform index self heal on volume
glustervol1 has been unsuccessful:
        Commit failed on 10.76.153.207. Please check log file for details.
        Launching heal operation to perform index self heal on volume
glustervol2 has been unsuccessful:
        Commit failed on 10.76.153.207. Please check log file for details.
        Launching heal operation to perform index self heal on volume
glustervol3 has been unsuccessful:
        Commit failed on 10.76.153.207. Please check log file for details.
        You have new mail in /var/spool/mail/root
        [root at gfs-1 ~]# 

6) When the gfs-3new was rolled back (all three servers are on 3.12.15), the
command "gluster volume heal <vol>" was successful.

Conclusions: 
        - "Heal" is not successful with one server is rolled-back to 3.12.15
while the other two are on 5.5.
          The command "gluster volume heal <vol>" is not successful as well

        - Heal starts once two servers are rolled-back to 3.12.15.
        - The command "gluster volume heal <vol>" is not successful till all
servers are rolled-back to 3.12.15.

-- 
You are receiving this mail because:
You are on the CC list for the bug.