[Bugs] [Bug 1687051] gluster volume heal failed when online upgrading from 3.12 to 5.x and when rolling back online upgrade from 4.1.4 to 3.12.15

bugzilla at redhat.com bugzilla at redhat.com
Sun Mar 24 03:55:36 UTC 2019


https://bugzilla.redhat.com/show_bug.cgi?id=1687051



--- Comment #52 from Amgad <amgad.saleh at nokia.com> ---

Hi Sanju:

I did more testing to take a closer look and here's a more-fine description of
the behavior:

0) Stating with 3-replica: gsf-1, gfs-2, and gfs-3new all on 3.12.15

1) Always had successful replication and success of the "gluster volume heal
<vol>" command during 
the online-upgrade from 3.12.15 to 5.5 on all three nodes in all steps.

2) During rolling back one node (gfs-1) to 3.12.15, I added files (128 files)
to one volume, the files were replicated between gfs-2 and gfs-3new servers.

3) When rollback was complete on gfs-1 to 3.12.15 (while gfs-2 and gfs-3new are
still on 5.5), files didn't replicate to gfs-1
   and "gluster volume heal <vol>" command failed (NO bricks were offline). 

   "gluster volume heal <vol> info showed "Number of entries:129" (128 files
and a directory) on the bricks on gfs-2 and gfs-3new.

   ** Heal never succeeded even when rebooted gfs-1.

        [root at gfs-1 ~]#  gluster volume heal glustervol3 info
        Brick 10.76.153.206:/mnt/data3/3                ==> gfs-1
        Status: Connected
        Number of entries: 0

        Brick 10.76.153.213:/mnt/data3/3                ==> gfs-2
        /test_file.0 
        / 
        /test_file.1 
        /test_file.2 
        .......
        /test_file.124 
        /test_file.125 
        /test_file.126 
        /test_file.127 
        Status: Connected
        Number of entries: 129

        Brick 10.76.153.207:/mnt/data3/3                ==> gfs-3new
        /test_file.0 
        / 
        /test_file.1 
        /test_file.2 
        /test_file.3 
        /test_file.4 
        .....
        /test_file.125 
        /test_file.126 
        /test_file.127 
        Status: Connected
        Number of entries: 129

        [root at gfs-1 ~]# 

4) When rolled-back gfs-2 to 3.12.15 (now gfs-1 is on 3.12.15 and gfs-3new is
on 5.5), the moment "glusterd" started on gfs-2,
   replication and heal started and the "Number of entries:" started to go down
till "0" within "8" seconds.

        Brick 10.76.153.206:/mnt/data3/3
        Status: Connected
        Number of entries: 0

        Brick 10.76.153.213:/mnt/data3/3
        /test_file.0 
        / - Possibly undergoing heal

        /test_file.1 
        /test_file.2 
        /test_file.3 
        ..
        /test_file.124 
        /test_file.125 
        /test_file.126 
        /test_file.127 
        Status: Connected
        Number of entries: 129

        Brick 10.76.153.207:/mnt/data3/3
        /test_file.0 
        /test_file.4 
        /test_file.5 
        /test_file.6 
        /test_file.7 
        /test_file.8 
        ..
        /test_file.124 
        /test_file.125 
        /test_file.126 
        /test_file.127 
        Status: Connected
        Number of entries: 125
        ==============
        Brick 10.76.153.206:/mnt/data3/3
        Status: Connected
        Number of entries: 0

        Brick 10.76.153.213:/mnt/data3/3
        /test_file.0 
        /test_file.68 
        /test_file.69 
        ..
        /test_file.124 
        /test_file.125 
        /test_file.126 
        /test_file.127 
        Status: Connected
        Number of entries: 61

        Brick 10.76.153.207:/mnt/data3/3
        /test_file.0 
        /test_file.76 
        /test_file.77 
        /test_file.78 
        ..
        /test_file.122 
        /test_file.123 
        /test_file.124 
        /test_file.125 
        /test_file.126 
        /test_file.127 
        Status: Connected
        Number of entries: 53
        ==============
        Brick 10.76.153.206:/mnt/data3/3
        Status: Connected
        Number of entries: 0

        Brick 10.76.153.213:/mnt/data3/3
        /test_file.0 
        Status: Connected
        Number of entries: 1

        Brick 10.76.153.207:/mnt/data3/3
        /test_file.0 
        Status: Connected
        Number of entries: 1
        ================
        Brick 10.76.153.206:/mnt/data3/3
        Status: Connected
        Number of entries: 0

        Brick 10.76.153.213:/mnt/data3/3
        Status: Connected
        Number of entries: 0

        Brick 10.76.153.207:/mnt/data3/3
        Status: Connected
        Number of entries: 0

5) Despite heal started when gfs-2 was rolled-back to 3.12.15 (2-nodes now are
on 3.12.15),
   the command "gluster volume heal <vol>" was continuously unsuccessful. No
bricks were offline.

        [root at gfs-1 ~]# for i in glustervol1 glustervol2 glustervol3; do
gluster volume heal $i; done
        Launching heal operation to perform index self heal on volume
glustervol1 has been unsuccessful:
        Commit failed on 10.76.153.207. Please check log file for details.
        Launching heal operation to perform index self heal on volume
glustervol2 has been unsuccessful:
        Commit failed on 10.76.153.207. Please check log file for details.
        Launching heal operation to perform index self heal on volume
glustervol3 has been unsuccessful:
        Commit failed on 10.76.153.207. Please check log file for details.
        You have new mail in /var/spool/mail/root
        [root at gfs-1 ~]# 

6) When the gfs-3new was rolled back (all three servers are on 3.12.15), the
command "gluster volume heal <vol>" was successful.

Conclusions: 
        - "Heal" is not successful with one server is rolled-back to 3.12.15
while the other two are on 5.5.
          The command "gluster volume heal <vol>" is not successful as well

        - Heal starts once two servers are rolled-back to 3.12.15.
        - The command "gluster volume heal <vol>" is not successful till all
servers are rolled-back to 3.12.15.

-- 
You are receiving this mail because:
You are on the CC list for the bug.


More information about the Bugs mailing list