[Bugs] [Bug 1807384] New: [AFR] Heal not happening after disk cleanup and self-heal of files/dirs is done to simulate disk replacement

Wed Feb 26 09:06:38 UTC 2020

https://bugzilla.redhat.com/show_bug.cgi?id=1807384

            Bug ID: 1807384
           Summary: [AFR] Heal not happening after disk cleanup and
                    self-heal of files/dirs is done to simulate disk
                    replacement
           Product: GlusterFS
           Version: mainline
          Hardware: x86_64
                OS: Linux
            Status: NEW
         Component: replicate
          Keywords: Regression
          Severity: medium
          Assignee: bugs at gluster.org
          Reporter: kiyer at redhat.com
                CC: bugs at gluster.org
  Target Milestone: ---
    Classification: Community

Description of problem:
While trying to run patch [1] which does the steps mentioned in the following
sections, it was observed that the arequal-checksums were different as shown
below:
################################################################################
Checksum of the brick on which the data is removed
################################################################################
arequal-checksum -p /mnt/vol0/testvol_replicated_brick2 -i .glusterfs -i
.landfill -i .trashcan

Entry counts
Regular files   : 14
Directories     : 3
Symbolic links  : 0
Other           : 0
Total           : 17

Metadata checksums
Regular files   : 3e9
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : c4a0e0fd92dba41dc446cc3b33287983
Directories     : 300002e01
Symbolic links  : 0
Other           : 0
Total           : e62cc5a1f3f39f
################################################################################
Checksum of the brick where data wan't removed
################################################################################
arequal-checksum -p /mnt/vol0/testvol_replicated_brick1 -i .glusterfs -i
.landfill -i .trashcan

Entry counts
Regular files   : 16500
Directories     : 11
Symbolic links  : 0
Other           : 0
Total           : 16511

Metadata checksums
Regular files   : 3e9
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 6b72772e37d757ad53453c4aafed344c
Directories     : 301002f01
Symbolic links  : 0
Other           : 0
Total           : 38374b67993a4ce0
################################################################################

This mean that heal wasn't completing on the node where data was removed.
However when the heal was checked before checking the checksum it was showing
no entries to be healed on the bricks:
################################################################################
2020-02-25 12:15:34,416 INFO (run) root at 172.19.2.161 (cp): gluster volume heal
testvol_replicated info --xml
2020-02-25 12:15:34,416 DEBUG (_get_ssh_connection) Retrieved connection from
cache: root at 172.19.2.161
2020-02-25 12:15:34,618 INFO (_log_results) RETCODE (root at 172.19.2.161): 0
2020-02-25 12:15:34,619 DEBUG (_log_results) STDOUT (root at 172.19.2.161)...
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cliOutput>
  <healInfo>
    <bricks>
      <brick hostUuid="112835ce-16ed-43e1-a758-c104c78ff782">
        <name>172.19.2.161:/mnt/vol0/testvol_replicated_brick0</name>
        <status>Connected</status>
        <numberOfEntries>0</numberOfEntries>
      </brick>
      <brick hostUuid="3fdae765-7a1f-4ae5-99c1-ea7b24768554">
        <name>172.19.2.153:/mnt/vol0/testvol_replicated_brick1</name>
        <status>Connected</status>
        <numberOfEntries>0</numberOfEntries>
      </brick>
      <brick hostUuid="a3877a65-2963-423c-8e9f-95ceb07f907d">
        <name>172.19.2.164:/mnt/vol0/testvol_replicated_brick2</name>
        <status>Connected</status>
        <numberOfEntries>0</numberOfEntries>
      </brick>
    </bricks>
  </healInfo>
  <opRet>0</opRet>
  <opErrno>0</opErrno>
  <opErrstr/>
</cliOutput>
################################################################################

Version-Release number of selected component (if applicable):
glusterfs 20200220.a0e0890

How reproducible:
2/2

Steps to Reproduce:
- Create a volume of type replica or distributed-replica
- Create directory on mount point and write files/dirs
- Create another set of files (1K files)
- While creation of files/dirs are in progress Kill one brick
- Remove the contents of the killed brick(simulating disk replacement)
- When the IO's are still in progress, restart glusterd on the nodes
  where we simulated disk replacement to bring back bricks online
- Start volume heal
- Wait for IO's to complete
- Verify whether the files are self-healed
- Calculate arequals of the mount point and all the bricks

Actual results:
Arequal are different for replica volumes and aren't consistent in distributed
replicated volumes. 

Expected results:
Arequals should be same in case of replicate and should be consistent in case
of distributed-replicated volumes.

Additional info:
This issue wasn't observed in gluster 6.0 builds.

Reference links:
[1] https://review.gluster.org/#/c/glusto-tests/+/20378/
[2]
https://ci.centos.org/job/gluster_glusto-patch-check/2053/artifact/glustomain.log

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.