[Bugs] [Bug 1811373] New: glusterd crashes healing disperse volumes on arm

Sun Mar 8 04:09:47 UTC 2020

https://bugzilla.redhat.com/show_bug.cgi?id=1811373

            Bug ID: 1811373
           Summary: glusterd crashes healing disperse volumes on arm
           Product: GlusterFS
           Version: 7
          Hardware: armv7l
                OS: Linux
            Status: NEW
         Component: glusterd
          Assignee: bugs at gluster.org
          Reporter: foxxz.net at gmail.com
                CC: bugs at gluster.org
  Target Milestone: ---
    Classification: Community

Created attachment 1668387
  --> https://bugzilla.redhat.com/attachment.cgi?id=1668387&action=edit
Excerpts from several gluster logs

Description of problem:
the gluster brick process on an arm node that needs healing will crash (almost
always) seconds after it starts and connects to other cluster members. Have
tested under ubuntu 18, gluster v7 and v4 running on odroid HC2 and raspbian
gluster v5 running on raspberry pi 3

Version-Release number of selected component (if applicable):
gluster 7.2 but have also reproduced the problem on 4 and 5

How reproducible:
Reliably reproducible

Steps to Reproduce:
1. Create disperse volume on a cluster with 3 or more members/bricks and enable
healing
2. Have a client mount volume and begin writing files to volume
3. Reboot a cluster member during client operations
4. Cluster member rejoins cluster and attempts to heal
5. glusterd on that member typically crashes seconds to minutes after startup.
In rare cases longer.

Actual results:
gluster volume status
shows the affected brick online briefly and then offline after it crashes. The
self heal daemon shows as online. The brick is never able to heal and rejoin
the cluster.

Expected results:
The brick should come online and sync up.

Additional info:
Have run the same test on x86 hardware and it does not exhibit the same crash.

I am willing to make this testbed available to developers to help debug this
issue. It is a 12 node system comprised of odroid HC2 units with a 4tb drive
attached to each unit.

Volume Name: bigdisp
Type: Disperse   
Volume ID: 56fa5de3-36d5-45ec-9789-88d8aae02275
Status: Started  
Snapshot Count: 0
Number of Bricks: 1 x (8 + 4) = 12
Transport-type: tcp
Bricks:
Brick1: gluster1:/exports/sda/brick1/bigdisp
Brick2: gluster2:/exports/sda/brick1/bigdisp
Brick3: gluster3:/exports/sda/brick1/bigdisp
Brick4: gluster4:/exports/sda/brick1/bigdisp
Brick5: gluster5:/exports/sda/brick1/bigdisp
Brick6: gluster6:/exports/sda/brick1/bigdisp
Brick7: gluster7:/exports/sda/brick1/bigdisp
Brick8: gluster8:/exports/sda/brick1/bigdisp
Brick9: gluster9:/exports/sda/brick1/bigdisp
Brick10: gluster10:/exports/sda/brick1/bigdisp
Brick11: gluster11:/exports/sda/brick1/bigdisp
Brick12: gluster12:/exports/sda/brick1/bigdisp
Options Reconfigured:
disperse.shd-max-threads: 4
client.event-threads: 8
cluster.disperse-self-heal-daemon: enable
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on

Status of volume: bigdisp
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gluster1:/exports/sda/brick1/bigdisp  49152     0          Y       4632
Brick gluster2:/exports/sda/brick1/bigdisp  49152     0          Y       3115
Brick gluster3:/exports/sda/brick1/bigdisp  N/A       N/A        N       N/A
Brick gluster4:/exports/sda/brick1/bigdisp  49152     0          Y       2728
Brick gluster5:/exports/sda/brick1/bigdisp  49152     0          Y       3072
Brick gluster6:/exports/sda/brick1/bigdisp  49152     0          Y       2549
Brick gluster7:/exports/sda/brick1/bigdisp  49152     0          Y       16848
Brick gluster8:/exports/sda/brick1/bigdisp  49152     0          Y       16740
Brick gluster9:/exports/sda/brick1/bigdisp  49152     0          Y       2619
Brick gluster10:/exports/sda/brick1/bigdisp 49152     0          Y       2677
Brick gluster11:/exports/sda/brick1/bigdisp 49152     0          Y       3023
Brick gluster12:/exports/sda/brick1/bigdisp 49153     0          Y       2440
Self-heal Daemon on localhost               N/A       N/A        Y       4653
Self-heal Daemon on gluster3                N/A       N/A        Y       7620
Self-heal Daemon on gluster10               N/A       N/A        Y       2698
Self-heal Daemon on gluster7                N/A       N/A        Y       16869
Self-heal Daemon on gluster8                N/A       N/A        Y       16761
Self-heal Daemon on gluster12               N/A       N/A        Y       2461
Self-heal Daemon on gluster9                N/A       N/A        Y       2640
Self-heal Daemon on gluster2                N/A       N/A        Y       3136
Self-heal Daemon on gluster5                N/A       N/A        Y       3093
Self-heal Daemon on gluster4                N/A       N/A        Y       2749
Self-heal Daemon on gluster6                N/A       N/A        Y       2570
Self-heal Daemon on gluster11               N/A       N/A        Y       3044

Task Status of Volume bigdisp
------------------------------------------------------------------------------
There are no active volume tasks

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.