[Bugs] [Bug 1796609] New: Random glusterfsd crashes

Thu Jan 30 18:43:46 UTC 2020

https://bugzilla.redhat.com/show_bug.cgi?id=1796609

            Bug ID: 1796609
           Summary: Random glusterfsd crashes
           Product: GlusterFS
           Version: 7
          Hardware: x86_64
                OS: Linux
            Status: NEW
         Component: core
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: gagnon.pierluc at gmail.com
                CC: bugs at gluster.org
  Target Milestone: ---
    Classification: Community

Created attachment 1656556
  --> https://bugzilla.redhat.com/attachment.cgi?id=1656556&action=edit
Crash log

Description of problem:
The gluster volume becomes inacessible (Transport endpoint not connected),
which seems to be due to glusterfsd crashing.

Version-Release number of selected component (if applicable):
7.2

How reproducible:
Happens regularly (about once a day), but I cannot figure out how to trigger
it.

Additional info:

apport dump: https://drive.google.com/open?id=1zElM6I6HNE7V_WU_SQH5-emlPpdcRd6e

mnt-{volume}.log attached (my volume is called 'gfs')

My cluster is composed of 3 nodes:
mars: 192.168.4.132
venus: 192.168.5.196
saturn: 192.168.4.146

Each node has 2 bricks, with a replica set to 3 (so 2 x 3).

All bricks are on xfs, except for the 2 bricks on mars which are on a single
zfs volume (crash does not happen only on mars)

Extra:

~> sudo gluster peer status
Number of Peers: 2

Hostname: mars
Uuid: 53e473df-d8e9-4d0d-b753-ccfff5c5097c
State: Peer in Cluster (Connected)

Hostname: venus.sarbakaninc.local
Uuid: 4aa987f2-924b-4a2c-b441-ff1b0b1cbb86
State: Peer in Cluster (Connected)
Other names:
venus.sarbakaninc.local
venus

~> sudo gluster volume info

Volume Name: gfs
Type: Distributed-Replicate
Volume ID: 3f451b61-e48b-4be4-92ed-e509271d0284
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: saturn:/gluster/bricks/1/brick
Brick2: venus:/gluster/bricks/2/brick
Brick3: mars:/gluster/bricks/3/brick
Brick4: venus:/gluster/bricks/5/brick
Brick5: saturn:/gluster/bricks/6/brick
Brick6: mars:/gluster/bricks/4/brick
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
server.event-threads: 4
changelog.changelog: on
geo-replication.ignore-pid-check: off
performance.client-io-threads: on
nfs.disable: on
transport.address-family: inet
auth.allow:
192.168.5.222,192.168.5.196,192.168.4.132,192.168.4.133,192.168.5.195,192.168.4.146,192.168.5.55
performance.cache-size: 1GB
cluster.enable-shared-storage: disable

~> sudo gluster volume status
Status of volume: gfs
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick saturn:/gluster/bricks/1/brick        49152     0          Y       21533
Brick venus:/gluster/bricks/2/brick         49152     0          Y       4590 
Brick mars:/gluster/bricks/3/brick          49154     0          Y       30419
Brick venus:/gluster/bricks/5/brick         49153     0          Y       4591 
Brick saturn:/gluster/bricks/6/brick        49153     0          Y       21534
Brick mars:/gluster/bricks/4/brick          49155     0          Y       30447
Self-heal Daemon on localhost               N/A       N/A        Y       21564
Self-heal Daemon on venus.sarbakaninc.local N/A       N/A        Y       4610 
Self-heal Daemon on mars                    N/A       N/A        Y       3640 

Task Status of Volume gfs
------------------------------------------------------------------------------
There are no active volume tasks

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.