[Bugs] [Bug 1592258] New: Gluster brick will not come back up online

Mon Jun 18 09:28:19 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1592258

            Bug ID: 1592258
           Summary: Gluster brick will not come back up online
           Product: GlusterFS
           Version: 3.12
         Component: glusterd
          Severity: medium
          Assignee: bugs at gluster.org
          Reporter: stefan.luteijn at kpn.com
                CC: bugs at gluster.org

Created attachment 1452579
  --> https://bugzilla.redhat.com/attachment.cgi?id=1452579&action=edit
Dumps of the bricks of the volume that has the brick refusing the get back
online.

Description of problem:
Every now and then we have gluster bricks going offline on our cluster. Most of
the time we can restart them by running gluster volume start <gluster_vol>
force. Every now and then however we get a brick which refuses to start again
even when we force start it. Stopping and then starting the volume does bring
all bricks online in all cases.

Version-Release number of selected component (if applicable):
glusterfs 3.12.9

How reproducible:
Not reliably. Run gluster with replication on (in our setup with ~50 volumes)
and every 1 or 2 days a few bricks will go offline. About every 1 out of every
5 of those bricks does not want to get back online when we firce start them and
seems to be assigned a port number already in use by another brick on that
node. 

Steps to Reproduce:
1. Have an offline brick of a live gluster volume
2. Gluster volume start <gluster_vol> force
3. Confirm brick of the gluster is still offline

Actual results:
Gluster brick stays offline after running gluster volume start <gluster_vol>
force

Expected results:
Gluster brick will get back online after running gluster volume start
<gluster_vol> force

Additional info:
We run a setup where we create replicated volumes using three gluster nodes. We
also run a dr site. with another three gluster nodes for which we use the data
replication option of gluster to keep them in sync with the main site.
We have noticed that this exclusively happens on our environment when
replication is on. We did notice that so far everytime we have a brick that
does not want to get back online, it seems that is assigned a port number which
has already been assigned to a different brick. However, I don't know at which
point in time a port number gets assigned to a brick so this might be a false
flag. In any case, below the config of the brick config with the same port
number:

172.16.0.4:-var-lib-heketi-mounts-vg_ff6390856febea2c9ec2a7fb7d0c1ff9-brick_08f4fd811dc92bbbfbf1872b7a49c67d-brick:listen-port=49187
<c1ff9-brick_08f4fd811dc92bbbfbf1872b7a49c67d-brick                          
uuid=19a9c113-41ba-411c-ad76-c7b8fdbe14f2
hostname=172.16.0.4
path=/var/lib/heketi/mounts/vg_ff6390856febea2c9ec2a7fb7d0c1ff9/brick_08f4fd811dc92bbbfbf1872b7a49c67d/brick
real_path=/var/lib/heketi/mounts/vg_ff6390856febea2c9ec2a7fb7d0c1ff9/brick_08f4fd811dc92bbbfbf1872b7a49c67d/brick
listen-port=49187
rdma.listen-port=0
decommissioned=0
brick-id=vol_ade97766557f27313661681852eebdf0-client-2
mount_dir=/brick
snap-status=0
brick-fsid=65204

172.16.0.4:-var-lib-heketi-mounts-vg_ff6390856febea2c9ec2a7fb7d0c1ff9-brick_847d3127b26fbea7aa55fd24f46042e4-brick:listen-port=49187
<c1ff9-brick_847d3127b26fbea7aa55fd24f46042e4-brick 
uuid=19a9c113-41ba-411c-ad76-c7b8fdbe14f2
hostname=172.16.0.4
path=/var/lib/heketi/mounts/vg_ff6390856febea2c9ec2a7fb7d0c1ff9/brick_847d3127b26fbea7aa55fd24f46042e4/brick
real_path=/var/lib/heketi/mounts/vg_ff6390856febea2c9ec2a7fb7d0c1ff9/brick_847d3127b26fbea7aa55fd24f46042e4/brick
listen-port=49187
rdma.listen-port=0
decommissioned=0
brick-id=vol_c0d4d479b43ff3d00601b04d25eff60e-client-2
mount_dir=/brick
snap-status=0
brick-fsid=65104

Used gluster packages:
glusterfs.x86_64                         3.12.9-1.el7        
@centos-gluster312
glusterfs-api.x86_64                     3.12.9-1.el7        
@centos-gluster312
glusterfs-cli.x86_64                     3.12.9-1.el7        
@centos-gluster312
glusterfs-client-xlators.x86_64          3.12.9-1.el7        
@centos-gluster312
glusterfs-fuse.x86_64                    3.12.9-1.el7        
@centos-gluster312
glusterfs-geo-replication.x86_64         3.12.9-1.el7        
@centos-gluster312
glusterfs-libs.x86_64                    3.12.9-1.el7        
@centos-gluster312
glusterfs-rdma.x86_64                    3.12.9-1.el7        
@centos-gluster312
glusterfs-server.x86_64                  3.12.9-1.el7        
@centos-gluster312
Available Packages
glusterfs-api-devel.x86_64               3.12.9-1.el7         centos-gluster312 
glusterfs-coreutils.x86_64               0.2.0-1.el7          centos-gluster312 
glusterfs-devel.x86_64                   3.12.9-1.el7         centos-gluster312 
glusterfs-events.x86_64                  3.12.9-1.el7         centos-gluster312 
glusterfs-extra-xlators.x86_64           3.12.9-1.el7         centos-gluster312 
glusterfs-gnfs.x86_64                    3.12.9-1.el7         centos-gluster312 
glusterfs-resource-agents.noarch         3.12.9-1.el7         centos-gluster312 

xfs options:
xfs_info
/dev/mapper/vg_ff6390856febea2c9ec2a7fb7d0c1ff9-brick_08f4fd811dc92bbbfbf1872b7a49c67d
meta-data=/dev/mapper/vg_ff6390856febea2c9ec2a7fb7d0c1ff9-brick_08f4fd811dc92bbbfbf1872b7a49c67d
isize=512    agcount=8, agsize=65472 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=523776, imaxpct=25
         =                       sunit=64     swidth=64 blks
naming   =version 2              bsize=8192   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

uname -r
4.11.9-coreos
cat /etc/issue
\S
Kernel \r on an \m

10:46$ df -Th
Filesystem              Type      Size  Used Avail Use% Mounted on
udev                    devtmpfs  7,7G     0  7,7G   0% /dev
tmpfs                   tmpfs     1,6G  9,7M  1,6G   1% /run
/dev/sda1               ext4      220G  149G   60G  72% /
tmpfs                   tmpfs     7,7G  369M  7,3G   5% /dev/shm
tmpfs                   tmpfs     5,0M  4,0K  5,0M   1% /run/lock
tmpfs                   tmpfs     7,7G     0  7,7G   0% /sys/fs/cgroup
/dev/loop1              squashfs  158M  158M     0 100% /snap/mailspring/209
/dev/loop0              squashfs   94M   94M     0 100% /snap/slack/6
/dev/loop4              squashfs  157M  157M     0 100% /snap/mailspring/216
/dev/loop3              squashfs   87M   87M     0 100% /snap/core/4486
/dev/loop6              squashfs   94M   94M     0 100% /snap/slack/5
/dev/loop2              squashfs  144M  144M     0 100% /snap/slack/7
/dev/loop5              squashfs   87M   87M     0 100% /snap/core/4571
/dev/loop7              squashfs  158M  158M     0 100% /snap/mailspring/202
/dev/loop8              squashfs   87M   87M     0 100% /snap/core/4650
tmpfs                   tmpfs     1,6G   64K  1,6G   1% /run/user/1000

Attached are the brick dump files of the volume

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.