[Bugs] [Bug 1592258] New: Gluster brick will not come back up online
bugzilla at redhat.com
bugzilla at redhat.com
Mon Jun 18 09:28:19 UTC 2018
https://bugzilla.redhat.com/show_bug.cgi?id=1592258
Bug ID: 1592258
Summary: Gluster brick will not come back up online
Product: GlusterFS
Version: 3.12
Component: glusterd
Severity: medium
Assignee: bugs at gluster.org
Reporter: stefan.luteijn at kpn.com
CC: bugs at gluster.org
Created attachment 1452579
--> https://bugzilla.redhat.com/attachment.cgi?id=1452579&action=edit
Dumps of the bricks of the volume that has the brick refusing the get back
online.
Description of problem:
Every now and then we have gluster bricks going offline on our cluster. Most of
the time we can restart them by running gluster volume start <gluster_vol>
force. Every now and then however we get a brick which refuses to start again
even when we force start it. Stopping and then starting the volume does bring
all bricks online in all cases.
Version-Release number of selected component (if applicable):
glusterfs 3.12.9
How reproducible:
Not reliably. Run gluster with replication on (in our setup with ~50 volumes)
and every 1 or 2 days a few bricks will go offline. About every 1 out of every
5 of those bricks does not want to get back online when we firce start them and
seems to be assigned a port number already in use by another brick on that
node.
Steps to Reproduce:
1. Have an offline brick of a live gluster volume
2. Gluster volume start <gluster_vol> force
3. Confirm brick of the gluster is still offline
Actual results:
Gluster brick stays offline after running gluster volume start <gluster_vol>
force
Expected results:
Gluster brick will get back online after running gluster volume start
<gluster_vol> force
Additional info:
We run a setup where we create replicated volumes using three gluster nodes. We
also run a dr site. with another three gluster nodes for which we use the data
replication option of gluster to keep them in sync with the main site.
We have noticed that this exclusively happens on our environment when
replication is on. We did notice that so far everytime we have a brick that
does not want to get back online, it seems that is assigned a port number which
has already been assigned to a different brick. However, I don't know at which
point in time a port number gets assigned to a brick so this might be a false
flag. In any case, below the config of the brick config with the same port
number:
172.16.0.4:-var-lib-heketi-mounts-vg_ff6390856febea2c9ec2a7fb7d0c1ff9-brick_08f4fd811dc92bbbfbf1872b7a49c67d-brick:listen-port=49187
<c1ff9-brick_08f4fd811dc92bbbfbf1872b7a49c67d-brick
uuid=19a9c113-41ba-411c-ad76-c7b8fdbe14f2
hostname=172.16.0.4
path=/var/lib/heketi/mounts/vg_ff6390856febea2c9ec2a7fb7d0c1ff9/brick_08f4fd811dc92bbbfbf1872b7a49c67d/brick
real_path=/var/lib/heketi/mounts/vg_ff6390856febea2c9ec2a7fb7d0c1ff9/brick_08f4fd811dc92bbbfbf1872b7a49c67d/brick
listen-port=49187
rdma.listen-port=0
decommissioned=0
brick-id=vol_ade97766557f27313661681852eebdf0-client-2
mount_dir=/brick
snap-status=0
brick-fsid=65204
172.16.0.4:-var-lib-heketi-mounts-vg_ff6390856febea2c9ec2a7fb7d0c1ff9-brick_847d3127b26fbea7aa55fd24f46042e4-brick:listen-port=49187
<c1ff9-brick_847d3127b26fbea7aa55fd24f46042e4-brick
uuid=19a9c113-41ba-411c-ad76-c7b8fdbe14f2
hostname=172.16.0.4
path=/var/lib/heketi/mounts/vg_ff6390856febea2c9ec2a7fb7d0c1ff9/brick_847d3127b26fbea7aa55fd24f46042e4/brick
real_path=/var/lib/heketi/mounts/vg_ff6390856febea2c9ec2a7fb7d0c1ff9/brick_847d3127b26fbea7aa55fd24f46042e4/brick
listen-port=49187
rdma.listen-port=0
decommissioned=0
brick-id=vol_c0d4d479b43ff3d00601b04d25eff60e-client-2
mount_dir=/brick
snap-status=0
brick-fsid=65104
Used gluster packages:
glusterfs.x86_64 3.12.9-1.el7
@centos-gluster312
glusterfs-api.x86_64 3.12.9-1.el7
@centos-gluster312
glusterfs-cli.x86_64 3.12.9-1.el7
@centos-gluster312
glusterfs-client-xlators.x86_64 3.12.9-1.el7
@centos-gluster312
glusterfs-fuse.x86_64 3.12.9-1.el7
@centos-gluster312
glusterfs-geo-replication.x86_64 3.12.9-1.el7
@centos-gluster312
glusterfs-libs.x86_64 3.12.9-1.el7
@centos-gluster312
glusterfs-rdma.x86_64 3.12.9-1.el7
@centos-gluster312
glusterfs-server.x86_64 3.12.9-1.el7
@centos-gluster312
Available Packages
glusterfs-api-devel.x86_64 3.12.9-1.el7 centos-gluster312
glusterfs-coreutils.x86_64 0.2.0-1.el7 centos-gluster312
glusterfs-devel.x86_64 3.12.9-1.el7 centos-gluster312
glusterfs-events.x86_64 3.12.9-1.el7 centos-gluster312
glusterfs-extra-xlators.x86_64 3.12.9-1.el7 centos-gluster312
glusterfs-gnfs.x86_64 3.12.9-1.el7 centos-gluster312
glusterfs-resource-agents.noarch 3.12.9-1.el7 centos-gluster312
xfs options:
xfs_info
/dev/mapper/vg_ff6390856febea2c9ec2a7fb7d0c1ff9-brick_08f4fd811dc92bbbfbf1872b7a49c67d
meta-data=/dev/mapper/vg_ff6390856febea2c9ec2a7fb7d0c1ff9-brick_08f4fd811dc92bbbfbf1872b7a49c67d
isize=512 agcount=8, agsize=65472 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0 spinodes=0
data = bsize=4096 blocks=523776, imaxpct=25
= sunit=64 swidth=64 blks
naming =version 2 bsize=8192 ascii-ci=0 ftype=1
log =internal bsize=4096 blocks=2560, version=2
= sectsz=512 sunit=64 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
uname -r
4.11.9-coreos
cat /etc/issue
\S
Kernel \r on an \m
10:46$ df -Th
Filesystem Type Size Used Avail Use% Mounted on
udev devtmpfs 7,7G 0 7,7G 0% /dev
tmpfs tmpfs 1,6G 9,7M 1,6G 1% /run
/dev/sda1 ext4 220G 149G 60G 72% /
tmpfs tmpfs 7,7G 369M 7,3G 5% /dev/shm
tmpfs tmpfs 5,0M 4,0K 5,0M 1% /run/lock
tmpfs tmpfs 7,7G 0 7,7G 0% /sys/fs/cgroup
/dev/loop1 squashfs 158M 158M 0 100% /snap/mailspring/209
/dev/loop0 squashfs 94M 94M 0 100% /snap/slack/6
/dev/loop4 squashfs 157M 157M 0 100% /snap/mailspring/216
/dev/loop3 squashfs 87M 87M 0 100% /snap/core/4486
/dev/loop6 squashfs 94M 94M 0 100% /snap/slack/5
/dev/loop2 squashfs 144M 144M 0 100% /snap/slack/7
/dev/loop5 squashfs 87M 87M 0 100% /snap/core/4571
/dev/loop7 squashfs 158M 158M 0 100% /snap/mailspring/202
/dev/loop8 squashfs 87M 87M 0 100% /snap/core/4650
tmpfs tmpfs 1,6G 64K 1,6G 1% /run/user/1000
Attached are the brick dump files of the volume
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list