[Bugs] [Bug 1749322] New: glustershd can not decide heald_sinks, and skip repair, so some entries lingering in volume heal info
bugzilla at redhat.com
bugzilla at redhat.com
Thu Sep 5 11:18:23 UTC 2019
https://bugzilla.redhat.com/show_bug.cgi?id=1749322
Bug ID: 1749322
Summary: glustershd can not decide heald_sinks, and skip
repair, so some entries lingering in volume heal info
Product: GlusterFS
Version: mainline
OS: Linux
Status: ASSIGNED
Component: replicate
Severity: high
Assignee: ksubrahm at redhat.com
Reporter: ksubrahm at redhat.com
CC: bugs at gluster.org, nchilaka at redhat.com,
rhs-bugs at redhat.com, shujun.huang at nokia-sbell.com,
storage-qa-internal at redhat.com,
zz.sh.cynthia at gmail.com
Blocks: 1740968
Target Milestone: ---
Classification: Community
+++ This bug was initially created as a clone of Bug #1740968 +++
Description of problem:
[root at mn-0:/home/robot]
# gluster v heal services info
Brick mn-0.local:/mnt/bricks/services/brick
/db/upgrade
Status: Connected
Number of entries: 1
Brick mn-1.local:/mnt/bricks/services/brick
/db/upgrade
Status: Connected
Number of entries: 1
Brick dbm-0.local:/mnt/bricks/services/brick
Status: Connected
Number of entries: 0
those entries keeps showing in gluster v heal info command,
from glustershd log, each times when glustershd deal with this entry, nothing
real is done, from gdb info, shd can not decide the heald_sinks, so nothing is
done at each round of repair
[root at mn-0:/home/robot]
# gluster v heal services info
Brick mn-0.local:/mnt/bricks/services/brick
/db/upgrade
Status: Connected
Number of entries: 1
Brick mn-1.local:/mnt/bricks/services/brick
/db/upgrade
Status: Connected
Number of entries: 1
Brick dbm-0.local:/mnt/bricks/services/brick
Status: Connected
Number of entries: 0
[Env info]
Three bricks mn-0, mn-1,dbm-0
[root at mn-1:/mnt/bricks/services/brick/db/upgrade]
# gluster v info services
Volume Name: services
Type: Replicate
Volume ID: 062748ce-0876-46f6-9936-d9ff3a2b110a
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: mn-0.local:/mnt/bricks/services/brick
Brick2: mn-1.local:/mnt/bricks/services/brick
Brick3: dbm-0.local:/mnt/bricks/services/brick
Options Reconfigured:
cluster.heal-timeout: 60
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
cluster.server-quorum-type: none
cluster.quorum-type: auto
cluster.quorum-reads: true
cluster.consistent-metadata: on
server.allow-insecure: on
network.ping-timeout: 42
cluster.favorite-child-policy: mtime
client.ssl: on
server.ssl: on
ssl.private-key: /var/opt/nokia/certs/glusterfs/glusterfs.key
ssl.own-cert: /var/opt/nokia/certs/glusterfs/glusterfs.pem
ssl.ca-list: /var/opt/nokia/certs/glusterfs/glusterfs.ca
cluster.server-quorum-ratio: 51%
[debug info]
[root at mn-0:/mnt/bricks/services/brick/db]
# getfattr -m . -d -e hex upgrade/
# file: upgrade/
system.posix_acl_access=0x0200000001000700ffffffff04000500ffffffff08000700d302000008000700d402000010000700ffffffff20000500ffffffff
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.services-client-1=0x000000000000000000000015
trusted.afr.services-client-2=0x000000000000000000000000
trusted.gfid=0xf9ebed9856fb4e26987c3a890ed5203c
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
[root at mn-1:/mnt/bricks/services/brick/db/upgrade]
# getfattr -m . -d -e hex .
# file: .
system.posix_acl_access=0x0200000001000700ffffffff04000500ffffffff08000700d302000008000700d402000010000700ffffffff20000500ffffffff
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.services-client-0=0x000000000000000000000003
trusted.afr.services-client-2=0x000000000000000000000000
trusted.gfid=0xf9ebed9856fb4e26987c3a890ed5203c
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
[root at dbm-0:/mnt/bricks/services/brick/db/upgrade]
# getfattr -m . -d -e hex .
# file: .
system.posix_acl_access=0x0200000001000700ffffffff04000500ffffffff08000700d302000008000700d402000010000700ffffffff20000500ffffffff
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.services-client-0=0x000000000000000000000000
trusted.afr.services-client-1=0x000000000000000000000000
trusted.gfid=0xf9ebed9856fb4e26987c3a890ed5203c
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
gdb attached to mn-0 glustershd process,
Thread 14 "glustershdheal" hit Breakpoint 10, __afr_selfheal_entry_prepare
(frame=frame at entry=0x7f54840321e0, this=this at entry=0x7f548c016980,
inode=<optimized out>, locked_on=locked_on at entry=0x7f545effc780
"\001\001\001dT\177", sources=sources at entry=0x7f545effc7c0 "",
sinks=sinks at entry=0x7f545effc7b0 "", healed_sinks=<optimized out>,
replies=<optimized out>, source_p=<optimized out>, pflag=<optimized out>)
at afr-self-heal-entry.c:546
546 in afr-self-heal-entry.c
(gdb) print heald_sinks[0]
No symbol "heald_sinks" in current context.
(gdb) print healed_sinks[0]
value has been optimized out
(gdb) print source
$12 = 2
(gdb) print sinks[0]
$13 = 0 '\000'
(gdb) print sinks[1]
$14 = 0 '\000'
(gdb) print sinks[2]
$15 = 0 '\000'
(gdb) print locked_on[0]
$16 = 1 '\001'
(gdb) print locked_on[1]
$17 = 1 '\001'
(gdb) print locked_on[2]
$18 = 1 '\001'
According to the code in __afr_selfheal_entry, each time of heal , because the
head_sinks is all 0 so “if (AFR_COUNT(healed_sinks, priv->child_count) == 0)”
will goto unlock, and skip this round of heal, /db/upgrade will keeps showing
in “volume heal info" command. Seems current gluster shd code does not handle
this kind of situation, but I think if it keeps showing up, it is not very
perfect.
Any idea how to improve this?
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1740968
[Bug 1740968] glustershd can not decide heald_sinks, and skip repair, so some
entries lingering in volume heal info
--
You are receiving this mail because:
You are on the CC list for the bug.
More information about the Bugs
mailing list