[Gluster-users] when there are dangling entry(without gfid) in only one brick dir, the glusterfs heal info will keep showing the entry, glustershd can not really remove this entry from brick .

Zhou, Cynthia (NSB - CN/Hangzhou) cynthia.zhou at nokia-sbell.com
Wed Oct 10 08:52:33 UTC 2018


Hi glusterfs expert,
I meet one problem in my test bed (3 brick on 3 sn nodes), the "/" is always in glusterfs v heal info output. In my ftest+reboot-sn-nodes-randomly test, the gluster v heal info output keeps showing entry "/" even for hours, and even you do some touch or ls of /mnt/mstate , it will not help to solve this issue.

[root at sn-0:/mnt/bricks/mstate/brick]
# gluster v heal mstate info
Brick sn-0.local:/mnt/bricks/mstate/brick
/
Status: Connected
Number of entries: 1

Brick sn-2.local:/mnt/bricks/mstate/brick
/
Status: Connected
Number of entries: 1

Brick sn-1.local:/mnt/bricks/mstate/brick
/
Status: Connected
Number of entries: 1


>From sn glustershd.log I find following prints
[2018-10-10 08:13:00.005250] I [MSGID: 108026] [afr-self-heald.c:432:afr_shd_index_heal] 0-mstate-replicate-0: got entry: 00000000-0000-0000-0000-000000000001 from mstate-client-0
[2018-10-10 08:13:00.006077] I [MSGID: 108026] [afr-self-heald.c:341:afr_shd_selfheal] 0-mstate-replicate-0: entry: path /, gfid: 00000000-0000-0000-0000-000000000001
[2018-10-10 08:13:00.011599] I [MSGID: 108026] [afr-self-heal-entry.c:887:afr_selfheal_entry_do] 0-mstate-replicate-0: performing entry selfheal on 00000000-0000-0000-0000-000000000001
[2018-10-10 08:16:28.722059] W [MSGID: 108015] [afr-self-heal-entry.c:47:afr_selfheal_entry_delete] 0-mstate-replicate-0: expunging dir 00000000-0000-0000-0000-000000000001/fstest_76f272545249be5d71359f06962e069b (00000000-0000-0000-0000-000000000000) on mstate-client-0
[2018-10-10 08:16:28.722975] W [MSGID: 114031] [client-rpc-fops.c:670:client3_3_rmdir_cbk] 0-mstate-client-0: remote operation failed [No such file or directory]


When I check the env I find that fstest_76f272545249be5d71359f06962e069b only exists on sn-0 node brick only and the getfattr of this is empty!

[root at sn-0:/mnt/bricks/mstate/brick]
# getfattr -m . -d -e hex fstest_76f272545249be5d71359f06962e069b    //return is empty output
[root at sn-0:/mnt/bricks/mstate/brick]
# getfattr -m . -d -e hex .
# file: .
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.mstate-client-1=0x000000000000000000000000
trusted.afr.mstate-client-2=0x0000000000000000000002a7
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.volume-id=0xbf7aad0e4ce44196aa9b0a33928ea2ff

[root at sn-1:/root]
# stat /mnt/bricks/mstate/brick/fstest_76f272545249be5d71359f06962e069b
stat: cannot stat '/mnt/bricks/mstate/brick/fstest_76f272545249be5d71359f06962e069b': No such file or directory
[root at sn-1:/root]
[root at sn-1:/mnt/bricks/mstate/brick]
# getfattr -m . -d -e hex .
# file: .
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.mstate-client-0=0x000000000000000000000006
trusted.afr.mstate-client-1=0x000000000000000000000000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.volume-id=0xbf7aad0e4ce44196aa9b0a33928ea2ff

[root at sn-2:/mnt/bricks/mstate/brick]
# stat /mnt/bricks/mstate/brick/fstest_76f272545249be5d71359f06962e069b
stat: cannot stat '/mnt/bricks/mstate/brick/fstest_76f272545249be5d71359f06962e069b': No such file or directory
[root at sn-2:/mnt/bricks/mstate/brick]

[root at sn-2:/mnt/bricks/mstate/brick]
# getfattr -m . -d -e hex .
# file: .
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.mstate-client-0=0x000000000000000000000006
trusted.afr.mstate-client-2=0x000000000000000000000000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.volume-id=0xbf7aad0e4ce44196aa9b0a33928ea2ff


I think the entry fstest_76f272545249be5d71359f06962e069b should either be assigned gfid or be removed, from the glustershd.log it shows clearly that glustershd on sn-0 try to remove this dangling entry but meet some error. And when I do some gdb I find that in this case the entry is not assigned to gfid, because the __afr_selfheal_heal_dirent input param source is 1, so  replies[source].op_ret == -1, and can not assign gfid to it. my question is in this case there is no fstest_76f272545249be5d71359f06962e069b on sn-1 and sn-2, so if want to remove this dangling entry on sn-0 can not use syncop_rmdir, I would like your opinion on this issue, thanks!

Thread 12 "glustershdheal" hit Breakpoint 1, __afr_selfheal_heal_dirent (frame=0x7f5aec009350, this=0x7f5b1001d8d0, fd=0x7f5b0800c8a0,
    name=0x7f5b08059db0 "fstest_76f272545249be5d71359f06962e069b", inode=0x7f5aec001e80, source=1, sources=0x7f5af8fefb20 "",
    healed_sinks=0x7f5af8fefae0 "\001", locked_on=0x7f5af8fefac0 "\001\001\001\370Z\177", replies=0x7f5af8fef190) at afr-self-heal-entry.c:172
172 afr-self-heal-entry.c: No such file or directory.
(gdb) print name
$17 = 0x7f5b08059db0 "fstest_76f272545249be5d71359f06962e069b"
(gdb) print source
$18 = 1
(gdb) print replies[0].op_ret
$19 = 0
(gdb) print replies[1].op_ret
$20 = -1
(gdb) print replies[2].op_ret
$21 = -1
(gdb) print replies[1].op_errno
$22 = 2


When set brick point to afr_selfheal_entry_delete
Thread 12 "glustershdheal" hit Breakpoint 1, afr_selfheal_entry_delete (this=0x7f5b1001d8d0, dir=0x7f5b100847f0,
    name=0x7f5b08059510 "fstest_76f272545249be5d71359f06962e069b", inode=0x7f5aec001650, child=0, replies=0x7f5af8fef190) at afr-self-heal-entry.c:24
24  afr-self-heal-entry.c: No such file or directory.
(gdb) print uuid_utoa(inode->gfid)
$1 = 0x7f5aec0022a0 "00000000-0000-0000-0000-", '0' <repeats 12 times>
(gdb) print name
$2 = 0x7f5b08059510 "fstest_76f272545249be5d71359f06962e069b"
(gdb) bt
#0  afr_selfheal_entry_delete (this=0x7f5b1001d8d0, dir=0x7f5b100847f0, name=0x7f5b08059510 "fstest_76f272545249be5d71359f06962e069b",
    inode=0x7f5aec001650, child=0, replies=0x7f5af8fef190) at afr-self-heal-entry.c:24
#1  0x00007f5b141e517c in __afr_selfheal_heal_dirent (frame=0x7f5aec21f470, this=0x7f5b1001d8d0, fd=0x7f5b10004510,
    name=0x7f5b08059510 "fstest_76f272545249be5d71359f06962e069b", inode=0x7f5aec001650, source=1, sources=0x7f5af8fefb20 "",
    healed_sinks=0x7f5af8fefae0 "\001", locked_on=0x7f5af8fefac0 "\001\001\001\370Z\177", replies=0x7f5af8fef190) at afr-self-heal-entry.c:201
#2  0x00007f5b141e59ab in __afr_selfheal_entry_dirent (frame=0x7f5aec21f470, this=0x7f5b1001d8d0, fd=0x7f5b10004510,
    name=0x7f5b08059510 "fstest_76f272545249be5d71359f06962e069b", inode=0x7f5aec001650, source=1, sources=0x7f5af8fefb20 "",
    healed_sinks=0x7f5af8fefae0 "\001", locked_on=0x7f5af8fefac0 "\001\001\001\370Z\177", replies=0x7f5af8fef190) at afr-self-heal-entry.c:383
#3  0x00007f5b141e63ec in afr_selfheal_entry_dirent (frame=0x7f5aec21f470, this=0x7f5b1001d8d0, fd=0x7f5b10004510,
    name=0x7f5b08059510 "fstest_76f272545249be5d71359f06962e069b", parent_idx_inode=0x0, subvol=0x7f5b10016f70, full_crawl=_gf_true)
    at afr-self-heal-entry.c:610
#4  0x00007f5b141e6a1a in afr_selfheal_entry_do_subvol (frame=0x7f5b100011c0, this=0x7f5b1001d8d0, fd=0x7f5b10004510, child=0) at afr-self-heal-entry.c:742
#5  0x00007f5b141e7207 in afr_selfheal_entry_do (frame=0x7f5b100011c0, this=0x7f5b1001d8d0, fd=0x7f5b10004510, source=1, sources=0x7f5af8ff07f0 "",
    healed_sinks=0x7f5af8ff07b0 "\001") at afr-self-heal-entry.c:908
#6  0x00007f5b141e7846 in __afr_selfheal_entry (frame=0x7f5b100011c0, this=0x7f5b1001d8d0, fd=0x7f5b10004510, locked_on=0x7f5af8ff0900 "\001\001\001[")
    at afr-self-heal-entry.c:1002
#7  0x00007f5b141e7d4a in afr_selfheal_entry (frame=0x7f5b100011c0, this=0x7f5b1001d8d0, inode=0x7f5b100847f0) at afr-self-heal-entry.c:1112
#8  0x00007f5b141df3aa in afr_selfheal_do (frame=0x7f5b100011c0, this=0x7f5b1001d8d0, gfid=0x7f5af8ff0b00 "") at afr-self-heal-common.c:2534
#9  0x00007f5b141df4a0 in afr_selfheal (this=0x7f5b1001d8d0, gfid=0x7f5af8ff0b00 "") at afr-self-heal-common.c:2575
#10 0x00007f5b141eadec in afr_shd_selfheal (healer=0x7f5b10084c30, child=0, gfid=0x7f5af8ff0b00 "") at afr-self-heald.c:343
#11 0x00007f5b141eb19b in afr_shd_index_heal (subvol=0x7f5b10016f70, entry=0x7f5b100012f0, parent=0x7f5af8ff0dc0, data=0x7f5b10084c30)
    at afr-self-heald.c:440
#12 0x00007f5b1a682ed3 in syncop_mt_dir_scan (frame=0x7f5b100b89e0, subvol=0x7f5b10016f70, loc=0x7f5af8ff0dc0, pid=-6, data=0x7f5b10084c30,
    fn=0x7f5b141eb04c <afr_shd_index_heal>, xdata=0x7f5b100b88d0, max_jobs=1, max_qlen=1024) at syncop-utils.c:407
#13 0x00007f5b141eb445 in afr_shd_index_sweep (healer=0x7f5b10084c30, vgfid=0x7f5b14213790 "glusterfs.xattrop_index_gfid") at afr-self-heald.c:494
#14 0x00007f5b141eb524 in afr_shd_index_sweep_all (healer=0x7f5b10084c30) at afr-self-heald.c:517
#15 0x00007f5b141eb827 in afr_shd_index_healer (data=0x7f5b10084c30) at afr-self-heald.c:597
#16 0x00007f5b193cd5da in start_thread () from /lib64/libpthread.so.0
#17 0x00007f5b18ca3cbf in clone () from /lib64/libc.so.6
(gdb) quit
A debugging session is active.

      Inferior 1 [process 2230] will be detached.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20181010/5283e187/attachment-0001.html>


More information about the Gluster-users mailing list