[Bugs] [Bug 1748205] New: null gfid entries can not be healed

bugzilla at redhat.com bugzilla at redhat.com
Tue Sep 3 07:32:59 UTC 2019


https://bugzilla.redhat.com/show_bug.cgi?id=1748205

            Bug ID: 1748205
           Summary: null gfid entries can not be healed
           Product: GlusterFS
           Version: 4.1
          Hardware: x86_64
                OS: Linux
            Status: NEW
         Component: selfheal
          Severity: medium
          Assignee: bugs at gluster.org
          Reporter: zz.sh.cynthia at gmail.com
                CC: bugs at gluster.org
  Target Milestone: ---
    Classification: Community



Description of problem:

some entry can not be healed because of empty gfid
Version-Release number of selected component (if applicable):

3.12.15
# gluster v info services

Volume Name: services
Type: Replicate
Volume ID: 32b6bb97-4d0a-4096-9cfa-4cf0385bed31
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 169.254.0.31:/mnt/bricks/services/brick
Brick2: 169.254.0.49:/mnt/bricks/services/brick
Options Reconfigured:
performance.client-io-threads: off
server.allow-insecure: on
network.ping-timeout: 42
cluster.consistent-metadata: on
cluster.favorite-child-policy: mtime
cluster.server-quorum-type: none
transport.address-family: inet
nfs.disable: on
cluster.server-quorum-ratio: 51%
How reproducible:


Steps to Reproduce:
1.start io on one glusterfs client node
2.hard reboot all 3 storage nodes (sn-0 sn-1 has brick, sn-2 is quorum)
3.sometimes this problem appear

Actual results:


Expected results:


Additional info:

1>"/" keeps showing up in command "gluster v heal services info",seems
glustershd can not finish healing this "/" of services volume. when i check the
glutershd log on sn-0 node, there are following output, repeatedly.
2>there is one entry fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t
existing only on sn-1 node(not exist on sn-0 node) "/mnt/bricks/services/brick"
directory, and the xattr of it is empty.


[question]:
1>i check the glusterfs heal related code, do not find much difference between
glusterfs 3.12.15(we are using) and the latest version 6.5,is this issue a
known one? do you think this issue also exits on latest version?
2>in this case sn-0 "/" accuse sn-1, and sn-0 shd try to remove this entry from
sn-1, but failed, is this the error and the cause of this issue?



[glustershd log on sn-0]:
[2019-09-03 07:10:50.003265] I [MSGID: 108026]
[afr-self-heald.c:432:afr_shd_index_heal] 0-services-replicate-0: got entry:
00000000-0000-0000-0000-000000000001 from services-client-0
[2019-09-03 07:10:50.003476] I [MSGID: 108026]
[afr-self-heald.c:341:afr_shd_selfheal] 0-services-replicate-0: entry: path /,
gfid: 00000000-0000-0000-0000-000000000001
[2019-09-03 07:10:50.006066] I [MSGID: 108026]
[afr-self-heal-entry.c:893:afr_selfheal_entry_do] 0-services-replicate-0:
performing entry selfheal on 00000000-0000-0000-0000-000000000001
[2019-09-03 07:10:50.017819] W [MSGID: 108015]
[afr-self-heal-entry.c:56:afr_selfheal_entry_delete] 0-services-replicate-0:
expunging file
00000000-0000-0000-0000-000000000001/fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t
(00000000-0000-0000-0000-000000000000) on services-client-1


[root at SN-0(RCP-1234) /mnt/bricks/services/brick]
# gluster v heal services info
Brick 169.254.0.31:/mnt/bricks/services/brick
/ 
Status: Connected
Number of entries: 1

Brick 169.254.0.49:/mnt/bricks/services/brick
Status: Connected
Number of entries: 0

[root at SN-0(RCP-1234) /mnt/bricks/services/brick]
# ls -l
total 92
drwxr-xr-x  9 _nokfssysalarmprocessor _nokfssysalarmprocessor 4096 Sep  2 14:04
AlarmFileSystem
drw-------  2 root                    root                    4096 Sep  2 14:03
backup
drwxr-xr-x  3 root                    root                    4096 Sep  2 14:04
CLM
drwxr-xr-x  3 root                    root                    4096 Sep  2 14:03
cmf
drwxr-xr-x  3 root                    root                    4096 Sep  2 14:03
commandcalendar
drwxrwx---  2 root                    _nokrcpsysdcif          4096 Sep  2 14:06
commoncollector
drwxr-xr-x  2 root                    root                    4096 Sep  2 14:01
coredumper
drwxr-xr-x  3 root                    root                    4096 Sep  2 14:04
db
drwx------  5 root                    root                    4096 Sep  2 14:04
EventCorrelationEngine
drwx------  8 root                    root                    4096 Sep  2 14:15
hypertracer
drwxrwx---+ 2 root                    root                    4096 Sep  2 14:02
LCM
drwxr-xr-x+ 2 root                    root                    4096 Sep  2 14:01
LDAPUserInfo
drwxr-xr-x  4 root                    root                    4096 Sep  2 14:01
lightcm
-rw-r--r--  2 root                    root                       0 Sep  2 14:01
LMN-0_recover_flag
-rw-r--r--  2 root                    root                       0 Sep  2 14:04
LMN-1_recover_flag
drwxr-xr-x  2 root                    root                    4096 Sep  2 14:05
lockd
drwxr-xr-x  2 root                    root                    4096 Sep  2 14:04
Log
drwxr-xr-x  3 _nokfssyspm9            _nokfssyspm9            4096 Sep  2 14:04
PM9
drw-------  2 root                    root                    4096 Sep  2 14:03
RCP_Backup
drwxr-xr-x  4 root                    root                    4096 Sep  2 14:04
RCPPTEngine
drwxr-xr-x  2 root                    root                    4096 Sep  2 14:01
TestDBDump
[root at SN-0(RCP-1234) /mnt/bricks/services/brick]

[root at SN-0(RCP-1234) /mnt/bricks/services/brick]
# getfattr -m . -d -e hex .
# file: .
system.posix_acl_access=0x0200000001000700ffffffff04000500ffffffff08000500f103000010000500ffffffff20000500ffffffff
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.services-client-1=0x00000000000000000000010a
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.volume-id=0x32b6bb974d0a40969cfa4cf0385bed31

[root at SN-0(RCP-1234) /mnt/bricks/services/brick/.glusterfs/indices/xattrop]
# ls
00000000-0000-0000-0000-000000000001 
xattrop-7006a00e-edbc-4e0c-862b-0c58b2974487

/////////////////////////////////////////////////
[root at SN-1(RCP-1234) /root]
# cd /mnt/bricks/services/brick/
[root at SN-1(RCP-1234) /mnt/bricks/services/brick]
# ls -la
total 108
drwxr-xr-x+  22 root                    root                    4096 Sep  3
14:56 .
drwxr-xr-x    4 root                    root                    4096 Sep  2
14:00 ..
drwxr-xr-x    9 _nokfssysalarmprocessor _nokfssysalarmprocessor 4096 Sep  2
14:04 AlarmFileSystem
drw-------    2 root                    root                    4096 Sep  2
14:03 backup
drwxr-xr-x    3 root                    root                    4096 Sep  2
14:04 CLM
drwxr-xr-x    3 root                    root                    4096 Sep  2
14:03 cmf
drwxr-xr-x    3 root                    root                    4096 Sep  2
14:03 commandcalendar
drwxrwx---    2 root                    _nokrcpsysdcif          4096 Sep  2
14:06 commoncollector
drwxr-xr-x    2 root                    root                    4096 Sep  2
14:01 coredumper
drwxr-xr-x    3 root                    root                    4096 Sep  2
14:04 db
drwx------    5 root                    root                    4096 Sep  2
14:04 EventCorrelationEngine
-rw-r--r--    1 root                    root                       0 Sep  3
14:33 fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t
drw-------  263 root                    root                    4096 Sep  2
14:23 .glusterfs
drwx------    8 root                    root                    4096 Sep  2
14:15 hypertracer
drwxrwx---+   2 root                    root                    4096 Sep  2
14:02 LCM
drwxr-xr-x+   2 root                    root                    4096 Sep  2
14:01 LDAPUserInfo
drwxr-xr-x    4 root                    root                    4096 Sep  2
14:01 lightcm
-rw-r--r--    2 root                    root                       0 Sep  2
14:01 LMN-0_recover_flag
-rw-r--r--    2 root                    root                       0 Sep  2
14:04 LMN-1_recover_flag
drwxr-xr-x    2 root                    root                    4096 Sep  2
14:05 lockd
drwxr-xr-x    2 root                    root                    4096 Sep  2
14:04 Log
drwxr-xr-x    3 _nokfssyspm9            _nokfssyspm9            4096 Sep  2
14:04 PM9
drw-------    2 root                    root                    4096 Sep  2
14:03 RCP_Backup
drwxr-xr-x    4 root                    root                    4096 Sep  2
14:04 RCPPTEngine
drwxr-xr-x    2 root                    root                    4096 Sep  2
14:01 TestDBDump
[root at SN-1(RCP-1234) /mnt/bricks/services/brick]
# getfattr -m . -d -e hex
fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t 
[root at SN-1(RCP-1234) /mnt/bricks/services/brick]
# stat fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t 
  File: fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t
  Size: 0               Blocks: 0          IO Block: 4096   regular empty file
Device: fd71h/64881d    Inode: 8767        Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2019-09-03 14:33:48.468224170 +0800
Modify: 2019-09-03 14:33:48.468224170 +0800
Change: 2019-09-03 14:33:48.468224170 +0800
 Birth: 2019-09-03 14:33:48.468224170 +0800
[root at SN-1(RCP-1234) /mnt/bricks/services/brick]
# getfattr -m . -d -e hex .
# file: .
system.posix_acl_access=0x0200000001000700ffffffff04000500ffffffff08000500f103000010000500ffffffff20000500ffffffff
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.services-client-0=0x000000000000000000000000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.volume-id=0x32b6bb974d0a40969cfa4cf0385bed31

[root at SN-1(RCP-1234) /mnt/bricks/services/brick]
in sn-1 services brick process log, there is following error prints:

[2019-09-03 07:20:51.018870] E [MSGID: 113002] [posix.c:362:posix_lookup]
0-services-posix: buf->ia_gfid is null for
/mnt/bricks/services/brick/fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t
[No data available]
[2019-09-03 07:20:51.018910] W [MSGID: 115005]
[server-resolve.c:70:resolve_gfid_entry_cbk] 0-services-server:
00000000-0000-0000-0000-000000000001/fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t:
failed to resolve (No data available) [No data available]

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list