[Bugs] [Bug 1420184] New: [Remove-brick] Hardlink migration fails with " lookup failed (No such file or directory)" error messages in rebalance logs

bugzilla at redhat.com bugzilla at redhat.com
Wed Feb 8 04:24:35 UTC 2017


https://bugzilla.redhat.com/show_bug.cgi?id=1420184

            Bug ID: 1420184
           Summary: [Remove-brick] Hardlink migration fails with "lookup
                    failed (No such file or directory)" error messages in
                    rebalance logs
           Product: GlusterFS
           Version: 3.8
         Component: distribute
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: nbalacha at redhat.com
                CC: amukherj at redhat.com, bugs at gluster.org,
                    tdesala at redhat.com
        Depends On: 1415761
            Blocks: 1409474, 1419855



+++ This bug was initially created as a clone of Bug #1415761 +++

+++ This bug was initially created as a clone of Bug #1409474 +++

Description of problem:
=======================
If the dataset contains hardlinks and when we do a remove-brick operation,
rebalance is failing to migrate few hardlinks. In the rebalance logs we are
seeing the below lookup failure errors,

[2017-01-02 06:41:06.277232] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl4013: lookup failed on distrep-replicate-2 (No such file or
directory)
[2017-01-02 06:41:06.510761] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl4027: lookup failed on distrep-replicate-2 (No such file or
directory)
[2017-01-02 06:41:06.541836] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl4028: lookup failed on distrep-replicate-2 (No such file or
directory)
[2017-01-02 06:41:06.947640] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl4037: lookup failed on distrep-replicate-2 (No such file or
directory)
[2017-01-02 06:41:07.360477] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl4047: lookup failed on distrep-replicate-2 (No such file or
directory)
[2017-01-02 06:41:44.231718] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl3284: lookup failed on distrep-replicate-2 (No such file or
directory)
[2017-01-02 06:41:49.990234] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl1578: lookup failed on distrep-replicate-0 (No such file or
directory)
[2017-01-02 06:41:50.217159] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl1590: lookup failed on distrep-replicate-0 (No such file or
directory)
[2017-01-02 06:41:51.594092] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl1595: lookup failed on distrep-replicate-0 (No such file or
directory)
[2017-01-02 06:41:51.873224] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl1598: lookup failed on distrep-replicate-0 (No such file or
directory)
[2017-01-02 06:41:58.151533] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl1586: lookup failed on distrep-replicate-2 (No such file or
directory)


How reproducible:
=================
Always

Steps to Reproduce:
===================
1) Create a Distributed-Replicate volume and start it.
2) FUSE mount the volume and create a dataset such that there are a lot of
hardlinks
lets say,
for i in {1..20000};do touch f$i;done
for i in {1..20000};do ln f$i fl$i;done
3) Start remove-brick operation to trigger rebalance.

For few of the hardlinks you can see rebalance failures due to lookup failures.

Actual results:
===============
Hardlink migration is failing during remove-brick operation

Expected results:
=================
Hardlinks should be migrated without any errors/issues during remove-brick


================
After the rebalance failures, I can see few original files and hardlinks still
present on the decommissioned bricks. So, a commit will result in loss of the
files.




--- Additional comment from Prasad Desala on 2017-01-02 06:07:16 EST ---

The above output snippets of lookup errors in rebalance logs and ll from
decommissioned bricks are taken from a different nodes.

Outputs from node server1:
===============================

[2017-01-02 06:41:06.277232] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl4013: lookup failed on distrep-replicate-2 (No such file or
directory)
[2017-01-02 06:41:06.510761] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl4027: lookup failed on distrep-replicate-2 (No such file or
directory)
[2017-01-02 06:41:06.541836] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl4028: lookup failed on distrep-replicate-2 (No such file or
directory)
[2017-01-02 06:41:06.947640] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl4037: lookup failed on distrep-replicate-2 (No such file or
directory)
[2017-01-02 06:41:07.360477] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl4047: lookup failed on distrep-replicate-2 (No such file or
directory)
[2017-01-02 06:41:44.231718] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl3284: lookup failed on distrep-replicate-2 (No such file or
directory)
[2017-01-02 06:41:49.990234] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl1578: lookup failed on distrep-replicate-0 (No such file or
directory)
[2017-01-02 06:41:50.217159] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl1590: lookup failed on distrep-replicate-0 (No such file or
directory)
[2017-01-02 06:41:51.594092] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl1595: lookup failed on distrep-replicate-0 (No such file or
directory)
[2017-01-02 06:41:51.873224] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl1598: lookup failed on distrep-replicate-0 (No such file or
directory)
[2017-01-02 06:41:58.151533] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-distrep-dht: Migrate file
failed:/fl1586: lookup failed on distrep-replicate-2 (No such file or
directory)

[root at node1 ~]# ll /bricks/brick2/b2/* | grep -i rw
-rw-r--r--. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/f4013
-rw-rw----. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/f4027
-rw-r--r--. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/f4028
-rw-rw----. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/f4037
-rw-r--r--. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/f4038
-rw-rw----. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/f4047
-rw-rw----. 3 root root 0 Jan  2 11:15 /bricks/brick2/b2/f5746
-rw-r--r--. 3 root root 0 Jan  2 11:15 /bricks/brick2/b2/f5759
-rw-rw----. 3 root root 0 Jan  2 11:15 /bricks/brick2/b2/f5828
-rw-r--r--. 3 root root 0 Jan  2 11:15 /bricks/brick2/b2/f5839
-rw-rw----. 3 root root 0 Jan  2 11:15 /bricks/brick2/b2/f5841
-rw-r--r--. 3 root root 0 Jan  2 11:17 /bricks/brick2/b2/f8016
-rw-r--r--. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/fl4013
-rw-rw----. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/fl4027
-rw-r--r--. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/fl4028
-rw-rw----. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/fl4037
-rw-r--r--. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/fl4038
-rw-rw----. 3 root root 0 Jan  2 11:13 /bricks/brick2/b2/fl4047
-rw-rw----. 3 root root 0 Jan  2 11:15 /bricks/brick2/b2/fl5746
-rw-r--r--. 3 root root 0 Jan  2 11:15 /bricks/brick2/b2/fl5759
-rw-rw----. 3 root root 0 Jan  2 11:15 /bricks/brick2/b2/fl5828
-rw-r--r--. 3 root root 0 Jan  2 11:15 /bricks/brick2/b2/fl5839
-rw-rw----. 3 root root 0 Jan  2 11:15 /bricks/brick2/b2/fl5841
-rw-r--r--. 3 root root 0 Jan  2 11:17 /bricks/brick2/b2/fl8016


Rebalance logs Errors:
======================
[2017-01-03 06:40:15.885029] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl5769: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:40:16.047939] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl5770: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:40:16.178511] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl5776: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:40:17.786372] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl6450: lookup failed on newdr-replicate-1 (No such file or directory)
[2017-01-03 06:40:18.483995] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl6466: lookup failed on newdr-replicate-3 (No such file or directory)
[2017-01-03 06:41:19.202179] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl7536: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:41:19.690604] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl7551: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:42:06.334415] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl9913: lookup failed on newdr-replicate-1 (No such file or directory)
[2017-01-03 06:42:06.452281] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl9920: lookup failed on newdr-replicate-3 (No such file or directory)
[2017-01-03 06:42:06.472840] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl9922: lookup failed on newdr-replicate-1 (No such file or directory)
[2017-01-03 06:42:06.781910] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl9938: lookup failed on newdr-replicate-3 (No such file or directory)
[2017-01-03 06:42:06.800052] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl9940: lookup failed on newdr-replicate-1 (No such file or directory)
[2017-01-03 06:42:37.065830] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl9563: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:42:37.321748] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl9564: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:42:37.350976] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl9566: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:42:37.372147] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl9567: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:42:56.941938] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl11382: lookup failed on newdr-replicate-1 (No such file or directory)
[2017-01-03 06:42:57.075788] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl11383: lookup failed on newdr-replicate-1 (No such file or directory)
[2017-01-03 06:43:41.016772] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl12808: lookup failed on newdr-replicate-1 (No such file or directory)
[2017-01-03 06:43:52.374158] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl11814: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:43:52.860047] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl11820: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:43:52.963148] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl11821: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:43:53.189461] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl11836: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:44:49.132674] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl13827: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:44:49.141978] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl13834: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:45:39.011654] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl15846: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:45:39.450021] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl15860: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:45:39.458259] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl15872: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:45:39.610044] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl15875: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:46:33.056754] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl17948: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:46:33.240254] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl17960: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:46:33.249345] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl17966: lookup failed on newdr-replicate-2 (No such file or directory)
[2017-01-03 06:46:33.561897] E [MSGID: 109023]
[dht-rebalance.c:1378:dht_migrate_file] 0-newdr-dht: Migrate file
failed:/fl17977: lookup failed on newdr-replicate-2 (No such file or directory)

--- Additional comment from Nithya Balachandran on 2017-01-23 11:45:32 EST ---

RCA:

The remove-brick operation will migrate files with hardlinks (unlike a regular
rebalance). The following steps are performed:

1. dht_setxattr (key = GF_XATTR_FILE_MIGRATE_KEY) sets the target/hashed
subvolume for a migrate file operation in local->rebalance.target_node.

2. For a hardlink, dht_migrate_file () will use the hashed subvol of the first
link to be migrated as the hashed subvolume. This might not match the value in
local->rebalance.target_node for the other links.

3. dht_migrate_file returns 0 if __is_file_migratable () /
__check_file_has_hardlink returns -2 (indicating that the file is a hardlink).

4. rebalance_task_completion updates the cached subvol in inode_ctx with the
value of local->rebalance.target_node. This is incorrect and causes the lookup
failures for successive hardlink lookups as the file does not exist on that
subvol.


Solution:

Do not call dht_layout_preset in rebalance_task_completion as it will be done
as part of the syncop_lookup called after a successful file migration in
dht_migrate_file.

--- Additional comment from Nithya Balachandran on 2017-01-23 11:57:09 EST ---

Upstream patch: 
https://review.gluster.org/#/c/16457/1

--- Additional comment from Worker Ant on 2017-01-30 01:18:19 EST ---

REVIEW: https://review.gluster.org/16457 (cluster/dht: Don't update layout in
rebalance_task_completion) posted (#3) for review on master by N Balachandran
(nbalacha at redhat.com)

--- Additional comment from Worker Ant on 2017-02-06 02:24:27 EST ---

COMMIT: https://review.gluster.org/16457 committed in master by Raghavendra G
(rgowdapp at redhat.com) 
------
commit ddf05f3d1e39cc920251c809e9ba42fe42b2c5f2
Author: N Balachandran <nbalacha at redhat.com>
Date:   Mon Jan 23 22:19:01 2017 +0530

    cluster/dht: Don't update layout in rebalance_task_completion

    Updating the layout in the dht inode_ctx in
    rebalance_task_completion after the file is migrated
    is erroneous in case of files with hardlinks.
    This step can be skipped as the layout will be set
    in the syncop_lookup call post the migration in
    dht_migrate_file.

    Change-Id: I24ac798a919585d91a117d6a207e6a31b88486c6
    BUG: 1415761
    Signed-off-by: N Balachandran <nbalacha at redhat.com>
    Reviewed-on: https://review.gluster.org/16457
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Raghavendra G <rgowdapp at redhat.com>
    Reviewed-by: Susant Palai <spalai at redhat.com>


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1409474
[Bug 1409474] [Remove-brick] Hardlink migration fails with "lookup failed
(No such file or directory)" error messages in rebalance logs
https://bugzilla.redhat.com/show_bug.cgi?id=1415761
[Bug 1415761] [Remove-brick] Hardlink migration fails with "lookup failed
(No such file or directory)" error messages in rebalance logs
https://bugzilla.redhat.com/show_bug.cgi?id=1419855
[Bug 1419855] [Remove-brick] Hardlink migration fails with "lookup failed
(No such file or directory)" error messages in rebalance logs
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list