[Bugs] [Bug 1468186] New: [Geo-rep]: entry failed to sync to slave with ENOENT errror

Thu Jul 6 08:49:08 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1468186

            Bug ID: 1468186
           Summary: [Geo-rep]: entry failed to sync to slave with ENOENT
                    errror
           Product: Red Hat Gluster Storage
           Version: 3.3
         Component: geo-replication
          Severity: high
          Assignee: avishwan at redhat.com
          Reporter: rhinduja at redhat.com
        QA Contact: rhinduja at redhat.com
                CC: bugs at gluster.org, csaba at redhat.com,
                    khiremat at redhat.com, rhinduja at redhat.com,
                    rhs-bugs at redhat.com, storage-qa-internal at redhat.com
        Depends On: 1467718

+++ This bug was initially created as a clone of Bug #1467718 +++

Description of problem:
When running iozone, bonnie, smallfiles workload on master, the entry failed to
sync to slave with ENOENT on slave (Parent directory does not exist on slave)

The errors is like below.

[2017-06-16 14:54:26.1849] E [master(/gluster/brick1/brick):785:log_failures]
_GMaster: ENTRY FAILED: ({'uid': 0, 'gfid':
'4d16fd49-591d-4088-8f87-e75c081ca2f9', 'gid': 0, 'mode': 33152, 'entry':
'.gfid/abe8c2f6-210b-4ac3-8c05-a84d44c3b5b1/dovecot.index', 'op': 'MKNOD'}, 2) 

Version-Release number of selected component (if applicable):
mainline

How reproducible:
Saw only once

Steps to Reproduce:
1. Setup geo-rep and run iozone, bonnie, smallfile workload on master

Actual results:
Entry failure error with ENOENT

Expected results:
Entry failures should not happen

Additional info:

--- Additional comment from Kotresh HR on 2017-07-04 15:39:32 EDT ---

Analysis:
It was seen that the RMDIR followed by MKDIR is recorded in changelog on
a particular subvolume with same gfid and pargfid/bname but not on all
subvolumes as below.

    E 61c67a2e-07f2-45a9-95cf-d8f16a5e9c36 RMDIR \
    9cc51be8-91c3-4ef4-8ae3-17596fcfed40%2Ffedora2
    E 61c67a2e-07f2-45a9-95cf-d8f16a5e9c36 MKDIR 16877 0 0 \
    9cc51be8-91c3-4ef4-8ae3-17596fcfed40%2Ffedora2

While processing this changelog, geo-rep thinks RMDIR is successful and does
recursive rmdir on slave. But in the master the directory still exists. Further
entry creation under this directory which hashed to that particular subvol
failed with ENOENT.

--- Additional comment from Worker Ant on 2017-07-04 15:43:30 EDT ---

REVIEW: https://review.gluster.org/17695 (geo-rep: Fix entry failure because
parent dir doesn't exist) posted (#1) for review on master by Kotresh HR
(khiremat at redhat.com)

--- Additional comment from Kotresh HR on 2017-07-04 15:45:16 EDT ---

Cause:
    RMDIR-MKDIR pair gets recorded so in changelog when the
    directory removal is successful on cached subvolume and
    failed in one of hashed subvol for some reason
    (may be down). In this case, the directory is re-created
    on cached subvol which gets recorded as MKDIR again in
    changelog.

 Solution:
    So while processing RMDIR geo-replication should stat on
    master with gfid and should not delete it if it's present.

--- Additional comment from Worker Ant on 2017-07-05 11:44:27 EDT ---

COMMIT: https://review.gluster.org/17695 committed in master by Aravinda VK
(avishwan at redhat.com) 
------
commit b25bf64f3a3520a96ad557daa4903c0ceba96d72
Author: Kotresh HR <khiremat at redhat.com>
Date:   Tue Jul 4 08:46:06 2017 -0400

    geo-rep: Fix entry failure because parent dir doesn't exist

    In a distributed volume on master, it can so happen that
    the RMDIR followed by MKDIR is recorded in changelog on
    a particular subvolume with same gfid and pargfid/bname
    but not on all subvolumes as below.

    E 61c67a2e-07f2-45a9-95cf-d8f16a5e9c36 RMDIR \
    9cc51be8-91c3-4ef4-8ae3-17596fcfed40%2Ffedora2
    E 61c67a2e-07f2-45a9-95cf-d8f16a5e9c36 MKDIR 16877 0 0 \
    9cc51be8-91c3-4ef4-8ae3-17596fcfed40%2Ffedora2

    While processing this changelog, geo-rep thinks RMDIR is
    successful and does recursive rmdir on slave. But in the
    master the directory still exists. This could lead to
    data discrepancy between master and slave.

    Cause:
    RMDIR-MKDIR pair gets recorded so in changelog when the
    directory removal is successful on cached subvolume and
    failed in one of hashed subvol for some reason
    (may be down). In this case, the directory is re-created
    on cached subvol which gets recorded as MKDIR again in
    changelog.

    Solution:
    So while processing RMDIR geo-replication should stat on
    master with gfid and should not delete it if it's present.

    Change-Id: If5da1d6462eb4d9ebe2e88b3a70cc454411a133e
    BUG: 1467718
    Signed-off-by: Kotresh HR <khiremat at redhat.com>
    Reviewed-on: https://review.gluster.org/17695
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Aravinda VK <avishwan at redhat.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1467718
[Bug 1467718] [Geo-rep]: entry failed to sync to slave with ENOENT errror
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=s42hCkVUmB&a=cc_unsubscribe