[Bugs] [Bug 1211863] RFE: Support in md-cache to use upcall notifications to invalidate its cache

Thu Oct 20 07:08:02 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1211863


--- Comment #123 from Worker Ant <bugzilla-bot at gluster.org> ---
COMMIT: http://review.gluster.org/15398 committed in master by Pranith Kumar
Karampuri (pkarampu at redhat.com) 
------
commit 8d8eded58cd5431a7000a70337444b828cb400d8
Author: Poornima G <pgurusid at redhat.com>
Date:   Sun Sep 4 08:27:47 2016 +0530

    md-cache, afr: Reduce the window of stale read

    Problem:
    Consider a replica setup, where one mount writes data to a
    file and the other mount reads the file. In afr, read operations
    are not transaction based, a brick(read subvolume) is chosen as
    a part of lookup or other operations, read is always wound only
    to the read subvolume, even if there was write from a different client
    that failed on this brick. This stale read continues until there is
    a lookup or any write operation from the mount point. Currently, this
    is not a major issue, as a lookup is issued before every read and it will
    switch the read subvolume to a correct one. But with the plan of
    increasing md-cache timeout to 600s, the stale read problem will be
    more pronounced, i.e. stale read can continue for 600s(or more if cascaded
    with readdirp), as there will be no lookups.

    Solution:
    Afr doesn't have any built-in solution for stale read(without affecting
    the performance). The solution that came up, was to use upcall. When a file
    on any brick is marked bad for the first time, upcall sends a notification
    to all the clients that had recently accessed the file. The solution has
    2 parts:
    - Identifying when a file is marked bad, on any of the bricks,
      for the first time
    - Client side actions on recieving the notifications

    Identifying when a file is marked bad on any of the bricks for the first
time:
   
-----------------------------------------------------------------------------
    The idea is to track xattrop in upcall. xattrop currently comes with 2 afr
    xattrs - afr dirty bit and afr pending xattrs.
       Dirty xattr is set to 1 before every write, and is unset if write
succeeds.
    In certain scenarios, dirty xattr can be 0 and still the file could be bad
    copy. Hence do not track dirty xattr.
       Pending xattr is set on the good copy, indicating the other bricks that
have
    bad copy. It is still not as simple as, notifying when any of the pending
xattrs
    change. It could lead to flood of notifcations, in case the other brick is
    completely down or consistantly failing. Hence it is important to notify
only
    once, the first time a good copy is marked bad.

    Client side actions on recieving pending xattr change, notification:
    --------------------------------------------------------------------
    md-cache will invalidate the cache of that file, so that further lookup is
    passed down to afr and hence update the read subvolume. Invalidating only
in
    md-cache is not enough, consider the folling oder of opertaions:
    - pending xattr invalidation - invalidate md-cache
    - readdirp on the bad read subvolume - fill md-cache
    - lookup (served from md-cache)
    - read - wound to the old read subvol.
    Hence, along with invalidating md-cache, it is very important to reset the
    read subvolume for that file, in afr.

    Design Credit: Anuradha Talur, Ravishankar N

    1. xattrop doesn't carry info saying post op/pre op.
    2. Pre xattrop will have 0 value for all pending xattrs,
       the cbk of pre xattrop carries the on-disk xattr value.
       Non zero indicated healing is required.
    3. Post xattrop will have non zero value for any of the
       pending xattrs, if the fop failed on any of the bricks.

    Change-Id: I469cbc111714c433984fe1c922be2ef113c25804
    BUG: 1211863
    Signed-off-by: Poornima G <pgurusid at redhat.com>
    Reviewed-on: http://review.gluster.org/15398
    Reviewed-by: Pranith Kumar Karampuri <pkarampu at redhat.com>
    Tested-by: Pranith Kumar Karampuri <pkarampu at redhat.com>
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=iXlDXQIthr&a=cc_unsubscribe