[Bugs] [Bug 1350867] New: RFE: FEATURE: Lock revocation for features/locks xlator

Tue Jun 28 14:31:43 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1350867

            Bug ID: 1350867
           Summary: RFE: FEATURE: Lock revocation for features/locks
                    xlator
           Product: GlusterFS
           Version: mainline
         Component: locks
          Keywords: FutureFeature, Triaged
          Severity: high
          Priority: high
          Assignee: bugs at gluster.org
          Reporter: pkarampu at redhat.com
                CC: bugs at gluster.org, rgowdapp at redhat.com, rwareing at fb.com
        Depends On: 1301401

+++ This bug was initially created as a clone of Bug #1301401 +++

Description of problem:
Mis-behaving brick clients (gNFSd, FUSE, gfAPI) can cause cluster instability
and eventual complete unavailability due to failures in releasing entry/inode
locks in a timely manner.

Classic symptoms on this are increased brick (and/or gNFSd) memory usage due
the high number of (lock request) frames piling up in the processes.  The
failure-mode results in bricks eventually slowing down to a crawl due to
swapping, or OOMing due to complete memory exhaustion; during this period the
entire cluster can begin to fail.  End-users will experience this as hangs on
the filesystem, first in a specific region of the file-system and ultimately
the entire filesystem as the offending brick begins to turn into a zombie (i.e.
not quite dead, but not quite alive either).

Currently, these situations must be handled by an administrator detecting &
intervening via the "clear-locks" CLI command.  Unfortunately this doesn't
scale for large numbers of clusters, and it depends on the correct (external)
detection of the locks piling up (for which there is little signal other than
state dumps).

This patch introduces two features to remedy this situation:

1. Monkey-unlocking - This is a feature targeted at developers (only!) to help
track down crashes due to stale locks, and prove the utility of he lock
revocation feature.  It does this by silently dropping 1% of unlock requests;
simulating bugs or mis-behaving clients.

The feature is activated via:
features.locks-monkey-unlocking <on/off>

You'll see the message
"[<timestamp>] W [inodelk.c:653:pl_inode_setlk] 0-groot-locks: MONKEY LOCKING
(forcing stuck lock)!" in the logs indicating a request has been dropped.

2. Lock revocation - Once enabled, this feature will revoke a contended lock
either by the amount of time the lock has been held, how many other lock
requests are waiting on the lock to be freed, or some combination of both. 
Clients which are losing their locks will be notified by receiving EAGAIN (send
back to their callback function).

The feature is activated via these options:
features.locks-revocation-secs <integer; 0 to disable>
features.locks-revocation-clear-all [on/off]
features.locks-revocation-max-blocked <integer>

Recommended settings are: 1800 seconds for a time based timeout (give clients
the benefit of the doubt, or chose a max-blocked requires some experimentation
depending on your workload, but generally values of hundreds to low thousands
(it's normal for many ten's of locks to be taken out when files are being
written @ high throughput).

Version-Release number of selected component (if applicable):
Clear patch-set provided for GlusterFS v3.7.6, v3.6 patches are available upon
request.

How reproducible:
- Without using monkey-unlocking these situations are extremely difficult to
reproduce.
- 100% by turning on monkey-unlocking; a crash bug was immediately detected
using this feature (and a fix is included with this patch: changes to
xlators/features/locks/src/clear.c).

Steps to Reproduce:
First you will need TWO fuse mounts for this repro.  Call them /mnt/patchy1 &
/mnt/patchy2.

1. Enable monkey unlocking on the volume:
gluster vol set patchy features.locks-monkey-unlocking on

2. From the "patchy1", use DD or some other utility to begin writing to a file,
eventually the dd will hang due to the dropped unlocked requests.  This now
simulates the broken client.  Run:

for i in {1..1000};do dd if=/dev/zero of=/mnt/patchy1/testfile bs=1k
count=10;done'

...this will eventually hang as the unlock request has been lost.

3. Goto another window and setup the mount "patchy2" @ /mnt/patchy2, and
observe that 'echo "hello" >> /mnt/patchy2/testfile" will hang due to the
inability of the client to take out the required lock.

4. Next, re-start the test this time enabling lock revocation; use a timeout of
2-5 seconds for testing: 'gluster vol set patchy features.locks-revocation-secs
<2-5>'

5. Wait 2-5 seconds before executing step 3 above this time.  Observe that this
time the access to the file will succeed, and the writes on patchy1 will
unblock until they hit another failed unlock request due to "monkey-unlocking".

Actual results:
n/a

Expected results:
n/a

Additional info:

--- Additional comment from  on 2016-01-24 19:37 EST ---

Prove test for lock revocation feature.

--- Additional comment from Vijay Bellur on 2016-06-27 13:47:46 EDT ---

REVIEW: http://review.gluster.org/14816 (features/locks: Add lock revocation
functionality to posix locks translator) posted (#1) for review on master by
Pranith Kumar Karampuri (pkarampu at redhat.com)

--- Additional comment from Vijay Bellur on 2016-06-27 13:47:49 EDT ---

REVIEW: http://review.gluster.org/14817 (Revert "tests: remove tests for
clear-locks") posted (#1) for review on master by Pranith Kumar Karampuri
(pkarampu at redhat.com)

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1301401
[Bug 1301401] RFE: FEATURE: Lock revocation for features/locks xlator
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.