[Gluster-devel] Feature: Automagic lock-revocation for features/locks xlator (v3.7.x)

Raghavendra G raghavendra at gluster.com
Tue Feb 16 04:00:57 UTC 2016


On Sat, Feb 13, 2016 at 12:38 AM, Richard Wareing <rwareing at fb.com> wrote:

> Hey,
>
> Sorry for the late reply but I missed this e-mail.  With respect to
> identifying locking domains, we use the identical logic that GlusterFS
> itself uses to identify the domains; which is just a simple string
> comparison if I'm not mistaken.   System processes (SHD/Rebalance) locking
> domains are treated identical to any other, this is specifically critical
> to things like DHT healing as this locking domain is used both in userland
> and by SHDs (you cannot disable DHT healing).
>

We cannot disable DHT healing altogether. But we _can_ identify whether
healing is done by a mount process (on behalf of application) or a
rebalance process. All internal processes (rebalance, shd, quotad etc) have
a negative value in frame->root->pid (as opposed to a positive value for a
fop request from a mount process). I agree with you that just by looking at
domain, we cannot figure out whether lock request is from internal process
or a mount process. But, with the help of frame->root->pid, we can. By
choosing to flush locks from rebalance process (instead of locks from mount
process), I thought we can reduce the scenarios where application sees
errors. Of course we'll see more of rebalance failures, but that is a
trade-off we perhaps have to live with. Just a thought :).



>  To illustrate this, consider the case where a SHD holds a lock to do a
> DHT heal but can't because of GFID split-brain....a user comes along a
> hammers that directory attempting to get a lock....you can pretty much kiss
> your cluster good-bye after that :).
>
> With this in mind, we explicitly choose not to respect system process
> (SHD/rebalance) locks any more than a user lock request as they can be just
> as likely (if not more so) to cause a system to fall over vs. a user (see
> example above).  Although this might seem unwise at first, I'd put forth
> that having clusters fall over catastrophically pushes far worse decisions
> on operators such as re-kicking random bricks or entire clusters in
> desperate attempts at freeing locks (the CLI is often unable to free the
> locks in our experience) or stopping run away memory consumption due to
> frames piling up on the bricks.  To date, we haven't even observed a single
> instance of data corruption (and we've been looking for it!) due to this
> feature.
>
> We've even used it on clusters where they were on the verge of falling
> over and we enable revocation and the entire system stabilizes almost
> instantly (it's really like magic when you see it :) ).
>
> Hope this helps!
>
> Richard
>
>
> ------------------------------
> *From:* raghavendra.hg at gmail.com [raghavendra.hg at gmail.com] on behalf of
> Raghavendra G [raghavendra at gluster.com]
> *Sent:* Tuesday, January 26, 2016 9:49 PM
> *To:* Raghavendra Gowdappa
> *Cc:* Richard Wareing; Gluster Devel
> *Subject:* Re: [Gluster-devel] Feature: Automagic lock-revocation for
> features/locks xlator (v3.7.x)
>
>
>
> On Mon, Jan 25, 2016 at 10:39 AM, Raghavendra Gowdappa <
> rgowdapp at redhat.com> wrote:
>
>>
>>
>> ----- Original Message -----
>> > From: "Richard Wareing" <rwareing at fb.com>
>> > To: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
>> > Cc: gluster-devel at gluster.org
>> > Sent: Monday, January 25, 2016 8:17:11 AM
>> > Subject: Re: [Gluster-devel] Feature: Automagic lock-revocation for
>> features/locks xlator (v3.7.x)
>> >
>> > Yup per domain would be useful, the patch itself currently honors
>> domains as
>> > well. So locks in a different domains will not be touched during
>> revocation.
>> >
>> > In our cases we actually prefer to pull the plug on SHD/DHT domains to
>> ensure
>> > clients do not hang, this is important for DHT self heals which cannot
>> be
>> > disabled via any option, we've found in most cases once we reap the lock
>> > another properly behaving client comes along and completes the DHT heal
>> > properly.
>>
>> Flushing waiting locks of DHT can affect application continuity too.
>> Though locks requested by rebalance process can be flushed to certain
>> extent without applications noticing any failures, there is no guarantee
>> that locks requested in DHT_LAYOUT_HEAL_DOMAIN and DHT_FILE_MIGRATE_DOMAIN,
>> are issued by only rebalance process.
>
>
> I missed this point in my previous mail. Now I remember that we can use
> frame->root->pid (being negative) to identify internal processes. Was this
> the approach you followed to identify locks from rebalance process?
>
>
>> These two domains are used for locks to synchronize among and between
>> rebalance process(es) and client(s). So, there is equal probability that
>> these locks might be requests from clients and hence application can see
>> some file operations failing.
>>
>> In case of pulling plug on DHT_LAYOUT_HEAL_DOMAIN, dentry operations that
>> depend on layout can fail. These operations can include create, link,
>> unlink, symlink, mknod, mkdir, rename for files/directory within the
>> directory on which lock request is failed.
>>
>> In case of pulling plug on DHT_FILE_MIGRATE_DOMAIN, rename of immediate
>> subdirectories/files can fail.
>>
>>
>> >
>> > Richard
>> >
>> >
>> > Sent from my iPhone
>> >
>> > On Jan 24, 2016, at 6:42 PM, Pranith Kumar Karampuri <
>> pkarampu at redhat.com >
>> > wrote:
>> >
>> >
>> >
>> >
>> >
>> >
>> > On 01/25/2016 02:17 AM, Richard Wareing wrote:
>> >
>> >
>> >
>> > Hello all,
>> >
>> > Just gave a talk at SCaLE 14x today and I mentioned our new locks
>> revocation
>> > feature which has had a significant impact on our GFS cluster
>> reliability.
>> > As such I wanted to share the patch with the community, so here's the
>> > bugzilla report:
>> >
>> > https://bugzilla.redhat.com/show_bug.cgi?id=1301401
>> >
>> > =====
>> > Summary:
>> > Mis-behaving brick clients (gNFSd, FUSE, gfAPI) can cause cluster
>> instability
>> > and eventual complete unavailability due to failures in releasing
>> > entry/inode locks in a timely manner.
>> >
>> > Classic symptoms on this are increased brick (and/or gNFSd) memory
>> usage due
>> > the high number of (lock request) frames piling up in the processes. The
>> > failure-mode results in bricks eventually slowing down to a crawl due to
>> > swapping, or OOMing due to complete memory exhaustion; during this
>> period
>> > the entire cluster can begin to fail. End-users will experience this as
>> > hangs on the filesystem, first in a specific region of the file-system
>> and
>> > ultimately the entire filesystem as the offending brick begins to turn
>> into
>> > a zombie (i.e. not quite dead, but not quite alive either).
>> >
>> > Currently, these situations must be handled by an administrator
>> detecting &
>> > intervening via the "clear-locks" CLI command. Unfortunately this
>> doesn't
>> > scale for large numbers of clusters, and it depends on the correct
>> > (external) detection of the locks piling up (for which there is little
>> > signal other than state dumps).
>> >
>> > This patch introduces two features to remedy this situation:
>> >
>> > 1. Monkey-unlocking - This is a feature targeted at developers (only!)
>> to
>> > help track down crashes due to stale locks, and prove the utility of he
>> lock
>> > revocation feature. It does this by silently dropping 1% of unlock
>> requests;
>> > simulating bugs or mis-behaving clients.
>> >
>> > The feature is activated via:
>> > features.locks-monkey-unlocking <on/off>
>> >
>> > You'll see the message
>> > "[<timestamp>] W [inodelk.c:653:pl_inode_setlk] 0-groot-locks: MONKEY
>> LOCKING
>> > (forcing stuck lock)!" ... in the logs indicating a request has been
>> > dropped.
>> >
>> > 2. Lock revocation - Once enabled, this feature will revoke a
>> *contended*lock
>> > (i.e. if nobody else asks for the lock, we will not revoke it) either
>> by the
>> > amount of time the lock has been held, how many other lock requests are
>> > waiting on the lock to be freed, or some combination of both. Clients
>> which
>> > are losing their locks will be notified by receiving EAGAIN (send back
>> to
>> > their callback function).
>> >
>> > The feature is activated via these options:
>> > features.locks-revocation-secs <integer; 0 to disable>
>> > features.locks-revocation-clear-all [on/off]
>> > features.locks-revocation-max-blocked <integer>
>> >
>> > Recommended settings are: 1800 seconds for a time based timeout (give
>> clients
>> > the benefit of the doubt, or chose a max-blocked requires some
>> > experimentation depending on your workload, but generally values of
>> hundreds
>> > to low thousands (it's normal for many ten's of locks to be taken out
>> when
>> > files are being written @ high throughput).
>> >
>> > I really like this feature. One question though, self-heal, rebalance
>> domain
>> > locks are active until self-heal/rebalance is complete which can take
>> more
>> > than 30 minutes if the files are in TBs. I will try to see what we can
>> do to
>> > handle these without increasing the revocation-secs too much. May be we
>> can
>> > come up with per domain revocation timeouts. Comments are welcome.
>> >
>> > Pranith
>> >
>> >
>> >
>> >
>> > =====
>> >
>> > The patch supplied will patch clean the the v3.7.6 release tag, and
>> probably
>> > to any 3.7.x release & master (posix locks xlator is rarely touched).
>> >
>> > Richard
>> >
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > Gluster-devel mailing list Gluster-devel at gluster.org
>> > http://www.gluster.org/mailman/listinfo/gluster-devel
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gluster.org_mailman_listinfo_gluster-2Ddevel&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=qJ8Lp7ySfpQklq3QZr44Iw&m=PZGWfy5Y16_RQ1y2K73ShgTiHmTdeo4ZTHlPNp5ENd8&s=hgqwXn2r8O02F2c4nk3ssRN_DNRBtaa7vsBxJbRwi1g&e=>
>> >
>> >
>> > _______________________________________________
>> > Gluster-devel mailing list
>> > Gluster-devel at gluster.org
>> > http://www.gluster.org/mailman/listinfo/gluster-devel
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gluster.org_mailman_listinfo_gluster-2Ddevel&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=qJ8Lp7ySfpQklq3QZr44Iw&m=PZGWfy5Y16_RQ1y2K73ShgTiHmTdeo4ZTHlPNp5ENd8&s=hgqwXn2r8O02F2c4nk3ssRN_DNRBtaa7vsBxJbRwi1g&e=>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gluster.org_mailman_listinfo_gluster-2Ddevel&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=qJ8Lp7ySfpQklq3QZr44Iw&m=PZGWfy5Y16_RQ1y2K73ShgTiHmTdeo4ZTHlPNp5ENd8&s=hgqwXn2r8O02F2c4nk3ssRN_DNRBtaa7vsBxJbRwi1g&e=>
>>
>
>
>
> --
> Raghavendra G
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20160216/e18e6314/attachment.html>


More information about the Gluster-devel mailing list