[Gluster-devel] Regarding doing away with refkeeper in locks xlator

Thu Jun 5 17:46:33 UTC 2014

----- Original Message -----

From: "Anand Avati" <aavati at redhat.com> 
To: "Krutika Dhananjay" <kdhananj at redhat.com>, "Pranith Kumar Karampuri" <pkarampu at redhat.com> 
Cc: gluster-devel at gluster.org 
Sent: Thursday, June 5, 2014 12:38:41 PM 
Subject: Re: [Gluster-devel] Regarding doing away with refkeeper in locks xlator 

On 6/4/14, 9:43 PM, Krutika Dhananjay wrote: 
> 
> 
> ------------------------------------------------------------------------ 
> 
> *From: *"Pranith Kumar Karampuri" <pkarampu at redhat.com> 
> *To: *"Krutika Dhananjay" <kdhananj at redhat.com>, "Anand Avati" 
> <aavati at redhat.com> 
> *Cc: *gluster-devel at gluster.org 
> *Sent: *Wednesday, June 4, 2014 12:23:59 PM 
> *Subject: *Re: [Gluster-devel] Regarding doing away with refkeeper 
> in locks xlator 
> 
> 
> On 06/04/2014 12:02 PM, Pranith Kumar Karampuri wrote: 
> 
> 
> On 06/04/2014 11:37 AM, Krutika Dhananjay wrote: 
> 
> Hi, 
> 
> Recently there was a crash in locks translator (BZ 1103347, 
> BZ 1097102) with the following backtrace: 
> (gdb) bt 
> #0 uuid_unpack (in=0x8 <Address 0x8 out of bounds>, 
> uu=0x7fffea6c6a60) at ../../contrib/uuid/unpack.c:44 
> #1 0x00007feeba9e19d6 in uuid_unparse_x (uu=<value 
> optimized out>, out=0x2350fc0 
> "081bbc7a-7551-44ac-85c7-aad5e2633db9", 
> fmt=0x7feebaa08e00 
> "%08x-%04x-%04x-%02x%02x-%02x%02x%02x%02x%02x%02x") at 
> ../../contrib/uuid/unparse.c:55 
> #2 0x00007feeba9be837 in uuid_utoa (uuid=0x8 <Address 0x8 
> out of bounds>) at common-utils.c:2138 
> #3 0x00007feeb06e8a58 in pl_inodelk_log_cleanup 
> (this=0x230d910, ctx=0x7fee700f0c60) at inodelk.c:396 
> #4 pl_inodelk_client_cleanup (this=0x230d910, 
> ctx=0x7fee700f0c60) at inodelk.c:428 
> #5 0x00007feeb06ddf3a in pl_client_disconnect_cbk 
> (this=0x230d910, client=<value optimized out>) at posix.c:2550 
> #6 0x00007feeba9fa2dd in gf_client_disconnect 
> (client=0x27724a0) at client_t.c:368 
> #7 0x00007feeab77ed48 in server_connection_cleanup 
> (this=0x2316390, client=0x27724a0, flags=<value optimized 
> out>) at server-helpers.c:354 
> #8 0x00007feeab77ae2c in server_rpc_notify (rpc=<value 
> optimized out>, xl=0x2316390, event=<value optimized out>, 
> data=0x2bf51c0) at server.c:527 
> #9 0x00007feeba775155 in rpcsvc_handle_disconnect 
> (svc=0x2325980, trans=0x2bf51c0) at rpcsvc.c:720 
> #10 0x00007feeba776c30 in rpcsvc_notify (trans=0x2bf51c0, 
> mydata=<value optimized out>, event=<value optimized out>, 
> data=0x2bf51c0) at rpcsvc.c:758 
> #11 0x00007feeba778638 in rpc_transport_notify (this=<value 
> optimized out>, event=<value optimized out>, data=<value 
> optimized out>) at rpc-transport.c:512 
> #12 0x00007feeb115e971 in socket_event_poll_err (fd=<value 
> optimized out>, idx=<value optimized out>, data=0x2bf51c0, 
> poll_in=<value optimized out>, poll_out=0, 
> poll_err=0) at socket.c:1071 
> #13 socket_event_handler (fd=<value optimized out>, 
> idx=<value optimized out>, data=0x2bf51c0, poll_in=<value 
> optimized out>, poll_out=0, poll_err=0) at socket.c:2240 
> #14 0x00007feeba9fc6a7 in event_dispatch_epoll_handler 
> (event_pool=0x22e2d00) at event-epoll.c:384 
> #15 event_dispatch_epoll (event_pool=0x22e2d00) at 
> event-epoll.c:445 
> #16 0x0000000000407e93 in main (argc=19, 
> argv=0x7fffea6c7f88) at glusterfsd.c:2023 
> (gdb) f 4 
> #4 pl_inodelk_client_cleanup (this=0x230d910, 
> ctx=0x7fee700f0c60) at inodelk.c:428 
> 428 pl_inodelk_log_cleanup (l); 
> (gdb) p l->pl_inode->refkeeper 
> $1 = (inode_t *) 0x0 
> (gdb) 
> 
> pl_inode->refkeeper was found to be NULL even when there 
> were some blocked inodelks in a certain domain of the inode, 
> which when dereferenced by the epoll thread in the cleanup 
> codepath led to a crash. 
> 
> On inspecting the code (for want of a consistent 
> reproducer), three things were found: 
> 
> 1. The function where the crash happens 
> (pl_inodelk_log_cleanup()), makes an attempt to resolve the 
> inode to path as can be seen below. But the way inode_path() 
> itself 
> works is to first construct the path based on the given 
> inode's ancestry and place it in the buffer provided. And if 
> all else fails, the gfid of the inode is placed in a certain 
> format ("<gfid:%s>"). 
> This eliminates the need for statements from line 4 
> through 7 below, thereby "preventing" dereferencing of 
> pl_inode->refkeeper. 
> Now, although this change prevents the crash 
> altogether, it still does not fix the race that led to 
> pl_inode->refkeeper becoming NULL, and comes at the cost of 
> printing "(null)" in the log message on line 9 every 
> time pl_inode->refkeeper is found to be NULL, rendering the 
> logged messages somewhat useless. 
> 
> <code> 
> 0 pl_inode = lock->pl_inode; 
> 1 
> 2 inode_path (pl_inode->refkeeper, NULL, &path); 
> 3 
> 4 if (path) 
> 5 file = path; 
> 6 else 
> 7 file = uuid_utoa 
> (pl_inode->refkeeper->gfid); 
> 8 
> 9 gf_log (THIS->name, GF_LOG_WARNING, 
> 10 "releasing lock on %s held by " 
> 11 "{client=%p, pid=%"PRId64" lk-owner=%s}", 
> 12 file, lock->client, (uint64_t) 
> lock->client_pid, 
> 13 lkowner_utoa (&lock->owner)); 
> <\code> 
> 
> I think this logging code is from the days when gfid handle 
> concept was not there. So it wasn't returning <gfid:gfid-str> in 
> cases the path is not present in the dentries. I believe the 
> else block can be deleted safely now. 
> 
> Pranith 
> 
> Posted patch http://review.gluster.org/#/c/7981/1 for review, to fix 
> problem #1 above. 
> 
> -Krutika 
> 
> 
> 2. There is at least one codepath found that can lead to 
> this crash: 
> Imagine an inode on which an inodelk operation is 
> attempted by a client and is successfully granted too. 
> Now, between the time the lock was granted and 
> pl_update_refkeeper() was called by this thread, the client 
> could send a DISCONNECT event, 
> causing cleanup codepath to be executed, where the epoll 
> thread crashes on dereferencing pl_inode->refkeeper which is 
> STILL NULL at this point. 
> 
> Besides, there are still places in locks xlator where 
> the refkeeper is NOT updated whenever the lists are modified 
> - for instance in the cleanup codepath from a DISCONNECT. 
> 
> 3. Also, pl_update_refkeeper() seems to be not taking into 
> account blocked locks on the inode in the 
> __pl_inode_is_empty() check, when it should, as there could 
> still be cases 
> where the granted list could be empty but not the 
> blocked list at the time of udpating the refkeeper, in which 
> case pl_inode must still take ref on the inode. 
> 
> Avati, 
> There is one more issue, where pl_inode_forget is cleaning 
> up blocked/granted locks which is not handling the new 
> 'client_list'. But if we fix inode ref/unref like Kritika 
> proposed, pl_forget should never come when there is a 
> granted/blocked lock. So may be we should delete that as well. 
> Please feel free to comment. 
> 
> It is pl_forget not pl_inode_forget. typo :-(. 
> 
> Pranith 
> 
> 
> Pranith 
> 
> 
> Proposed solution to 2/3: 
> 
> 1. Do away with refkeeper in pl_inode_t altogether. 
> 2. Let every lock object (inodelk/entryllk/posix_lock) have 
> an inode_t * member to act as a placeholder for the 
> associated inode object on which it is locking. 
> 3. Let each lock object hold a ref on the inode at the time 
> of its creation and unref the inode during its destruction. 
> 
> Please let me know what you think of the above. 
> 
> -Krutika 
> 
> 
> 
> 
> _______________________________________________ 
> Gluster-devel mailing list 
> Gluster-devel at gluster.orghttp://supercolony.gluster.org/mailman/listinfo/gluster-devel 
> 
> 
> 
> 
> _______________________________________________ 
> Gluster-devel mailing list 
> Gluster-devel at gluster.orghttp://supercolony.gluster.org/mailman/listinfo/gluster-devel 
> 
> 
> 

To summarize, the real "problems" are: 

- Deref of pl_inode->refkeeper as an inode_t in the cleanup logger. It 
is an internal artifact of pl_update_refkeeper() working and nobody else 
is supposed to "touch" it. 

For this, the solution I have in mind is to 
a. have a placeholder for gfid in pl_inode_t object, 
b. remember the gfid of the inode at the time of pl_inode_t creation in pl_ctx_get(), and 
c. print pl_inode->gfid in the log message in pl_{inodelk,entrylk}_log_cleanup(). 

<blockquote>

- Missing call to pl_update_refkeeper() in client_disconnect_cbk(). This 
can result in a potential leak of inode ref (not a leak if the same 
inode gets any locking activity by another client in the future). 

</blockquote>

As far as updates to refkeeper is concerned during DISCONNECT is concerned, Pranith and I did have discussions at length but could 
not find a single safe place in cleanup functions to call pl_update_refkeeper() that does not lead to illegal memory access, 
whether before or after the call to __pl_inodelk_unref() in the third for loop. 

Case #1 explained: 
============= 

<snip> 
list_for_each_entry_safe() { 
... 
... 
pl_update_refkeeper(inode); 
pthread_mutex_lock (&pl_inode->mutex); 
__pl_inodelk_unref(l); 
pthread_mutex_unlock (&pl_inode->mutex); 
</snip> 

This causes the last unref in pl_update_refkeeper() to implicitly call pl_forget() where pl_inode_t object is destroyed while it is 
still needed in terms of having to obtain pl_inode->mutex before unrefing the lock object. 

Case 2 explained: 
============= 
<snip> 
list_for_each_entry_safe() { 
... 
... 
pthread_mutex_lock (&pl_inode->mutex); 
__pl_inodelk_unref(l); 
pthread_mutex_unlock (&pl_inode->mutex); 
pl_update_refkeeper(inode); 
</snip> 

After the first for loop is processed, the domain's lists could have been emptied. And at this point, there could well come a second thread that could update refkeeper, 
triggering last unref on the inode leading to a call to pl_forget() (where pl_inode_t is freed), after which the DISCONNECT thread would be trying to perform illegal 
memory access in its call to pl_update_refkeeper() during its turn. 

So the solution Pranith and I came up with involves making sure that the inode_t object is alive for as long as there is atleast one lock on it: 

1. refkeeper will not be used for inodelks and entrylks (but will continue to be used in fcntl locks). 
2. Each pl_inode_t object will also host an inode_t member which is initialised during a call to pl_inode_get() in pl_common_{inodelk, entrylks}(). 
3. Everytime a lock is granted/blocked, pl_inode->inode is ref'd (in pl_inode_setlk() after successful locking. 
4. Everytime a lock is unlocked, pl_inode->inode is unref'd. 
5. In disconnect codepath, as part of "released" list processing, the inodes associated with the client are unref'd in the same loop right after every lock object is unref'd. 

With all this, there is one problem with the lock object that is found to be in both blocked and granted list in the transient state, when there's a race between the unlocking thread 
and the disconnecting thread. This can be best explained with the following example: 
<example> 
Consider an inode I on which there's a granted lock G and a blocked lock B, from clients C1 and C2 respectively. With this approach, at this point the number of refs taken 
by the locks xlator on I would be 2. 
C1 unlocks B, at which point I's refcount becomes 1. 
Now as part of granting other blocked locks in unlock codepath, B is put in granted list. 
Now B's client C2 sends a DISCONNECT event, which would cause I to be unref'd again. This being the last unref, pl_forget() is called on I causing its pl_inode to be destroyed 
At this point, the unlocking thread could try to do a mutex lock on pl_inode->mutex in grant_blocked_{inode,entry}_locks() (whereas pl_inode is now already destroyed) leading to a crash. 
</example> 

The problem described in the example can be fixed by making sure grant_blocked_{inode,entry}_locks() refs the inode first thing and then unrefs it just before returning. 
This would fix illegal memory access. 

<blockquote>
Nice to have (not necessary): 

- Include blocked locks in pl_inode_is_empty(), for the sake of 
symmetry. I don't see how this will make a difference thoough, given the 
goal of pl_update_refkeeper (which is, ensure inode->ref > 0 while a 
user's locks are existing). 
</blockquote>

Done. 

<blockquote>

Avati 

</blockquote>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20140605/a4f3aee3/attachment-0001.html>