[Gluster-devel] Regarding doing away with refkeeper in locks xlator

Thu Jun 5 06:54:18 UTC 2014

On 6/3/14, 11:07 PM, Krutika Dhananjay wrote:
> Hi,
>
> Recently there was a crash in locks translator (BZ 1103347, BZ 1097102)
> with the following backtrace:
> (gdb) bt
> #0  uuid_unpack (in=0x8 <Address 0x8 out of bounds>, uu=0x7fffea6c6a60)
> at ../../contrib/uuid/unpack.c:44
> #1  0x00007feeba9e19d6 in uuid_unparse_x (uu=<value optimized out>,
> out=0x2350fc0 "081bbc7a-7551-44ac-85c7-aad5e2633db9",
>      fmt=0x7feebaa08e00
> "%08x-%04x-%04x-%02x%02x-%02x%02x%02x%02x%02x%02x") at
> ../../contrib/uuid/unparse.c:55
> #2  0x00007feeba9be837 in uuid_utoa (uuid=0x8 <Address 0x8 out of
> bounds>) at common-utils.c:2138
> #3  0x00007feeb06e8a58 in pl_inodelk_log_cleanup (this=0x230d910,
> ctx=0x7fee700f0c60) at inodelk.c:396
> #4  pl_inodelk_client_cleanup (this=0x230d910, ctx=0x7fee700f0c60) at
> inodelk.c:428
> #5  0x00007feeb06ddf3a in pl_client_disconnect_cbk (this=0x230d910,
> client=<value optimized out>) at posix.c:2550
> #6  0x00007feeba9fa2dd in gf_client_disconnect (client=0x27724a0) at
> client_t.c:368
> #7  0x00007feeab77ed48 in server_connection_cleanup (this=0x2316390,
> client=0x27724a0, flags=<value optimized out>) at server-helpers.c:354
> #8  0x00007feeab77ae2c in server_rpc_notify (rpc=<value optimized out>,
> xl=0x2316390, event=<value optimized out>, data=0x2bf51c0) at server.c:527
> #9  0x00007feeba775155 in rpcsvc_handle_disconnect (svc=0x2325980,
> trans=0x2bf51c0) at rpcsvc.c:720
> #10 0x00007feeba776c30 in rpcsvc_notify (trans=0x2bf51c0, mydata=<value
> optimized out>, event=<value optimized out>, data=0x2bf51c0) at rpcsvc.c:758
> #11 0x00007feeba778638 in rpc_transport_notify (this=<value optimized
> out>, event=<value optimized out>, data=<value optimized out>) at
> rpc-transport.c:512
> #12 0x00007feeb115e971 in socket_event_poll_err (fd=<value optimized
> out>, idx=<value optimized out>, data=0x2bf51c0, poll_in=<value
> optimized out>, poll_out=0,
>      poll_err=0) at socket.c:1071
> #13 socket_event_handler (fd=<value optimized out>, idx=<value optimized
> out>, data=0x2bf51c0, poll_in=<value optimized out>, poll_out=0,
> poll_err=0) at socket.c:2240
> #14 0x00007feeba9fc6a7 in event_dispatch_epoll_handler
> (event_pool=0x22e2d00) at event-epoll.c:384
> #15 event_dispatch_epoll (event_pool=0x22e2d00) at event-epoll.c:445
> #16 0x0000000000407e93 in main (argc=19, argv=0x7fffea6c7f88) at
> glusterfsd.c:2023
> (gdb) f 4
> #4  pl_inodelk_client_cleanup (this=0x230d910, ctx=0x7fee700f0c60) at
> inodelk.c:428
> 428                pl_inodelk_log_cleanup (l);
> (gdb) p l->pl_inode->refkeeper
> $1 = (inode_t *) 0x0
> (gdb)
>
> pl_inode->refkeeper was found to be NULL even when there were some
> blocked inodelks in a certain domain of the inode,
> which when dereferenced by the epoll thread in the cleanup codepath led
> to a crash.
>
> On inspecting the code (for want of a consistent reproducer), three
> things were found:
>
> 1. The function where the crash happens (pl_inodelk_log_cleanup()),
> makes an attempt to resolve the inode to path as can be seen below. But
> the way inode_path() itself
>      works is to first construct the path based on the given inode's
> ancestry and place it in the buffer provided. And if all else fails, the
> gfid of the inode is placed in a certain format ("<gfid:%s>").
>      This eliminates the need for statements from line 4 through 7
> below, thereby "preventing" dereferencing of pl_inode->refkeeper.
>      Now, although this change prevents the crash altogether, it still
> does not fix the race that led to pl_inode->refkeeper becoming NULL, and
> comes at the cost of
>      printing "(null)" in the log message on line 9 every time
> pl_inode->refkeeper is found to be NULL, rendering the logged messages
> somewhat useless.
>
> <code>
>    0         pl_inode = lock->pl_inode;
>    1
>    2         inode_path (pl_inode->refkeeper, NULL, &path);
>    3
>    4         if (path)
>    5                 file = path;
>    6         else
>    7                 file = uuid_utoa (pl_inode->refkeeper->gfid);
>    8
>    9         gf_log (THIS->name, GF_LOG_WARNING,
>   10                 "releasing lock on %s held by "
>   11                 "{client=%p, pid=%"PRId64" lk-owner=%s}",
>   12                 file, lock->client, (uint64_t) lock->client_pid,
>   13                 lkowner_utoa (&lock->owner));
> <\code>
>
> 2. There is at least one codepath found that can lead to this crash:
>      Imagine an inode on which an inodelk operation is attempted by a
> client and is successfully granted too.
>     Now, between the time the lock was granted and pl_update_refkeeper()
> was called by this thread, the client could send a DISCONNECT event,
>     causing cleanup codepath to be executed, where the epoll thread
> crashes on dereferencing pl_inode->refkeeper which is STILL NULL at this
> point.
>
>     Besides, there are still places in locks xlator where the refkeeper
> is NOT updated whenever the lists are modified - for instance in the
> cleanup codepath from a DISCONNECT.
>
> 3. Also, pl_update_refkeeper() seems to be not taking into account
> blocked locks on the inode in the __pl_inode_is_empty() check, when it
> should, as there could still be cases
>      where the granted list could be empty but not the blocked list at
> the time of udpating the refkeeper, in which case pl_inode must still
> take ref on the inode.
>
> Proposed solution to 2/3:
>
> 1. Do away with refkeeper in pl_inode_t altogether.
> 2. Let every lock object (inodelk/entryllk/posix_lock) have an inode_t *
> member to act as a placeholder for the associated inode object on which
> it is locking.
> 3. Let each lock object hold a ref on the inode at the time of its
> creation and unref the inode during its destruction.
>
>   Please let me know what you think of the above.

The problem with the proposed solution is that, there will inevitably be 
calls to inode_ref() and/or inode_ref() while pl_inode->mutex is held. 
The idea is to avoid double locks (inode_ref() and unref() are locked 
operations), as it can bite us back with deadlocks when another thread 
holds the locks in the reverse order.

In fact, the reason refkeeper was created in the current form is for 
this exact reason - refkeeper update happens with a single lock at a 
time. And it is idempotent, so if it is missed getting called somewhere, 
you can add a call to it without fear of calling one too many times.

To address your questions:

- It is a good idea to include blocked_locks lists in 
__pl_inode_is_empty() for sake of symmetry, though I can't think of case 
where this can be a problem yet.

- We should call pl_update_refkeeper() in client_disconnect_cbk() on 
each of the inodes.

- Referring pl_inode->refkeeper as an inode_t (in the log_cleanup) is an 
abuse of interface. Refkeeper is supposed to be an opaque object, 
completely internal to pl_update_refkeeper() function and not be 
deref'ed by anybody else.

Thanks