[Gluster-devel] RFC on fix to bug #802414

Tue May 22 07:11:36 UTC 2012

<in continuation from our chat>

The PARENT_DOWN_HANDLED approach will take us backwards from the current 
state where we are resiliant to frame losses and other class of bugs 
(i.e, if a frame loss happens on either server or client, it only 
results in prevented graph cleanup but the graph switch still happens).

The root "cause" here is that we are giving up on a very important and 
fundamental principle of immutability on the fd object. The real 
solution here is to never modify fd->inode. Instead we must bring about 
a more native fd "migration" than just re-opening an existing fd on the 
new graph.

Think of the inode migration analogy. The handle coming from FUSE (the 
address of the object) is a "hint". Usually the hint is right, if the 
object in the address belongs to the latest graph. If not, using the 
GFID we resolve a new inode on the latest graph and use it.

In case of FD we can do something similar, except there are not GFIDs 
(which should not be a problem). We need to make the handle coming from 
FUSE (the address of fd_t) just a hint. If the 
fd->inode->table->xl->graph is the latest, then the hint was a HIT. If 
the graph was not the latest, we look for a previous migration 
attempt+result in the "base" (original) fd's context. If that does not 
exist or is not fresh (on the latest graph) then we do a new fd 
creation, open on new graph, fd_unref the old cached result in the fd 
context of the "base fd" and keep ref to this new result. All this must 
happen from fuse_resolve_fd(). The setting of the latest fd and updation 
of the latest fd pointer happens under the scope of the base_fd->lock() 
which gives it a very clear and unambiguous scope which was missing with 
the old scheme.

[The next step will be to nuke the fd->inode swapping in fuse_create_cbk]

Avati

On 05/21/2012 10:26 PM, Raghavendra Gowdappa wrote:
>
>
> ----- Original Message -----
>> From: "Pranith Kumar Karampuri"<pkarampu at redhat.com>
>> To: "Anand Avati"<aavati at redhat.com>
>> Cc: "Vijay Bellur"<vbellur at redhat.com>, "Amar Tumballi"<atumball at redhat.com>, "Krishnan Parthasarathi"
>> <kparthas at redhat.com>, "Raghavendra Gowdappa"<rgowdapp at redhat.com>
>> Sent: Tuesday, May 22, 2012 8:42:58 AM
>> Subject: Re: RFC on fix to bug #802414
>>
>> Dude,
>>      We have already put logs yesterday in LOCK and UNLOCK and saw
>>      that the&fd->inode->lock address changed from LOCK to UNLOCK.
>
> Yes, even I too believe that the hang is because of fd->inode swap in fuse_migrate_fd and not the one in fuse_create_cbk. We could clearly see in the log files following race:
> fuse-mig-thr: acquires fd->inode->lock for swapping fd->inode (this was a naive fix - hold lock on inode in old graph - to the race-condition caused by swapping fd->inode, which didn't work)
>
> poll-thr: tries to acquire fd->inode->lock (inode is old_inode present in old-graph) in afr_local_cleanup
> fuse-mig-thr: swaps fd->inode and releases lock on old_inode->lock
> poll-thr: gets woken up from lock call on old_inode->lock.
> poll-thr: does its work, but while unlocking, uses fd->inode where inode belongs to new graph.
>
> we had logs printing lock address before and after acquisition of lock and we could clearly see that lock address changed after acquiring lock in afr_local_cleanup.
>
>>
>>>> "The hang in fuse_migrate_fd is _before_ the inode swap performed
>>>> there."
>> All the fds are opened on the same file. So all fds in the fd
>> migration point to same inode. The race is hit by nth fd, (n+1)th fd
>> hangs. We have seen that afr_local_cleanup was doing fd_unref, and
>> LOCK(fd->inode->lock) was done with one address then by the time
>> UNLOCK(fd->inode->lock) is done the address changed. So the next fd
>> that has to migrate hung because the prev inode lock is not
>> unlocked.
>>
>> If after nth fd introduces the race a _cbk comes in epoll thread on
>> (n+1)th fd which tries to LOCK(fd->inode->lock) epoll thread will
>> hang.
>> Which is my theory for the hang we observed on Saturday.
>>
>> Pranith.
>> ----- Original Message -----
>> From: "Anand Avati"<aavati at redhat.com>
>> To: "Raghavendra Gowdappa"<rgowdapp at redhat.com>
>> Cc: "Vijay Bellur"<vbellur at redhat.com>, "Amar Tumballi"
>> <atumball at redhat.com>, "Krishnan Parthasarathi"
>> <kparthas at redhat.com>, "Pranith Kumar Karampuri"
>> <pkarampu at redhat.com>
>> Sent: Tuesday, May 22, 2012 2:09:33 AM
>> Subject: Re: RFC on fix to bug #802414
>>
>> On 05/21/2012 11:11 AM, Raghavendra Gowdappa wrote:
>>> Avati,
>>>
>>> fuse_migrate_fd (running in reader thread - rdthr) assigns new
>>> inode to fd, once it looks up inode in new graph. But this
>>> assignment can race with code that accesses fd->inode->lock
>>> executing in poll-thread (pthr) as follows
>>>
>>> pthr: LOCK (fd->inode->lock); (inode in old graph)
>>> rdthr: fd->inode = inode (resolved in new graph)
>>> pthr: UNLOCK (fd->inode->lock); (inode in new graph)
>>>
>>
>> The way I see it (the backtrace output in the other mail), the swap
>> happening in fuse_create_cbk() must be the one causing lock/unlock to
>> land on different inode objects. The hang in fuse_migrate_fd is
>> _before_
>> the inode swap performed there. Can you put some logs in
>> fuse_create_cbk()'s inode swap code and confirm this?
>>
>>
>>> Now, any lock operations on inode in old graph will block. Thanks
>>> to pranith for pointing to this race-condition.
>>>
>>> The problem here is we don't have a single lock that can
>>> synchronize assignment "fd->inode = inode" and other locking
>>> attempts on fd->inode->lock. So, we are thinking that instead of
>>> trying to synchronize, eliminate the parallel accesses altogether.
>>> This can be done by splitting fd migration into two tasks.
>>>
>>> 1. Actions on old graph (like fsync to flush writes to disk)
>>> 2. Actions in new graph (lookup, open)
>>>
>>> We can send PARENT_DOWN when,
>>> 1. Task 1 is complete.
>>> 2. No fop sent by fuse is pending.
>>>
>>> on receiving PARENT_DOWN, protocol/client will shutdown transports.
>>> As part of transport cleanup, all pending frames are unwound and
>>> protocol/client will notify its parents with PARENT_DOWN_HANDLED
>>> event. Each of the translator will pass this event to its parents
>>> once it is convinced that there are no pending fops started by it
>>> (like background self-heal, reads as part of read-ahead etc). Once
>>> fuse receives PARENT_DOWN_HANDLED, it is guaranteed that there
>>> will be no replies that will be racing with migration (note that
>>> migration is done using syncops). At this point in time, it is
>>> safe to start Task 2 (which associates fd with an inode in new
>>> graph).
>>>
>>> Also note that reader thread will not do other operations till it
>>> completes both tasks.
>>>
>>> As far as the implementation of this patch goes, major work is in
>>> translators like read-ahead, afr, dht to provide the guarantee
>>> required to send PARENT_DOWN_HANDLED event to their parents.
>>>
>>> Please let me know your thoughts on this.
>>>
>>
>> All the above steps might not apply if it is caused by the swap in
>> fuse_create_cbk(). Let's confirm that first.
>>
>> Avati
>>