[Gluster-devel] races in dict_foreach() causing crashes in tier-file-creat.t

Pranith Kumar Karampuri pkarampu at redhat.com
Fri Mar 11 16:25:41 UTC 2016



On 03/11/2016 06:25 PM, Jeff Darcy wrote:
>> Raghavendra G and I discussed about this problem and the right way to
>> fix it is to take a copy(without dict_foreach) of the dictionary in
>> dict_foreach inside a lock and then loop over the local dictionary. I am
>> worried about the performance implication of this

I think what I forgot to say was that the copy will happen inside a lock 
dict already has.

> I'm worried about the correctness implications.  Any such copy will have
> to do the equivalent of dict_foreach even if it doesn't call the function
> of that name, and will be subject to the same races.
>
>> Also included Xavi, who earlier said we need to change dict.c but it is
>> a bigger change. May be the time has come? I would love to gather all
>> your inputs and implement a better version of dict if we need one.
> There are three solutions I can think of.
>
> (1) Have tier use STACK_WIND_COOKIE to pass the address of a lock down
> both paths, so the two can synchronize between themselves.
>
> (2) Have tier issue the lookup down the two paths *serially* instead
> of in parallel, so there's no contention on the dictionary.  This is
> probably most complicated *and* worst for performance, but I include
> it for the sake of completeness.

Tier does send lookups serially, which fail on the hashed subvolumes of 
dhts. Both of them trigger lookup_everywhere which is executed in epoll 
threads, thus the they are executed in parallel. So (1) won't be simple 
enough, in the sense it has to remember it in local etc.

>
> (3) Enhance dict_t with a gf_lock_t that can be used to serialize
> access.  We don't have to use the lock in every invocation of
> dict_foreach (though we should probably investigate that).  For
> now, we can just use it in the code paths we know are contending.

dict already has a lock. The performance implication I was talking about 
was that we will allow syscalls like getxattr in posix which happen 
inside these locks if we execute dict_foreach inside this lock, which is 
costly. That is the reason Du and I wanted to do the copy of dict inside 
locks and then use this dictionary to carry out dict_foreach without any 
locks so that we don't have to be inside locks for too long. the copy 
shouldn't use dict_foreach for this reason :-). But I forgot to mention 
about the lock. I am wondering if anyone has better solution than this.

>
> I suspect Xavi was suggesting (3) and I concur that it's the best
> long-term solution.
Xavi was mentioning that dict_copy_with_ref is too costly, which is 
true, if we make this change it will be even more costly :-(.

Pranith


More information about the Gluster-devel mailing list