[Gluster-devel] races in dict_foreach() causing crashes in tier-file-creat.t

Mon Mar 14 05:38:22 UTC 2016


On 03/11/2016 10:16 PM, Jeff Darcy wrote:
>> Tier does send lookups serially, which fail on the hashed subvolumes of
>> dhts. Both of them trigger lookup_everywhere which is executed in epoll
>> threads, thus the they are executed in parallel.
> According to your earlier description, items are being deleted by EC
> (i.e. the cold tier) while AFR (i.e. the hot tier) is trying to access
> the same dictionary.  That sounds pretty parallel across the two.  It
> doesn't matter, though, because I think we agree that this solution is
> too messy anyway.
>
>>> (3) Enhance dict_t with a gf_lock_t that can be used to serialize
>>> access.  We don't have to use the lock in every invocation of
>>> dict_foreach (though we should probably investigate that).  For
>>> now, we can just use it in the code paths we know are contending.
>> dict already has a lock.
> Yes, we have a lock which is used in get/set/add/delete - but not in
> dict_foreach for the reasons you mention.  I should have been clearer
> that I was suggesting a *different* lock that's only used in this
> case.  Manually locking with the lock we already have might not work
> due to recursive locking, but the lock ordering with a separate
> higher-level lock is pretty simple and it won't affect any other uses.
I didn't quite get it. Could you elaborate it please? The race is 
between 1) dict_set() and 2) dict_foreach()

Pranith
>
>> Xavi was mentioning that dict_copy_with_ref is too costly, which is
>> true, if we make this change it will be even more costly :-(.
> There are probably MVCC-ish approaches that could be both safe and
> performant, but they'd be quite complicated to implement.