[Gluster-devel] races in dict_foreach() causing crashes in tier-file-creat.t

Fri Mar 11 16:46:16 UTC 2016

> Tier does send lookups serially, which fail on the hashed subvolumes of
> dhts. Both of them trigger lookup_everywhere which is executed in epoll
> threads, thus the they are executed in parallel.

According to your earlier description, items are being deleted by EC
(i.e. the cold tier) while AFR (i.e. the hot tier) is trying to access
the same dictionary.  That sounds pretty parallel across the two.  It
doesn't matter, though, because I think we agree that this solution is
too messy anyway.

> > (3) Enhance dict_t with a gf_lock_t that can be used to serialize
> > access.  We don't have to use the lock in every invocation of
> > dict_foreach (though we should probably investigate that).  For
> > now, we can just use it in the code paths we know are contending.
> 
> dict already has a lock.

Yes, we have a lock which is used in get/set/add/delete - but not in
dict_foreach for the reasons you mention.  I should have been clearer
that I was suggesting a *different* lock that's only used in this
case.  Manually locking with the lock we already have might not work
due to recursive locking, but the lock ordering with a separate
higher-level lock is pretty simple and it won't affect any other uses.

> Xavi was mentioning that dict_copy_with_ref is too costly, which is
> true, if we make this change it will be even more costly :-(.

There are probably MVCC-ish approaches that could be both safe and
performant, but they'd be quite complicated to implement.