[Gluster-devel] spurious failure with sparse-file-heal.t test

Sat Jun 6 05:06:29 UTC 2015

----- Original Message -----
> 
> 
> ----- Original Message -----
> > 
> > > This seems to happen because of race between STACK_RESET and stack
> > > statedump. Still thinking how to fix it without taking locks around
> > > writing to file.
> > 
> > Why should we still keep the stack being reset as part of pending pool of
> > frames? Even we if we had to (can't guess why?), when we remove we should
> > do
> > the following to prevent gf_proc_dump_pending_frames from crashing.
> > 
> > ...
> > 
> > call_frame_t *toreset = NULL;
> > 
> > LOCK (&stack->pool->lock)
> > {
> >   toreset = stack->frames;
> >   stack->frames = NULL;
> > }
> > UNLOCK (&stack->pool->lock);
> > 
> > ...
> > 
> > Now, perform all operations that are done on stack->frames on toreset
> > instead. Thoughts?

Here is a patch does more than what is mentioned in the snippet above.
http://review.gluster.com/11095. This patch makes stack to use a struct list_head
for frames. This makes frames manipulation simpler and easy to reason, especially
since most of us are familiar with struct list_head. Additionally, this patch
fixes the race you pointed between STACK_RESET and gf_proc_dump_pending_frames.
This is done by making STACK_RESET take the call_pool->lock, but not for long, just
to splice the list of frames that needs to be destroyed. I checked that sparse-self-heal.t
passes both on my laptop (nearly irrelevant) and on jenkins. Hope that solves this
regression failure. Let me know what you think?