[Gluster-devel] bug-857330/normal.t failure

Thu May 22 12:54:00 UTC 2014

On 22/05/2014, at 1:34 PM, Kaushal M wrote:
> Thanks Justin, I found the problem. The VM can be deleted now.

Done. :)

> Turns out, there was more than enough time for the rebalance to complete. But we hit a race, which caused a command to fail.
> 
> The particular test that failed is waiting for rebalance to finish. It does this by doing a 'gluster volume rebalance <> status' command and checking the result. The EXPECT_WITHIN function runs this command till we have a match, the command fails or the timeout happens.
> 
> For a rebalance status command, glusterd sends a request to the rebalance process (as a brick_op) to get the latest stats. It had done the same in this case as well. But while glusterd was waiting for the reply, the rebalance completed and the process stopped itself. This caused the rpc connection between glusterd and rebalance proc to close. This caused the all pending requests to be unwound as failures. Which in turnlead to the command failing.
> 
> I cannot think of a way to avoid this race from within glusterd. For this particular test, we could avoid using the 'rebalance status' command if we directly checked the rebalance process state using its pid etc. I don't particularly approve of this approach, as I think I used the 'rebalance status' command for a reason. But I currently cannot recall the reason, and if cannot come with it soon, I wouldn't mind changing the test to avoid rebalance status.

Hmmm, is it the kind of thing where the "rebalance status" command
should retry, if it's connection gets closed by a just-completed-
rebalance (as happened here)?

Or would that not work as well?

+ Justin

--
Open Source and Standards @ Red Hat

twitter.com/realjustinclift