[Gluster-devel] netbsd regression update : cdc.t

Wed May 6 02:57:50 UTC 2015

----- Original Message -----
> On Mon, May 04, 2015 at 09:20:45AM +0530, Atin Mukherjee wrote:
> > I see the following log from the brick process:
> > 
> > [2015-05-04 03:43:50.309769] E [socket.c:823:__socket_server_bind]
> > 4-tcp.patchy-server: binding to  failed: Address already in use
> 
> This happens before the failing test 52 (volume stop), on test 51, which is
> volume reset network.compression operation.
> 
> At that time the volume is already started, with the brick process running.
> volume reset network.compression cause the brick process to be started
> again. But since the previous brick process was not terminated, it still
> holds the port and the new process fails to start.
> 
> As a result we have a volume started with its only brick not running.
> It seems volume stop waits for the missing brick to get online and
> here is why we fail.

[Correction(s) in root cause analysis for posterity]
The brick process is _not_ restarted. The graph change detection algorithm
is not capable of handling addition of a translator to the server-side graph.
As a result, brick process calls init() on all its existing translators, more
importantly server translator in this case. As part of server translator's init(),
the listening socket is being bound (again). That is what the log messages say too.

> 
> The patch below is enough to workaround the problem: first stop the
> volume before doing colume reset network.compression.
> 
> Questions:
> 1) is it expected that volume reset network.compression. restarts
>    the bricks?

It _doesn't_. See above.

> 2) shall we consider it a bug that volume stop waits for brcks  that
>    are down? I thing we should.

volume-stop has two important phases in its execution. In the first phase,
glusterd sends an RPC asking the brick to kill itself after performing a cleanup.
Subsequently, glusterd issues a kill(2) system call, if the brick process didn't
kill itself (yet). In this case, glusterd issues the RPC, but brick process doesn't
'process' it. Since the stack frames were corrupted (as gdb claimed) we couldn't analyse
the cause further. At this point, we suspect that brick process' poll(3) thread is blocked
for some reason.

We need help in getting gdb to work with proper stack frames. It is mostly my lack of *BSD knowledge.
With the fix for cdc.t we avoid the immediate regression failure, but don't know the real underlying issue
yet. Any help would be appreciated.