[Gluster-users] Troubleshooting an outage with version mismatch in 3.8.x

Fri Feb 10 17:01:53 UTC 2017

Hello,

I am trying to understand an outage that we had recently when adding a 
new GlusterFS brick to our pool. The three nodes were each running 
3.8.5. The new node was 3.8.8. We didn't have any reason to think a 
point difference would cause problems. Within ten hours one of our sites 
experienced the following problems:

- nginx was unable to read files from GlusterFS
- Docker container providing nginx service became unresponsive to stop / 
start commands
- restarting the Docker service did not make it possible to stop / start 
the affected nginx containers
- ultimately a reboot of the host server was required

During the early part of the outage, the GlusterFS commands stopped 
working. As the outage proceeded, it was possible to navigate the files 
via Gluster, but not serve them via nginx.

We experienced three outages in three days all with similar symptoms.

- After the first outage we simply restarted the server to get the files 
to be delivered normally.
- After the second outage (22.5 hours later) we stopped the GlusterFS 
service on the new server. It was listed as "disconnected".
- After the third outage (11 hours later) we manually removed the volume 
for the affected (high volume) site.

It was only after taking this action that the outages stopped.

As best I can tell the problem is the new brick which has a 0.0.2 
difference from our nodes in the pool.

Is this the expected behaviour from a point release? I would have 
thought a patch release would be fine.

Regards,
Emma