[Gluster-users] Troubleshooting an outage with version mismatch in 3.8.x
Emma Hogbin Westby
emma at humanitarianresponse.info
Fri Feb 10 17:01:53 UTC 2017
I am trying to understand an outage that we had recently when adding a
new GlusterFS brick to our pool. The three nodes were each running
3.8.5. The new node was 3.8.8. We didn't have any reason to think a
point difference would cause problems. Within ten hours one of our sites
experienced the following problems:
- nginx was unable to read files from GlusterFS
- Docker container providing nginx service became unresponsive to stop /
- restarting the Docker service did not make it possible to stop / start
the affected nginx containers
- ultimately a reboot of the host server was required
During the early part of the outage, the GlusterFS commands stopped
working. As the outage proceeded, it was possible to navigate the files
via Gluster, but not serve them via nginx.
We experienced three outages in three days all with similar symptoms.
- After the first outage we simply restarted the server to get the files
to be delivered normally.
- After the second outage (22.5 hours later) we stopped the GlusterFS
service on the new server. It was listed as "disconnected".
- After the third outage (11 hours later) we manually removed the volume
for the affected (high volume) site.
It was only after taking this action that the outages stopped.
As best I can tell the problem is the new brick which has a 0.0.2
difference from our nodes in the pool.
Is this the expected behaviour from a point release? I would have
thought a patch release would be fine.
More information about the Gluster-users