[Bugs] [Bug 1709959] Gluster causing Kubernetes containers to enter crash loop with 'mkdir ... file exists' error message

Thu May 16 20:30:17 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1709959

--- Comment #8 from Jeff Bischoff <jeff.bischoff at turbonomic.com> ---
In my last comment, I asked: "...why the brick stays offline after the timeout.
After all, it is only a "temporarily" unavailable resource. Shouldn't Gluster
be able to recover from this error without user intervention?"

To answer my own question: this appears to be a feature, not a bug according to
https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Features/brick-failure-detection/.

"When a brick process detects that the underlaying storage is not responding
anymore, the process will exit. There is no automated way that the brick
process gets restarted, the sysadmin will need to fix the problem with the
storage first."

It's good to at least understand why it isn't coming back up. However, it seems
strange to me that Gluster would choose to stop and stay off like this in the
face of an apparently transient issue. What is the best approach to remedy
this? Should I increase the timeouts... or even disable the health checker?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.