[Gluster-devel] Sick but still "alive" nodes
Jeff Darcy
jdarcy at redhat.com
Fri Jan 25 13:43:41 UTC 2013
On 01/25/2013 07:47 AM, jayunit100 at gmail.com wrote:
> Hi guys: I just saw an issue on the Hfds mailing list that might be a
> potential problem in gluster clusters. It kind of reminds me of
> Jeff's idea of bricks as first class objects in the API.
>
> What happens if a gluster brick is on a machine which, although still
> alive, performs poorly?
>
> would such scenarios be detected and if so, can the brick be
> decommissioned/ignored/moved ? If not it would be a cool feature to
> have because I'm sure it happens from time to time.
There's nothing currently in place to detect such a condition, and of
course if we can't detect it we can't do anything about it. There are
also several cases where we might actually manage to make things worse
if we try to do this ourselves. For example, consider the case where
the slowness is because of a short-duration contending activity. We
might well react just as that activity subsides, suspending that brick
just as another brick is "going bad" due to similar transient activity
there. Similarly, if the system overall is truly overloaded, suspending
bricks is a bit like squeezing a water balloon - the "bulge" just
reappears elsewhere and all we've done is diminish total resources
available.
I've seen problems like this with other parallel filesystems, and I'm
pretty sure I've read papers about them too. IMO the right place to
deal with such issues is at the job-scheduler or similar level, where
more of the total system state is known. What we can do is provide more
information about our part of the system state, plus levers that they
can pull when they decide that preparation or correction for a
higher-level event (that we probably don't even know about) is appropriate.
More information about the Gluster-devel
mailing list