[Gluster-devel] Blocking client feature request

Sun Jan 18 05:38:36 UTC 2009

Hi,

> Do you have logs/cores which can help us?

I'll try to produce some for you soon. I've been busy trying to stabilise the 
affected production systems for during our high demand period, so this will 
have to wait until it's a safe time to incur a deliberate outage.

(The configuration I'm running now, while prone to less crashes, is not one I 
intend to keep running long-term, as only one daemon is in use across two 
client machines, and without any performance translators. So I have to wait 
until after peak time to try to debug the performance enhanced one 
daemon-per-server configuration I normally run, that I want to work again.)

Oh - one thing I have noticed is that post-upgrade from 1.3 TLA 646, there's 
been a large number of (for want of a better word) 'unhealable' files - files 
that I know were present previously on at least one dataspace block, but are 
now only present in the namespace block.

I mention this as there seems to be some correlation between deleting these 
files and increasing the time between crashes. It doesn't seem to be as clear 
cut as 'self-heal is causing the crash', as processes accessing the affected 
files through the GlusterFS export doesn't cause a crash right there and 
then.

It just seems to increase the risk of a crash over time. Perhaps it's some 
sort of resource leak in self heal?

Anyway, hopefully the logs - when I can safely produce them - will be able to 
resolve the true cause.

> Given the fact that there is a reasonably high demand for it, I think
> we should be adding this support as an option in our protocol. There
> are a few challenges with the current design (like having stateful fd)
> which will need some trickery to accommodate them across reconnects.
> So it may not be implemented immediately, but maybe in 2.1.x or 2.2.x.

Thanks for considering this.

If I had a wish list for GlusterFS, this feature would be at the top of it.

Kind regards,

Geoff Kassel.

On Sat, 17 Jan 2009, Anand Avati wrote:
> >   What I've realized is that a blocking GlusterFS client would solve this
> > negative visibility problem for me while I look again at the crash
> > issues. (I've just upgraded to the latest 1.4/2.0 TLA, so my experiences
> > are relevant to the majority again. Yes, I'm still getting crashes.)
>
> Do you have logs/cores which can help us?
>
> >   That way, I'd just have to restart the GlusterFS daemon(s), and my
> > running services would block, but not have to be restarted. My clients
> > would see a lack of responsiveness for up to 20 seconds, not a five to
> > ten minute outage.
> >
> >   Is there any possibility of this feature being added to GlusterFS?
>
> Given the fact that there is a reasonably high demand for it, I think
> we should be adding this support as an option in our protocol. There
> are a few challenges with the current design (like having stateful fd)
> which will need some trickery to accommodate them across reconnects.
> So it may not be implemented immediately, but maybe in 2.1.x or 2.2.x.
>
> avati