[Gluster-devel] Re; Load balancing ...

Wed Apr 30 16:10:04 UTC 2008

>We have a work around for this so that clients don't see a "blocked" 
>situation. we have a 6 node cluster and whenever a server node comes up 
>it lists all the files on it's "slice" of the gluster mount and then 
>does a find on those files on the main mounted volume. In other words 
>each server is also a client, and does the heal process right after 
>starting up.

Well here's the thing, unless I'm mistaken, you are relying on this to keep your files in sync .. any files that are open for update when your node comes back online, will not be healed as you expect them to (!)

For the heal to work properly, in addition to your find you would need open files on all nodes to be closed and re-opened (!) (unless there are some nice patches that have gone in since my last test?)

Gareth.

----- Original Message -----
From: "Mickey Mazarick" <mic at digitaltadpole.com>
To: "Gareth Bult" <gareth at encryptec.net>
Cc: gordan at bobich.net, gluster-devel at nongnu.org
Sent: Wednesday, April 30, 2008 4:58:46 PM GMT +00:00 GMT Britain, Ireland, Portugal
Subject: Re: [Gluster-devel] Re; Load balancing ...

We have a work around for this so that clients don't see a "blocked" 
situation. we have a 6 node cluster and whenever a server node comes up 
it lists all the files on it's "slice" of the gluster mount and then 
does a find on those files on the main mounted volume. In other words 
each server is also a client, and does the heal process right after 
starting up.

-Mic

Gareth Bult wrote:
> Ok, I'm afraid I'm not agreeing with some of the things you're saying, but that doesn't really get us anywhere.
>
> I think the bottom line is, having read or write operations block for any length of time in order to merge a node back into the cluster following an outage simply isn't "real world". "any length of time" will vary for different operators, for me it's seconds, for anyone running real-time systems or VoIP for example, it might be milliseconds.
>
> My feeling is that self-heal "on open" is "nice", easy to code and maintain, neat, and I can see "how it came to be".
>
> It's now however practical for a commercial solution, healing needs to be a background process, not a foreground process.
>
> The only question I really have is, what sort of timescale are we looking at before GlusterFS has the kind of capabilities (in context) that one would expect of a production clustering filesystem ? 
> (I think we all know hash/sync would be an interim solution)
>
> (I know it's "different", but just as a point of reference, DRBD syncs in the background, as does Linux software raid .. can you imagine anyone using either if they tried to make people wait while they did a foreground sync ??)
>
> Gareth.
>
>
> ----- Original Message -----
> From: gordan at bobich.net
> To: gluster-devel at nongnu.org
> Sent: Wednesday, April 30, 2008 1:56:12 PM GMT +00:00 GMT Britain, Ireland, Portugal
> Subject: Re: [Gluster-devel] Re; Load balancing ...
>
>
>
> On Wed, 30 Apr 2008, Gareth Bult wrote:
>
>   
>>> It would certainly ber beneficial in the cases when the network speed 
>>> is slow (e.g. WAN replication).
>>>       
>> So long as it's server side AFR and not client-side ... ?
>>     
>
> Sure.
>
>   
>> I'm guessing there would need to be some server side logic to ensure 
>> that local servers generated their own hashes and only exchanged the 
>> hashes over the network rather than the data ?
>>     
>
> Indeed - same as rsync does.
>
>   
>>> Journal per se wouldn't work, because that implies fixed size and write-ahead logging.
>>> What would be required here is more like the snapshot style undo logging.
>>>       
>> A journal wouldn't work ?!
>> You mean it's effectiveness would be governed by it's size?
>>     
>
> Among other things. A "journal" just isn't suitable for this sort of 
> thing.
>
>   
>>> 1) Categorically establish whether each server is connected and up to date
>>> for the file being checked, and only log if the server has disconnected.
>>> This involves overhead.
>>>       
>> Surely you would log anyway, as there could easily be latency between an 
>> actual "down" and one's ability to detect it .. in which case detecting 
>> whether a server has disconnected it a moot point.
>>     
>
> Not really. A connected client/server will have a live/working TCP 
> connection open. Read-locks don't matter as they can be served locally, 
> but when a write occurs, the file gets locked. If a remote machine doesn't 
> ack the lock, and/or it's TCP connection resets, then it's safe to assume 
> that it's not connected.
>
>   
>> In terms of the 
>> overhead of logging, I guess this would be a decision for the sysadmin 
>> concerned, whether the overhead of logging to a journal was worthwhile 
>> .vs. the potential issues involved in recovering from an outage?
>>     
>
> That complicates things further, then. You'd essentially have asynchronous 
> logging/replication. At that point you pretty much have to log all writes 
> all the time. That means potentially huge space and speed overheads.
>
>   
>> From my point of view, if journaling halved my write performance (which 
>> it wouldn't) I wouldn't even have to think about it.
>>     
>
> Actually, saving an undo-log a-la snapshots, which is what would be 
> required, _WOULD_ halve your write performance on all surviving servers if 
> one server was out. If multiple servers were out, you could probably work 
> around some of this with merging/splitting the undo logs for various 
> machines, so your write performance would generally be around 1/2 of 
> standard, but wouldn't end up degrading to 1/n+1 where n is the number of 
> failed servers for which the logging needs to be done.
>
>   
>>> The problem that arises then is that the fast(er) resyncs on small changes
>>> come at the cost of massive slowdown in operation when you have multiple
>>> downed servers. As the number of servers grows, this rapidly stops being a
>>> workable solution.
>>>       
>> Ok, I don't know about anyone else, but my setups all rely on 
>> consistency rather than peaks and troughs. I'd far rather run a journal 
>> at half potential speed, and have everything run at that speed all the 
>> time .. than occasionally have to stop the entire setup while the system 
>> recovers, or essentially wait for 5-10 minutes while the system re-syncs 
>> after a node is reloaded.
>>     
>
> There may be a way to address the issue of halting the rest of the cluster 
> during the sync, though. Read lock on a syncing file shouldn't stop other 
> read locks. Of course, it will block writes while the file syncs and the 
> reading app finishes the operation.
>
> Gordan
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>   

--