[Gluster-devel] Suggestions

Wed Jun 8 15:22:45 UTC 2011

Hans K. Rosbach wrote:
> On Wed, 2011-06-08 at 12:34 +0100, Gordan Bobic wrote:
>> Hans K. Rosbach wrote:
>>
>>> -SCTP support, this might not be a silver bullet but it feels
>> [...]
>>
>>>  Features that might need glusterfs code changes:
>> [...]
>>>   -Multihoming (failover when one nic dies)
>> How is this different to what can be achieved (probably much more 
>> cleanly) with NIC bonding?
> 
> NIC bonding is nice for a small network, but routed networks might
> have advantages from this. This is not something I feel that I need,
> but I am sure it would be an advantage for some other users. This could
> possibly be of help in geo-replication setups for example.

Not sure what routedness has to do with this. If you need route failover 
this is probably something best done by having a HA/cluster service 
change the routing table accordingly.

>>> -Ability to have the storage nodes autosync themselves.
>>>  In our setup the normal nodes have 2x1Gbit connections while the
>>>  storage boxes have 2x10Gbit connections, so having the storage
>>>  boxes use their own bandwidth and resources to sync would be nice.
>> Sounds like you want server-side rather than client-side replication. 
>> You could do this by using afr/replicate on the servers, and export via 
>> NFS to the clients. Have failover handled as for any normal NFS server.
> 
> We have considered this, and might decide to go down this route
> eventually, however it seems strange that this can not also be done
> using the native client.

Is the current NFS wheel not quite round enough for you? ;)

> The fact that each client writes to both servers is fine, but the
> fact that the clients needs to do the re-sync work whenever the
> storage nodes are out of sync (one of them rebooted for example)
> seems strange and feels very unreliable especially since this is
> a manual operation.

There is a plan C, though. You can make the servers also clients. You 
can then have a process that does "ls -laR" periodically or upon failure.

>>> -An ability for the clients to subscribe to metadata updates for
>>>  a specific directory would also be nice, so that it can cache that
>>>  folders stats while working there and still know that it will not
>>>  miss any changes. This would perhaps increase overhead in large
>>>  clusters but could improve performance by a lot in clusters where
>>>  several nodes work in the same folder (mail spool folder for example).
>> You have a shared mail spool on your nodes? How do you avoid race 
>> conditions on deferred mail?
> 
> Several nodes can deliver mails to the spool folder, and dedicated queue
> runners will pick them up and deliver them to local and/or remote hosts.
> I am not certain what race conditions you are referring to, but locking
> should make sure no more than one queue runner touches the file at one
> time. Am I missing something?

Are you sure your MTA applies locks suitably? I wouldn't bet on it. I 
would expect that most of them assume unshared spools. Also remember 
that locking is a _major_ performance bottleneck when it comes to 
cluster file systems. Multiple nodes doing locking and r/w in the same 
directory will have an inverse scaling impact on performance, especially 
on small I/O such as you are likely to experience on a mail spool.

If there is no file locking you will likely see non-deterministic 
multiple sending of mail, especially deferred mail. Depending on how 
your MTA produces mail spool file names, you may see non-deterministic 
silent clobbering, too, if it doesn't do parent directory locking on 
file creation/deletion.
If there is locking, you will likely see that the performance starts to 
reduce as you add more servers due to lock contention.

Gordan