[Gluster-users] advice on optimal configuration

Wed Mar 10 07:39:32 UTC 2010

Barry,

Just to clarify, the application that would cache files on glusterfs would do it across regular mount points and not copy off from the backend servers, right ? If that is the case then that is fine.

Since you mentioned such a small partition my guess would be that you are using SSD on the 128 cache nodes. Is that correct ?

Since you can re-generate or retreive files from the upstream file server seamlessly, I would recommend not to use replication and instead configure a 2X cache using distribute configuration. If there are enough files and the application is caching files that are in demand, they will spread out nicely over the 128 nodes and will give you a good load balancing effect.

With replication, suppose you have two replicas, like you mentioned, the write goes to both replica servers and the read for a file will go to a preferred server. There is no load balancing per file per se. What I mean is, suppose 100 clients mount a volume that is replicated across 2 servers, if all of them access the same file in read mode, it will be read from the same server and will not be balanced across the 2 servers. This however can be fixed by using a client preferred read server - but this would have to be set on each client. Also, it will work only for a replication count of 2. It does not allow for a preference list for servers - like it would not allow for a replica count of 3, one client to give preference of s1, s2, s3, another client to give preference of s2, s3, s1 and the next one a preference of s3, s1, s2 and so on and so forth.

At some point we intend to automate some of that, but since most users use a replication count of 2 only, it can be managed - except of the work required to set preferences on each client. Again, if there are lots of files being accessed, it evens out, so that becomes less of a concern again and gives a load balanced effect.

So in summary, read for same file does not get balanced, unless each client sets a preference. However for many files being accessed it evens out and gives a load balanced effect.

Since you are only going to write once, that does not hurt performance much ( a replicated write returns only after the write has happened to both replica locations ).

Since you are still in testing phase, what you can do is this - create one backend FS on each nodes. Create two directories in that - one called distribute and the other called something like replica<volume><replica#> so you can use that to group it with a similar one on another node for replication.

The backend subvolumes exported from the servers can be directories so you can setup a distribute GlusterFS volume as well as the replicated GlusterFS volumes and mount both on the clients and hence test both. At any point when you have decide to use one of them, just umount the other one, delete the directory from the the backend FS and thats it. 

If you have SSDs like I assumed, you would actually be decreasing wear per cached data ( if there were such a term :-) ) by not using replication.

Let me know if you have any questions on this.

Regards,
Tejas.

----- Original Message -----
From: "Barry Robison" <barry.robison at drdstudios.com>
To: gluster-users at gluster.org
Sent: Wednesday, March 10, 2010 5:28:24 AM GMT +05:30 Chennai, Kolkata, Mumbai, New Delhi
Subject: [Gluster-users] advice on optimal configuration

Hello,

I have 128 physically identical blades, with 1GbE uplink per blade,
and 10GbE between chassis ( 32 blades per chassis ). Each node will
have a 80GB gluster partition. Dual-quad core intel Xeons, 24GB RAM.

The goal is to use gluster as a cache for files used by render
applications. All files in gluster could be re-generated or retrieved
from the upstream file server.

My first volume config attempt is 64 replicated volumes with partner
pairs on different chassis.

Is replicating a performance hit? Do reads balance between replication nodes?

Would NUFA make more sense for this set-up?

Here is my config, any advice appreciated.

Thank you,
-Barry

>>>>
volume c001b17-1
    type protocol/client
    option transport-type tcp
    option remote-host c001b17
    option transport.socket.nodelay on
    option transport.remote-port 6996
    option remote-subvolume brick1
    option ping-timeout 5
end-volume
.
<snip>
.
volume c004b48-1
    type protocol/client
    option transport-type tcp
    option remote-host c004b48
    option transport.socket.nodelay on
    option transport.remote-port 6996
    option remote-subvolume brick1
    option ping-timeout 5
end-volume

volume replicate001-17
    type cluster/replicate
    subvolumes c001b17-1 c002b17-1
end-volume
.
<snip>
.
volume replicate001-48
    type cluster/replicate
    subvolumes c001b48-1 c002b48-1
end-volume

volume replicate003-17
    type cluster/replicate
    subvolumes c003b17-1 c004b17-1
end-volume
.
<snip>
.
volume replicate003-48
    type cluster/replicate
    subvolumes c003b48-1 c004b48-1
end-volume

volume distribute
    type cluster/distribute
    subvolumes replicate001-17 replicate001-18 replicate001-19
replicate001-20 replicate001-21 replicate001-22 replicate001-23
replicate001-24 replicate001-25 replicate001-26 replicate001-27
replicate001-28 replicate001-29 replicate001-30 replicate001-31
replicate001-32 replicate001-33 replicate001-34 replicate001-35
replicate001-36 replicate001-37 replicate001-38 replicate001-39
replicate001-40 replicate001-41 replicate001-42 replicate001-43
replicate001-44 replicate001-45 replicate001-46 replicate001-47
replicate001-48 replicate003-17 replicate003-18 replicate003-19
replicate003-20 replicate003-21 replicate003-22 replicate003-23
replicate003-24 replicate003-25 replicate003-26 replicate003-27
replicate003-28 replicate003-29 replicate003-30 replicate003-31
replicate003-32 replicate003-33 replicate003-34 replicate003-35
replicate003-36 replicate003-37 replicate003-38 replicate003-39
replicate003-40 replicate003-41 replicate003-42 replicate003-43
replicate003-44 replicate003-45 replicate003-46 replicate003-47
replicate003-48
end-volume

volume writebehind
    type performance/write-behind
    option cache-size 64MB
    option flush-behind on
    subvolumes distribute
end-volume

volume readahead
    type performance/read-ahead
    option page-count 4
    subvolumes writebehind
end-volume

volume iocache
    type performance/io-cache
    option cache-size 128MB
    option cache-timeout 10
    subvolumes readahead
end-volume

volume quickread
    type performance/quick-read
    option cache-timeout 1
    option max-file-size 64kB
    subvolumes iocache
end-volume

volume statprefetch
    type performance/stat-prefetch
    subvolumes quickread
end-volume
_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users