[Gluster-devel] write-behind tuning

Wed May 14 07:15:35 UTC 2008

Hi all,

I am in the process of implementing write-behind and had some questions.

1) I've been told aggregate-size should be between 0-4MB. What is the
down-side to making it large? In our case (a backup server) I would think
the bigger the better since we are doing lots of consecutive/parallel rsyncs
of a combination of tons of small files and some very large files. The only
down-side I could see is that small transfers are not distributed as evenly
since large writes will be done to only 1 brick instead of half of the write
to each brick. Perhaps someone can clarify.

2) What does flush-behind do? What is the advantage of having it on and what
is the advantage of it off.

3) write-behind on the client aggregates small writes into larger ones. Is
there any purpose to doing it on the server side? If so, how is this
helpful?

4) should write-behind be done on a brick-by-brick basis on the client, or
is it fine to do it after the unify? (seems like it would be fine since this
would consolidate small writes before sending it to the scheduler).

Hardware wise we currently have 2 16x1TB Hardware RAID6 servers (each is
8core, 8gb of RAM). Each acts as both a server and a unify client.
Underlying filesystem is currently XFS on Linux, ~13TB each. Interconnect is
GigE and eventually we will have more external clients, though for now we
are just using the servers as clients. My current client config is below.
Any other suggestions are also appreciated.

Thanks, Jordan

----

### Client config
### Import storage volumes and thread for performance
volume brick1
  type protocol/client
  option transport-type tcp/client
  option remote-host storage-0-1
  option remote-subvolume brick
end-volume

volume brick2
  type protocol/client
  option transport-type tcp/client
  option remote-host storage-0-2
  option remote-subvolume brick
end-volume

volume brick-io1
  type performance/io-threads
  option thread-count 8
  option cache-size 4096MB
  subvolumes brick1
end-volume

volume brick-io2
  type performance/io-threads
  option thread-count 8
  option cache-size 4096MB
  subvolumes brick2
end-volume

### Imported namespaces and mirror them for redudancy
volume brick-ns1
  type protocol/client
  option transport-type tcp/client
  option remote-host storage-0-1
  option remote-subvolume brick-ns
end-volume

volume brick-ns2
  type protocol/client
  option transport-type tcp/client
  option remote-host storage-0-2
  option remote-subvolume brick-ns
end-volume

volume brick-ns
  type cluster/afr
  subvolumes brick-ns1 brick-ns2
end-volume

### Unify bricks into a single logical volume, and use ALU for scheduling
volume bricks
  type cluster/unify
  subvolumes brick-io1 brick-io2
  option namespace brick-ns

  # Use ALU scheduling algorithm
  option scheduler alu                    # use the ALU scheduler
  option alu.limits.min-free-disk  5%            # Don't create files one a
volume with less than 5% free diskspace
  option alu.limits.max-open-files 10000        # Don't create files on a
volume with more than 10000 files open

  # When deciding where to place a file, first look at the disk-usage, then
at
  # read-usage, write-usage, open files, and finally the disk-speed-usage.
  option alu.order
disk-usage:write-usage:read-usage:open-files-usage:disk-speed-usage
  option alu.disk-usage.entry-threshold 100GB        # Kick in if the
discrepancy in disk-usage between volumes is more than 100GB
  option alu.disk-usage.exit-threshold  50GB        # Don't stop writing to
the least-used volume until the discrepancy is 50GB
  option alu.open-files-usage.entry-threshold 1024    # Kick in if the
discrepancy in open files is 1024
  option alu.open-files-usage.exit-threshold 32        # Don't stop until
992 files have been written the least-used volume
  option alu.read-usage.entry-threshold 20%        # Kick in when the
read-usage discrepancy is 20%
  option alu.read-usage.exit-threshold 4%        # Don't stop until the
discrepancy has been reduced to 16% (20% - 4%)
  option alu.write-usage.entry-threshold 20%        # Kick in when the
write-usage discrepancy is 20%
  option alu.write-usage.exit-threshold 4%        # Don't stop until the
discrepancy has been reduced to 16%
  option alu.stat-refresh.interval 10sec        # Refresh the statistics
used for decision-making every 10 seconds
  # option alu.stat-refresh.num-file-create 10        # Refresh the
statistics used for decision-making after creating 10 files
end-volume

volume write-back
  type performance/write-behind
  option aggregate-size 1MB
  option flush-behind on    # default is 'off'
  subvolumes bricks
end-volume