[Gluster-devel] Unfair scheduling in unify/AFR

Tue Nov 20 12:50:35 UTC 2007

Krishna Srinivas wrote:
> In case you have not put io-threads, can you test it with that and see
> how it behaves?

I'm using io-threads on client side. At this time, only one client
accesses a storage brick at a time (on a given server), so I thought
io-threads won't help there. But on client side, waiting for a read for
one thread shouldn't block the whole client (because there can be more
threads), so I loaded io-threads on the client.

> When you copy, are the source and destination files both on the glusterfs
> mount point?

No, since we are testing pure read performance, and only GlusterFS
performance. So I copy sparse files over GlsuterFS into /dev/null with
`dd bs=1M`.

Currently, a single and only thread has a read performance about 540-580
MB/s. What I would like to see is two threads reading two files from two
servers, with a performance of at least 540 MB/s *each*.

> Can you mail the client spec file?

Sure. Here it goes.

### DATA

volume data-it27
  type protocol/client
  option transport-type ib-verbs/client
  option remote-host 10.40.40.1
  option remote-subvolume data
end-volume

volume data-it28
  type protocol/client
  option transport-type ib-verbs/client
  option remote-host 10.40.40.2
  option remote-subvolume data
end-volume

volume data-it29
  type protocol/client
  option transport-type ib-verbs/client
  option remote-host 10.40.40.3
  option remote-subvolume data
end-volume

### NAMESPACE

volume data-ns-it27
  type protocol/client
  option transport-type ib-verbs/client
  option remote-host 10.40.40.1
  option remote-subvolume data-ns
end-volume

volume data-ns-it28
  type protocol/client
  option transport-type ib-verbs/client
  option remote-host 10.40.40.2
  option remote-subvolume data-ns
end-volume

volume data-ns-it29
  type protocol/client
  option transport-type ib-verbs/client
  option remote-host 10.40.40.3
  option remote-subvolume data-ns
end-volume

### AFR

volume data-afr
     type cluster/afr
     subvolumes data-it29 data-it27 data-it28
end-volume

volume data-ns-afr
     type cluster/afr
     subvolumes data-ns-it27 data-ns-it28 data-ns-it29
end-volume

### UNIFY

volume data-unify
     type cluster/unify
     subvolumes data-afr
     option namespace data-ns-afr
     option scheduler rr
end-volume

volume ds
     type performance/io-threads
     option thread-count 8
     option cache-size 64MB
     subvolumes data-unify
end-volume

volume ds-ra
  type performance/read-ahead
  subvolumes ds
  option page-size 518kB
  option page-count 48
end-volume

Thanks,
--
Szabolcs

> On Nov 20, 2007 1:24 AM, Székelyi Szabolcs <cc at avaxio.hu> wrote:
>> Hi,
>>
>> I use a configuration with 3 servers and one client, with client-side
>> AFR/unify.
>>
>> It looks like the unify and AFR translators (with the new load-balancing
>> code) do unfair scheduling among concurrent threads.
>>
>> I tried to copy two files with two concurrent (ie. parallel) threads,
>> and one of the threads always gets much more bandwidth than the other.
>> When the threads start to run, actually only one of them get served by
>> the GlusterFS client at a reasonable performance, the other (almost)
>> starves. When the first thread finishes, comes the other one.
>>
>> The order of the threads seems constant over consecutive runs.
>>
>> Even more, a thread started when one thread is already running, the
>> second one can steal performance from the first.
>>
>> The preference of the threads is determined by the remote server. (I
>> mean a thread served by a particular host always gets more performance
>> than another one. This is how a thread started later can steal
>> performance from the other.)
>>
>> Doing the same thing with two GlusterFS clients (mounting the same
>> configuration on two different directories) gives absolutely fair
>> scheduling.
>>
>> The trouble with this is that this way one can't benefit from AFR
>> load-balancing. We would like to exceed the physical disk speed limit by
>> spreading the reads over multiple GlusterFS servers, but they cannot be
>> spread this way; only one server does the work at a given point in time.
>>
>> Do you have any idea what could be wrong and how to fix it?
>>
>> Thanks,
>> --
>> Szabolcs