[Gluster-devel] Re: [Gluster-users] I/O fair share to avoid I/O bottlenecks on small clsuters

Mon Feb 1 15:14:05 UTC 2010

Jeff Darcy wrote:
> On 01/31/2010 09:06 AM, Ran wrote:
>> You guys are talking about network IO im taking about the gluster server disk IO
>> the idea to shape the trafic does make sence seens the virt machines
>> server do use network to get to the disks(gluster)
>> but what about if there are say 5 KVM servers(with VPS's) all on
>> gluster what do you do then ? its not quite fair share seens every
>> server has its own fair share and doesnt see the others .
>>
>> Also there are other applications that uses gluster like mail etc..
>> and i see that gluster IO is very high very often cousing the all
>> storage not to work .
>> Its very disturbing .
> 
> 
> You bring up a good set of points.  Some of these problems can be
> addressed at the hypervisor (i.e. GlusterFS client) level, some can be
> addressed by GlusterFS itself, and some can be addressed only at the
> level of the local-filesystem or block-device level on the GlusterFS
> servers.

That sentence doesn't really parse for me. A part of the problem is that 
Ran didn't really specify what his storage setup is (DAS in the host or 
SAN), and whether the "uses up all disk I/O" is referring to it using up 
all the available disk I/O on just the local virtualization host (DAS) 
or whether the access pattern from one server is eating all the disk I/O 
for all the other servers connected to the SAN. Obviously, one is more 
pathological than the other, but without knowing the details it is 
impossible to point the finger at gluster when the problem could be more 
deeply rooted (e.g. a mis-optimization of the RAID array). Optimizing 
file systems is a relatively complex thing and a lot of the conventional 
wisdom is just plain wrong at times.

Here's an article I wrote on the subject a while back:
http://www.altechnative.net/e107_plugins/content/content.php?content.11

I'm not sure how much of this is applicable to the specific case being 
discussed but I cannot help but wonder just how many (if any at all) 
"enterprise grade" storage solutions take all of what is mentioned there 
into account. In my experience the difference in I/O throughput can be 
quite staggering, especially for random I/O.

> Unfortunately, I/O traffic shaping is still in its infancy
> compared to what's available for networking - or perhaps even "infancy"
> is too generous.  As far as the I/O stack is concerned, all of the
> traffic is coming from the glusterfsd process(es) without
> differentiation, so even if the functionality to apportion I/O amongst
> tasks existed it wouldn't be usable without more information.  Maybe
> some day...

I don't think this would even be useful. It sounds like seeking more 
finely grained (sub-process level!) control over disk I/O prioritisation 
without there even being a clearly presented case about the current 
functionality (ionice) not being sufficient.

If you are running a glfs server in a guest VM, and that VM is consuming 
all of the disk I/O available to the host, then the guest VM container 
process (qemu for qemu or KVM, vmx for vmware, etc.) can be ionice-d to 
lower it's priority and give the other VMs more share of the disk I/O. I 
haven't heard an argument yet explaining why that is not sufficient in 
this case.

> What you can do now at the GlusterFS level, though, is make sure that
> traffic is distributed across many servers and possibly across many
> volumes per server to take advantage of multiple physical disks and/or
> interconnects for one server.  That way, a single VM will only use a
> small subset of the servers/volumes and will not starve other clients
> that are using different servers/volumes (except for network bottlenecks
> which are a separate issue).  That's what the "distribute" translator is
> for, and it can be combined with replicate or stripe to provide those
> functions as well.  Perhaps it would be useful to create and publish
> some up-to-date recipes for these sorts of combinations.

Hold on, you seem to be talking about something else here. You're 
talking about clients not distributing their requests evenly across 
servers. Is that really what the original problem was about? My 
understanding or the original post was that a glfs server VM (KVM) was 
consuming more than it's fair share of disk I/O capability, and that 
there was a need to throttle it - which can be done by applying ionice 
to the qemu container process.

Given that this has been pretty much ignored, I'm guessing that I'm 
missing the point and that my understanding of the problem being 
experienced is in some way incorrect. So can we have some clarification 
on it, with the explanation of why ionice-ing the qemu process isn't 
applicable? What other feature is required and why exactly would it be 
useful?

Gordan