[Gluster-devel] glfs vs. unfsd performance figures
Gordan Bobic
gordan at bobich.net
Sat Jan 9 12:28:36 UTC 2010
Shehjar Tikoo wrote:
>>>>>> The answer to your question is, yes, it will be possible to export
>>>>>> your
>>>>>> local file system with knfsd and glusterfs distributed-replicated
>>>>>> volumes with Gluster NFS translator BUT not in the first release.
>>>>>
>>>>> See comment above. Isn't that all the more reason to double check
>>>>> performance figures before even bothering?
>>>>>
>>>>> In fact, I may have just convinced myself to acquire some iozone
>>>>> performance figures. Will report later.
>>>>
>>>> OK, I couldn't get iozone to report sane results. glfs was reporting
>>>> things in the reasonable ball park I'd expect (between 7MB/s and
>>>> 110MB/s which is what I'd expect on gigabit ethernet). NFS was
>>>> reporting figures that look more like the memory bandwidth so I'd
>>>> guess that FS-Cache was taking over. With O_DIRECT and O_SYNC
>>>> figures were in the 700KB/s range for NFS which is clearly not sane
>>>> because in actual use the two seem fairly equivalent.
>>>>
>>>> So - I did a redneck test instead - dd 64MB of /dev/zero to a file
>>>> on the mounted partition.
>>>>
>>>> On writes, NFS gets 4.4MB/s, GlusterFS (server side AFR) gets
>>>> 4.6MB/s. Pretty even.
>>>> On reads GlusterFS gets 117MB/s, NFS gets 119MB/s (on the first read
>>>> after flushing the caches, after that it goes up to 600MB/s). The
>>>> difference in the unbuffered readings seems to be in the sane ball
>>>> park and the difference on the reads is roughly what I'd expect
>>>> considering NFS is running UDP and GLFS is running TCP.
>>>>
>>>> So in conclusion - there is no performance difference between them
>>>> worth speaking of. So what is the point in implementing a user-space
>>>> NFS handler in glusterfsd when unfsd seems to do the job as well as
>>>> glusterfsd could reasonably hope to?
>>>
>>> A single dd, which is basically sequential IO is something even
>>> an undergrad OS 101 project can optimize for. We, on the other hand,
>>> are aiming higher. We'll be providing much better meta-data
>>> performance, something unfsd sucks at(..not without reason, I
>>> appreciate the measures it takes for ensuring correctness..) due to
>>> the large number of system calls it performs, much better support for
>>> concurrency in order to exploit the proliferating multi-cores, much
>>> better parallelism for multiple NFS clients where all of them are
>>> hammering away at the server, again something unfsd does not to do.
>>
>> Since you (quite rightly) say that a single sequential I/O isn't a
>> particularly valid real-world test case, I now have some performance
>> figures, and they are showing a similar equivalence between glfs and
>> unfsd client connections (see tests 8,9 below).
>>
>> The testing was done using the following method:
>> make clean;
>> # prime the caches for the benefit of the doubt
>> find . -type f -exec cat '{}' > /dev/null \;;
>> sync;
>> # The machines involved are quad core
>> time make -j8 all
>>
>> 1) pure ext3 6:40 CPU bound
>> 2) ext3 15:15 rootfs (glfs, no cache) I/O bound
>> 3) ext3+knfsd 7:02 mostly network bound
>> 4) ext3+unfsd 16:04
>> 5) glfs 61:54 rootfs (glfs, no cache) I/O bound
>> 6) glfs+cache 32:32 rootfs (glfs, no cache) I/O bound
>> 7) glfs+unfsd 278:30
>> 8) glfs+cache+unfsd 189:15
>> 9) glfs+cache+glfs 186:43
>>
>> Notes:
>> - Time is in minutes:seconds
>> - GlusterFS 2.0.9 was used in all cases, on RHEL 5.4, 64-bit
>> - The times are for building the RHEL 5.4 kernel
>> - noatime is used on all mounts
>> - cache means that caching was applied on the server in the form of
>> writebehind and io-cache translators directly on top of the assembled
>> AFR bricks.
>> - All tests except 2, 5, and 6 were done on a Quad Core2 3.2 GHz with
>> 2GB of RAM
>> - Tests 2, 5, and 6 were done on a Phenom X4 2.8GHz with 4GB of RAM.
>> In this instance the figures are reasonably comparable
>> - In tests 2, 5, 6 rootfs (which is where gcc and other binaries are),
>> was on glfs, which caused further slow-down.
>> - In all cases except 1 (where all the files were local), the server
>> was the same PhenomX4 machine with 4GB of RAM. It was paired in AFR to
>> an Atom 330 machine in all cases where glfs was used.
>> - Gigabit network was used in all cases.
>> - The client was always connecting to a single, server assembled AFR
>> volume (so the server was proxying write requests to the slaved Atom
>> 330 machine).
>> - glfs rootfs runs without any performance translators in all cases
>> and with --disable-direct-io=off
>> - the volume containing /usr/src where the source code being compiled
>> resides was always mounted without the direct-io mount parameter
>> mentioned.
>>
>> Even if we ignore tests, 2, 5, 6, the results are quite concerning:
>> 1) pure ext3 6:40 CPU bound
>> 3) ext3+knfsd 7:02 mostly network bound
>> 4) ext3+unfsd 16:04
>> 7) glfs+unfsd 278:30
>> 8) glfs+cache+unfsd 189:15
>> 9) glfs+cache+glfs 186:43
>>
>> REsults 1,3,4 above are pretty much just the base line for how long
>> the operation takes without any glfs involvement.
>>
>> The main point here is between results 7, 8 and 9:
>> 7) glfs+unfsd 278:30
>> 8) glfs+cache+unfsd 189:15
>> 9) glfs+cache+glfs 186:43
>>
>> Specifically, the point I was making earlier about glfs vs. unfsd
>> performance. The difference appears to be quite negligible, so I'd
>> dare say that in terms of performance, rolling a NFS server into
>> glusterfs will do absolutely nothing for performance.
>
> ..and based on that sweeping generalization, which in itself is based on
> this one particular test at one particular deployment, if your
> point is that there is no need for the NFS translator, I cannot help
> but notice that you're completely ignoring all the other points I've
> made in the previous emails.
Not at all - I accepted the one all-important point that on it's own
justifies the idea, along with all other non-performance-related points
you made. I'm just using this to show that the protocol used for the end
client connection doesn't appear to make any difference to overall
performance, in what is arguably a pretty nasty test case (kernel build)
that has deconstructed many a file system performance claim before.
Since in-process glfs has the same performance in this context as the
external unfsd, I'm curious how rolling nfs server into the same process
will improve the performance. Don't get me wrong, I hope that it will, I
just don't see where the gain is to be made unless there is something
very sub-optimal already going on in the glfs export.
> Then, for a second lets ignore the config, to which I'll come later,
> you're extrapolating about NFS translator performance only on the
> basis of this comparison between unfsd and glusterfs. I find this
> judgment on the NFS translator without having tested it unacceptable.
Again, I agree - but if you are saying that replacing glfs protocol with
NFS protocol on the last hop will help the performance, then that would
imply that there is an improvement to be made somewhere other than the
simple choice of the protocol itself. If that is the case then I would
like to see glfs sped up by the same amount. I was expecting glfs to
have had an advantage in my tests because, for one, there is one fewer
process and one fewer context switch needed.
> But, since you've put in this much appreciated effort into running
> the tests, a few more points are deserved.
>
>> So in bullet points:
>> - unfsd runs at a bit under half the speed of knfsd.
>> - glfs without writebehind + io-cache translators runs approximately
>> 10x slower than ext3 (when backed by ext3 as in this test, at least).
>> - writebehind + io-cache approximately doubles the performance. This
>> is evident both from tests 5,6 and 7,8
>> - With glfs being used for the replicated volume to be exported to
>> clients, the performance is approximately 30x lower than the nearest
>> comparable case which is ext3+unfsd.
> > - there is no performance difference between unfsd and glfs for the
> > exported volumes.
> >
>
> ext3+unfsd is not the nearest comparable case, or even the base line.
> Where is the logic in comparing a disk filesystem to a distributed one
> in terms of performance? The ext3+knfsd and ext3+unfsd cases are
> relevant only to the point of calibrating the test and showing the
> upper-bound on performance but not as a data point to which you can
> make comparisons and say things like "approximately 30x lower".
It was only meant for calibrating the test and showing the upper bound.
I thought I'd made that clear, but I guess it wasn't explicit enough.
But as far as distributed fs performance goes, now that you mention it,
maybe I should do a DRBD+GFS comparison instead? Tempting, I might get
around to that in the next week or so, but glfs would still have the
performance disadvantage of being user-space rather than kernel-space.
There is no real way to get away from that, so the test would still not
be entirely fair and directly comparable, and again, only usable as
another calibration/reference point.
> Then there is the very peculiar replication config. If I understand
> it correctly, this is what it looks like for 8 and 9:
>
> Clnt(Gl or NFS)<-Lnk(a)->(unfsd+gl client or gl srv)<-Lnk(b)->(gl srv)
>
> The primary benefit of GlusterFS is that the replication and
> distribution logic sits at the client, allowing GlusterFS client
> to talk directly to the servers/replicas.
Actually, that would almost certainly unfairly penalize the performance
because it would end up shifting the bottleneck directly to the client
interconnect. I have found that shifting the replication brick assembly
to the server end makes a positive difference to performance. This gets
worse, as the number of replicas goes up. Since the interconnects are
full duplex it means the server can propagate the write for the slave(s)
using a channel that isn't being saturated at the same time by the
client sending the write operation data to the server.
Plus in this test (and in the real world) there are also memory
contention issues - where can we afford the memory for caching better,
on all clients or on one server? And then there is the case of comparing
like with like - the main thing I wanted to compare is the difference
between nfs and glfs at the final hop to the client, all other things
being equal. Pretty much all other results might as well be discarded as
not directly relevant.
> With the need to support NFS, keeping in mind that when I say NFS, I
> mean NFSv3 and v4 to an extent, we need an extra layer that performs
> the transformations between NFS ops/semantics and GlusterFS ops.
> Doing so generally requires a new server instance that at one
> side of network(Lnk(a)) talks NFS, like (unfsd+gl client) above, and
> also GlusterFS for the other side, in order to talk to GlusterFS
> backends. The point being that this extra network link, i.e. Lnk(b),
> in a NFS+GlusterFS deployment is unavoidable.
>
> However, the second link is completely avoidable for the case where
> you need two replicas with GlusterFS clients and GlusterFS servers,
> because, again, GlusterFS clients can talk to each of the replicas
> directly without an intermediate server.
Sure - but as I said above, that shifts the write bottleneck toward the
client because the client ends up having to replicate all the writes to
multiple servers, so as the number of servers goes up, the write
performance scales inversely linearly, and that's BAD.
It's the same argument as ring vs. star replication in databases. Star
replication writes scale O(n) while ring replication writes scale O(1).
What my setup essentially emulates (for 2 servers) is the equivalent of
ring, but this is only possible with 2 servers (with 2 servers ring and
star are the same). Shifting replication to the client effectively turns
it into a star because the client in this sense becomes a server.
With the server-side replication, at the very least we go from
client-side replication's O(n) scaling to O(n-1) because the 1 part goes
to local disks which aren't subject to interconnect contention. This
won't make much difference on huge setups (as n will be dominant) but on
smaller setups it's a worthwhile difference.
> In effect, what this particular config does is, it forces glusterfs
> performance down to that of a NFS+GlusterFS deployment
> by forcing GlusterFS to talk to atleast one replica through an extra
> network link.
I'm not sure I follow. The client connects to only one of the two
servers (the primary). Primary server is doing the AFR brick assembly
and replicating the writes to the secondary server. Both bricks have
writebehind caching, which should pretty much nullify the performance
effect of writing to the slaved server (right up to the point where the
primary server runs out of CPU, which is far from occurring). The
read-subvolume is set to the primary server, so reads should always be
available at local speeds.
For the sake of the argument, I can shunt the primary<->secondary server
interconnect to a dedicated interface, but I don't think this will make
any difference since the bottleneck doesn't appear to be in the
interconnect.
> In traditional and most used configurations of
> replicate, gluster clients do talk directly to the glusterfs backends.
> It is obvious that on those, GlusterFS replicated performance would be
> better then NFS+GlusterFS replicated performance. The question
> then arises is, is this even a relevant comparison, let alone fair?
> It is clearly not on both counts.
See above. I don't think I agree with your reasoning on what would give
better performance. But I'll make a mental note to test the client-side
replication performance just to make sure.
> That is why I believe this particular comparison of glusterfs
> and unfsd is not useful to make the kind of conclusion you're
> attempting to regarding NFS in general and NFS translator,
> in particular.
I'll reserve my answer to that part until I have tested the performance
with the replication being pushed all the way to the client side. The
only gain I can see it benefiting from is shifting of io-cache to the
client itself, but see my point earlier about caching memory usage (one
server vs. all clients). If you are shifting everything toward the
client, the extreme case of this then becomes the one where every client
becomes a server and has a complete local copy, but as I've already
pointed out, the write performance scales inversely, so even though that
use case would increase the read performance, the write performance
would suffer badly.
> Now, having looked at and hacked on unfsd source, please trust me
> when I say that the NFS translator will perform better than
> unfsd when used with glusterfs. The inflection point of unfsd
> and NFS xlator perf might not be in the first release, but rest
> assured that incremental changes will start showing differences.
I'm looking forward to seeing that. :)
Gordan
More information about the Gluster-devel
mailing list