[Gluster-devel] I/O performance

Xavi Hernandez xhernandez at redhat.com
Wed Feb 6 06:57:41 UTC 2019


On Wed, Feb 6, 2019 at 7:00 AM Poornima Gurusiddaiah <pgurusid at redhat.com>
wrote:

>
>
> On Tue, Feb 5, 2019, 10:53 PM Xavi Hernandez <xhernandez at redhat.com wrote:
>
>> On Fri, Feb 1, 2019 at 1:51 PM Xavi Hernandez <xhernandez at redhat.com>
>> wrote:
>>
>>> On Fri, Feb 1, 2019 at 1:25 PM Poornima Gurusiddaiah <
>>> pgurusid at redhat.com> wrote:
>>>
>>>> Can the threads be categorised to do certain kinds of fops?
>>>>
>>>
>>> Could be, but creating multiple thread groups for different tasks is
>>> generally bad because many times you end up with lots of idle threads which
>>> waste resources and could increase contention. I think we should only
>>> differentiate threads if it's absolutely necessary.
>>>
>>>
>>>> Read/write affinitise to certain set of threads, the other metadata
>>>> fops to other set of threads. So we limit the read/write threads and not
>>>> the metadata threads? Also if aio is enabled in the backend the threads
>>>> will not be blocked on disk IO right?
>>>>
>>>
>>> If we don't block the thread but we don't prevent more requests to go to
>>> the disk, then we'll probably have the same problem. Anyway, I'll try to
>>> run some tests with AIO to see if anything changes.
>>>
>>
>> I've run some simple tests with AIO enabled and results are not good. A
>> simple dd takes >25% more time. Multiple parallel dd take 35% more time to
>> complete.
>>
>
>
> Thank you. That is strange! Had few questions, what tests are you running
> for measuring the io-threads performance(not particularly aoi)? is it dd
> from multiple clients?
>

Yes, it's a bit strange. What I see is that many threads from the thread
pool are active but using very little CPU. I also see an AIO thread for
each brick, but its CPU usage is not big either. Wait time is always 0 (I
think this is a side effect of AIO activity). However system load grows
very high. I've seen around 50, while on the normal test without AIO it's
stays around 20-25.

Right now I'm running the tests on a single machine (no real network
communication) using an NVMe disk as storage. I use a single mount point.
The tests I'm running are these:

   - Single dd, 128 GiB, blocks of 1MiB
   - 16 parallel dd, 8 GiB per dd, blocks of 1MiB
   - fio in sequential write mode, direct I/O, blocks of 128k, 16 threads,
   8GiB per file
   - fio in sequential read mode, direct I/O, blocks of 128k, 16 threads,
   8GiB per file
   - fio in random write mode, direct I/O, blocks of 128k, 16 threads, 8GiB
   per file
   - fio in random read mode, direct I/O, blocks of 128k, 16 threads, 8GiB
   per file
   - smallfile create, 16 threads, 256 files per thread, 32 MiB per file
   (with one brick down, for the following test)
   - self-heal of an entire brick (from the previous smallfile test)
   - pgbench init phase with scale 100

I run all these tests for a replica 3 volume and a disperse 4+2 volume.

Xavi


> Regards,
> Poornima
>
>
>> Xavi
>>
>>
>>> All this is based on the assumption that large number of parallel read
>>>> writes make the disk perf bad but not the large number of dentry and
>>>> metadata ops. Is that true?
>>>>
>>>
>>> It depends. If metadata is not cached, it's as bad as a read or write
>>> since it requires a disk access (a clear example of this is the bad
>>> performance of 'ls' in cold cache, which is basically metadata reads). In
>>> fact, cached data reads are also very fast, and data writes could go to the
>>> cache and be updated later in background, so I think the important point is
>>> if things are cached or not, instead of if they are data or metadata. Since
>>> we don't have this information from the user side, it's hard to tell what's
>>> better. My opinion is that we shouldn't differentiate requests of
>>> data/metadata. If metadata requests happen to be faster, then that thread
>>> will be able to handle other requests immediately, which seems good enough.
>>>
>>> However there's one thing that I would do. I would differentiate reads
>>> (data or metadata) from writes. Normally writes come from cached
>>> information that is flushed to disk at some point, so this normally happens
>>> in the background. But reads tend to be in foreground, meaning that someone
>>> (user or application) is waiting for it. So I would give preference to
>>> reads over writes. To do so effectively, we need to not saturate the
>>> backend, otherwise when we need to send a read, it will still need to wait
>>> for all pending requests to complete. If disks are not saturated, we can
>>> have the answer to the read quite fast, and then continue processing the
>>> remaining writes.
>>>
>>> Anyway, I may be wrong, since all these things depend on too many
>>> factors. I haven't done any specific tests about this. It's more like a
>>> brainstorming. As soon as I can I would like to experiment with this and
>>> get some empirical data.
>>>
>>> Xavi
>>>
>>>
>>>> Thanks,
>>>> Poornima
>>>>
>>>>
>>>> On Fri, Feb 1, 2019, 5:34 PM Emmanuel Dreyfus <manu at netbsd.org wrote:
>>>>
>>>>> On Thu, Jan 31, 2019 at 10:53:48PM -0800, Vijay Bellur wrote:
>>>>> > Perhaps we could throttle both aspects - number of I/O requests per
>>>>> disk
>>>>>
>>>>> While there it would be nice to detect and report  a disk with lower
>>>>> than
>>>>> peer performance: that happen sometimes when a disk is dying, and last
>>>>> time I was hit by that performance problem, I had a hard time finding
>>>>> the culprit.
>>>>>
>>>>> --
>>>>> Emmanuel Dreyfus
>>>>> manu at netbsd.org
>>>>> _______________________________________________
>>>>> Gluster-devel mailing list
>>>>> Gluster-devel at gluster.org
>>>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20190206/81608a15/attachment.html>


More information about the Gluster-devel mailing list