[Gluster-users] Gluster very poor performance when copying small files (1x (2+1) = 3, SSD)

Raghavendra Gowdappa rgowdapp at redhat.com
Tue Mar 20 07:23:22 UTC 2018


On Tue, Mar 20, 2018 at 9:45 AM, Sam McLeod <mailinglists at smcleod.net>
wrote:

> Excellent description, thank you.
>
> With performance.write-behind-trickling-writes ON (default):
>
> ## 4k randwrite
>

> # fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=test
> --filename=test --bs=4k --iodepth=32 --size=256MB --readwrite=randwrite
> test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> 4096B-4096B, ioengine=libaio, iodepth=32
> fio-3.1
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=17.3MiB/s][r=0,w=4422 IOPS][eta
> 00m:00s]
> test: (groupid=0, jobs=1): err= 0: pid=42701: Tue Mar 20 15:05:23 2018
>   write: *IOPS=4443*, *BW=17.4MiB/s* (18.2MB/s)(256MiB/14748msec)
>    bw (  KiB/s): min=16384, max=19184, per=99.92%, avg=17760.45,
> stdev=602.48, samples=29
>    iops        : min= 4096, max= 4796, avg=4440.07, stdev=150.66,
> samples=29
>   cpu          : usr=4.00%, sys=18.02%, ctx=131097, majf=0, minf=7
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
> >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >=64=0.0%
>      issued rwt: total=0,65536,0, short=0,0,0, dropped=0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
>   WRITE: bw=17.4MiB/s (18.2MB/s), 17.4MiB/s-17.4MiB/s (18.2MB/s-18.2MB/s),
> io=256MiB (268MB), run=14748-14748msec
>
>
> ## 2k randwrite
>
> # fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=test
> --filename=test --bs=2k --iodepth=32 --size=256MB --readwrite=randwrite
> test: (g=0): rw=randwrite, bs=(R) 2048B-2048B, (W) 2048B-2048B, (T)
> 2048B-2048B, ioengine=libaio, iodepth=32
> fio-3.1
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=8624KiB/s][r=0,w=4312 IOPS][eta
> 00m:00s]
> test: (groupid=0, jobs=1): err= 0: pid=42781: Tue Mar 20 15:05:57 2018
>   write: *IOPS=4439, BW=8880KiB/s* (9093kB/s)(256MiB/29522msec)
>    bw (  KiB/s): min= 6908, max= 9564, per=99.94%, avg=8874.03,
> stdev=428.92, samples=59
>    iops        : min= 3454, max= 4782, avg=4437.00, stdev=214.44,
> samples=59
>   cpu          : usr=2.43%, sys=18.18%, ctx=262222, majf=0, minf=8
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
> >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >=64=0.0%
>      issued rwt: total=0,131072,0, short=0,0,0, dropped=0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
>   WRITE: bw=8880KiB/s (9093kB/s), 8880KiB/s-8880KiB/s (9093kB/s-9093kB/s),
> io=256MiB (268MB), run=29522-29522msec
>
>
> With performance.write-behind-trickling-writes OFF:
>
> ## 4k randwrite - just over half the IOP/s of having it ON.
>

Note that since the workload is random write, no aggregation is possible.
So, there is no point in waiting for future writes and turning
trickling-writes on makes sense.

A better test to measure the impact of this option would be sequential
write workload. I guess smaller the writes, more pronounced one would see
the benefits of this option turned off.


>
> # fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=test
> --filename=test --bs=4k --iodepth=32 --size=256MB --readwrite=randwrite
> test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> 4096B-4096B, ioengine=libaio, iodepth=32
> fio-3.1
> Starting 1 process
> Jobs: 1 (f=1): [f(1)][100.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta
> 00m:00s]
> test: (groupid=0, jobs=1): err= 0: pid=44225: Tue Mar 20 15:11:04 2018
>   write: *IOPS=2594, BW=10.1MiB/s* (10.6MB/s)(256MiB/25259msec)
>    bw (  KiB/s): min= 2248, max=18728, per=100.00%, avg=10454.10,
> stdev=6481.14, samples=50
>    iops        : min=  562, max= 4682, avg=2613.50, stdev=1620.35,
> samples=50
>   cpu          : usr=2.29%, sys=10.09%, ctx=131141, majf=0, minf=7
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
> >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >=64=0.0%
>      issued rwt: total=0,65536,0, short=0,0,0, dropped=0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
>   WRITE: bw=10.1MiB/s (10.6MB/s), 10.1MiB/s-10.1MiB/s (10.6MB/s-10.6MB/s),
> io=256MiB (268MB), run=25259-25259msec
>
>
> ## 2k randwrite - no noticable change.
>
> # fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --name=test
> --filename=test --bs=2k --iodepth=32 --size=256MB --readwrite=randwrite
> test: (g=0): rw=randwrite, bs=(R) 2048B-2048B, (W) 2048B-2048B, (T)
> 2048B-2048B, ioengine=libaio, iodepth=32
> fio-3.1
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=8662KiB/s][r=0,w=4331 IOPS][eta
> 00m:00s]
> test: (groupid=0, jobs=1): err= 0: pid=45813: Tue Mar 20 15:12:02 2018
>   write: *IOPS=4291, BW=8583KiB/s* (8789kB/s)(256MiB/30541msec)
>    bw (  KiB/s): min= 7416, max=10264, per=99.94%, avg=8577.66,
> stdev=618.31, samples=61
>    iops        : min= 3708, max= 5132, avg=4288.84, stdev=309.15,
> samples=61
>   cpu          : usr=2.87%, sys=15.83%, ctx=262236, majf=0, minf=8
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%,
> >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >=64=0.0%
>      issued rwt: total=0,131072,0, short=0,0,0, dropped=0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
>   WRITE: bw=8583KiB/s (8789kB/s), 8583KiB/s-8583KiB/s (8789kB/s-8789kB/s),
> io=256MiB (268MB), run=30541-30541msec
>
>
> Let me know if you'd recommend any other benchmarks
> comparing performance.write-behind-trickling-writes ON/OFF (just nothing
> that'll seriously risk locking up the whole gluster cluster please!).
>
>
> --
> Sam McLeod
> Please respond via email when possible.
> https://smcleod.net
> https://twitter.com/s_mcleod
>
> On 20 Mar 2018, at 2:56 pm, Raghavendra Gowdappa <rgowdapp at redhat.com>
> wrote:
>
>
>
> On Tue, Mar 20, 2018 at 8:57 AM, Sam McLeod <mailinglists at smcleod.net>
> wrote:
>
>> Hi Raghavendra,
>>
>>
>> On 20 Mar 2018, at 1:55 pm, Raghavendra Gowdappa <rgowdapp at redhat.com>
>> wrote:
>>
>> Aggregating large number of small writes by write-behind into large
>> writes has been merged on master:
>> https://github.com/gluster/glusterfs/issues/364
>>
>> Would like to know whether it helps for this usecase. Note that its not
>> part of any release yet. So you've to build and install from repo.
>>
>>
>> Sounds interesting, not too keen to build packages at the moment but I've
>> added myself as a watcher to that issue on Github and once it's in a 3.x
>> release I'll try it and let you know.
>>
>> Another suggestion is to run tests with turning off option
>> performance.write-behind-trickling-writes.
>>
>> # gluster volume set <volname> performance.write-behind-trickling-writes
>> off
>>
>> A word of caution though is if your files are too small, these
>> suggestions may not have much impact.
>>
>>
>> I'm looking for documentation on this option but all I could really find
>> is in the source for write-behind.c:
>>
>> if is enabled (which it is), do not hold back writes if there are no
>> outstanding requests.
>>
>
> Till recently this functionality though was available, couldn't be
> configured from cli. One could change this option by editing volume
> configuration file. However, now its configurable through cli:
>
> https://review.gluster.org/#/c/18719/
>
>
>>
>> and a note on aggregate-size stating that
>>
>> *"aggregation won't happen if performance.write-behind-trickling-writes
>> is turned on"*
>>
>>
>> What are the potentially negative performance impacts of disabling this?
>>
>
> Even if aggregation option is turned off, write-behind has the capacity to
> aggregate till a size of 128KB. But, to completely make use of this in case
> of small write workloads write-behind has to wait for sometime so that
> there are enough number of write-requests to fill the capacity. With this
> option enabled, write-behind though aggregates existing requests, won't
> wait for future writes. This means descendant xlators of write-behind can
> see writes smaller than 128K. So, for a scenario where small number of
> large writes are preferred over large number of small sized writes, this
> can be a problem.
>
>
>> --
>> Sam McLeod (protoporpoise on IRC)
>> https://smcleod.net
>> https://twitter.com/s_mcleod
>>
>> Words are my own opinions and do not necessarily represent those of
>> my employer or partners.
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180320/53fc1da4/attachment.html>


More information about the Gluster-users mailing list