[Gluster-devel] Latency analysis of GlusterFS' network layer for pgbench

Sat Jan 26 02:36:24 UTC 2019

On Sat, Jan 26, 2019 at 8:03 AM Raghavendra Gowdappa <rgowdapp at redhat.com>
wrote:

>
>
> On Fri, Jan 11, 2019 at 8:09 PM Raghavendra Gowdappa <rgowdapp at redhat.com>
> wrote:
>
>> Here is the update of the progress till now:
>> * The client profile attached till now shows the tuple creation is
>> dominated by writes and fstats. Note that fstats are side-effects of writes
>> as writes invalidate attributes of the file from kernel attribute cache.
>> * The rest of the init phase (which is marked by msgs "setting primary
>> key" and "vaccuum") is dominated by reads. Next bigger set of operations
>> are writes followed by fstats.
>>
>> So, only writes, reads and fstats are the operations we need to optimize
>> to reduce the init time latency. As mentioned in my previous mail, I did
>> following tunings:
>> * Enabled only write-behind, md-cache and open-behind.
>>     - write-behind was configured with a cache-size/window-size of 20MB
>>     - open-behind was configured with read-after-open yes
>>     - md-cache was loaded as a child of write-behind in xlator graph. As
>> a parent of write-behind, writes responses of writes cached in write-behind
>> would invalidate stats. But when loaded as a child of write-behind this
>> problem won't be there. Note that in both cases fstat would pass through
>> write-behind (In the former case due to no stats in md-cache). However in
>> the latter case fstats can be served by md-cache.
>>     - md-cache used to aggressively invalidate inodes. For the purpose of
>> this test, I just commented out inode-invalidate code in md-cache. We need
>> to fine tune the invalidation invocation logic.
>>     - set group-metadata-cache to on. But turned off upcall
>> notifications. Note that since this workload basically accesses all its
>> data through single mount point. So, there is no shared files across mounts
>> and hence its safe to turn off invalidations.
>> * Applied fix to https://bugzilla.redhat.com/show_bug.cgi?id=1648781
>>
>> With the above set of tunings I could reduce the init time of scale 8000
>> from 16.6 hrs to 11.4 hrs - an improvement in the range 25% to 30%
>>
>> Since the workload is dominated by reads, we think a good read-cache
>> where reads to regions just written are served from cache would greatly
>> improve the performance. Since kernel page-cache already provides that
>> functionality along with read-ahead (which is more intelligent and serves
>> more read patterns than supported by Glusterfs read-ahead), we wanted to
>> try that. But, Manoj found a bug where reads followed by writes are not
>> served from page cache [5]. I am currently waiting for the resolution of
>> this bug. As an alternative, I can modify io-cache to serve reads from the
>> data just written. But, the change involves its challenges and hence would
>> like to get a resolution on [5] (either positive or negative) before
>> proceeding with modifications to io-cache.
>>
>> As to the rpc latency, Krutika had long back identified that reading a
>> single rpc message involves atleast 4 reads to socket. These many number of
>> reads were done to identify the structure of the message on the go. The
>> reason we wanted to discover the rpc message was to identify the part of
>> the rpc message containing read or write payload and make sure that payload
>> is directly read into a buffer different than the one containing rest of
>> the rpc message. This strategy will make sure payloads are not copied again
>> when buffers are moved across caches (read-ahead, io-cache etc) and also
>> the rest of the rpc message can be freed even though the payload outlives
>> the rpc message (when payloads are cached). However, we can experiment an
>> approach where we can either do away with zero-copy requirement or let the
>> entire buffer containing rpc message and payload to live in the cache.
>>
>> From my observations and discussions with Manoj and Xavi, this workload
>> is very sensitive to latency (than to concurrency). So, I am hopeful the
>> above approaches will give positive results.
>>
>
> Me, Manoj and Csaba figured out that invalidations by md-cache and Fuse
> auto-invalidations  were dropping the kernel page-cache (more details on
> [5]).
>

Thanks to Miklos for the pointer on auto-invalidations.

> Changes to stats by writes from same client (local writes) were triggering
> both these codepaths dropping the cache. Since all the I/O done by this
> workload goes through the caches of single client, the invalidations are
> not necessary and I made code changes to fuse-bridge to disable
> auto-invalidations completely and commented out inode-invalidations in
> md-cache. Note that this doesn't regress the consistency/coherency of data
> seen in the caches as its a single client use-case. With these two changes
> coupled with earlier optimizations (client-io-threads=on,
> server/client-event-threads=4, md-cache as a child of write-behind in
> xlator graph, performance.md-cache-timeout=600), pgbench init of scale 8000
> on a volume with NVMe backend completed in 54m25s. This is a whopping 94%
> improvement to the time we started out with (59280s vs 3360s).
>
> [root at shakthi4 ~]# gluster volume info
>
> Volume Name: nvme-r3
> Type: Replicate
> Volume ID: d1490bcc-bcf1-4e09-91e8-ab01d9781263
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: shakthi4:/gluster/nvme0n1/bricks/nvme-r3-1
> Brick2: shakthi4:/gluster/nvme0n1/bricks/nvme-r3-2
> Brick3: shakthi4:/gluster/nvme0n1/bricks/nvme-r3-3
> Options Reconfigured:
> server.event-threads: 4
> client.event-threads: 4
> diagnostics.client-log-level: INFO
> performance.md-cache-timeout: 600
> performance.io-cache: off
> performance.read-ahead: off
> diagnostics.count-fop-hits: on
> diagnostics.latency-measurement: on
> transport.address-family: inet
> nfs.disable: on
> performance.client-io-threads: on
> performance.stat-prefetch: on
>
> I'll be concentrating on how to disable fuse-auto-invalidations without
> regressing on the consistency model we've been providing till now. The
> consistency model Glusterfs has been providing till now is close to open
> consistency similar to what NFS provides [6][7].
>
> But the initial thoughts are, at least for the pgbench test-case there is
> no harm in totally disabling fuse-auto-invalidations and md-cache
> invalidations as this workload totally runs on single mount point and hence
> invalidations itself are not necessary as all I/O goes through caches and
> hence caches are in sync with the state of the file on backend.
>
> [6] http://nfs.sourceforge.net/#faq_a8
> [7]
> https://lists.gluster.org/pipermail/gluster-users/2013-March/012805.html
>
>
>> [5] https://bugzilla.redhat.com/show_bug.cgi?id=1664934
>>
>> regards,
>> Raghavendra
>>
>> On Fri, Dec 28, 2018 at 12:44 PM Raghavendra Gowdappa <
>> rgowdapp at redhat.com> wrote:
>>
>>>
>>>
>>> On Mon, Dec 24, 2018 at 6:05 PM Raghavendra Gowdappa <
>>> rgowdapp at redhat.com> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Dec 24, 2018 at 3:40 PM Sankarshan Mukhopadhyay <
>>>> sankarshan.mukhopadhyay at gmail.com> wrote:
>>>>
>>>>> [pulling the conclusions up to enable better in-line]
>>>>>
>>>>> > Conclusions:
>>>>> >
>>>>> > We should never have a volume with caching-related xlators disabled.
>>>>> The price we pay for it is too high. We need to make them work consistently
>>>>> and aggressively to avoid as many requests as we can.
>>>>>
>>>>> Are there current issues in terms of behavior which are known/observed
>>>>> when these are enabled?
>>>>>
>>>>
>>>> We did have issues with pgbench in past. But they've have been fixed.
>>>> Please refer to bz [1] for details. On 5.1, it runs successfully with all
>>>> caching related xlators enabled. Having said that the only performance
>>>> xlators which gave improved performance were open-behind and write-behind
>>>> [2] (write-behind had some issues, which will be fixed by [3] and we'll
>>>> have to measure performance again with fix to [3]).
>>>>
>>>
>>> One quick update. Enabling write-behind and md-cache with fix for [3]
>>> reduced the total time taken for pgbench init phase roughly by 20%-25%
>>> (from 12.5 min to 9.75 min for a scale of 100). Though this is still a huge
>>> time (around 12hrs for a db of scale 8000). I'll follow up with a detailed
>>> report once my experiments are complete. Currently trying to optimize the
>>> read path.
>>>
>>>
>>>> For some reason, read-side caching didn't improve transactions per
>>>> second. I am working on this problem currently. Note that these bugs
>>>> measure transaction phase of pgbench, but what xavi measured in his mail is
>>>> init phase. Nevertheless, evaluation of read caching (metadata/data) will
>>>> still be relevant for init phase too.
>>>>
>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1512691
>>>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1629589#c4
>>>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1648781
>>>>
>>>>
>>>>> > We need to analyze client/server xlators deeper to see if we can
>>>>> avoid some delays. However optimizing something that is already at the
>>>>> microsecond level can be very hard.
>>>>>
>>>>> That is true - are there any significant gains which can be accrued by
>>>>> putting efforts here or, should this be a lower priority?
>>>>>
>>>>
>>>> The problem identified by xavi is also the one we (Manoj, Krutika, me
>>>> and Milind) had encountered in the past [4]. The solution we used was to
>>>> have multiple rpc connections between single brick and client. The solution
>>>> indeed fixed the bottleneck. So, there is definitely work involved here -
>>>> either to fix the single connection model or go with multiple connection
>>>> model. Its preferred to improve single connection and resort to multiple
>>>> connections only if bottlenecks in single connection are not fixable.
>>>> Personally I think this is high priority along with having appropriate
>>>> client side caching.
>>>>
>>>> [4] https://bugzilla.redhat.com/show_bug.cgi?id=1467614#c52
>>>>
>>>>
>>>>> > We need to determine what causes the fluctuations in brick side and
>>>>> avoid them.
>>>>> > This scenario is very similar to a smallfile/metadata workload, so
>>>>> this is probably one important cause of its bad performance.
>>>>>
>>>>> What kind of instrumentation is required to enable the determination?
>>>>>
>>>>> On Fri, Dec 21, 2018 at 1:48 PM Xavi Hernandez <xhernandez at redhat.com>
>>>>> wrote:
>>>>> >
>>>>> > Hi,
>>>>> >
>>>>> > I've done some tracing of the latency that network layer introduces
>>>>> in gluster. I've made the analysis as part of the pgbench performance issue
>>>>> (in particulat the initialization and scaling phase), so I decided to look
>>>>> at READV for this particular workload, but I think the results can be
>>>>> extrapolated to other operations that also have small latency (cached data
>>>>> from FS for example).
>>>>> >
>>>>> > Note that measuring latencies introduces some latency. It consists
>>>>> in a call to clock_get_time() for each probe point, so the real latency
>>>>> will be a bit lower, but still proportional to these numbers.
>>>>> >
>>>>>
>>>>> [snip]
>>>>> _______________________________________________
>>>>> Gluster-devel mailing list
>>>>> Gluster-devel at gluster.org
>>>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20190126/9b883426/attachment.html>