[Bugs] [Bug 1674364] New: glusterfs-fuse client not benefiting from page cache on read after write
bugzilla at redhat.com
bugzilla at redhat.com
Mon Feb 11 06:38:49 UTC 2019
https://bugzilla.redhat.com/show_bug.cgi?id=1674364
Bug ID: 1674364
Summary: glusterfs-fuse client not benefiting from page cache
on read after write
Product: GlusterFS
Version: 6
Hardware: x86_64
OS: Linux
Status: NEW
Component: fuse
Keywords: Performance
Severity: high
Assignee: bugs at gluster.org
Reporter: rgowdapp at redhat.com
CC: bugs at gluster.org
Depends On: 1664934
Blocks: 1670710, 1672818 (glusterfs-6.0)
Target Milestone: ---
Classification: Community
+++ This bug was initially created as a clone of Bug #1664934 +++
Description of problem:
On a simple single brick distribute volume, I'm running tests to validate
glusterfs-fuse client's use of page cache. The tests are indicating that a read
following a write is reading from the brick, not from client cache. In
contrast, a 2nd read gets data from the client cache.
Version-Release number of selected component (if applicable):
glusterfs-*5.2-1.el7.x86_64
kernel-3.10.0-957.el7.x86_64 (RHEL 7.6)
How reproducible:
Consistently
Steps to Reproduce:
1. use fio to create a data set that would fit easily in the page cache. My
client has 128 GB RAM; I'll create a 64 GB data set:
fio --name=initialwrite --ioengine=sync --rw=write \
--direct=0 --create_on_open=1 --end_fsync=1 --bs=128k \
--directory=/mnt/glustervol/ --filename_format=f.\$jobnum.\$filenum \
--filesize=16g --size=16g --numjobs=4
2. run an fio read test that reads the data set from step 1, without
invalidating the page cache:
fio --name=readtest --ioengine=sync --rw=read --invalidate=0 \
--direct=0 --bs=128k --directory=/mnt/glustervol/ \
--filename_format=f.\$jobnum.\$filenum --filesize=16g \
--size=16g --numjobs=4
Read throughput is much lower than it would be if reading from page cache:
READ: bw=573MiB/s (601MB/s), 143MiB/s-144MiB/s (150MB/s-150MB/s), io=64.0GiB
(68.7GB), run=114171-114419msec
Reads are going over the 10GbE network as shown in (edited) sar output:
05:01:04 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s
05:01:06 AM em1 755946.26 40546.26 1116287.75 3987.24 0.00
[There is some read amplification here: application is getting lower throughput
than what client is reading over the n/w. More on that later]
3. Run the read test in step 2 again. This time read throughput is really high,
indicating read from cache, rather than over the network:
READ: bw=14.8GiB/s (15.9GB/s), 3783MiB/s-4270MiB/s (3967MB/s-4477MB/s),
io=64.0GiB (68.7GB), run=3837-4331msec
Expected results:
The read test in step 2 should be reading from page cache, and should be giving
throughput close to what we get in step 3.
Additional Info:
gluster volume info:
Volume Name: perfvol
Type: Distribute
Volume ID: 7033539b-0331-44b1-96cf-46ddc6ee2255
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: 172.16.70.128:/mnt/rhs_brick1
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
--- Additional comment from Manoj Pillai on 2019-01-10 05:43:53 UTC ---
(In reply to Manoj Pillai from comment #0)
[...]
> 1. use fio to create a data set that would fit easily in the page cache. My
> client has 128 GB RAM; I'll create a 64 GB data set:
>
> fio --name=initialwrite --ioengine=sync --rw=write \
> --direct=0 --create_on_open=1 --end_fsync=1 --bs=128k \
> --directory=/mnt/glustervol/ --filename_format=f.\$jobnum.\$filenum \
> --filesize=16g --size=16g --numjobs=4
>
Memory usage on the client while the write test is running:
<excerpt>
# sar -r 5
Linux 3.10.0-957.el7.x86_64 (c09-h08-r630.rdu.openstack.engineering.redhat.com)
01/10/2019 _x86_64_ (56 CPU)
05:35:36 AM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit
%commit kbactive kbinact kbdirty
05:35:41 AM 126671972 4937712 3.75 0 2974352 256704
0.18 1878020 1147776 36
05:35:46 AM 126671972 4937712 3.75 0 2974352 256704
0.18 1878020 1147776 36
05:35:51 AM 126666904 4942780 3.76 0 2974324 259900
0.19 1879948 1147772 16
05:35:56 AM 126665820 4943864 3.76 0 2974348 261300
0.19 1880304 1147776 24
05:36:01 AM 126663136 4946548 3.76 0 2974348 356356
0.25 1881500 1147772 20
05:36:06 AM 126663028 4946656 3.76 0 2974348 356356
0.25 1881540 1147772 20
05:36:11 AM 126664444 4945240 3.76 0 2974388 356356
0.25 1880648 1147788 32
05:36:16 AM 126174984 5434700 4.13 0 3449508 930284
0.66 1892912 1622536 32
05:36:21 AM 120539884 11069800 8.41 0 9076076 930284
0.66 1893784 7247852 32
05:36:26 AM 114979592 16630092 12.64 0 14620932 930284
0.66 1893796 12793472 32
05:36:31 AM 109392488 22217196 16.88 0 20192112 930284
0.66 1893796 18365764 32
05:36:36 AM 104113900 27495784 20.89 0 25457272 930284
0.66 1895152 23630336 32
05:36:41 AM 98713688 32895996 25.00 0 30842800 930284
0.66 1895156 29015400 32
05:36:46 AM 93355560 38254124 29.07 0 36190264 930688
0.66 1897548 34361664 32
05:36:51 AM 87640900 43968784 33.41 0 41885972 930688
0.66 1897556 40057860 32
05:36:56 AM 81903068 49706616 37.77 0 47626388 930688
0.66 1897004 45798848 0
05:37:01 AM 76209860 55399824 42.09 0 53303272 930688
0.66 1897004 51475716 0
05:37:06 AM 70540340 61069344 46.40 0 58956264 930688
0.66 1897004 57128836 0
05:37:11 AM 64872776 66736908 50.71 0 64609648 930688
0.66 1897000 62782624 0
05:37:16 AM 59376144 72233540 54.88 0 70096880 930688
0.66 1897368 68270084 0
05:37:21 AM 71333376 60276308 45.80 0 58169584 356740
0.25 1891388 56342848 0
05:37:26 AM 126653336 4956348 3.77 0 2974476 356740
0.25 1891392 1148348 0
05:37:31 AM 126654360 4955324 3.77 0 2974388 356740
0.25 1891380 1147784 0
05:37:36 AM 126654376 4955308 3.77 0 2974388 356740
0.25 1891380 1147784 0
05:37:41 AM 126654376 4955308 3.77 0 2974388 356740
0.25 1891380 1147784 0
</excerpt>
So as the write test progresses, kbcached steadily increases. But looks like
the cached data is dropped subsequently.
--- Additional comment from Manoj Pillai on 2019-01-10 05:52:14 UTC ---
When I run the same sequence of tests on an XFS file system on the server, I
get expected results: both step 2. and step 3. of comment #0 report high read
throughput (15+GiB/s) indicating data is read from the page cache.
--- Additional comment from Manoj Pillai on 2019-01-10 11:01:23 UTC ---
(In reply to Manoj Pillai from comment #0)
[...]
>
> Read throughput is much lower than it would be if reading from page cache:
> READ: bw=573MiB/s (601MB/s), 143MiB/s-144MiB/s (150MB/s-150MB/s), io=64.0GiB
> (68.7GB), run=114171-114419msec
>
> Reads are going over the 10GbE network as shown in (edited) sar output:
> 05:01:04 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s
> 05:01:06 AM em1 755946.26 40546.26 1116287.75 3987.24 0.00
>
> [There is some read amplification here: application is getting lower
> throughput than what client is reading over the n/w. More on that later]
>
This turned out to be primarily read-ahead related. Open a new bug for it:
https://bugzilla.redhat.com/show_bug.cgi?id=1665029.
--- Additional comment from Raghavendra G on 2019-01-23 13:04:54 UTC ---
>From preliminary tests I see two reasons for this:
1. inode-invalidations triggered by md-cache
2. Fuse auto invalidations
With a hacky fix removing both of the above, I can see read after write being
served from kernel page-cache. I'll update the bug with more details discussing
validity/limitations with the above two approaches later.
--- Additional comment from Manoj Pillai on 2019-01-24 04:43:40 UTC ---
(In reply to Raghavendra G from comment #4)
> From preliminary tests I see two reasons for this:
> 1. inode-invalidations triggered by md-cache
> 2. Fuse auto invalidations
Trying with kernel NFS, another distributed fs solution. I see that cache is
retained at the end of the write test, and both read-after-write and
read-after-read are served from the page cache.
In principle, if kNFS can do it, FUSE should be able to do it. I think :D.
--- Additional comment from Worker Ant on 2019-01-29 03:15:45 UTC ---
REVIEW: https://review.gluster.org/22109 (mount/fuse: expose
fuse-auto-invalidation as a mount option) posted (#1) for review on master by
Raghavendra G
--- Additional comment from Raghavendra G on 2019-01-30 05:41:39 UTC ---
(In reply to Manoj Pillai from comment #5)
> (In reply to Raghavendra G from comment #4)
> > From preliminary tests I see two reasons for this:
> > 1. inode-invalidations triggered by md-cache
> > 2. Fuse auto invalidations
>
> Trying with kernel NFS, another distributed fs solution. I see that cache is
> retained at the end of the write test, and both read-after-write and
> read-after-read are served from the page cache.
>
> In principle, if kNFS can do it, FUSE should be able to do it. I think :D.
kNFS and FUSE have different invalidation policies.
* kNFS provides close-to-open consistency. To quote from their FAQ [1]
"Linux implements close-to-open cache consistency by comparing the results of a
GETATTR operation done just after the file is closed to the results of a
GETATTR operation done when the file is next opened. If the results are the
same, the client will assume its data cache is still valid; otherwise, the
cache is purged."
For the workload used in this bz, file is not changed between close and open.
Hence two values of stat fetched - at close and open - match and hence
page-cache is retained.
* FUSE auto-invalidation compares times of stats cached with the values got
from the underlying filesystem implementation at all codepaths where stat is
fetched. This means comparision happens in lookup, (f)stat, (f)setattr etc
codepaths. Since (f)stat, lookup can happen asynchronously and concurrently wrt
writes, they'll end up identifying delta between two values of stats resulting
in cache purge. Please note that the consistency offered by FUSE is stronger
than close-to-open consistency, which means it also provides close-to-open
consistency along with consistency in codepaths like lookup, fstat etc.
We have following options:
* disable auto-invalidations and use glusterfs custom designed invalidation
policy. The invalidation policy can be the same as NFS close-to-open
consistency or something more stronger.
* check whether the current form of auto-invalidation (though stricter)
provides any added benefits to close-to-open consistency which are useful. If
no, change FUSE auto-invalidation to close-to-open consistency.
[1] http://nfs.sourceforge.net/#faq_a8
--- Additional comment from Raghavendra G on 2019-01-30 05:45:23 UTC ---
Miklos,
It would be helpful if you can comment on comment #7.
regards,
Raghavendra
--- Additional comment from Raghavendra G on 2019-01-30 05:59:06 UTC ---
Note that a lease based invalidation policy would be a complete solution, but
it will take some time to implement that and get it working in Glusterfs.
--- Additional comment from Worker Ant on 2019-02-02 03:08:22 UTC ---
REVIEW: https://review.gluster.org/22109 (mount/fuse: expose auto-invalidation
as a mount option) merged (#13) on master by Amar Tumballi
--- Additional comment from Miklos Szeredi on 2019-02-04 09:53:18 UTC ---
The underlying problem is that auto invalidate cannot differentiate local and
remote modification based on mtime alone.
What NFS apprently does is refresh attributes immediately after a write (not
sure how often it does this, I guess not after each individual write).
FUSE maybe should do this if auto invalidation is enabled, but if the
filesystem can do its own invalidation, possibly based on better information
than c/mtime, then that seem to be a better option.
--- Additional comment from Worker Ant on 2019-02-08 12:14:58 UTC ---
REVIEW: https://review.gluster.org/22178 (mount/fuse: fix bug related to
--auto-invalidation in mount script) posted (#1) for review on master by
Raghavendra G
--- Additional comment from Worker Ant on 2019-02-09 18:41:54 UTC ---
REVIEW: https://review.gluster.org/22178 (mount/fuse: fix bug related to
--auto-invalidation in mount script) merged (#2) on master by Raghavendra G
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1664934
[Bug 1664934] glusterfs-fuse client not benefiting from page cache on read
after write
https://bugzilla.redhat.com/show_bug.cgi?id=1670710
[Bug 1670710] glusterfs-fuse client not benefiting from page cache on read
after write
https://bugzilla.redhat.com/show_bug.cgi?id=1672818
[Bug 1672818] GlusterFS 6.0 tracker
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list