[Bugs] [Bug 1674364] New: glusterfs-fuse client not benefiting from page cache on read after write

Mon Feb 11 06:38:49 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1674364

            Bug ID: 1674364
           Summary: glusterfs-fuse client not benefiting from page cache
                    on read after write
           Product: GlusterFS
           Version: 6
          Hardware: x86_64
                OS: Linux
            Status: NEW
         Component: fuse
          Keywords: Performance
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: rgowdapp at redhat.com
                CC: bugs at gluster.org
        Depends On: 1664934
            Blocks: 1670710, 1672818 (glusterfs-6.0)
  Target Milestone: ---
    Classification: Community

+++ This bug was initially created as a clone of Bug #1664934 +++

Description of problem:
On a simple single brick distribute volume, I'm running tests to validate
glusterfs-fuse client's use of page cache. The tests are indicating that a read
following a write is reading from the brick, not from client cache. In
contrast, a 2nd read gets data from the client cache.

Version-Release number of selected component (if applicable):

glusterfs-*5.2-1.el7.x86_64
kernel-3.10.0-957.el7.x86_64 (RHEL 7.6)

How reproducible:

Consistently

Steps to Reproduce:
1. use fio to create a data set that would fit easily in the page cache. My
client has 128 GB RAM; I'll create a 64 GB data set:

fio --name=initialwrite --ioengine=sync --rw=write \
--direct=0 --create_on_open=1 --end_fsync=1 --bs=128k \
--directory=/mnt/glustervol/ --filename_format=f.\$jobnum.\$filenum \
--filesize=16g --size=16g --numjobs=4

2. run an fio read test that reads the data set from step 1, without
invalidating the page cache:

fio --name=readtest --ioengine=sync --rw=read --invalidate=0 \
--direct=0 --bs=128k --directory=/mnt/glustervol/ \
--filename_format=f.\$jobnum.\$filenum --filesize=16g \
--size=16g --numjobs=4

Read throughput is much lower than it would be if reading from page cache:
READ: bw=573MiB/s (601MB/s), 143MiB/s-144MiB/s (150MB/s-150MB/s), io=64.0GiB
(68.7GB), run=114171-114419msec

Reads are going over the 10GbE network as shown in (edited) sar output:
05:01:04 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s 
05:01:06 AM       em1 755946.26  40546.26 1116287.75   3987.24      0.00

[There is some read amplification here: application is getting lower throughput
than what client is reading over the n/w. More on that later]      

3. Run the read test in step 2 again. This time read throughput is really high,
indicating read from cache, rather than over the network:
READ: bw=14.8GiB/s (15.9GB/s), 3783MiB/s-4270MiB/s (3967MB/s-4477MB/s),
io=64.0GiB (68.7GB), run=3837-4331msec

Expected results:

The read test in step 2 should be reading from page cache, and should be giving
throughput close to what we get in step 3.

Additional Info:

gluster volume info:

Volume Name: perfvol
Type: Distribute
Volume ID: 7033539b-0331-44b1-96cf-46ddc6ee2255
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: 172.16.70.128:/mnt/rhs_brick1
Options Reconfigured:
transport.address-family: inet
nfs.disable: on

--- Additional comment from Manoj Pillai on 2019-01-10 05:43:53 UTC ---

(In reply to Manoj Pillai from comment #0)
[...]
> 1. use fio to create a data set that would fit easily in the page cache. My
> client has 128 GB RAM; I'll create a 64 GB data set:
> 
> fio --name=initialwrite --ioengine=sync --rw=write \
> --direct=0 --create_on_open=1 --end_fsync=1 --bs=128k \
> --directory=/mnt/glustervol/ --filename_format=f.\$jobnum.\$filenum \
> --filesize=16g --size=16g --numjobs=4
> 

Memory usage on the client while the write test is running:

<excerpt>
# sar -r 5
Linux 3.10.0-957.el7.x86_64 (c09-h08-r630.rdu.openstack.engineering.redhat.com)
        01/10/2019      _x86_64_ (56 CPU)

05:35:36 AM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit  
%commit  kbactive   kbinact   kbdirty
05:35:41 AM 126671972   4937712      3.75         0   2974352    256704     
0.18   1878020   1147776        36
05:35:46 AM 126671972   4937712      3.75         0   2974352    256704     
0.18   1878020   1147776        36
05:35:51 AM 126666904   4942780      3.76         0   2974324    259900     
0.19   1879948   1147772        16
05:35:56 AM 126665820   4943864      3.76         0   2974348    261300     
0.19   1880304   1147776        24
05:36:01 AM 126663136   4946548      3.76         0   2974348    356356     
0.25   1881500   1147772        20
05:36:06 AM 126663028   4946656      3.76         0   2974348    356356     
0.25   1881540   1147772        20
05:36:11 AM 126664444   4945240      3.76         0   2974388    356356     
0.25   1880648   1147788        32
05:36:16 AM 126174984   5434700      4.13         0   3449508    930284     
0.66   1892912   1622536        32
05:36:21 AM 120539884  11069800      8.41         0   9076076    930284     
0.66   1893784   7247852        32
05:36:26 AM 114979592  16630092     12.64         0  14620932    930284     
0.66   1893796  12793472        32
05:36:31 AM 109392488  22217196     16.88         0  20192112    930284     
0.66   1893796  18365764        32
05:36:36 AM 104113900  27495784     20.89         0  25457272    930284     
0.66   1895152  23630336        32
05:36:41 AM  98713688  32895996     25.00         0  30842800    930284     
0.66   1895156  29015400        32
05:36:46 AM  93355560  38254124     29.07         0  36190264    930688     
0.66   1897548  34361664        32
05:36:51 AM  87640900  43968784     33.41         0  41885972    930688     
0.66   1897556  40057860        32
05:36:56 AM  81903068  49706616     37.77         0  47626388    930688     
0.66   1897004  45798848         0
05:37:01 AM  76209860  55399824     42.09         0  53303272    930688     
0.66   1897004  51475716         0
05:37:06 AM  70540340  61069344     46.40         0  58956264    930688     
0.66   1897004  57128836         0
05:37:11 AM  64872776  66736908     50.71         0  64609648    930688     
0.66   1897000  62782624         0
05:37:16 AM  59376144  72233540     54.88         0  70096880    930688     
0.66   1897368  68270084         0
05:37:21 AM  71333376  60276308     45.80         0  58169584    356740     
0.25   1891388  56342848         0
05:37:26 AM 126653336   4956348      3.77         0   2974476    356740     
0.25   1891392   1148348         0
05:37:31 AM 126654360   4955324      3.77         0   2974388    356740     
0.25   1891380   1147784         0
05:37:36 AM 126654376   4955308      3.77         0   2974388    356740     
0.25   1891380   1147784         0
05:37:41 AM 126654376   4955308      3.77         0   2974388    356740     
0.25   1891380   1147784         0
</excerpt>

So as the write test progresses, kbcached steadily increases. But looks like
the cached data is dropped subsequently.

--- Additional comment from Manoj Pillai on 2019-01-10 05:52:14 UTC ---

When I run the same sequence of tests on an XFS file system on the server, I
get expected results: both step 2. and step 3. of comment #0 report high read
throughput (15+GiB/s) indicating data is read from the page cache.

--- Additional comment from Manoj Pillai on 2019-01-10 11:01:23 UTC ---

(In reply to Manoj Pillai from comment #0)
[...]
> 
> Read throughput is much lower than it would be if reading from page cache:
> READ: bw=573MiB/s (601MB/s), 143MiB/s-144MiB/s (150MB/s-150MB/s), io=64.0GiB
> (68.7GB), run=114171-114419msec
> 
> Reads are going over the 10GbE network as shown in (edited) sar output:
> 05:01:04 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s 
> 05:01:06 AM       em1 755946.26  40546.26 1116287.75   3987.24      0.00
> 
> [There is some read amplification here: application is getting lower
> throughput than what client is reading over the n/w. More on that later]    
> 

This turned out to be primarily read-ahead related. Open a new bug for it:
https://bugzilla.redhat.com/show_bug.cgi?id=1665029.

--- Additional comment from Raghavendra G on 2019-01-23 13:04:54 UTC ---

>From preliminary tests I see two reasons for this:
1. inode-invalidations triggered by md-cache
2. Fuse auto invalidations

With a hacky fix removing both of the above, I can see read after write being
served from kernel page-cache. I'll update the bug with more details discussing
validity/limitations with the above two approaches later.

--- Additional comment from Manoj Pillai on 2019-01-24 04:43:40 UTC ---

(In reply to Raghavendra G from comment #4)
> From preliminary tests I see two reasons for this:
> 1. inode-invalidations triggered by md-cache
> 2. Fuse auto invalidations

Trying with kernel NFS, another distributed fs solution. I see that cache is
retained at the end of the write test, and both read-after-write and
read-after-read are served from the page cache.

In principle, if kNFS can do it, FUSE should be able to do it. I think :D.

--- Additional comment from Worker Ant on 2019-01-29 03:15:45 UTC ---

REVIEW: https://review.gluster.org/22109 (mount/fuse: expose
fuse-auto-invalidation as a mount option) posted (#1) for review on master by
Raghavendra G

--- Additional comment from Raghavendra G on 2019-01-30 05:41:39 UTC ---

(In reply to Manoj Pillai from comment #5)
> (In reply to Raghavendra G from comment #4)
> > From preliminary tests I see two reasons for this:
> > 1. inode-invalidations triggered by md-cache
> > 2. Fuse auto invalidations
> 
> Trying with kernel NFS, another distributed fs solution. I see that cache is
> retained at the end of the write test, and both read-after-write and
> read-after-read are served from the page cache.
> 
> In principle, if kNFS can do it, FUSE should be able to do it. I think :D.

kNFS and FUSE have different invalidation policies.

* kNFS provides close-to-open consistency. To quote from their FAQ [1]

"Linux implements close-to-open cache consistency by comparing the results of a
GETATTR operation done just after the file is closed to the results of a
GETATTR operation done when the file is next opened. If the results are the
same, the client will assume its data cache is still valid; otherwise, the
cache is purged."

For the workload used in this bz, file is not changed between close and open.
Hence two values of stat fetched - at close and open - match and hence
page-cache is retained.

* FUSE auto-invalidation compares times of stats cached with the values got
from the underlying filesystem implementation at all codepaths where stat is
fetched. This means comparision happens in lookup, (f)stat, (f)setattr etc
codepaths. Since (f)stat, lookup can happen asynchronously and concurrently wrt
writes, they'll end up identifying delta between two values of stats resulting
in cache purge. Please note that the consistency offered by FUSE is stronger
than close-to-open consistency, which means it also provides close-to-open
consistency along with consistency in codepaths like lookup, fstat etc.

We have following options:

* disable auto-invalidations and use glusterfs custom designed invalidation
policy. The invalidation policy can be the same as NFS close-to-open
consistency or something more stronger.
* check whether the current form of auto-invalidation (though stricter)
provides any added benefits to close-to-open consistency which are useful. If
no, change FUSE auto-invalidation to close-to-open consistency.

[1] http://nfs.sourceforge.net/#faq_a8

--- Additional comment from Raghavendra G on 2019-01-30 05:45:23 UTC ---

Miklos,

It would be helpful if you can comment on comment #7.

regards,
Raghavendra

--- Additional comment from Raghavendra G on 2019-01-30 05:59:06 UTC ---

Note that a lease based invalidation policy would be a complete solution, but
it will take some time to implement that and get it working in Glusterfs.

--- Additional comment from Worker Ant on 2019-02-02 03:08:22 UTC ---

REVIEW: https://review.gluster.org/22109 (mount/fuse: expose auto-invalidation
as a mount option) merged (#13) on master by Amar Tumballi

--- Additional comment from Miklos Szeredi on 2019-02-04 09:53:18 UTC ---

The underlying problem is that auto invalidate cannot differentiate local and
remote modification based on mtime alone.

What NFS apprently does is refresh attributes immediately after a write (not
sure how often it does this, I guess not after each individual write).

FUSE maybe should do this if auto invalidation is enabled, but if the
filesystem can do its own invalidation, possibly based on better information
than c/mtime, then that seem to be a better option.

--- Additional comment from Worker Ant on 2019-02-08 12:14:58 UTC ---

REVIEW: https://review.gluster.org/22178 (mount/fuse: fix bug related to
--auto-invalidation in mount script) posted (#1) for review on master by
Raghavendra G

--- Additional comment from Worker Ant on 2019-02-09 18:41:54 UTC ---

REVIEW: https://review.gluster.org/22178 (mount/fuse: fix bug related to
--auto-invalidation in mount script) merged (#2) on master by Raghavendra G

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1664934
[Bug 1664934] glusterfs-fuse client not benefiting from page cache on read
after write
https://bugzilla.redhat.com/show_bug.cgi?id=1670710
[Bug 1670710] glusterfs-fuse client not benefiting from page cache on read
after write
https://bugzilla.redhat.com/show_bug.cgi?id=1672818
[Bug 1672818] GlusterFS 6.0 tracker
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.