[Bugs] [Bug 1426044] New: read-ahead not working if open-behind is turned on

Thu Feb 23 05:52:19 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1426044

            Bug ID: 1426044
           Summary: read-ahead not working if open-behind is turned on
           Product: Red Hat Gluster Storage
           Version: 3.2
         Component: read-ahead
          Keywords: Triaged
          Severity: medium
          Priority: medium
          Assignee: rgowdapp at redhat.com
          Reporter: csaba at redhat.com
        QA Contact: rhinduja at redhat.com
                CC: bengland at redhat.com, bugs at gluster.org,
                    hchen at redhat.com, mpillai at redhat.com,
                    pgurusid at redhat.com, ppai at redhat.com,
                    rgowdapp at redhat.com, rhs-bugs at redhat.com,
                    rkavunga at redhat.com, sasundar at redhat.com,
                    smohan at redhat.com
        Depends On: 1084508, 1393419

+++ This bug was initially created as a clone of Bug #1393419 +++

+++ This bug was initially created as a clone of Bug #1084508 +++

Description of problem:

open-behind xlator is turned on by default when creating a new volume. This
appears to prevent read-ahead from working.

Version-Release number of selected component (if applicable):

release-3.4 branch.

How reproducible:

Steps to Reproduce:

1. create a volume called vol4
[root at bd-vm ~]# mkdir /test/vol4
[root at bd-vm ~]# gluster volume create vol4 bd-vm:/test/vol4 force
volume create: vol4: success: please start the volume to access data
[root at bd-vm ~]# gluster volume start vol4
volume start: vol4: success
[root at bd-vm ~]# gluster volume info vol4

Volume Name: vol4
Type: Distribute
Volume ID: 85af878b-0119-4f99-b01f-caf4577cb4d4
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: bd-vm:/test/vol4

2. mount the volume

[root at bd-vm ~]# mkdir /mnt4
[root at bd-vm ~]# mount -t glusterfs localhost:/vol4 /mnt4

3. write a 4GB file (= RAM size)
[root at bd-vm fio]# dd if=/dev/zero of=/mnt4/4g bs=1M count=4K
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 23.0355 s, 186 MB/s

4. first read, with read-ahead = 1, got throughput 99MB/s

[root at bd-vm ~]# gluster volume set vol4 performance.read-ahead-page-count 1
volume set: success
[root at bd-vm ~]# dd if=/mnt4/4g bs=1M of=/dev/null
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 43.0906 s, 99.7 MB/s

5. second read, read-ahead=16, got throughput 107 MB/s, not much difference

[root at bd-vm ~]# gluster volume set vol4 performance.read-ahead-page-count 16
volume set: success
[root at bd-vm fio]# dd if=/mnt4/4g bs=1M of=/dev/null
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 40.1117 s, 107 MB/s

6. third read, read-ahead=16, open-behind=off, got throughput 269MB/s

[root at bd-vm ~]# gluster volume set vol4 performance.open-behind off
volume set: success
[root at bd-vm fio]# dd if=/mnt4/4g bs=1M of=/dev/null
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 15.982 s, 269 MB/s

Actual results:

read-ahead has no impact on sequential read and re-read

Expected results:

read-ahead should improve sequential re-read

Additional info:

I built gluster from git source as of Mar 25, 2014, branch release-3.4.

--- Additional comment from Ben England on 2014-05-30 06:34:51 EDT ---

to assess priority, how many folks are using open-behind volume option? 
Open-behind translator is an optimization for small-file workloads, correct? 
Has anyone measured performance of open-behind on vs off?  Does it help?

--- Additional comment from Raghavendra G on 2016-01-20 02:49:43 EST ---

@Poornima/Anuradha,

Can you take a look at this bug?

regards,
Raghavendra

--- Additional comment from Raghavendra G on 2016-04-26 00:47:56 EDT ---

I think the issue is because of open-behind using anonymous-fd. See the
following option in open-behind:

        { .key  = {"read-after-open"},
          .type = GF_OPTION_TYPE_BOOL,
          .default_value = "no",
      .description = "read is sent only after actual open happens and real "
          "fd is obtained, instead of doing on anonymous fd (similar to
write)",
        },

The read-ahead cache is per-fd and stored in the context of fd. If open-behind
is using anonymous fds for doing reads, read is never sent on the fd which
read-ahead has seen (during open from application). So, there is no read-ahead
cache.

Can you retry the tests by setting option "read-after-open" in open-behind to
"yes"?

[root at unused glusterfs]# gluster volume set dist-rep
performance.read-after-open on
volume set: success
[root at unused glusterfs]# gluster volume info

Volume Name: dist-rep
Type: Distributed-Replicate
Volume ID: 201492ff-9eb8-48f9-a647-59b89853e3d3
Status: Created
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: booradley:/home/export-2/dist-rep1
Brick2: booradley:/home/export-2/dist-rep2
Brick3: booradley:/home/export-2/dist-rep3
Brick4: booradley:/home/export-2/dist-rep4
Options Reconfigured:
performance.read-after-open: on
performance.readdir-ahead: on

--- Additional comment from Raghavendra G on 2016-04-26 09:27:15 EDT ---

(In reply to Raghavendra G from comment #3)
> I think the issue is because of open-behind using anonymous-fd. See the
> following option in open-behind:
> 
>         { .key  = {"read-after-open"},
>           .type = GF_OPTION_TYPE_BOOL,
>           .default_value = "no",
> 	  .description = "read is sent only after actual open happens and real "
>           "fd is obtained, instead of doing on anonymous fd (similar to
> write)",
>         },
> 
> The read-ahead cache is per-fd and stored in the context of fd. If
> open-behind is using anonymous fds for doing reads, read is never sent on
> the fd which read-ahead has seen (during open from application). So, there
> is no read-ahead cache.

This RCA is not valid. The reason is during read-request, fd is stored in local
and in response cache is stored on the fd stored in local. So, even though
open-behind sends read on anonymous fd, read-ahead stores the cache in the fd
passed to application/kernel.

> 
> Can you retry the tests by setting option "read-after-open" in open-behind
> to "yes"?
> 
> [root at unused glusterfs]# gluster volume set dist-rep
> performance.read-after-open on
> volume set: success
> [root at unused glusterfs]# gluster volume info
>  
> Volume Name: dist-rep
> Type: Distributed-Replicate
> Volume ID: 201492ff-9eb8-48f9-a647-59b89853e3d3
> Status: Created
> Number of Bricks: 2 x 2 = 4
> Transport-type: tcp
> Bricks:
> Brick1: booradley:/home/export-2/dist-rep1
> Brick2: booradley:/home/export-2/dist-rep2
> Brick3: booradley:/home/export-2/dist-rep3
> Brick4: booradley:/home/export-2/dist-rep4
> Options Reconfigured:
> performance.read-after-open: on
> performance.readdir-ahead: on

--- Additional comment from Raghavendra G on 2016-04-26 14:27:44 EDT ---

(In reply to Raghavendra G from comment #4)
> (In reply to Raghavendra G from comment #3)
> > I think the issue is because of open-behind using anonymous-fd. See the
> > following option in open-behind:
> > 
> >         { .key  = {"read-after-open"},
> >           .type = GF_OPTION_TYPE_BOOL,
> >           .default_value = "no",
> > 	  .description = "read is sent only after actual open happens and real "
> >           "fd is obtained, instead of doing on anonymous fd (similar to
> > write)",
> >         },
> > 
> > The read-ahead cache is per-fd and stored in the context of fd. If
> > open-behind is using anonymous fds for doing reads, read is never sent on
> > the fd which read-ahead has seen (during open from application). So, there
> > is no read-ahead cache.
> 
> This RCA is not valid. The reason is during read-request, fd is stored in
> local and in response cache is stored on the fd stored in local. So, even
> though open-behind sends read on anonymous fd, read-ahead stores the cache
> in the fd passed to application/kernel.

Well, the core of the RCA - read-ahead is disabled because of open-behind using
anonymous fds - is still valid :). What was wrong was the mechanism through
which read-ahead is turned off. In our current configuration read-ahead is
loaded below open-behind. So, with "read-after-open" turned off, read-ahead
never receives an open. Without an open, read-ahead doesn't create a context in
fd, which is where all the cache is stored.

There are two solutions to this problem:
1. Load read-ahead as one of the ancestors of open-behind. This way read-ahead
witnesses the open sent by application before open-behind stopping it.
2. Turn "read-after-open" option on, so that open behind does an open.

--- Additional comment from Worker Ant on 2016-11-09 09:05:37 EST ---

REVIEW: http://review.gluster.org/15811 (glusterd/volgen: Changing the order of
read-ahead xlator) posted (#1) for review on master by mohammed rafi  kc
(rkavunga at redhat.com)

--- Additional comment from Prashanth Pai on 2016-11-22 07:56:29 EST ---

I don't know if this is relevant but just a heads up - I remember the option
"root-squash" was dependent on or tied to the option "read-after-open". If you
plan to change "read-after-open", please do take a look at the code that
handles "server.root-squash" too.

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1084508
[Bug 1084508] read-ahead not working if open-behind is turned on
https://bugzilla.redhat.com/show_bug.cgi?id=1393419
[Bug 1393419] read-ahead not working if open-behind is turned on
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=9nPZJVHeyi&a=cc_unsubscribe