[Gluster-users] Questions about gluster/fuse, page cache, and coherence

Sat Mar 23 20:50:10 UTC 2013

Please find answers below -

On Mon, Mar 18, 2013 at 12:03 AM, nlxswig <nlxswig at 126.com> wrote:

> Good questions,
> Why are there no reply?
>
> At 2011-08-16 04:53:50,"Patrick J. LoPresti" <lopresti at gmail.com> wrote:
> >(FUSE developers:  Although my questions are specifically about
> >Gluster, I suspect most of the answers have more to do with FUSE, so I
> >figure this is on-topic for your list.  If I figured wrong, I
> >apologize.)
> >
> >I have done quite a bit of searching looking for answers to these
> >questions, and I just cannot find them...
> >
> >I think I understand how the Linux page cache works for an ordinary
> >local (non-FUSE) partition.  Specifically:
> >
> >1) When my application calls read(), it reads from the page cache.  If
> >the page(s) are not resident, the kernel puts my application to sleep
> >and gets busy reading them from disk.
> >
> >2) When my application calls write(), it writes to the page cache.
> >The kernel will -- eventually, when it feels like it -- flush those
> >dirty pages to disk.
> >
> >3) When my application calls mmap(), page cache pages are mapped into
> >my process's address space, allowing me to create a dirty page or read
> >a page by accessing memory.
> >
> >4) When the kernel reads a page, it might decide to read some other
> >pages, depending on the underlying block device's read-ahead
> >parameters.  I can control these via "blockdev".  On the write side, I
> >can exercise some control with various VM parameters (dirty_ratio
> >etc).  I can also use calls like fsync() and posix_fadvise() to exert
> >some control over page cache management at the application level.
> >
> >
> >My question is pretty simple.  If you had to re-write the above four
> >points for a Gluster file system, what would they look like?  If it
> >matters, I am specifically interested in Gluster 3.2.2 on Suse Linux
> >Enterprise Server 11 SP1 (Linux 2.6.32.43 + whatever Suse does to
> >their kernels).
> >
> >Does Gluster use the page cache on read()?  On write()?  If so, how
> >does it ensure coherency between clients?  If not, how does mmap()
> >work (or does it not work)?
>
> Gluster or any FUSE filesystem by themselves do not use the page-cache
directly. It serves read/write requests by either reading from or writing
to /dev/fuse. The read/write implementations of the /dev/fuse "device"
perform the copy. Now where the perform the copy to/from depends on whether
the file is open with O_DIRECT and/or if "direct_io" was enabled on the
open file. For "normal" IO, the copy happens to/from the page cache. For
O_DIRECT or "direct_io" page-cache is bypassed completely, but care is
taken to make sure that the copy of data in the page cache is flushed -- as
a best effort attempt -- to give a consistent "view" of the file between
two applications (on the SAME mount point ONLY) which are opening the file
with different modes (O_DIRECT and otherwise).

As long as all the mounts are using "direct_io" mount option, coherency
between mounts is really in the hands of the filesystem (like gluster) as
FUSE is acting like a pure pass-through. On the other hand, if "normal" IO
is happening, utilizing the page cache, then re-reads can always get served
directly from the page-cache without the filesystem (like gluster) even
knowing that a read() request was issued by a process. The filesystem could
however use the reverse invalidation calls to invalidate the pages in all
mounts if a write is happening from elsewhere (the co-ordination needs to
happen in the filesystem, FUSE only provides the invalidation primitives)
-- Gluster does NOT do this yet.

There is also a flag in open() FUSE operation to indicate whether or not to
keep the page cache of the file. By default gluster asks FUSE to purge the
page cache in open(). This provides you close-to-open consistency (i.e, if
an open() from a process is performed strictly after close() from any other
process, even on a different machine, then you are guaranteed to see all
the content written by that application -- very similar consistency offered
by NFS (v3) client in Linux.)

In summary, this means by default you get close-to-open consistency with
gluster, but if you require strict consistency between two applications on
different client which have opened the file at the same time, then you need
BOTH a and b:

a. Either app opens with O_DIRECT or mount glusterfs with
--enable-direct-io to keep page-cache out of the way of consistency

b. Either app opens with with O_DSYNC (or O_SYNC) or disable write-behind
in the gluster volume configuration.

W.R.T mmap(), Getting strict consistency between the "shared" mapped
regions of two applications on different machines is pretty much impossible
(the filesystem/kernel knows only the first time an app attempts to "write"
to the mapped region with a page fault, but once the page is marked dirty
in the first write, nobody is getting notified that the app is modifying
other memory regions of that page). There are four combinations - private
vs shared, and mmap on "direct_io" file vs "normal" file.

shared and direct_io - not even supported (fails with ENODEV)
shared and normal - unless you do msync() data is not flushed to the server
(i.e, other client mounts are not capable of receiving it when they ask for
that region's data).
private (either direct_io or normal) - works, but in gluster you are not
guaranteed to see modifications by another client in region that is already
mapped and accessed once (this can be sort of made to work if the proper
reverse invalidation wiring is done in the distributed filesystem)

> >What read-ahead will the kernel use?  Does posix_fadvise(...,
> >POSIX_FADV_WILLNEED) have any effect on a Gluster file system?
>
> read-ahead (and posix_fadvise) kicks in only if reads are going through
the page cache. So you should not be mounting with --disable-direct-io-mode
or opening with O_DIRECT.

Thanks,
Avati
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130323/c6a25d5b/attachment.html>