[Gluster-devel] Update of work on fixing POSIX compliance issues in Glusterfs

Tue Oct 2 02:10:33 UTC 2018

All,

There have been issues related to POSIX compliance especially while running
Database workloads <https://bugzilla.redhat.com/show_bug.cgi?id=1512691> on
Glusterfs. Recently we've worked on fixing some of them. This mail is an
update on that effort.

The issues themselves can be classfied into following categories:

   - rename atomicity. When rename (src, dst) is done with dst already
   present, at no point in time access to dst (like open, stat, chmod etc)
   should fail. However, since the rename itself changes the association of
   dst-path from dst-inode to src-inode, inode based operations like open,
   stat etc that have already completed resolution of dst-path  into dst-inode
   will end up not finding the dst-inode after rename causing them to fail.
   However VFS provides a workaround for this by doing the resolution of path
   once again provided operations fail with ESTALE. There were some issues
   associated with this:
      - Glusterfs in some codepaths returned ENOENT even when the operation
      is on an inode and hence VFS didn't retry the resolution. Much of the
      discussion around this topic can be found at this mail thread
      <https://www.spinics.net/lists/gluster-devel/msg18981.html>. This
      issue has been
      <http://review.gluster.org/r/I2e752ca60dd8af1b989dd1d29c7b002ee58440b4>
       fixed
      <http://review.gluster.org/r/I8d07d2ebb5a0da6c3ea478317442cb42f1797a4b>
       by
      <http://review.gluster.org/r/Ia07e3cece404811703c8cfbac9b402ca5fe98c1e>
various
      patches
      - VFS retries exactly once. So, when retry fails with ESTALE, VFS
      gives up and syscalls like open are failed. We've hit this class
of issues
      in bugs like these
      <https://bugzilla.redhat.com/show_bug.cgi?id=1543279>. The current
      understanding is real world workloads won't hit this race and hence one
      retry mechanism is enough. NFS relies on the same mechanism of
VFS and NFS
      developers say they've not hit bugs of this kind in real workloads.
      - DHT in rename codepaths acquires locks on src and dst inodes. If a
      parallel rename overwrote dst-inode, this locking fails and rename
      operation used to fail. The issue is tracked and fixed as part of this
      bug <https://bugzilla.redhat.com/show_bug.cgi?id=1543279>.
      - Quorum imposition by afr in open fop. afr imposes Quorum on fd
   based operations, but not on open. This means operations can fail on a
   valid fd due to lack of Quorum. Not fixed yet and is tracked on this bug
   <https://bugzilla.redhat.com/show_bug.cgi?id=1634664>.
   - Operations on a valid fd failing after the file was deleted by
   rename/unlink.
      - Fuse-bridge used to randomly pick fds in fstat codepath as earlier
      versions of fuse api didn't provide filehandle as argument of Getattr
      request. This resulted in fstat failures when the file was deleted either
      through rename/unlink after it has been successfully opened.
This is fixed
      in this patch
      <http://review.gluster.org/r/I67eebbf5407ca725ed111fbda4181ead10d03f6d>
and
      this patch
      <http://review.gluster.org/r/I88dd29b3607cd2594eee9d72a1637b5346c8d49c>
      .
      - performance/open-behind fakes an open call. Due to bugs in
      rename/unlink codepath, it couldn't open file before the file was deleted
      due to rename or unlink. Fixed by this patch
      <https://review.gluster.org/#/c/glusterfs/+/20428/>
   - Stale (meta)data cached by various performance xlators
   - md-cache used to cache stale fstat. Fixed by this patch
      <http://review.gluster.org/r/Ia4bb9dd36494944e2d91e9e71a79b5a3974a8c77>
      .
      - write-behind did not provide correct stat in rename cbk when writes
      on src were cached in write-behind. Fixed by this patch
      <http://review.gluster.org/r/Ic9f2adf8edd0b58ebaf661f3a8d0ca086bc63111>
      .
      - write-behind did not provide correct stat in readdirp response.
      Fixed by this patch
      <http://review.gluster.org/r/I12d167bf450648baa64be1cbe1ca0fddf5379521>
      - Ordering of operations done on different fds by write-behind. It
      considered operations on different fds as independent. So an fstat done
      after a write is complete when both operations are on different
fds, didn't
      fetch stat that reflected the write operation. This is fixed by this
      patch
      <http://review.gluster.org/r/Iee748cebb6d2a5b32f9328aff2b5b7cbf6c52c05>
      - readdir-ahead used to provide stale stat. The issue is fixed by
      this patch
      <http://review.gluster.org/Ia27ff49a61922e88c73a1547ad8aacc9968a69df>
      - Most of the caching xlators rely on ctime/mtime of stat to find out
      whether the current (meta)data is newer/stale than the cached (meta)data.
      However ctime/mtime provided by replica/afr is not always
consistent as it
      can pick stat from any of its subvolumes. This issue can be
solved once ctime
      generatior <https://github.com/gluster/glusterfs/issues/208> becomes
      production ready and is enabled by default. Note that ctime generator
      xlator can also help in fixing issues with tar
      <https://bugzilla.redhat.com/show_bug.cgi?id=1179169>, ElasticSearch
      <https://bugzilla.redhat.com/show_bug.cgi?id=1379568> etc that rely
      on correctness of ctime. Also, I still see a rare pgbench failure even
      after all the fixes to bz 1512691 due to unreliable ctime/mtime from
      underlying xlators.
      - Though this issue
      <https://bugzilla.redhat.com/show_bug.cgi?id=1601166> is not really a
      consistency issue, it hindered performance of read-ahead as
fstats flushed
      read-ahead cache. Note that fstats also have an impact on
write-behind when
      reads and writes are interleaved on a file as fstats wait on
cached-writes
      in write-behind. A bug
      <https://bugzilla.redhat.com/show_bug.cgi?id=1563508> has been filed
      on fuse kernel module for implementation of noatime feature so
that fstats
      are not issued during reads.
   - AMQP needed flock -w to work. Tracked as part of this issue
   <https://github.com/gluster/glusterfs/issues/465>.

The issues listed above are either fixed or work is in progress to fix
them. There are still more issues which are not worked upon yet and we'll
provide updates on them in future. Some of the prominent known issues (the
list is not exhaustive) are:

   - Missing dentries
<https://bugzilla.redhat.com/show_bug.cgi?id=1563848> when
   performance.parallel-readdir is enabled. Note that its a cache coherence
   issue, the dentries and files are still intact on backend.
   - Evaluate and initiate discussion on how to propagate errors
   encountered during commit of cached writes, to application. A wider
   discussion (across different filesystems) on this topic is found at:
   https://lwn.net/Articles/752063/. Thanks to @csabahenk for pointing this
   discussion.
   - Sanitize the stack to return ESTALE for inode missing and ENOENT for
   path missing. For eg., storage/posix sometimes return ENOENT for scenarios
   where gfid handles are missing, even though the correct error is ESTALE.
   Failing to return ESTALE can throw off the retry logic in VFS. An open
   failing with ENOENT is wrong as open is a gfid based operation. An easy fix
   would be to fuse-bridge convert all ENOENT errors to ESTALE in _all_ inode
   based fop responses. Currently its done only in open(dir) codepath. This
   has to be extended to other codepaths too.
   - Lookup and rename in DHT are not atomic. rename is a compound
   operation in DHT which involves some hardlinking and in the rename window
   both src and dst are visible as hardlinks to each other. If lookup samples
   src or dst in this window, it'll perceive the file to have hardlinks.
   - stale dentries of src in inode-tables (of fuse, protocol/server) after
   successful rename of src and dst. This can be caused due to a lookup on src
   racing with rename. This issue is not very much different from the issue of
   caching xlators needing a way of identifying which among the two (meta)data
   is latest. ctime generator xlator can be used here to compare ctime of
   parent directory as recorded in itable with that of in lookup response and
   making sure only latest dentry is linked into inode table.
      - Note that stale dentries can cause corruption in applications like
      SAS, pgbench that rely on the pattern of create a tmp file,
write to it and
      rename it to the file to be consumed by another thread. Since
src resolves
      to dst inode due to stale dentries having same stat of dst, the dst file
      ends up corrupted as writes of next cycle end up on the file
being consumed
      for previous cycle. So, this is an important issue to be fixed.
   - There are few bugs on SAS
      - issues with fcntl locking
      <https://bugzilla.redhat.com/show_bug.cgi?id=1630735>.
      - From my limited conversation with people who use/work on SAS, it
      seem to rely on fsync as a checkpoint after which the changes by one job
      should be visible to other jobs which could be running on
different mounts
      on a different machine. This means, fsync on one mount should
update caches
      of other mounts too with updated data. This functionality is currently
      missing in Glusterfs.

regards,
Raghavendra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20181002/5bf12a67/attachment-0001.html>