[Bugs] [Bug 1433829] New: Nodeid changed due to write-behind option chagned online will lead to un-expected umount by kernel

Mon Mar 20 03:24:24 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1433829

            Bug ID: 1433829
           Summary: Nodeid changed due to write-behind option chagned
                    online will lead to un-expected umount by kernel
           Product: GlusterFS
           Version: mainline
         Component: write-behind
          Severity: urgent
          Assignee: bugs at gluster.org
          Reporter: george.lian at nokia.com
                CC: bugs at gluster.org



Description of problem:

when we created some bind mount on the location which mounted by glusterfs, 
if we change the performance.write-behind option online, the bind mount will be
un-expected disappear, and the issue is permanent!
Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1)    mkdir -p /mnt/log/node1/test1/test2; mkdir -p /mnt/log/node2/test1/test2
2)    mkdir -p /mnt/test1/test2
3)    mount –bind /mnt/log/node1/test1 /mnt/test1
4)    mount –bind /mnt/log/node2/test1/test2 /mnt/test1/test2  
5)    mount | cut -d " " -f 3|xargs stat
6)    gluster volume set log performance.write-behind on/off    check
/mnt/test1/test2 , could found this bind mount get lost


Actual results:
all mount point should be not lost.

Expected results:
bind mount get lost!

Additional info:

some investigation information by us FYI:)

We’ve consulted the linux kernel side guys, their explanations for this “bind
mount lost” issue is when kernel side do lookup or stat(), it will find the
nodeid has changed and trigger fuse_dentry_revalidate, and
fuse_dentry_relivalidate() fails for some dentry and invalidates it which leads
to unmounting.

   88.912262 |   0)    stat-437    |               |               
fuse_dentry_revalidate [fuse]() {
<...>
   88.912264 |   0)    stat-437    |               |                 
fuse_simple_request [fuse]() {
<...>
   88.921383 |   0)    stat-437    | # 9119.255 us |                  } /*
fuse_simple_request [fuse] */
   88.921383 |   0)    stat-437    |   0.093 us    |                  dput();
   88.921384 |   0)    stat-437    |               |                 
fuse_queue_forget [fuse]() {
<...>
   88.921427 |   0)    stat-437    | + 42.737 us   |                  }
   88.921427 |   0)    stat-437    | # 9164.967 us |                } /*
fuse_dentry_revalidate [fuse] */
   88.921427 |   0)    stat-437    |               |               
d_invalidate() {
   88.921427 |   0)    stat-437    |   0.048 us    |                 
_raw_spin_lock();
   88.921428 |   0)    stat-437    |               |                  d_walk()
{
   88.921428 |   0)    stat-437    |   0.040 us    |                   
_raw_spin_lock();
   88.921428 |   0)    stat-437    |   0.046 us    |                   
detach_and_collect();
   88.921429 |   0)    stat-437    |   0.723 us    |                  }
   88.921429 |   0)    stat-437    |               |                 
__detach_mounts() {

fuse_queue_forget() got called inside fuse_dentry_revalidate(), so the failure
is clearly caused by  (outarg.nodeid != get_node_id(inode)) check:

7078187a795f8 (Miklos Szeredi     2014-12-12 09:49:05 +0100  230)              
if (!ret) {
6314efee3cfee (Miklos Szeredi     2013-10-01 16:41:22 +0200  231)              
        fi = get_fuse_inode(inode);
9e6268db496a2 (Miklos Szeredi     2005-09-09 13:10:29 -0700  232)              
        if (outarg.nodeid != get_node_id(inode)) {
07e77dca8a1f1 (Miklos Szeredi     2010-12-07 20:16:56 +0100  233)              
                fuse_queue_forget(fc, forget, outarg.nodeid, 1);
e2a6b95236eba (Miklos Szeredi     2013-09-05 11:44:43 +0200  234)              
                goto invalid;
9e6268db496a2 (Miklos Szeredi     2005-09-09 13:10:29 -0700  235)              
        }

The latest  kernel is doing the following in fs/fuse/dir.c:

            if (outarg.nodeid != get_node_id(inode)) {
                fuse_queue_forget(fc, forget, outarg.nodeid, 1);
                goto invalid;
            }

invalid:
    ret = 0;
    goto out;
}

When We check the glusterfs code, I find that when execute command “gluster
volume set <volume> performance.write-behind on/off”, the glusterfs client side
graph will be re-built. And all xlators will be re-initialized , 
So I am wondering if this will lead to node id change and cause kernel side do
the umount, because I find in function fuse_graph_setup, inode_table_new is
called, does it mean all inode will be changed?
If so, we need consider the following solutions
 1) is there some way to notify kernel side in graceful way so that kernel side
can do update the nodeid which stored before?
 2) is there some way let itable only initialized once, that's mean, we only
initialized while fuse_graph_setup is first called, and reused the itable
address, while new graph is created and fuse_graph_setup called due to options
changed such like performance.write_behind.
 3) is there some way let ino(which showed in Annex 1)  not changed due to
graph rebuild, I mean not through address, such like the mapping of the gfid,
then even if the itable rebuild, glusterfs have the same nodeid which stored in
kernel if solution 1 and solution 2 can't be used?
 4) any other solutions?

I would like to know your opinion about this issue and our solutions, and your
reply will be appreciated!

Annex 1:  inode mapping with the address of node stored in itable.
fuse_ino_to_inode (uint64_t ino, xlator_t *fuse)
{
        inode_t  *inode = NULL;
        xlator_t *active_subvol = NULL;

        if (ino == 1) {
                active_subvol = fuse_active_subvol (fuse);
                if (active_subvol)
                        inode = active_subvol->itable->root;
        } else {
                inode = (inode_t *) (unsigned long) ino;
                inode_ref (inode);
        }

        return inode;
}

Annex 2:  kernel code changed for this issue, which mean for some kernel there
is no un-expect umount issue, but has DOS-attack issue, and while the latest
kernel will have the un-expect umount issue due to fuse_graph_setup rebuild.
1) no un-expected umount issue but has DOS-attack issue.
kernel is doing the following in fs/fuse/dir.c:

            if (outarg.nodeid != get_node_id(inode)) {
                fuse_queue_forget(fc, forget, outarg.nodeid, 1);
                goto invalid;
            }

invalid:
    ret = 0;
    if (check_submounts_and_drop(entry) != 0)
        ret = 1;
    goto out;
}


The above check_submounts_and_drop() call was added in September 2013:

commit 46ea1562da792a94ee8391f7aed882eb7771e18c
Author: Anand Avati <avati at redhat.com>
Date:   Thu Sep 5 11:44:44 2013 +0200

    fuse: drop dentry on failed revalidate

    Drop a subtree when we find that it has moved or been delated.  This can be
    done as long as there are no submounts under this location.

    If the directory was moved and we come across the same directory in a
    future lookup it will be reconnected by d_materialise_unique().

    Signed-off-by: Anand Avati <avati at redhat.com>
    Signed-off-by: Miklos Szeredi <mszeredi at suse.cz>
    Signed-off-by: Al Viro <viro at zeniv.linux.org.uk>

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 25c6cfe98801..0e6961aae6c0 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -259,6 +259,8 @@ out:

 invalid:
        ret = 0;
+       if (check_submounts_and_drop(entry) != 0)
+               ret = 1;
        goto out;
 }

Ie. this version of fuse specifically avoided invalidating the dentry if there
were directories mounted under it.
(The check_submounts_and_drop() function is part of general vfs code, not fuse
specific.)

Then, in October 2013 there was a wider kernel change which caused invalid
mounts to be deleted more aggressively
for security reasons (elimination of certain kind of DOS attacks):

commit 8ed936b5671bfb33d89bc60bdcc7cf0470ba52fe
Author: Eric W. Biederman <ebiederman at twitter.com>
Date:   Tue Oct 1 18:33:48 2013 -0700

    vfs: Lazily remove mounts on unlinked files and directories.

    With the introduction of mount namespaces and bind mounts it became
    possible to access files and directories that on some paths are mount
    points but are not mount points on other paths.  It is very confusing
    when rm -rf somedir returns -EBUSY simply because somedir is mounted
    somewhere else.  With the addition of user namespaces allowing
    unprivileged mounts this condition has gone from annoying to allowing
    a DOS attack on other users in the system.

    The possibility for mischief is removed by updating the vfs to support
    rename, unlink and rmdir on a dentry that is a mountpoint and by
    lazily unmounting mountpoints on deleted dentries.


The above commit changed check_submounts_and_drop() function so that it will
not return non-zero error status any more if the entry is a mount point (or has
mount points under it).
Instead it simply unmounts those entries unconditionally.


Finally, that call was removed from fuse code because it was not doing anything
anymore:

commit 9b053f3207e8887fed88162a339fdd4001abcdb2
Author: Eric W. Biederman <ebiederm at xmission.com>
Date:   Thu Feb 13 09:34:30 2014 -0800

    vfs: Remove unnecessary calls of check_submounts_and_drop

    Now that check_submounts_and_drop can not fail and is called from
    d_invalidate there is no longer a need to call check_submounts_and_drom
    from filesystem d_revalidate methods so remove it.

    Reviewed-by: Miklos Szeredi <miklos at szeredi.hu>
    Signed-off-by: "Eric W. Biederman" <ebiederm at xmission.com>
    Signed-off-by: Al Viro <viro at zeniv.linux.org.uk>

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index de1d84af9f7c..820efd74ca9f 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -274,9 +274,6 @@ out:


2) has un-expected umount issue.
 invalid:
        ret = 0;
-
-       if (!(flags & LOOKUP_RCU) && check_submounts_and_drop(entry) != 0)
-               ret = 1;
    goto out;
 }

Later, the check_submounts_and_drop() function was merged to d_invalidate() and
ceased to exist.

So, the unmounting does not happen in CentOS because it is using quite old and
more relaxed kernel where filesystem code lets the mount exists even if
glusterfs node id has changed.
But newer kernels released during the last 3 years are using stricter mount
handling policy which causes the unmounting (and most likely will not go back).

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.