[Gluster-devel] Nodeid changed due to write-behind option chagned online will lead to un-expected umount by kernel

Mon Mar 20 13:35:08 UTC 2017

+csaba

----- Original Message -----
> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> To: "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
> Cc: "Bingxuan Zhang (Nokia - CN/Hangzhou)" <bingxuan.zhang at nokia.com>, Gluster-devel at gluster.org, "Tero Venetjoki
> (Nokia - FI/Espoo)" <tero.venetjoki at nokia.com>, "Cynthia Zhou (Nokia - CN/Hangzhou)" <cynthia.zhou at nokia.com>
> Sent: Monday, March 20, 2017 7:02:50 PM
> Subject: Re: [Gluster-devel] Nodeid changed due to write-behind option chagned online will lead to un-expected umount
> by kernel
> 
> 
> 
> ----- Original Message -----
> > From: "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
> > To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>, Gluster-devel at gluster.org
> > Cc: "Bingxuan Zhang (Nokia - CN/Hangzhou)" <bingxuan.zhang at nokia.com>,
> > "Cynthia Zhou (Nokia - CN/Hangzhou)"
> > <cynthia.zhou at nokia.com>, "Tero Venetjoki (Nokia - FI/Espoo)"
> > <tero.venetjoki at nokia.com>
> > Sent: Monday, March 20, 2017 8:44:30 AM
> > Subject: Nodeid changed due to write-behind option chagned online will lead
> > to un-expected umount by kernel
> > 
> > 
> > Hi, GlusterFS expert,
> > 
> > In our latest test, We found an issue potentially related to glusterfs.
> > When
> > I execute “gluster volume set <volume> performance.write-behind on/off”,
> > some bind mounts will get lost, this issue is permanent!
> > Test steps:
> > a)	mkdir -p /mnt/log/node1/test1/test2; mkdir -p /mnt/log/node2/test1/test2
> > b)	mkdir -p /mnt/test1/test2
> > c)	mount –bind /mnt/log/node1/test1 /mnt/test1
> > d)	mount –bind /mnt/log/node2/test1/test2 /mnt/test1/test2
> > e)	mount | cut -d " " -f 3|xargs stat
> > f)	gluster volume set log performance.write-behind on/off    check
> > /mnt/test1/test2 , could found this bind mount get lost
> > 
> > We’ve consulted the linux kernel side guys, their explanations for this
> > “bind
> > mount lost” issue is when kernel side do lookup or stat(), it will find the
> > nodeid has changed and trigger fuse_dentry_revalidate, and
> > fuse_dentry_relivalidate() fails for some dentry and invalidates it which
> > leads to unmounting.
> > 
> >    88.912262 |   0)    stat-437    |               |
> >    fuse_dentry_revalidate [fuse]() {
> > <...>
> >    88.912264 |   0)    stat-437    |               |
> >    fuse_simple_request [fuse]() {
> > <...>
> >    88.921383 |   0)    stat-437    | # 9119.255 us |                  } /*
> >    fuse_simple_request [fuse] */
> >    88.921383 |   0)    stat-437    |   0.093 us    |
> >    dput();
> >    88.921384 |   0)    stat-437    |               |
> >    fuse_queue_forget [fuse]() {
> > <...>
> >    88.921427 |   0)    stat-437    | + 42.737 us   |                  }
> >    88.921427 |   0)    stat-437    | # 9164.967 us |                } /*
> >    fuse_dentry_revalidate [fuse] */
> >    88.921427 |   0)    stat-437    |               |
> >    d_invalidate() {
> >    88.921427 |   0)    stat-437    |   0.048 us    |
> >    _raw_spin_lock();
> >    88.921428 |   0)    stat-437    |               |
> >    d_walk() {
> >    88.921428 |   0)    stat-437    |   0.040 us    |
> >    _raw_spin_lock();
> >    88.921428 |   0)    stat-437    |   0.046 us    |
> >    detach_and_collect();
> >    88.921429 |   0)    stat-437    |   0.723 us    |                  }
> >    88.921429 |   0)    stat-437    |               |
> >    __detach_mounts() {
> > 
> > fuse_queue_forget() got called inside fuse_dentry_revalidate(), so the
> > failure is clearly caused by  (outarg.nodeid != get_node_id(inode)) check:
> > 
> > 7078187a795f8 (Miklos Szeredi     2014-12-12 09:49:05 +0100  230)
> > if (!ret) {
> > 6314efee3cfee (Miklos Szeredi     2013-10-01 16:41:22 +0200  231)
> > fi = get_fuse_inode(inode);
> > 9e6268db496a2 (Miklos Szeredi     2005-09-09 13:10:29 -0700  232)
> > if (outarg.nodeid != get_node_id(inode)) {
> > 07e77dca8a1f1 (Miklos Szeredi     2010-12-07 20:16:56 +0100  233)
> > fuse_queue_forget(fc, forget, outarg.nodeid,
> > 1);
> > e2a6b95236eba (Miklos Szeredi     2013-09-05 11:44:43 +0200  234)
> > goto invalid;
> > 9e6268db496a2 (Miklos Szeredi     2005-09-09 13:10:29 -0700  235)
> > }
> > 
> > The latest  kernel is doing the following in fs/fuse/dir.c:
> > 
> > 			if (outarg.nodeid != get_node_id(inode)) {
> > 				fuse_queue_forget(fc, forget, outarg.nodeid, 1);
> > 				goto invalid;
> > 			}
> > 
> > invalid:
> > 	ret = 0;
> > 	goto out;
> > }
> > 
> > When We check the glusterfs code, I find that when execute command “gluster
> > volume set <volume> performance.write-behind on/off”, the glusterfs client
> > side graph will be re-built. And all xlators will be re-initialized ,
> > So I am wondering if this will lead to node id change and cause kernel side
> > do the umount, because I find in function fuse_graph_setup, inode_table_new
> > is called, does it mean all inode will be changed?
> 
> Yes nodeids change after graph switch. Nice catch btw :).
> 
> > If so, we need consider the following solutions
> >  1) is there some way to notify kernel side in graceful way so that kernel
> >  side can do update the nodeid which stored before?
> 
> AFAIK, there is no such way.
> 
> >  2) is there some way let itable only initialized once, that's mean, we
> >  only
> >  initialized while fuse_graph_setup is first called, and reused the itable
> >  address, while new graph is created and fuse_graph_setup called due to
> >  options changed such like performance.write_behind.
> 
> We cannot use the inodes from old graph, because per xlator state from
> xlators of new graph is not present in them. So, we build new inodes and let
> interested xlators in the new graph store the state. This is done by sending
> a lookup call for each nodeid the kernel is using (on demand basis. We don't
> do this if kernel doesn't do an operation on the inodes it has cached).
> 
> The other possible approach is to build the state in old inodes instead of
> creating new inodes (that way we don't change nodeid). But how scalable is
> that approach is the question (what happens if there are too many graph
> switches and functions like inode_ctx_get/set slow down due to too many
> xlator states). Also, this is just a thought and I've to think more on this.
> Even if we do this, it will be a more disruptive change and needs more
> testing to be accepted as a solution.
> 
> >  3) is there some way let ino(which showed in Annex 1)  not changed due to
> >  graph rebuild, I mean not through address, such like the mapping of the
> >  gfid, then even if the itable rebuild, glusterfs have the same nodeid
> >  which
> >  stored in kernel if solution 1 and solution 2 can't be used?
> 
> Using plain gfids as nodeids would be the best solution. But kernel allows
> only 64bit nodeids and gfids are bigger. Mapping requires one more
> data-structure (say hash-table etc), that brings more overhead and I guess
> we wanted to keep things simple (by just using address of inode object as
> nodeid - Annex 1).
> 
> >  4) any other solutions?
> 
> I can't think of an easy simple solution now. Will update if I can find
> something.
> 
> > 
> > I would like to know your opinion about this issue and our solutions, and
> > your reply will be appreciated!
> > 
> > Annex 1:  inode mapping with the address of node stored in itable.
> > fuse_ino_to_inode (uint64_t ino, xlator_t *fuse)
> > {
> >         inode_t  *inode = NULL;
> >         xlator_t *active_subvol = NULL;
> > 
> >         if (ino == 1) {
> >                 active_subvol = fuse_active_subvol (fuse);
> >                 if (active_subvol)
> >                         inode = active_subvol->itable->root;
> >         } else {
> >                 inode = (inode_t *) (unsigned long) ino;
> >                 inode_ref (inode);
> >         }
> > 
> >         return inode;
> > }
> > 
> > Annex 2:  kernel code changed for this issue, which mean for some kernel
> > there is no un-expect umount issue, but has DOS-attack issue, and while the
> > latest kernel will have the un-expect umount issue due to fuse_graph_setup
> > rebuild.
> > 1) no un-expected umount issue but has DOS-attack issue.
> > kernel is doing the following in fs/fuse/dir.c:
> > 
> > 			if (outarg.nodeid != get_node_id(inode)) {
> > 				fuse_queue_forget(fc, forget, outarg.nodeid, 1);
> > 				goto invalid;
> > 			}
> > 
> > invalid:
> > 	ret = 0;
> > 	if (check_submounts_and_drop(entry) != 0)
> > 		ret = 1;
> > 	goto out;
> > }
> > 
> > 
> > The above check_submounts_and_drop() call was added in September 2013:
> > 
> > commit 46ea1562da792a94ee8391f7aed882eb7771e18c
> > Author: Anand Avati <avati at redhat.com>
> > Date:   Thu Sep 5 11:44:44 2013 +0200
> > 
> >     fuse: drop dentry on failed revalidate
> > 
> >     Drop a subtree when we find that it has moved or been delated.  This
> >     can
> >     be
> >     done as long as there are no submounts under this location.
> > 
> >     If the directory was moved and we come across the same directory in a
> >     future lookup it will be reconnected by d_materialise_unique().
> > 
> >     Signed-off-by: Anand Avati <avati at redhat.com>
> >     Signed-off-by: Miklos Szeredi <mszeredi at suse.cz>
> >     Signed-off-by: Al Viro <viro at zeniv.linux.org.uk>
> > 
> > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > index 25c6cfe98801..0e6961aae6c0 100644
> > --- a/fs/fuse/dir.c
> > +++ b/fs/fuse/dir.c
> > @@ -259,6 +259,8 @@ out:
> > 
> >  invalid:
> >         ret = 0;
> > +       if (check_submounts_and_drop(entry) != 0)
> > +               ret = 1;
> >         goto out;
> >  }
> > 
> > Ie. this version of fuse specifically avoided invalidating the dentry if
> > there were directories mounted under it.
> > (The check_submounts_and_drop() function is part of general vfs code, not
> > fuse specific.)
> > 
> > Then, in October 2013 there was a wider kernel change which caused invalid
> > mounts to be deleted more aggressively
> > for security reasons (elimination of certain kind of DOS attacks):
> > 
> > commit 8ed936b5671bfb33d89bc60bdcc7cf0470ba52fe
> > Author: Eric W. Biederman <ebiederman at twitter.com>
> > Date:   Tue Oct 1 18:33:48 2013 -0700
> > 
> >     vfs: Lazily remove mounts on unlinked files and directories.
> > 
> >     With the introduction of mount namespaces and bind mounts it became
> >     possible to access files and directories that on some paths are mount
> >     points but are not mount points on other paths.  It is very confusing
> >     when rm -rf somedir returns -EBUSY simply because somedir is mounted
> >     somewhere else.  With the addition of user namespaces allowing
> >     unprivileged mounts this condition has gone from annoying to allowing
> >     a DOS attack on other users in the system.
> > 
> >     The possibility for mischief is removed by updating the vfs to support
> >     rename, unlink and rmdir on a dentry that is a mountpoint and by
> >     lazily unmounting mountpoints on deleted dentries.
> > 
> > 
> > The above commit changed check_submounts_and_drop() function so that it
> > will
> > not return non-zero error status any more if the entry is a mount point (or
> > has mount points under it).
> > Instead it simply unmounts those entries unconditionally.
> > 
> > 
> > Finally, that call was removed from fuse code because it was not doing
> > anything anymore:
> > 
> > commit 9b053f3207e8887fed88162a339fdd4001abcdb2
> > Author: Eric W. Biederman <ebiederm at xmission.com>
> > Date:   Thu Feb 13 09:34:30 2014 -0800
> > 
> >     vfs: Remove unnecessary calls of check_submounts_and_drop
> > 
> >     Now that check_submounts_and_drop can not fail and is called from
> >     d_invalidate there is no longer a need to call check_submounts_and_drom
> >     from filesystem d_revalidate methods so remove it.
> > 
> >     Reviewed-by: Miklos Szeredi <miklos at szeredi.hu>
> >     Signed-off-by: "Eric W. Biederman" <ebiederm at xmission.com>
> >     Signed-off-by: Al Viro <viro at zeniv.linux.org.uk>
> > 
> > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > index de1d84af9f7c..820efd74ca9f 100644
> > --- a/fs/fuse/dir.c
> > +++ b/fs/fuse/dir.c
> > @@ -274,9 +274,6 @@ out:
> > 
> > 
> > 2) has un-expected umount issue.
> >  invalid:
> >         ret = 0;
> > -
> > -       if (!(flags & LOOKUP_RCU) && check_submounts_and_drop(entry) != 0)
> > -               ret = 1;
> > 	goto out;
> >  }
> > 
> > Later, the check_submounts_and_drop() function was merged to d_invalidate()
> > and ceased to exist.
> > 
> > So, the unmounting does not happen in CentOS because it is using quite old
> > and more relaxed kernel where filesystem code lets the mount exists even if
> > glusterfs node id has changed.
> > But newer kernels released during the last 3 years are using stricter mount
> > handling policy which causes the unmounting (and most likely will not go
> > back).
> > 
> > 
> > Best regards,
> > George
> > 
> > 
> >
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel