[Gluster-devel] Nodeid changed due to write-behind option chagned online will lead to un-expected umount by kernel

Mon Mar 20 15:14:27 UTC 2017

On a tangential note, any idea on how the change of nodeid affects caching (page, dentry, attribute cache etc) in VFS? Does it throw away the cache? If yes, after a graph switch, there might be performance drop (till the cache gets built again).

It would also be worth exploring what are the other effects of a changed nodeid for a file/directory.

----- Original Message -----
> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> To: "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
> Cc: "Bingxuan Zhang (Nokia - CN/Hangzhou)" <bingxuan.zhang at nokia.com>, Gluster-devel at gluster.org, "Tero Venetjoki
> (Nokia - FI/Espoo)" <tero.venetjoki at nokia.com>, "Cynthia Zhou (Nokia - CN/Hangzhou)" <cynthia.zhou at nokia.com>,
> "Csaba Henk" <chenk at redhat.com>
> Sent: Monday, March 20, 2017 7:05:08 PM
> Subject: Re: [Gluster-devel] Nodeid changed due to write-behind option chagned online will lead to un-expected umount
> by kernel
> 
> +csaba
> 
> ----- Original Message -----
> > From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> > To: "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
> > Cc: "Bingxuan Zhang (Nokia - CN/Hangzhou)" <bingxuan.zhang at nokia.com>,
> > Gluster-devel at gluster.org, "Tero Venetjoki
> > (Nokia - FI/Espoo)" <tero.venetjoki at nokia.com>, "Cynthia Zhou (Nokia -
> > CN/Hangzhou)" <cynthia.zhou at nokia.com>
> > Sent: Monday, March 20, 2017 7:02:50 PM
> > Subject: Re: [Gluster-devel] Nodeid changed due to write-behind option
> > chagned online will lead to un-expected umount
> > by kernel
> > 
> > 
> > 
> > ----- Original Message -----
> > > From: "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
> > > To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>,
> > > Gluster-devel at gluster.org
> > > Cc: "Bingxuan Zhang (Nokia - CN/Hangzhou)" <bingxuan.zhang at nokia.com>,
> > > "Cynthia Zhou (Nokia - CN/Hangzhou)"
> > > <cynthia.zhou at nokia.com>, "Tero Venetjoki (Nokia - FI/Espoo)"
> > > <tero.venetjoki at nokia.com>
> > > Sent: Monday, March 20, 2017 8:44:30 AM
> > > Subject: Nodeid changed due to write-behind option chagned online will
> > > lead
> > > to un-expected umount by kernel
> > > 
> > > 
> > > Hi, GlusterFS expert,
> > > 
> > > In our latest test, We found an issue potentially related to glusterfs.
> > > When
> > > I execute “gluster volume set <volume> performance.write-behind on/off”,
> > > some bind mounts will get lost, this issue is permanent!
> > > Test steps:
> > > a)	mkdir -p /mnt/log/node1/test1/test2; mkdir -p
> > > /mnt/log/node2/test1/test2
> > > b)	mkdir -p /mnt/test1/test2
> > > c)	mount –bind /mnt/log/node1/test1 /mnt/test1
> > > d)	mount –bind /mnt/log/node2/test1/test2 /mnt/test1/test2
> > > e)	mount | cut -d " " -f 3|xargs stat
> > > f)	gluster volume set log performance.write-behind on/off    check
> > > /mnt/test1/test2 , could found this bind mount get lost
> > > 
> > > We’ve consulted the linux kernel side guys, their explanations for this
> > > “bind
> > > mount lost” issue is when kernel side do lookup or stat(), it will find
> > > the
> > > nodeid has changed and trigger fuse_dentry_revalidate, and
> > > fuse_dentry_relivalidate() fails for some dentry and invalidates it which
> > > leads to unmounting.
> > > 
> > >    88.912262 |   0)    stat-437    |               |
> > >    fuse_dentry_revalidate [fuse]() {
> > > <...>
> > >    88.912264 |   0)    stat-437    |               |
> > >    fuse_simple_request [fuse]() {
> > > <...>
> > >    88.921383 |   0)    stat-437    | # 9119.255 us |                  }
> > >    /*
> > >    fuse_simple_request [fuse] */
> > >    88.921383 |   0)    stat-437    |   0.093 us    |
> > >    dput();
> > >    88.921384 |   0)    stat-437    |               |
> > >    fuse_queue_forget [fuse]() {
> > > <...>
> > >    88.921427 |   0)    stat-437    | + 42.737 us   |                  }
> > >    88.921427 |   0)    stat-437    | # 9164.967 us |                } /*
> > >    fuse_dentry_revalidate [fuse] */
> > >    88.921427 |   0)    stat-437    |               |
> > >    d_invalidate() {
> > >    88.921427 |   0)    stat-437    |   0.048 us    |
> > >    _raw_spin_lock();
> > >    88.921428 |   0)    stat-437    |               |
> > >    d_walk() {
> > >    88.921428 |   0)    stat-437    |   0.040 us    |
> > >    _raw_spin_lock();
> > >    88.921428 |   0)    stat-437    |   0.046 us    |
> > >    detach_and_collect();
> > >    88.921429 |   0)    stat-437    |   0.723 us    |                  }
> > >    88.921429 |   0)    stat-437    |               |
> > >    __detach_mounts() {
> > > 
> > > fuse_queue_forget() got called inside fuse_dentry_revalidate(), so the
> > > failure is clearly caused by  (outarg.nodeid != get_node_id(inode))
> > > check:
> > > 
> > > 7078187a795f8 (Miklos Szeredi     2014-12-12 09:49:05 +0100  230)
> > > if (!ret) {
> > > 6314efee3cfee (Miklos Szeredi     2013-10-01 16:41:22 +0200  231)
> > > fi = get_fuse_inode(inode);
> > > 9e6268db496a2 (Miklos Szeredi     2005-09-09 13:10:29 -0700  232)
> > > if (outarg.nodeid != get_node_id(inode)) {
> > > 07e77dca8a1f1 (Miklos Szeredi     2010-12-07 20:16:56 +0100  233)
> > > fuse_queue_forget(fc, forget, outarg.nodeid,
> > > 1);
> > > e2a6b95236eba (Miklos Szeredi     2013-09-05 11:44:43 +0200  234)
> > > goto invalid;
> > > 9e6268db496a2 (Miklos Szeredi     2005-09-09 13:10:29 -0700  235)
> > > }
> > > 
> > > The latest  kernel is doing the following in fs/fuse/dir.c:
> > > 
> > > 			if (outarg.nodeid != get_node_id(inode)) {
> > > 				fuse_queue_forget(fc, forget, outarg.nodeid, 1);
> > > 				goto invalid;
> > > 			}
> > > 
> > > invalid:
> > > 	ret = 0;
> > > 	goto out;
> > > }
> > > 
> > > When We check the glusterfs code, I find that when execute command
> > > “gluster
> > > volume set <volume> performance.write-behind on/off”, the glusterfs
> > > client
> > > side graph will be re-built. And all xlators will be re-initialized ,
> > > So I am wondering if this will lead to node id change and cause kernel
> > > side
> > > do the umount, because I find in function fuse_graph_setup,
> > > inode_table_new
> > > is called, does it mean all inode will be changed?
> > 
> > Yes nodeids change after graph switch. Nice catch btw :).
> > 
> > > If so, we need consider the following solutions
> > >  1) is there some way to notify kernel side in graceful way so that
> > >  kernel
> > >  side can do update the nodeid which stored before?
> > 
> > AFAIK, there is no such way.
> > 
> > >  2) is there some way let itable only initialized once, that's mean, we
> > >  only
> > >  initialized while fuse_graph_setup is first called, and reused the
> > >  itable
> > >  address, while new graph is created and fuse_graph_setup called due to
> > >  options changed such like performance.write_behind.
> > 
> > We cannot use the inodes from old graph, because per xlator state from
> > xlators of new graph is not present in them. So, we build new inodes and
> > let
> > interested xlators in the new graph store the state. This is done by
> > sending
> > a lookup call for each nodeid the kernel is using (on demand basis. We
> > don't
> > do this if kernel doesn't do an operation on the inodes it has cached).
> > 
> > The other possible approach is to build the state in old inodes instead of
> > creating new inodes (that way we don't change nodeid). But how scalable is
> > that approach is the question (what happens if there are too many graph
> > switches and functions like inode_ctx_get/set slow down due to too many
> > xlator states). Also, this is just a thought and I've to think more on
> > this.
> > Even if we do this, it will be a more disruptive change and needs more
> > testing to be accepted as a solution.
> > 
> > >  3) is there some way let ino(which showed in Annex 1)  not changed due
> > >  to
> > >  graph rebuild, I mean not through address, such like the mapping of the
> > >  gfid, then even if the itable rebuild, glusterfs have the same nodeid
> > >  which
> > >  stored in kernel if solution 1 and solution 2 can't be used?
> > 
> > Using plain gfids as nodeids would be the best solution. But kernel allows
> > only 64bit nodeids and gfids are bigger. Mapping requires one more
> > data-structure (say hash-table etc), that brings more overhead and I guess
> > we wanted to keep things simple (by just using address of inode object as
> > nodeid - Annex 1).
> > 
> > >  4) any other solutions?
> > 
> > I can't think of an easy simple solution now. Will update if I can find
> > something.
> > 
> > > 
> > > I would like to know your opinion about this issue and our solutions, and
> > > your reply will be appreciated!
> > > 
> > > Annex 1:  inode mapping with the address of node stored in itable.
> > > fuse_ino_to_inode (uint64_t ino, xlator_t *fuse)
> > > {
> > >         inode_t  *inode = NULL;
> > >         xlator_t *active_subvol = NULL;
> > > 
> > >         if (ino == 1) {
> > >                 active_subvol = fuse_active_subvol (fuse);
> > >                 if (active_subvol)
> > >                         inode = active_subvol->itable->root;
> > >         } else {
> > >                 inode = (inode_t *) (unsigned long) ino;
> > >                 inode_ref (inode);
> > >         }
> > > 
> > >         return inode;
> > > }
> > > 
> > > Annex 2:  kernel code changed for this issue, which mean for some kernel
> > > there is no un-expect umount issue, but has DOS-attack issue, and while
> > > the
> > > latest kernel will have the un-expect umount issue due to
> > > fuse_graph_setup
> > > rebuild.
> > > 1) no un-expected umount issue but has DOS-attack issue.
> > > kernel is doing the following in fs/fuse/dir.c:
> > > 
> > > 			if (outarg.nodeid != get_node_id(inode)) {
> > > 				fuse_queue_forget(fc, forget, outarg.nodeid, 1);
> > > 				goto invalid;
> > > 			}
> > > 
> > > invalid:
> > > 	ret = 0;
> > > 	if (check_submounts_and_drop(entry) != 0)
> > > 		ret = 1;
> > > 	goto out;
> > > }
> > > 
> > > 
> > > The above check_submounts_and_drop() call was added in September 2013:
> > > 
> > > commit 46ea1562da792a94ee8391f7aed882eb7771e18c
> > > Author: Anand Avati <avati at redhat.com>
> > > Date:   Thu Sep 5 11:44:44 2013 +0200
> > > 
> > >     fuse: drop dentry on failed revalidate
> > > 
> > >     Drop a subtree when we find that it has moved or been delated.  This
> > >     can
> > >     be
> > >     done as long as there are no submounts under this location.
> > > 
> > >     If the directory was moved and we come across the same directory in a
> > >     future lookup it will be reconnected by d_materialise_unique().
> > > 
> > >     Signed-off-by: Anand Avati <avati at redhat.com>
> > >     Signed-off-by: Miklos Szeredi <mszeredi at suse.cz>
> > >     Signed-off-by: Al Viro <viro at zeniv.linux.org.uk>
> > > 
> > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > > index 25c6cfe98801..0e6961aae6c0 100644
> > > --- a/fs/fuse/dir.c
> > > +++ b/fs/fuse/dir.c
> > > @@ -259,6 +259,8 @@ out:
> > > 
> > >  invalid:
> > >         ret = 0;
> > > +       if (check_submounts_and_drop(entry) != 0)
> > > +               ret = 1;
> > >         goto out;
> > >  }
> > > 
> > > Ie. this version of fuse specifically avoided invalidating the dentry if
> > > there were directories mounted under it.
> > > (The check_submounts_and_drop() function is part of general vfs code, not
> > > fuse specific.)
> > > 
> > > Then, in October 2013 there was a wider kernel change which caused
> > > invalid
> > > mounts to be deleted more aggressively
> > > for security reasons (elimination of certain kind of DOS attacks):
> > > 
> > > commit 8ed936b5671bfb33d89bc60bdcc7cf0470ba52fe
> > > Author: Eric W. Biederman <ebiederman at twitter.com>
> > > Date:   Tue Oct 1 18:33:48 2013 -0700
> > > 
> > >     vfs: Lazily remove mounts on unlinked files and directories.
> > > 
> > >     With the introduction of mount namespaces and bind mounts it became
> > >     possible to access files and directories that on some paths are mount
> > >     points but are not mount points on other paths.  It is very confusing
> > >     when rm -rf somedir returns -EBUSY simply because somedir is mounted
> > >     somewhere else.  With the addition of user namespaces allowing
> > >     unprivileged mounts this condition has gone from annoying to allowing
> > >     a DOS attack on other users in the system.
> > > 
> > >     The possibility for mischief is removed by updating the vfs to
> > >     support
> > >     rename, unlink and rmdir on a dentry that is a mountpoint and by
> > >     lazily unmounting mountpoints on deleted dentries.
> > > 
> > > 
> > > The above commit changed check_submounts_and_drop() function so that it
> > > will
> > > not return non-zero error status any more if the entry is a mount point
> > > (or
> > > has mount points under it).
> > > Instead it simply unmounts those entries unconditionally.
> > > 
> > > 
> > > Finally, that call was removed from fuse code because it was not doing
> > > anything anymore:
> > > 
> > > commit 9b053f3207e8887fed88162a339fdd4001abcdb2
> > > Author: Eric W. Biederman <ebiederm at xmission.com>
> > > Date:   Thu Feb 13 09:34:30 2014 -0800
> > > 
> > >     vfs: Remove unnecessary calls of check_submounts_and_drop
> > > 
> > >     Now that check_submounts_and_drop can not fail and is called from
> > >     d_invalidate there is no longer a need to call
> > >     check_submounts_and_drom
> > >     from filesystem d_revalidate methods so remove it.
> > > 
> > >     Reviewed-by: Miklos Szeredi <miklos at szeredi.hu>
> > >     Signed-off-by: "Eric W. Biederman" <ebiederm at xmission.com>
> > >     Signed-off-by: Al Viro <viro at zeniv.linux.org.uk>
> > > 
> > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > > index de1d84af9f7c..820efd74ca9f 100644
> > > --- a/fs/fuse/dir.c
> > > +++ b/fs/fuse/dir.c
> > > @@ -274,9 +274,6 @@ out:
> > > 
> > > 
> > > 2) has un-expected umount issue.
> > >  invalid:
> > >         ret = 0;
> > > -
> > > -       if (!(flags & LOOKUP_RCU) && check_submounts_and_drop(entry) !=
> > > 0)
> > > -               ret = 1;
> > > 	goto out;
> > >  }
> > > 
> > > Later, the check_submounts_and_drop() function was merged to
> > > d_invalidate()
> > > and ceased to exist.
> > > 
> > > So, the unmounting does not happen in CentOS because it is using quite
> > > old
> > > and more relaxed kernel where filesystem code lets the mount exists even
> > > if
> > > glusterfs node id has changed.
> > > But newer kernels released during the last 3 years are using stricter
> > > mount
> > > handling policy which causes the unmounting (and most likely will not go
> > > back).
> > > 
> > > 
> > > Best regards,
> > > George
> > > 
> > > 
> > >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > http://lists.gluster.org/mailman/listinfo/gluster-devel
>