[Gluster-devel] Nodeid changed due to write-behind option chagned online will lead to un-expected umount by kernel
Raghavendra Gowdappa
rgowdapp at redhat.com
Mon Mar 20 15:14:27 UTC 2017
On a tangential note, any idea on how the change of nodeid affects caching (page, dentry, attribute cache etc) in VFS? Does it throw away the cache? If yes, after a graph switch, there might be performance drop (till the cache gets built again).
It would also be worth exploring what are the other effects of a changed nodeid for a file/directory.
----- Original Message -----
> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> To: "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
> Cc: "Bingxuan Zhang (Nokia - CN/Hangzhou)" <bingxuan.zhang at nokia.com>, Gluster-devel at gluster.org, "Tero Venetjoki
> (Nokia - FI/Espoo)" <tero.venetjoki at nokia.com>, "Cynthia Zhou (Nokia - CN/Hangzhou)" <cynthia.zhou at nokia.com>,
> "Csaba Henk" <chenk at redhat.com>
> Sent: Monday, March 20, 2017 7:05:08 PM
> Subject: Re: [Gluster-devel] Nodeid changed due to write-behind option chagned online will lead to un-expected umount
> by kernel
>
> +csaba
>
> ----- Original Message -----
> > From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> > To: "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
> > Cc: "Bingxuan Zhang (Nokia - CN/Hangzhou)" <bingxuan.zhang at nokia.com>,
> > Gluster-devel at gluster.org, "Tero Venetjoki
> > (Nokia - FI/Espoo)" <tero.venetjoki at nokia.com>, "Cynthia Zhou (Nokia -
> > CN/Hangzhou)" <cynthia.zhou at nokia.com>
> > Sent: Monday, March 20, 2017 7:02:50 PM
> > Subject: Re: [Gluster-devel] Nodeid changed due to write-behind option
> > chagned online will lead to un-expected umount
> > by kernel
> >
> >
> >
> > ----- Original Message -----
> > > From: "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
> > > To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>,
> > > Gluster-devel at gluster.org
> > > Cc: "Bingxuan Zhang (Nokia - CN/Hangzhou)" <bingxuan.zhang at nokia.com>,
> > > "Cynthia Zhou (Nokia - CN/Hangzhou)"
> > > <cynthia.zhou at nokia.com>, "Tero Venetjoki (Nokia - FI/Espoo)"
> > > <tero.venetjoki at nokia.com>
> > > Sent: Monday, March 20, 2017 8:44:30 AM
> > > Subject: Nodeid changed due to write-behind option chagned online will
> > > lead
> > > to un-expected umount by kernel
> > >
> > >
> > > Hi, GlusterFS expert,
> > >
> > > In our latest test, We found an issue potentially related to glusterfs.
> > > When
> > > I execute “gluster volume set <volume> performance.write-behind on/off”,
> > > some bind mounts will get lost, this issue is permanent!
> > > Test steps:
> > > a) mkdir -p /mnt/log/node1/test1/test2; mkdir -p
> > > /mnt/log/node2/test1/test2
> > > b) mkdir -p /mnt/test1/test2
> > > c) mount –bind /mnt/log/node1/test1 /mnt/test1
> > > d) mount –bind /mnt/log/node2/test1/test2 /mnt/test1/test2
> > > e) mount | cut -d " " -f 3|xargs stat
> > > f) gluster volume set log performance.write-behind on/off check
> > > /mnt/test1/test2 , could found this bind mount get lost
> > >
> > > We’ve consulted the linux kernel side guys, their explanations for this
> > > “bind
> > > mount lost” issue is when kernel side do lookup or stat(), it will find
> > > the
> > > nodeid has changed and trigger fuse_dentry_revalidate, and
> > > fuse_dentry_relivalidate() fails for some dentry and invalidates it which
> > > leads to unmounting.
> > >
> > > 88.912262 | 0) stat-437 | |
> > > fuse_dentry_revalidate [fuse]() {
> > > <...>
> > > 88.912264 | 0) stat-437 | |
> > > fuse_simple_request [fuse]() {
> > > <...>
> > > 88.921383 | 0) stat-437 | # 9119.255 us | }
> > > /*
> > > fuse_simple_request [fuse] */
> > > 88.921383 | 0) stat-437 | 0.093 us |
> > > dput();
> > > 88.921384 | 0) stat-437 | |
> > > fuse_queue_forget [fuse]() {
> > > <...>
> > > 88.921427 | 0) stat-437 | + 42.737 us | }
> > > 88.921427 | 0) stat-437 | # 9164.967 us | } /*
> > > fuse_dentry_revalidate [fuse] */
> > > 88.921427 | 0) stat-437 | |
> > > d_invalidate() {
> > > 88.921427 | 0) stat-437 | 0.048 us |
> > > _raw_spin_lock();
> > > 88.921428 | 0) stat-437 | |
> > > d_walk() {
> > > 88.921428 | 0) stat-437 | 0.040 us |
> > > _raw_spin_lock();
> > > 88.921428 | 0) stat-437 | 0.046 us |
> > > detach_and_collect();
> > > 88.921429 | 0) stat-437 | 0.723 us | }
> > > 88.921429 | 0) stat-437 | |
> > > __detach_mounts() {
> > >
> > > fuse_queue_forget() got called inside fuse_dentry_revalidate(), so the
> > > failure is clearly caused by (outarg.nodeid != get_node_id(inode))
> > > check:
> > >
> > > 7078187a795f8 (Miklos Szeredi 2014-12-12 09:49:05 +0100 230)
> > > if (!ret) {
> > > 6314efee3cfee (Miklos Szeredi 2013-10-01 16:41:22 +0200 231)
> > > fi = get_fuse_inode(inode);
> > > 9e6268db496a2 (Miklos Szeredi 2005-09-09 13:10:29 -0700 232)
> > > if (outarg.nodeid != get_node_id(inode)) {
> > > 07e77dca8a1f1 (Miklos Szeredi 2010-12-07 20:16:56 +0100 233)
> > > fuse_queue_forget(fc, forget, outarg.nodeid,
> > > 1);
> > > e2a6b95236eba (Miklos Szeredi 2013-09-05 11:44:43 +0200 234)
> > > goto invalid;
> > > 9e6268db496a2 (Miklos Szeredi 2005-09-09 13:10:29 -0700 235)
> > > }
> > >
> > > The latest kernel is doing the following in fs/fuse/dir.c:
> > >
> > > if (outarg.nodeid != get_node_id(inode)) {
> > > fuse_queue_forget(fc, forget, outarg.nodeid, 1);
> > > goto invalid;
> > > }
> > >
> > > invalid:
> > > ret = 0;
> > > goto out;
> > > }
> > >
> > > When We check the glusterfs code, I find that when execute command
> > > “gluster
> > > volume set <volume> performance.write-behind on/off”, the glusterfs
> > > client
> > > side graph will be re-built. And all xlators will be re-initialized ,
> > > So I am wondering if this will lead to node id change and cause kernel
> > > side
> > > do the umount, because I find in function fuse_graph_setup,
> > > inode_table_new
> > > is called, does it mean all inode will be changed?
> >
> > Yes nodeids change after graph switch. Nice catch btw :).
> >
> > > If so, we need consider the following solutions
> > > 1) is there some way to notify kernel side in graceful way so that
> > > kernel
> > > side can do update the nodeid which stored before?
> >
> > AFAIK, there is no such way.
> >
> > > 2) is there some way let itable only initialized once, that's mean, we
> > > only
> > > initialized while fuse_graph_setup is first called, and reused the
> > > itable
> > > address, while new graph is created and fuse_graph_setup called due to
> > > options changed such like performance.write_behind.
> >
> > We cannot use the inodes from old graph, because per xlator state from
> > xlators of new graph is not present in them. So, we build new inodes and
> > let
> > interested xlators in the new graph store the state. This is done by
> > sending
> > a lookup call for each nodeid the kernel is using (on demand basis. We
> > don't
> > do this if kernel doesn't do an operation on the inodes it has cached).
> >
> > The other possible approach is to build the state in old inodes instead of
> > creating new inodes (that way we don't change nodeid). But how scalable is
> > that approach is the question (what happens if there are too many graph
> > switches and functions like inode_ctx_get/set slow down due to too many
> > xlator states). Also, this is just a thought and I've to think more on
> > this.
> > Even if we do this, it will be a more disruptive change and needs more
> > testing to be accepted as a solution.
> >
> > > 3) is there some way let ino(which showed in Annex 1) not changed due
> > > to
> > > graph rebuild, I mean not through address, such like the mapping of the
> > > gfid, then even if the itable rebuild, glusterfs have the same nodeid
> > > which
> > > stored in kernel if solution 1 and solution 2 can't be used?
> >
> > Using plain gfids as nodeids would be the best solution. But kernel allows
> > only 64bit nodeids and gfids are bigger. Mapping requires one more
> > data-structure (say hash-table etc), that brings more overhead and I guess
> > we wanted to keep things simple (by just using address of inode object as
> > nodeid - Annex 1).
> >
> > > 4) any other solutions?
> >
> > I can't think of an easy simple solution now. Will update if I can find
> > something.
> >
> > >
> > > I would like to know your opinion about this issue and our solutions, and
> > > your reply will be appreciated!
> > >
> > > Annex 1: inode mapping with the address of node stored in itable.
> > > fuse_ino_to_inode (uint64_t ino, xlator_t *fuse)
> > > {
> > > inode_t *inode = NULL;
> > > xlator_t *active_subvol = NULL;
> > >
> > > if (ino == 1) {
> > > active_subvol = fuse_active_subvol (fuse);
> > > if (active_subvol)
> > > inode = active_subvol->itable->root;
> > > } else {
> > > inode = (inode_t *) (unsigned long) ino;
> > > inode_ref (inode);
> > > }
> > >
> > > return inode;
> > > }
> > >
> > > Annex 2: kernel code changed for this issue, which mean for some kernel
> > > there is no un-expect umount issue, but has DOS-attack issue, and while
> > > the
> > > latest kernel will have the un-expect umount issue due to
> > > fuse_graph_setup
> > > rebuild.
> > > 1) no un-expected umount issue but has DOS-attack issue.
> > > kernel is doing the following in fs/fuse/dir.c:
> > >
> > > if (outarg.nodeid != get_node_id(inode)) {
> > > fuse_queue_forget(fc, forget, outarg.nodeid, 1);
> > > goto invalid;
> > > }
> > >
> > > invalid:
> > > ret = 0;
> > > if (check_submounts_and_drop(entry) != 0)
> > > ret = 1;
> > > goto out;
> > > }
> > >
> > >
> > > The above check_submounts_and_drop() call was added in September 2013:
> > >
> > > commit 46ea1562da792a94ee8391f7aed882eb7771e18c
> > > Author: Anand Avati <avati at redhat.com>
> > > Date: Thu Sep 5 11:44:44 2013 +0200
> > >
> > > fuse: drop dentry on failed revalidate
> > >
> > > Drop a subtree when we find that it has moved or been delated. This
> > > can
> > > be
> > > done as long as there are no submounts under this location.
> > >
> > > If the directory was moved and we come across the same directory in a
> > > future lookup it will be reconnected by d_materialise_unique().
> > >
> > > Signed-off-by: Anand Avati <avati at redhat.com>
> > > Signed-off-by: Miklos Szeredi <mszeredi at suse.cz>
> > > Signed-off-by: Al Viro <viro at zeniv.linux.org.uk>
> > >
> > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > > index 25c6cfe98801..0e6961aae6c0 100644
> > > --- a/fs/fuse/dir.c
> > > +++ b/fs/fuse/dir.c
> > > @@ -259,6 +259,8 @@ out:
> > >
> > > invalid:
> > > ret = 0;
> > > + if (check_submounts_and_drop(entry) != 0)
> > > + ret = 1;
> > > goto out;
> > > }
> > >
> > > Ie. this version of fuse specifically avoided invalidating the dentry if
> > > there were directories mounted under it.
> > > (The check_submounts_and_drop() function is part of general vfs code, not
> > > fuse specific.)
> > >
> > > Then, in October 2013 there was a wider kernel change which caused
> > > invalid
> > > mounts to be deleted more aggressively
> > > for security reasons (elimination of certain kind of DOS attacks):
> > >
> > > commit 8ed936b5671bfb33d89bc60bdcc7cf0470ba52fe
> > > Author: Eric W. Biederman <ebiederman at twitter.com>
> > > Date: Tue Oct 1 18:33:48 2013 -0700
> > >
> > > vfs: Lazily remove mounts on unlinked files and directories.
> > >
> > > With the introduction of mount namespaces and bind mounts it became
> > > possible to access files and directories that on some paths are mount
> > > points but are not mount points on other paths. It is very confusing
> > > when rm -rf somedir returns -EBUSY simply because somedir is mounted
> > > somewhere else. With the addition of user namespaces allowing
> > > unprivileged mounts this condition has gone from annoying to allowing
> > > a DOS attack on other users in the system.
> > >
> > > The possibility for mischief is removed by updating the vfs to
> > > support
> > > rename, unlink and rmdir on a dentry that is a mountpoint and by
> > > lazily unmounting mountpoints on deleted dentries.
> > >
> > >
> > > The above commit changed check_submounts_and_drop() function so that it
> > > will
> > > not return non-zero error status any more if the entry is a mount point
> > > (or
> > > has mount points under it).
> > > Instead it simply unmounts those entries unconditionally.
> > >
> > >
> > > Finally, that call was removed from fuse code because it was not doing
> > > anything anymore:
> > >
> > > commit 9b053f3207e8887fed88162a339fdd4001abcdb2
> > > Author: Eric W. Biederman <ebiederm at xmission.com>
> > > Date: Thu Feb 13 09:34:30 2014 -0800
> > >
> > > vfs: Remove unnecessary calls of check_submounts_and_drop
> > >
> > > Now that check_submounts_and_drop can not fail and is called from
> > > d_invalidate there is no longer a need to call
> > > check_submounts_and_drom
> > > from filesystem d_revalidate methods so remove it.
> > >
> > > Reviewed-by: Miklos Szeredi <miklos at szeredi.hu>
> > > Signed-off-by: "Eric W. Biederman" <ebiederm at xmission.com>
> > > Signed-off-by: Al Viro <viro at zeniv.linux.org.uk>
> > >
> > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > > index de1d84af9f7c..820efd74ca9f 100644
> > > --- a/fs/fuse/dir.c
> > > +++ b/fs/fuse/dir.c
> > > @@ -274,9 +274,6 @@ out:
> > >
> > >
> > > 2) has un-expected umount issue.
> > > invalid:
> > > ret = 0;
> > > -
> > > - if (!(flags & LOOKUP_RCU) && check_submounts_and_drop(entry) !=
> > > 0)
> > > - ret = 1;
> > > goto out;
> > > }
> > >
> > > Later, the check_submounts_and_drop() function was merged to
> > > d_invalidate()
> > > and ceased to exist.
> > >
> > > So, the unmounting does not happen in CentOS because it is using quite
> > > old
> > > and more relaxed kernel where filesystem code lets the mount exists even
> > > if
> > > glusterfs node id has changed.
> > > But newer kernels released during the last 3 years are using stricter
> > > mount
> > > handling policy which causes the unmounting (and most likely will not go
> > > back).
> > >
> > >
> > > Best regards,
> > > George
> > >
> > >
> > >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > http://lists.gluster.org/mailman/listinfo/gluster-devel
>
More information about the Gluster-devel
mailing list