[Gluster-devel] Nodeid changed due to write-behind option chagned online will lead to un-expected umount by kernel
Lian, George (Nokia - CN/Hangzhou)
george.lian at nokia.com
Wed Mar 22 07:17:59 UTC 2017
Hi, Csaba,
Could you please give some comments for this issues?
Thanks & Best Regards,
George
-----Original Message-----
From: Raghavendra Gowdappa [mailto:rgowdapp at redhat.com]
Sent: Monday, March 20, 2017 9:35 PM
To: Lian, George (Nokia - CN/Hangzhou) <george.lian at nokia.com>
Cc: Zhang, Bingxuan (Nokia - CN/Hangzhou) <bingxuan.zhang at nokia.com>; Gluster-devel at gluster.org; Venetjoki, Tero (Nokia - FI/Espoo) <tero.venetjoki at nokia.com>; Zhou, Cynthia (Nokia - CN/Hangzhou) <cynthia.zhou at nokia.com>; Csaba Henk <chenk at redhat.com>
Subject: Re: [Gluster-devel] Nodeid changed due to write-behind option chagned online will lead to un-expected umount by kernel
+csaba
----- Original Message -----
> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> To: "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
> Cc: "Bingxuan Zhang (Nokia - CN/Hangzhou)" <bingxuan.zhang at nokia.com>, Gluster-devel at gluster.org, "Tero Venetjoki
> (Nokia - FI/Espoo)" <tero.venetjoki at nokia.com>, "Cynthia Zhou (Nokia - CN/Hangzhou)" <cynthia.zhou at nokia.com>
> Sent: Monday, March 20, 2017 7:02:50 PM
> Subject: Re: [Gluster-devel] Nodeid changed due to write-behind option chagned online will lead to un-expected umount
> by kernel
>
>
>
> ----- Original Message -----
> > From: "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
> > To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>, Gluster-devel at gluster.org
> > Cc: "Bingxuan Zhang (Nokia - CN/Hangzhou)" <bingxuan.zhang at nokia.com>,
> > "Cynthia Zhou (Nokia - CN/Hangzhou)"
> > <cynthia.zhou at nokia.com>, "Tero Venetjoki (Nokia - FI/Espoo)"
> > <tero.venetjoki at nokia.com>
> > Sent: Monday, March 20, 2017 8:44:30 AM
> > Subject: Nodeid changed due to write-behind option chagned online will lead
> > to un-expected umount by kernel
> >
> >
> > Hi, GlusterFS expert,
> >
> > In our latest test, We found an issue potentially related to glusterfs.
> > When
> > I execute “gluster volume set <volume> performance.write-behind on/off”,
> > some bind mounts will get lost, this issue is permanent!
> > Test steps:
> > a) mkdir -p /mnt/log/node1/test1/test2; mkdir -p /mnt/log/node2/test1/test2
> > b) mkdir -p /mnt/test1/test2
> > c) mount –bind /mnt/log/node1/test1 /mnt/test1
> > d) mount –bind /mnt/log/node2/test1/test2 /mnt/test1/test2
> > e) mount | cut -d " " -f 3|xargs stat
> > f) gluster volume set log performance.write-behind on/off check
> > /mnt/test1/test2 , could found this bind mount get lost
> >
> > We’ve consulted the linux kernel side guys, their explanations for this
> > “bind
> > mount lost” issue is when kernel side do lookup or stat(), it will find the
> > nodeid has changed and trigger fuse_dentry_revalidate, and
> > fuse_dentry_relivalidate() fails for some dentry and invalidates it which
> > leads to unmounting.
> >
> > 88.912262 | 0) stat-437 | |
> > fuse_dentry_revalidate [fuse]() {
> > <...>
> > 88.912264 | 0) stat-437 | |
> > fuse_simple_request [fuse]() {
> > <...>
> > 88.921383 | 0) stat-437 | # 9119.255 us | } /*
> > fuse_simple_request [fuse] */
> > 88.921383 | 0) stat-437 | 0.093 us |
> > dput();
> > 88.921384 | 0) stat-437 | |
> > fuse_queue_forget [fuse]() {
> > <...>
> > 88.921427 | 0) stat-437 | + 42.737 us | }
> > 88.921427 | 0) stat-437 | # 9164.967 us | } /*
> > fuse_dentry_revalidate [fuse] */
> > 88.921427 | 0) stat-437 | |
> > d_invalidate() {
> > 88.921427 | 0) stat-437 | 0.048 us |
> > _raw_spin_lock();
> > 88.921428 | 0) stat-437 | |
> > d_walk() {
> > 88.921428 | 0) stat-437 | 0.040 us |
> > _raw_spin_lock();
> > 88.921428 | 0) stat-437 | 0.046 us |
> > detach_and_collect();
> > 88.921429 | 0) stat-437 | 0.723 us | }
> > 88.921429 | 0) stat-437 | |
> > __detach_mounts() {
> >
> > fuse_queue_forget() got called inside fuse_dentry_revalidate(), so the
> > failure is clearly caused by (outarg.nodeid != get_node_id(inode)) check:
> >
> > 7078187a795f8 (Miklos Szeredi 2014-12-12 09:49:05 +0100 230)
> > if (!ret) {
> > 6314efee3cfee (Miklos Szeredi 2013-10-01 16:41:22 +0200 231)
> > fi = get_fuse_inode(inode);
> > 9e6268db496a2 (Miklos Szeredi 2005-09-09 13:10:29 -0700 232)
> > if (outarg.nodeid != get_node_id(inode)) {
> > 07e77dca8a1f1 (Miklos Szeredi 2010-12-07 20:16:56 +0100 233)
> > fuse_queue_forget(fc, forget, outarg.nodeid,
> > 1);
> > e2a6b95236eba (Miklos Szeredi 2013-09-05 11:44:43 +0200 234)
> > goto invalid;
> > 9e6268db496a2 (Miklos Szeredi 2005-09-09 13:10:29 -0700 235)
> > }
> >
> > The latest kernel is doing the following in fs/fuse/dir.c:
> >
> > if (outarg.nodeid != get_node_id(inode)) {
> > fuse_queue_forget(fc, forget, outarg.nodeid, 1);
> > goto invalid;
> > }
> >
> > invalid:
> > ret = 0;
> > goto out;
> > }
> >
> > When We check the glusterfs code, I find that when execute command “gluster
> > volume set <volume> performance.write-behind on/off”, the glusterfs client
> > side graph will be re-built. And all xlators will be re-initialized ,
> > So I am wondering if this will lead to node id change and cause kernel side
> > do the umount, because I find in function fuse_graph_setup, inode_table_new
> > is called, does it mean all inode will be changed?
>
> Yes nodeids change after graph switch. Nice catch btw :).
>
> > If so, we need consider the following solutions
> > 1) is there some way to notify kernel side in graceful way so that kernel
> > side can do update the nodeid which stored before?
>
> AFAIK, there is no such way.
>
> > 2) is there some way let itable only initialized once, that's mean, we
> > only
> > initialized while fuse_graph_setup is first called, and reused the itable
> > address, while new graph is created and fuse_graph_setup called due to
> > options changed such like performance.write_behind.
>
> We cannot use the inodes from old graph, because per xlator state from
> xlators of new graph is not present in them. So, we build new inodes and let
> interested xlators in the new graph store the state. This is done by sending
> a lookup call for each nodeid the kernel is using (on demand basis. We don't
> do this if kernel doesn't do an operation on the inodes it has cached).
>
> The other possible approach is to build the state in old inodes instead of
> creating new inodes (that way we don't change nodeid). But how scalable is
> that approach is the question (what happens if there are too many graph
> switches and functions like inode_ctx_get/set slow down due to too many
> xlator states). Also, this is just a thought and I've to think more on this.
> Even if we do this, it will be a more disruptive change and needs more
> testing to be accepted as a solution.
>
> > 3) is there some way let ino(which showed in Annex 1) not changed due to
> > graph rebuild, I mean not through address, such like the mapping of the
> > gfid, then even if the itable rebuild, glusterfs have the same nodeid
> > which
> > stored in kernel if solution 1 and solution 2 can't be used?
>
> Using plain gfids as nodeids would be the best solution. But kernel allows
> only 64bit nodeids and gfids are bigger. Mapping requires one more
> data-structure (say hash-table etc), that brings more overhead and I guess
> we wanted to keep things simple (by just using address of inode object as
> nodeid - Annex 1).
>
> > 4) any other solutions?
>
> I can't think of an easy simple solution now. Will update if I can find
> something.
>
> >
> > I would like to know your opinion about this issue and our solutions, and
> > your reply will be appreciated!
> >
> > Annex 1: inode mapping with the address of node stored in itable.
> > fuse_ino_to_inode (uint64_t ino, xlator_t *fuse)
> > {
> > inode_t *inode = NULL;
> > xlator_t *active_subvol = NULL;
> >
> > if (ino == 1) {
> > active_subvol = fuse_active_subvol (fuse);
> > if (active_subvol)
> > inode = active_subvol->itable->root;
> > } else {
> > inode = (inode_t *) (unsigned long) ino;
> > inode_ref (inode);
> > }
> >
> > return inode;
> > }
> >
> > Annex 2: kernel code changed for this issue, which mean for some kernel
> > there is no un-expect umount issue, but has DOS-attack issue, and while the
> > latest kernel will have the un-expect umount issue due to fuse_graph_setup
> > rebuild.
> > 1) no un-expected umount issue but has DOS-attack issue.
> > kernel is doing the following in fs/fuse/dir.c:
> >
> > if (outarg.nodeid != get_node_id(inode)) {
> > fuse_queue_forget(fc, forget, outarg.nodeid, 1);
> > goto invalid;
> > }
> >
> > invalid:
> > ret = 0;
> > if (check_submounts_and_drop(entry) != 0)
> > ret = 1;
> > goto out;
> > }
> >
> >
> > The above check_submounts_and_drop() call was added in September 2013:
> >
> > commit 46ea1562da792a94ee8391f7aed882eb7771e18c
> > Author: Anand Avati <avati at redhat.com>
> > Date: Thu Sep 5 11:44:44 2013 +0200
> >
> > fuse: drop dentry on failed revalidate
> >
> > Drop a subtree when we find that it has moved or been delated. This
> > can
> > be
> > done as long as there are no submounts under this location.
> >
> > If the directory was moved and we come across the same directory in a
> > future lookup it will be reconnected by d_materialise_unique().
> >
> > Signed-off-by: Anand Avati <avati at redhat.com>
> > Signed-off-by: Miklos Szeredi <mszeredi at suse.cz>
> > Signed-off-by: Al Viro <viro at zeniv.linux.org.uk>
> >
> > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > index 25c6cfe98801..0e6961aae6c0 100644
> > --- a/fs/fuse/dir.c
> > +++ b/fs/fuse/dir.c
> > @@ -259,6 +259,8 @@ out:
> >
> > invalid:
> > ret = 0;
> > + if (check_submounts_and_drop(entry) != 0)
> > + ret = 1;
> > goto out;
> > }
> >
> > Ie. this version of fuse specifically avoided invalidating the dentry if
> > there were directories mounted under it.
> > (The check_submounts_and_drop() function is part of general vfs code, not
> > fuse specific.)
> >
> > Then, in October 2013 there was a wider kernel change which caused invalid
> > mounts to be deleted more aggressively
> > for security reasons (elimination of certain kind of DOS attacks):
> >
> > commit 8ed936b5671bfb33d89bc60bdcc7cf0470ba52fe
> > Author: Eric W. Biederman <ebiederman at twitter.com>
> > Date: Tue Oct 1 18:33:48 2013 -0700
> >
> > vfs: Lazily remove mounts on unlinked files and directories.
> >
> > With the introduction of mount namespaces and bind mounts it became
> > possible to access files and directories that on some paths are mount
> > points but are not mount points on other paths. It is very confusing
> > when rm -rf somedir returns -EBUSY simply because somedir is mounted
> > somewhere else. With the addition of user namespaces allowing
> > unprivileged mounts this condition has gone from annoying to allowing
> > a DOS attack on other users in the system.
> >
> > The possibility for mischief is removed by updating the vfs to support
> > rename, unlink and rmdir on a dentry that is a mountpoint and by
> > lazily unmounting mountpoints on deleted dentries.
> >
> >
> > The above commit changed check_submounts_and_drop() function so that it
> > will
> > not return non-zero error status any more if the entry is a mount point (or
> > has mount points under it).
> > Instead it simply unmounts those entries unconditionally.
> >
> >
> > Finally, that call was removed from fuse code because it was not doing
> > anything anymore:
> >
> > commit 9b053f3207e8887fed88162a339fdd4001abcdb2
> > Author: Eric W. Biederman <ebiederm at xmission.com>
> > Date: Thu Feb 13 09:34:30 2014 -0800
> >
> > vfs: Remove unnecessary calls of check_submounts_and_drop
> >
> > Now that check_submounts_and_drop can not fail and is called from
> > d_invalidate there is no longer a need to call check_submounts_and_drom
> > from filesystem d_revalidate methods so remove it.
> >
> > Reviewed-by: Miklos Szeredi <miklos at szeredi.hu>
> > Signed-off-by: "Eric W. Biederman" <ebiederm at xmission.com>
> > Signed-off-by: Al Viro <viro at zeniv.linux.org.uk>
> >
> > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > index de1d84af9f7c..820efd74ca9f 100644
> > --- a/fs/fuse/dir.c
> > +++ b/fs/fuse/dir.c
> > @@ -274,9 +274,6 @@ out:
> >
> >
> > 2) has un-expected umount issue.
> > invalid:
> > ret = 0;
> > -
> > - if (!(flags & LOOKUP_RCU) && check_submounts_and_drop(entry) != 0)
> > - ret = 1;
> > goto out;
> > }
> >
> > Later, the check_submounts_and_drop() function was merged to d_invalidate()
> > and ceased to exist.
> >
> > So, the unmounting does not happen in CentOS because it is using quite old
> > and more relaxed kernel where filesystem code lets the mount exists even if
> > glusterfs node id has changed.
> > But newer kernels released during the last 3 years are using stricter mount
> > handling policy which causes the unmounting (and most likely will not go
> > back).
> >
> >
> > Best regards,
> > George
> >
> >
> >
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
More information about the Gluster-devel
mailing list