[Gluster-devel] "du" a large count files in a directory casue mounted glusterfs filesystem coredump

Mon Dec 12 06:28:56 UTC 2016

----- Original Message -----
> From: "Cynthia Zhou (Nokia - CN/Hangzhou)" <cynthia.zhou at nokia.com>
> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>, "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
> Cc: Gluster-devel at gluster.org, "Carlos Chinea (Nokia - FI/Espoo)" <carlos.chinea at nokia.com>, "Kari Hautio (Nokia -
> FI/Espoo)" <kari.hautio at nokia.com>, linux-fsdevel at vger.kernel.org, "Bingxuan Zhang (Nokia - CN/Hangzhou)"
> <bingxuan.zhang at nokia.com>, "Deqian Li (Nokia - CN/Hangzhou)" <deqian.li at nokia.com>, "Jan Zizka (Nokia - CZ/Prague)"
> <jan.zizka at nokia.com>, "Xiaohui Bao (Nokia - CN/Hangzhou)" <xiaohui.bao at nokia.com>
> Sent: Monday, December 12, 2016 10:59:14 AM
> Subject: RE: [Gluster-devel] "du" a large count files in a directory casue mounted glusterfs filesystem coredump
> 
> Hi glusterfs expert:
> 	From
> 	https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Developer-guide/datastructure-inode/
> 	, there is following description :
> 
> when the the lru limit of the inode table has been exceeded, A inode is
> removed from the inode table and eventually destroyed
> 
>     From glusterfs source code in function inode_table_new, there are
>     following lines, so lru_limit is not infinite.
> 
>         /* In case FUSE is initing the inode table. */
>         if (lru_limit == 0)
>                 lru_limit = DEFAULT_INODE_MEMPOOL_ENTRIES; // 32 * 1024

That's just reuse of variable lru_limit. Note that the value passed by caller is already stored in new itable at:
https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/inode.c#L1582

Also, as can be seen here
https://github.com/gluster/glusterfs/blob/master/xlators/mount/fuse/src/fuse-bridge.c#L5205

Fuse passes 0 as lru-limit which is considered as infinite.

So, I don't seen an issue here.

>     Is that possible that glusterfs remove inode table because of lru limit
>     reached?
>     From the callbacktrace pasted by George ,seems inode table address is
>     invalid, which caused the coredump.
> 
> Best regards,
> Cynthia （周琳）
> MBB SM HETRAN SW3 MATRIX
> Storage
> Mobile: +86 (0)18657188311
> 
> -----Original Message-----
> From: Raghavendra Gowdappa [mailto:rgowdapp at redhat.com]
> Sent: Monday, December 12, 2016 12:34 PM
> To: Lian, George (Nokia - CN/Hangzhou) <george.lian at nokia.com>
> Cc: Gluster-devel at gluster.org; Chinea, Carlos (Nokia - FI/Espoo)
> <carlos.chinea at nokia.com>; Hautio, Kari (Nokia - FI/Espoo)
> <kari.hautio at nokia.com>; linux-fsdevel at vger.kernel.org; Zhang, Bingxuan
> (Nokia - CN/Hangzhou) <bingxuan.zhang at nokia.com>; Zhou, Cynthia (Nokia -
> CN/Hangzhou) <cynthia.zhou at nokia.com>; Li, Deqian (Nokia - CN/Hangzhou)
> <deqian.li at nokia.com>; Zizka, Jan (Nokia - CZ/Prague) <jan.zizka at nokia.com>;
> Bao, Xiaohui (Nokia - CN/Hangzhou) <xiaohui.bao at nokia.com>
> Subject: Re: [Gluster-devel] "du" a large count files in a directory casue
> mounted glusterfs filesystem coredump
> 
> 
> 
> ----- Original Message -----
> > From: "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
> > To: Gluster-devel at gluster.org, "Carlos Chinea (Nokia - FI/Espoo)"
> > <carlos.chinea at nokia.com>, "Kari Hautio (Nokia -
> > FI/Espoo)" <kari.hautio at nokia.com>, linux-fsdevel at vger.kernel.org
> > Cc: "Bingxuan Zhang (Nokia - CN/Hangzhou)" <bingxuan.zhang at nokia.com>,
> > "Cynthia Zhou (Nokia - CN/Hangzhou)"
> > <cynthia.zhou at nokia.com>, "Deqian Li (Nokia - CN/Hangzhou)"
> > <deqian.li at nokia.com>, "Jan Zizka (Nokia - CZ/Prague)"
> > <jan.zizka at nokia.com>, "Xiaohui Bao (Nokia - CN/Hangzhou)"
> > <xiaohui.bao at nokia.com>
> > Sent: Friday, December 9, 2016 2:50:44 PM
> > Subject: Re: [Gluster-devel] "du" a large count files in a directory casue
> > mounted glusterfs filesystem coredump
> > 
> > For Life cycle of inode in glusterfs which showed in
> > https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Developer-guide/datastructure-inode/
> > It shows that “ A inode is removed from the inode table and eventually
> > destroyed when unlink or rmdir operation is performed on a file/directory,
> > or the the lru limit of the inode table has been exceeded . ”
> > Now the default value for inode lru limit is 32k for glusterfs,
> > When we “ du ” or “ ls –R” large a mount files in directory which bigger
> > than
> > 32K, it could easy lead to the limit of lru.
> 
> Glusterfs mount process has an infinite lru limit. The reason is that
> Glusterfs passes the address of inode object as "nodeid" (aka identifier)
> representing inode. For all future references of the inode, kernel just
> sends back this nodeid. So, Glusterfs cannot free up the inode as long as
> kernel remembers it. In other words, inode table size in mount process is
> dependent on the dentry-cache or inode table size in fuse kernel module. So,
> for an inode to be freed up in mount process:
> 1. There should not be any on-going ops referring the inode
> 2. Kernel should send as many number of forgets as the number of lookups it
> has done.
> 
> 
> > @gluster-expert, when glusterfs destroy the inode due to the LRU limit,
> > does
> > glusterfs notify to the kernel? (from my study now , it seems not)
> 
> No. It does not. As explained above mount process never destroys the inode as
> long as kernel remembers it.
> 
> > @linux-fsdevel-expert, could you please clarify the mechanism of inode
> > recycle mechanism or fuse-forget FOP for inode for us ?
> > Is it possible that kernel free the inode (which will trigger fuse-forget
> > to
> > glusterfs) later than the destroy in glusterfs due to lru limit?
> 
> On mount process inode is never destroyed through lru mechanism as limit is
> inifinite.
> 
> > If it is possible , then the nodeid (which is conver from the memory
> > address
> > in glusterfs) maybe stale, and when it pass to the glusterfs userspace, the
> > glusterfs just conver the u64 nodeid to memory address, and try to access
> > the address, it will lead to invalid access and coredump finally !
> 
> That's precisely the reason why we keep an infinite lru limit for glusterfs
> client process. Though please note that we do have a finite lru limit for
> brick process, nfsv3 server etc.
> 
> regards,
> Raghavendra
> 
> > Thanks & Best Regards,
> > George
> > _____________________________________________
> > From: Lian, George (Nokia - CN/Hangzhou)
> > Sent: Friday, December 09, 2016 9:49 AM
> > To: 'Gluster-devel at gluster.org' <Gluster-devel at gluster.org>
> > Cc: Zhou, Cynthia (Nokia - CN/Hangzhou) <cynthia.zhou at nokia.com>; Bao,
> > Xiaohui (Nokia - CN/Hangzhou) <xiaohui.bao at nokia.com>; Zhang, Bingxuan
> > (Nokia - CN/Hangzhou) <bingxuan.zhang at nokia.com>; Li, Deqian (Nokia -
> > CN/Hangzhou) <deqian.li at nokia.com>
> > Subject: "du" a large count files in a directory casue mounted glusterfs
> > filesystem coredump
> > Hi, GlusterFS Expert,
> > Now we have an issue when run “du” command for a large count
> > files/directory
> > in a directory, in our environment there are more than 150k files in the
> > directory.
> > # df -i .
> > Filesystem Inodes IUsed IFree IUse% Mounted on
> > 169.254.0.23:/home 261888 154146 107742 59% /home
> > Now we run “du” command in this directory, it is so easy to cause glusterfs
> > process coredump, and the coredump backtrace shows it always caused by
> > do_forget API, but last call some time difference. Please see the detail
> > backtrace as the end of this mail.
> > From my investigation, the issue maybe caused by the unsafe call of API
> > “fuse_ino_to_inode”,
> > I JUST GUESS in some un-expect case, when call “fuse_ino_to_inode” with
> > nodeid which came from forget FOP, just call “fuse_ino_to_inode” to get the
> > address from simply mapping of uint64 to memory address,
> > the inode address maybe just destroyed by “ the lru limit of the inode
> > table
> > has been exceeded” in our large file case, so this operation maybe not
> > safe,
> > and the coredump backtrace also show there are more difference case when
> > core occurred.
> > Could you please share your comments on my investigation?
> > And BTW I have some questions,
> > 
> > 
> >     1. How the inode number in “stat” command mapping to the inode in
> >     glusterfs?
> > stat log
> > File: ‘log’
> > Size: 4096 Blocks: 8 IO Block: 4096 directory
> > Device: fd10h/64784d I node: 14861593 Links: 3
> > 
> > 
> >     1. When will system call the forget FOP and where the nodeid parameter
> >     came from in system?
> > 
> > 
> >     1. When the inode is eventually destroyed due to lru limit, and when
> >     the
> >     same file is FOPed next time, does the address of this inode is same
> >     address in next lookup? If not same, is there exist an case the FOP of
> >     forget give out an old nodeid than glusterfs has?
> > Thanks & Best Regards,
> > George
> > 
> > 
> >     1. Coredump backtrace 1
> > #0 0x00007fcd610a69e7 in __list_splice (list=0x26c350c,
> > head=0x7fcd56c23db0)
> > at list.h:121
> > #1 0x00007fcd610a6a51 in list_splice_init (list=0x26c350c,
> > head=0x7fcd56c23db0) at list.h:146
> > #2 0x00007fcd610a95c8 in inode_table_prune (table=0x26c347c) at
> > inode.c:1330
> > #3 0x00007fcd610a8a02 in inode_forget (inode=0x7fcd5001147c, nlookup=1) at
> > inode.c:977
> > #4 0x00007fcd5f151e24 in do_forget (this=0xc43590, unique=437787,
> > nodeid=140519787271292, nlookup=1) at fuse-bridge.c:637
> > #5 0x00007fcd5f151fd3 in fuse_batch_forget (this=0xc43590,
> > finh=0x7fcd50c266c0, msg=0x7fcd50c266e8) at fuse-bridge.c:676
> > #6 0x00007fcd5f168aff in fuse_thread_proc (data=0xc43590) at
> > fuse-bridge.c:4909
> > #7 0x00007fcd6080b414 in start_thread (arg=0x7fcd56c24700) at
> > pthread_create.c:333
> > #8 0x00007fcd600f7b9f in clone () at
> > ../sysdeps/unix/sysv/linux/x86_64/clone.S:105
> > (gdb) print head
> > $6 = (struct list_head *) 0x7fcd56c23db0
> > (gdb) print *head
> > $7 = {next = 0x7fcd56c23db0, prev = 0x7fcd56c23db0}
> > (gdb) print head->next
> > $8 = (struct list_head *) 0x7fcd56c23db0
> > (gdb) print list->prev
> > $9 = (struct list_head *) 0x5100000000
> > (gdb) print (list->prev)->next
> > Cannot access memory at address 0x5100000000
> > 
> > 
> >     1. Coredump backtrace 2
> > #0 __GI_raise (sig=sig at entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
> > #1 0x00007f612ab4a43a in __GI_abort () at abort.c:89
> > #2 0x00007f612ab41ccd in __assert_fail_base (fmt=0x7f612ac76618 "%s%s%s:%u:
> > %s%sAssertion `%s' failed.\n%n",
> > assertion=assertion at entry=0x7f612bc01ec1 "inode->nlookup >= nlookup",
> > file=file at entry=0x7f612bc01d9b "inode.c", line=line at entry=607,
> > function=function at entry=0x7f612bc02339 <__PRETTY_FUNCTION__.10128>
> > "__inode_forget") at assert.c:92
> > #3 0x00007f612ab41d82 in __GI___assert_fail (assertion=0x7f612bc01ec1
> > "inode->nlookup >= nlookup", file=0x7f612bc01d9b "inode.c", line=607,
> > function=0x7f612bc02339 <__PRETTY_FUNCTION__.10128> "__inode_forget") at
> > assert.c:101
> > #4 0x00007f612bbade56 in __inode_forget (inode=0x7f611801d68c, nlookup=4)
> > at
> > inode.c:607
> > #5 0x00007f612bbae9ea in inode_forget (inode=0x7f611801d68c, nlookup=4) at
> > inode.c:973
> > #6 0x00007f6129defdd5 in do_forget (this=0x1a895c0, unique=436589,
> > nodeid=140054991328908, nlookup=4) at fuse-bridge.c:633
> > #7 0x00007f6129defe94 in fuse_forget (this=0x1a895c0, finh=0x7f6118c28be0,
> > msg=0x7f6118c28c08) at fuse-bridge.c:652
> > #8 0x00007f6129e06ab0 in fuse_thread_proc (data=0x1a895c0) at
> > fuse-bridge.c:4905
> > #9 0x00007f612b311414 in start_thread (arg=0x7f61220d0700) at
> > pthread_create.c:333
> > #10 0x00007f612abfdb9f in clone () at
> > ../sysdeps/unix/sysv/linux/x86_64/clone.S:105
> > (gdb) f 5
> > #5 0x00007f612bbae9ea in inode_forget (inode=0x7f611801d68c, nlookup=4) at
> > inode.c:973
> > 973 inode.c: No such file or directory.
> > (gdb) f 4
> > #4 0x00007f612bbade56 in __inode_forget (inode=0x7f611801d68c, nlookup=4)
> > at
> > inode.c:607
> > 607 in inode.c
> > (gdb) print inode->nlookup
> > 
> > 
> >     1. Coredump backtrace 3
> > #0 __GI_raise (sig=sig at entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
> > #1 0x00007f86b0b0f43a in __GI_abort () at abort.c:89
> > #2 0x00007f86b0b06ccd in __assert_fail_base (fmt=0x7f86b0c3b618 "%s%s%s:%u:
> > %s%sAssertion `%s' failed.\n%n",
> > assertion=assertion at entry=0x7f86b12e1f38 "INTERNAL_SYSCALL_ERRNO (e, __err)
> > != ESRCH || !robust",
> > file=file at entry=0x7f86b12e1e7c "../nptl/pthread_mutex_lock.c",
> > line=line at entry=352,
> > function=function at entry=0x7f86b12e1fe0 <__PRETTY_FUNCTION__.8666>
> > "__pthread_mutex_lock_full") at assert.c:92
> > #3 0x00007f86b0b06d82 in __GI___assert_fail
> > (assertion=assertion at entry=0x7f86b12e1f38 "INTERNAL_SYSCALL_ERRNO (e,
> > __err)
> > != ESRCH || !robust",
> > file=file at entry=0x7f86b12e1e7c "../nptl/pthread_mutex_lock.c",
> > line=line at entry=352,
> > function=function at entry=0x7f86b12e1fe0 <__PRETTY_FUNCTION__.8666>
> > "__pthread_mutex_lock_full") at assert.c:101
> > #4 0x00007f86b12d89da in __pthread_mutex_lock_full (mutex=0x7f86a12ffcac)
> > at
> > ../nptl/pthread_mutex_lock.c:352
> > #5 0x00007f86b1b729f1 in inode_ref (inode=0x7f86a03cefec) at inode.c:476
> > #6 0x00007f86afdafb04 in fuse_ino_to_inode (ino=140216190693356,
> > fuse=0x1a541f0) at fuse-helpers.c:390
> > #7 0x00007f86afdb4d6b in do_forget (this=0x1a541f0, unique=96369,
> > nodeid=140216190693356, nlookup=1) at fuse-bridge.c:629
> > #8 0x00007f86afdb4f84 in fuse_batch_forget (this=0x1a541f0,
> > finh=0x7f86a03b4f90, msg=0x7f86a03b4fb8) at fuse-bridge.c:674
> > #9 0x00007f86afdcbab0 in fuse_thread_proc (data=0x1a541f0) at
> > fuse-bridge.c:4905
> > #10 0x00007f86b12d6414 in start_thread (arg=0x7f86a7ac8700) at
> > pthread_create.c:333
> > #11 0x00007f86b0bc2b9f in clone () at
> > ../sysdeps/unix/sysv/linux/x86_64/clone.S:105
> > 
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
>