[Gluster-devel] Issue about the size of fstat is less than the really size of the syslog file

Thu Oct 27 08:10:14 UTC 2016

Hi, Raghavendra,

Could you please give some suggestion for this issue? we try to find the clue for this issue for a long time, but it has no progress:(

Thanks & Best Regards,
George

-----Original Message-----
From: Lian, George (Nokia - CN/Hangzhou) 
Sent: Wednesday, October 19, 2016 4:40 PM
To: 'Raghavendra Gowdappa' <rgowdapp at redhat.com>
Cc: Gluster-devel at gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS <I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>; Zhang, Bingxuan (Nokia - CN/Hangzhou) <bingxuan.zhang at nokia.com>; Zizka, Jan (Nokia - CZ/Prague) <jan.zizka at nokia.com>
Subject: RE: [Gluster-devel] Issue about the size of fstat is less than the really size of the syslog file

Hi, Raghavendra

Just now, we test it with glusterfs log with debug-level "TRACE", and let some application trigger "glusterfs" produce large log, in that case, when we set write-behind and stat-prefetch both OFF,
Tail the glusterfs log such like mnt-{VOLUME-NAME}.log, it still failed with "file truncated",

So that means if file's IO in huge amount, the issue will still be there even write-behind and stat-prefetch both OFF.

Best Regards,
George

-----Original Message-----
From: Raghavendra Gowdappa [mailto:rgowdapp at redhat.com] 
Sent: Wednesday, October 19, 2016 2:54 PM
To: Lian, George (Nokia - CN/Hangzhou) <george.lian at nokia.com>
Cc: Gluster-devel at gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS <I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>; Zhang, Bingxuan (Nokia - CN/Hangzhou) <bingxuan.zhang at nokia.com>; Zizka, Jan (Nokia - CZ/Prague) <jan.zizka at nokia.com>
Subject: Re: [Gluster-devel] Issue about the size of fstat is less than the really size of the syslog file

----- Original Message -----
> From: "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> Cc: Gluster-devel at gluster.org, "I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS"
> <I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>, "Bingxuan Zhang (Nokia - CN/Hangzhou)"
> <bingxuan.zhang at nokia.com>, "Jan Zizka (Nokia - CZ/Prague)" <jan.zizka at nokia.com>
> Sent: Wednesday, October 19, 2016 12:05:01 PM
> Subject: RE: [Gluster-devel] Issue about the size of fstat is less than the really size of the syslog file
> 
> Hi, Raghavendra,
> 
> Thanks a lots for your quickly update!
> In my case, there are so many process(write) is writing to the syslog file,
> it do involve the writer is in the same host and writing in same mount point
> while the tail(reader) is reading it.
> 
> The bug I just guess is:
> When a writer write the data with write-behind, it call the call-back
> function " mdc_writev_cbk" and called "mdc_inode_iatt_set_validate" to
> validate the "iatt" data, but with the code I mentioned last mail, it do
> nothing.

mdc_inode_iatt_set_validate has following code

<snippet>
                if (!iatt || !iatt->ia_ctime) {
        		mdc->ia_time = 0;
                        goto unlock;
        	}
</snippet>

Which means a NULL iatt sets mdc->ia_time to 0. This results in subsequent lookup/stat calls to be NOT served from md-cache. Instead, the stat is served from backend bricks. So, I don't see an issue here.

However, one case where a NULL iatt is different from a valid iatt (which differs from the value stored in md-cache) is that the latter results in a call to inode_invalidate. This invalidation propagates to kernel and all dentry and page cache corresponding to file is purged. So, I am suspecting whether the stale stat you saw was served from kernel cache (not from glusterfs). If this is the case, having mount options "attribute-timeout=0" and "entry-timeout=0" should've helped.

I am still at loss to point out the RCA for this issue.

> And in same time, the reader(tail) read the "iatt" data, but in case of the
> cache-time is not timeout, it will return the "iatt" data without the last
> change.
> 
> Do your think it is a possible bug?
> 
> Thanks & Best Regards,
> George
> 
> -----Original Message-----
> From: Raghavendra Gowdappa [mailto:rgowdapp at redhat.com]
> Sent: Wednesday, October 19, 2016 2:06 PM
> To: Lian, George (Nokia - CN/Hangzhou) <george.lian at nokia.com>
> Cc: Gluster-devel at gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS
> <I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>; Zhang, Bingxuan (Nokia -
> CN/Hangzhou) <bingxuan.zhang at nokia.com>; Zizka, Jan (Nokia - CZ/Prague)
> <jan.zizka at nokia.com>
> Subject: Re: [Gluster-devel] Issue about the size of fstat is less than the
> really size of the syslog file
> 
> 
> 
> ----- Original Message -----
> > From: "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
> > To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> > Cc: Gluster-devel at gluster.org, "I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS"
> > <I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>, "Bingxuan Zhang (Nokia
> > - CN/Hangzhou)"
> > <bingxuan.zhang at nokia.com>, "Jan Zizka (Nokia - CZ/Prague)"
> > <jan.zizka at nokia.com>
> > Sent: Wednesday, October 19, 2016 10:51:24 AM
> > Subject: RE: [Gluster-devel] Issue about the size of fstat is less than the
> > really size of the syslog file
> > 
> > Hi, Raghavendra,
> > 
> > When we disable md-cache(gluster volume set log
> > performance.md-cache-timeout
> > 0),  the issue seems gone.
> > (we can't disable with " gluster volume set log performance.md-cache off"
> > why?)
> 
> Please use
> #gluster volume set log performance.stat-prefetch off
> 
> > 
> > So I double confuse that the code I abstract in last mail maybe have some
> > issue for this case.
> > Could you please share your comments?
> 
> Please find my comments below.
> 
> > 
> > Thanks & Best Regards,
> > George
> > 
> > -----Original Message-----
> > From: Lian, George (Nokia - CN/Hangzhou)
> > Sent: Friday, October 14, 2016 1:44 PM
> > To: 'Raghavendra Gowdappa' <rgowdapp at redhat.com>
> > Cc: Gluster-devel at gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS
> > <I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>; Zhang, Bingxuan (Nokia
> > -
> > CN/Hangzhou) <bingxuan.zhang at nokia.com>; Zizka, Jan (Nokia - CZ/Prague)
> > <jan.zizka at nokia.com>
> > Subject: RE: [Gluster-devel] Issue about the size of fstat is less than the
> > really size of the syslog file
> > 
> > Hi, Raghavendra,
> > 
> > Our version of GlusterFS is 3.6.9, and I also check the newest code of main
> > branch, the function of " mdc_inode_iatt_set_validate" is almost same, from
> > the following code of this function,
> > We could see a "TODO" comments inline, does it mean if we enhance
> > write-behind feature, the "iatt" field in callback will be NULL, so that
> > inode_invalidate will not be called? So the size of file will not update
> > since "write behind" enabled ?
> > Is it the root cause for "tail" application failed with "file truncated"
> > issue ?
> > 
> >         LOCK (&mdc->lock);
> >         {
> >                 if (!iatt || !iatt->ia_ctime) {
> >                         mdc->ia_time = 0;
> >                         goto unlock;
> >                 }
> > 
> > 		/*
> > 		 * Invalidate the inode if the mtime or ctime has changed
> > 		 * and the prebuf doesn't match the value we have cached.
> > 		 * TODO: writev returns with a NULL iatt due to
> > 		 * performance/write-behind, causing invalidation on writes.
> > 		 */
> 
> The issue explained in this comment is hit only when writes are done. But, in
> your use-case only "tail" is the application running on the mount (If I am
> not wrong, the  writer is running on a different mountpoint). So, I doubt
> you are hitting this issue. But, you are saying that the issue goes away
> when write-behind/md-cache is turned off pointing to some interaction
> between md-cache and write-behind causing the issue. I need more time to
> look into this issue. Can you file a bug on this?
> 
> > 		if (IA_ISREG(inode->ia_type) &&
> > 		    ((iatt->ia_mtime != mdc->md_mtime) ||
> > 		    (iatt->ia_ctime != mdc->md_ctime)))
> > 			if (!prebuf || (prebuf->ia_ctime != mdc->md_ctime) ||
> > 			    (prebuf->ia_mtime != mdc->md_mtime))
> > 				inode_invalidate(inode);
> > 
> >                 mdc_from_iatt (mdc, iatt);
> > 
> >                 time (&mdc->ia_time);
> >         }
> > 
> > Best Regards,
> > George
> > -----Original Message-----
> > From: Raghavendra Gowdappa [mailto:rgowdapp at redhat.com]
> > Sent: Thursday, October 13, 2016 8:58 PM
> > To: Lian, George (Nokia - CN/Hangzhou) <george.lian at nokia.com>
> > Cc: Gluster-devel at gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS
> > <I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>; Zhang, Bingxuan (Nokia
> > -
> > CN/Hangzhou) <bingxuan.zhang at nokia.com>; Zizka, Jan (Nokia - CZ/Prague)
> > <jan.zizka at nokia.com>
> > Subject: Re: [Gluster-devel] Issue about the size of fstat is less than the
> > really size of the syslog file
> > 
> > 
> > 
> > ----- Original Message -----
> > > From: "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
> > > To: Gluster-devel at gluster.org
> > > Cc: "I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS"
> > > <I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>, "Bingxuan Zhang
> > > (Nokia
> > > - CN/Hangzhou)" <bingxuan.zhang at nokia.com>, "Jan Zizka (Nokia -
> > > CZ/Prague)"
> > > <jan.zizka at nokia.com>
> > > Sent: Thursday, October 13, 2016 2:33:53 PM
> > > Subject: [Gluster-devel] Issue about the size of fstat is less than the
> > > really size of the syslog file
> > > 
> > > Hi, Dear Expert,
> > > We have use glusterfs as a network filesystem, and syslog store in there,
> > > some clients on different host may write the syslog file via “glusterfs”
> > > mount point.
> > > Now we encounter an issue when we “tail” the syslog file, it will
> > > occasional
> > > failed with error “ file truncated ”
> > > As we study and trace with the “tail” source code, it failed with the
> > > following code:
> > > if ( S_ISREG (mode) && stats.st_size < f[i].size )
> > > {
> > > error (0, 0, _("%s: file truncated"), quotef (name));
> > > /* Assume the file was truncated to 0,
> > > and therefore output all "new" data. */
> > > xlseek (fd, 0, SEEK_SET, name);
> > > f[i].size = 0;
> > > }
> > > When stats.st_size < f[i].size, what mean the size report by fstat is
> > > less
> > > than “tail” had read, it lead to “file truncated”, we also use “strace”
> > > tools to trace the tail application, the related tail strace log as the
> > > below:
> > > nanosleep({1, 0}, NULL) = 0
> > > fstat(3, {st_mode=S_IFREG|0644, st_size=192543105, ...}) = 0
> > > nanosleep({1, 0}, NULL) = 0
> > > fstat(3, {st_mode=S_IFREG|0644, st_size=192543105, ...}) = 0
> > > nanosleep({1, 0}, NULL) = 0
> > > fstat(3, {st_mode=S_IFREG|0644, st_size=192543105, ...}) = 0
> > > nanosleep({1, 0}, NULL) = 0
> > > fstat(3, {st_mode=S_IFREG|0644, st_size=192544549, ...}) = 0
> > > read(3, " Data … -"..., 8192) = 1444
> > > read(3, " Data.. "..., 8192) = 720
> > > read(3, "", 8192) = 0
> > > fstat(3, {st_mode=S_IFREG|0644, st_size=192544789, ...}) = 0
> > > write(1, “DATA…..” ) = 2164
> > > write(2, "tail: ", 6tail: ) = 6
> > > write(2, "/mnt/log/master/syslog: file tru"..., 38/mnt/log/master/syslog:
> > > file truncated) = 38
> > > as the above strace log, tail has read 1444+720=2164 bytes,
> > > but fstat tell “tail” 192544789 – 192543105 = 1664 which less than 2164,
> > > so
> > > it lead to “tail” application “file truncated”.
> > > And if we turn off “write-behind” feature, the issue will not be
> > > reproduced
> > > any more.
> > 
> > That seems strange. There are no writes happening on the fd/inode through
> > which tail is reading/stating from. So, it seems strange that write-behind
> > is involved here. I suspect whether any of md-cache/read-ahead/io-cache is
> > causing the issue. Can you,
> > 
> > 1. Turn off md-cache, read-ahead, io-cache xlators
> > 2. mount glusterfs with --attribute-timeout=0
> > 3. set write-behind on
> > 
> > and rerun the tests? If you don't hit the issue, you can experiment by
> > turning on/off of md-cache, read-ahead and io-cache translators and see
> > what
> > are the minimal number of xlators that need to be turned off to not hit the
> > issue (with write-behind on)?
> > 
> > regards,
> > Raghavendra
> > 
> > > So we think it may be related to cache consistence issue due to
> > > performance
> > > consider, but we still have concern that:
> > > The syslog file is used only with “Append” mode, so the size of file
> > > shouldn’t be reduced, when a client read the file, why “fstat” can’t
> > > return
> > > the really size match to the cache?
> > > From current investigation, we doubt that the current implement of
> > > “glusterfs” has a bug on “fstat” when cache is on.
> > > Your comments is our highly appreciated!
> > > Thanks & Best Regards
> > > George
> > > 
> > > _______________________________________________
> > > Gluster-devel mailing list
> > > Gluster-devel at gluster.org
> > > http://www.gluster.org/mailman/listinfo/gluster-devel
> > 
>