[Gluster-devel] Issue about the size of fstat is less than the really size of the syslog file

Mon Oct 31 05:41:03 UTC 2016

Thanks Pranith for the update.

Patch for fixing write-behind to not send stale stat can be found at:
http://review.gluster.org/#/c/15757/1

If possible, please test it and let us know the results. Your results are very important for us to resolve the issue.

regards,
Raghavendra

----- Original Message -----
> From: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> To: "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
> Cc: "Raghavendra Gowdappa" <rgowdapp at redhat.com>, "Bingxuan Zhang (Nokia - CN/Hangzhou)" <bingxuan.zhang at nokia.com>,
> Gluster-devel at gluster.org, "Jan Zizka (Nokia - CZ/Prague)" <jan.zizka at nokia.com>
> Sent: Monday, October 31, 2016 8:53:07 AM
> Subject: Re: [Gluster-devel] Issue about the size of fstat is less than the really size of the syslog file
> 
> Removing I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com, it is causing
> mail delivery problems for me.
> 
> George,
>          Raghavendra and I made some progress on this issue. We were in
> parallel working on another issue which is similar where elastic search
> indices are getting corrupted because of wrong stat sizes in our opinion.
> So I have been running different translator stacks in identifying the
> problematic xlators which are leading to indices corruption.
>           We found the list to be 1) Write-behind, 2) Quick-read, 3)
> Readdir-ahead. Raghavendra and I just had a chat and we are suspecting that
> lack of lookup/readdirp implementation in write-behind could be the reason
> for this problem. Similar problems may exist in other two xlators too. But
> we are working on write-behind with priority.
> 
> Our theory is this:
> If we do a 4KB write for example and it is cached in write-behind and we do
> a lookup on the file/do a readdirp on the directory with this file we send
> out wrong stat value to the kernel. There are different caches between
> kernel and gluster which may lead to fstat never coming till write-behind.
> So we need to make sure that we don't get into this situation.
> 
> Action items:
>  At the moment Raghavendra is working on a patch to implement
> lookup/readdirp in write-behind. I am going to test the same for elastic
> search. Will it be possible for you to test your application against the
> same patch and confirm that the patch fixes the problem?
> 
> 
> On Fri, Oct 28, 2016 at 12:08 PM, Pranith Kumar Karampuri <
> pkarampu at redhat.com> wrote:
> 
> > hi George,
> >        It would help if we can identify the bare minimum xlators which are
> > contributing to the issue like Raghavendra was mentioning earlier. We were
> > wondering if it is possible for you to help us in identifying the issue by
> > running the workload on a modified setup? We can suggest testing out using
> > custom volfiles so that we can slowly build the graph which could be
> > causing this issue. We would like you guys to try out this problem with
> > just posix-xlator and fuse and nothing else.
> >
> > On Thu, Oct 27, 2016 at 1:40 PM, Lian, George (Nokia - CN/Hangzhou) <
> > george.lian at nokia.com> wrote:
> >
> >> Hi, Raghavendra,
> >>
> >> Could you please give some suggestion for this issue? we try to find the
> >> clue for this issue for a long time, but it has no progress:(
> >>
> >> Thanks & Best Regards,
> >> George
> >>
> >> -----Original Message-----
> >> From: Lian, George (Nokia - CN/Hangzhou)
> >> Sent: Wednesday, October 19, 2016 4:40 PM
> >> To: 'Raghavendra Gowdappa' <rgowdapp at redhat.com>
> >> Cc: Gluster-devel at gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS <
> >> I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>; Zhang, Bingxuan
> >> (Nokia - CN/Hangzhou) <bingxuan.zhang at nokia.com>; Zizka, Jan (Nokia -
> >> CZ/Prague) <jan.zizka at nokia.com>
> >> Subject: RE: [Gluster-devel] Issue about the size of fstat is less than
> >> the really size of the syslog file
> >>
> >> Hi, Raghavendra
> >>
> >> Just now, we test it with glusterfs log with debug-level "TRACE", and let
> >> some application trigger "glusterfs" produce large log, in that case, when
> >> we set write-behind and stat-prefetch both OFF,
> >> Tail the glusterfs log such like mnt-{VOLUME-NAME}.log, it still failed
> >> with "file truncated",
> >>
> >> So that means if file's IO in huge amount, the issue will still be there
> >> even write-behind and stat-prefetch both OFF.
> >>
> >> Best Regards,
> >> George
> >>
> >> -----Original Message-----
> >> From: Raghavendra Gowdappa [mailto:rgowdapp at redhat.com]
> >> Sent: Wednesday, October 19, 2016 2:54 PM
> >> To: Lian, George (Nokia - CN/Hangzhou) <george.lian at nokia.com>
> >> Cc: Gluster-devel at gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS <
> >> I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>; Zhang, Bingxuan
> >> (Nokia - CN/Hangzhou) <bingxuan.zhang at nokia.com>; Zizka, Jan (Nokia -
> >> CZ/Prague) <jan.zizka at nokia.com>
> >> Subject: Re: [Gluster-devel] Issue about the size of fstat is less than
> >> the really size of the syslog file
> >>
> >>
> >>
> >> ----- Original Message -----
> >> > From: "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
> >> > To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> >> > Cc: Gluster-devel at gluster.org, "I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS"
> >> > <I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>, "Bingxuan Zhang
> >> (Nokia - CN/Hangzhou)"
> >> > <bingxuan.zhang at nokia.com>, "Jan Zizka (Nokia - CZ/Prague)" <
> >> jan.zizka at nokia.com>
> >> > Sent: Wednesday, October 19, 2016 12:05:01 PM
> >> > Subject: RE: [Gluster-devel] Issue about the size of fstat is less than
> >> the really size of the syslog file
> >> >
> >> > Hi, Raghavendra,
> >> >
> >> > Thanks a lots for your quickly update!
> >> > In my case, there are so many process(write) is writing to the syslog
> >> file,
> >> > it do involve the writer is in the same host and writing in same mount
> >> point
> >> > while the tail(reader) is reading it.
> >> >
> >> > The bug I just guess is:
> >> > When a writer write the data with write-behind, it call the call-back
> >> > function " mdc_writev_cbk" and called "mdc_inode_iatt_set_validate" to
> >> > validate the "iatt" data, but with the code I mentioned last mail, it do
> >> > nothing.
> >>
> >> mdc_inode_iatt_set_validate has following code
> >>
> >> <snippet>
> >>                 if (!iatt || !iatt->ia_ctime) {
> >>                         mdc->ia_time = 0;
> >>                         goto unlock;
> >>                 }
> >> </snippet>
> >>
> >> Which means a NULL iatt sets mdc->ia_time to 0. This results in
> >> subsequent lookup/stat calls to be NOT served from md-cache. Instead, the
> >> stat is served from backend bricks. So, I don't see an issue here.
> >>
> >> However, one case where a NULL iatt is different from a valid iatt (which
> >> differs from the value stored in md-cache) is that the latter results in a
> >> call to inode_invalidate. This invalidation propagates to kernel and all
> >> dentry and page cache corresponding to file is purged. So, I am suspecting
> >> whether the stale stat you saw was served from kernel cache (not from
> >> glusterfs). If this is the case, having mount options
> >> "attribute-timeout=0"
> >> and "entry-timeout=0" should've helped.
> >>
> >> I am still at loss to point out the RCA for this issue.
> >>
> >>
> >> > And in same time, the reader(tail) read the "iatt" data, but in case of
> >> the
> >> > cache-time is not timeout, it will return the "iatt" data without the
> >> last
> >> > change.
> >> >
> >> > Do your think it is a possible bug?
> >> >
> >> > Thanks & Best Regards,
> >> > George
> >> >
> >> > -----Original Message-----
> >> > From: Raghavendra Gowdappa [mailto:rgowdapp at redhat.com]
> >> > Sent: Wednesday, October 19, 2016 2:06 PM
> >> > To: Lian, George (Nokia - CN/Hangzhou) <george.lian at nokia.com>
> >> > Cc: Gluster-devel at gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS
> >> > <I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>; Zhang, Bingxuan
> >> (Nokia -
> >> > CN/Hangzhou) <bingxuan.zhang at nokia.com>; Zizka, Jan (Nokia - CZ/Prague)
> >> > <jan.zizka at nokia.com>
> >> > Subject: Re: [Gluster-devel] Issue about the size of fstat is less than
> >> the
> >> > really size of the syslog file
> >> >
> >> >
> >> >
> >> > ----- Original Message -----
> >> > > From: "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
> >> > > To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> >> > > Cc: Gluster-devel at gluster.org, "I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS"
> >> > > <I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>, "Bingxuan Zhang
> >> (Nokia
> >> > > - CN/Hangzhou)"
> >> > > <bingxuan.zhang at nokia.com>, "Jan Zizka (Nokia - CZ/Prague)"
> >> > > <jan.zizka at nokia.com>
> >> > > Sent: Wednesday, October 19, 2016 10:51:24 AM
> >> > > Subject: RE: [Gluster-devel] Issue about the size of fstat is less
> >> than the
> >> > > really size of the syslog file
> >> > >
> >> > > Hi, Raghavendra,
> >> > >
> >> > > When we disable md-cache(gluster volume set log
> >> > > performance.md-cache-timeout
> >> > > 0),  the issue seems gone.
> >> > > (we can't disable with " gluster volume set log performance.md-cache
> >> off"
> >> > > why?)
> >> >
> >> > Please use
> >> > #gluster volume set log performance.stat-prefetch off
> >> >
> >> > >
> >> > > So I double confuse that the code I abstract in last mail maybe have
> >> some
> >> > > issue for this case.
> >> > > Could you please share your comments?
> >> >
> >> > Please find my comments below.
> >> >
> >> > >
> >> > > Thanks & Best Regards,
> >> > > George
> >> > >
> >> > > -----Original Message-----
> >> > > From: Lian, George (Nokia - CN/Hangzhou)
> >> > > Sent: Friday, October 14, 2016 1:44 PM
> >> > > To: 'Raghavendra Gowdappa' <rgowdapp at redhat.com>
> >> > > Cc: Gluster-devel at gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS
> >> > > <I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>; Zhang, Bingxuan
> >> (Nokia
> >> > > -
> >> > > CN/Hangzhou) <bingxuan.zhang at nokia.com>; Zizka, Jan (Nokia -
> >> CZ/Prague)
> >> > > <jan.zizka at nokia.com>
> >> > > Subject: RE: [Gluster-devel] Issue about the size of fstat is less
> >> than the
> >> > > really size of the syslog file
> >> > >
> >> > > Hi, Raghavendra,
> >> > >
> >> > > Our version of GlusterFS is 3.6.9, and I also check the newest code
> >> of main
> >> > > branch, the function of " mdc_inode_iatt_set_validate" is almost
> >> same, from
> >> > > the following code of this function,
> >> > > We could see a "TODO" comments inline, does it mean if we enhance
> >> > > write-behind feature, the "iatt" field in callback will be NULL, so
> >> that
> >> > > inode_invalidate will not be called? So the size of file will not
> >> update
> >> > > since "write behind" enabled ?
> >> > > Is it the root cause for "tail" application failed with "file
> >> truncated"
> >> > > issue ?
> >> > >
> >> > >         LOCK (&mdc->lock);
> >> > >         {
> >> > >                 if (!iatt || !iatt->ia_ctime) {
> >> > >                         mdc->ia_time = 0;
> >> > >                         goto unlock;
> >> > >                 }
> >> > >
> >> > >             /*
> >> > >              * Invalidate the inode if the mtime or ctime has changed
> >> > >              * and the prebuf doesn't match the value we have cached.
> >> > >              * TODO: writev returns with a NULL iatt due to
> >> > >              * performance/write-behind, causing invalidation on
> >> writes.
> >> > >              */
> >> >
> >> > The issue explained in this comment is hit only when writes are done.
> >> But, in
> >> > your use-case only "tail" is the application running on the mount (If I
> >> am
> >> > not wrong, the  writer is running on a different mountpoint). So, I
> >> doubt
> >> > you are hitting this issue. But, you are saying that the issue goes away
> >> > when write-behind/md-cache is turned off pointing to some interaction
> >> > between md-cache and write-behind causing the issue. I need more time to
> >> > look into this issue. Can you file a bug on this?
> >> >
> >> > >             if (IA_ISREG(inode->ia_type) &&
> >> > >                 ((iatt->ia_mtime != mdc->md_mtime) ||
> >> > >                 (iatt->ia_ctime != mdc->md_ctime)))
> >> > >                     if (!prebuf || (prebuf->ia_ctime !=
> >> mdc->md_ctime) ||
> >> > >                         (prebuf->ia_mtime != mdc->md_mtime))
> >> > >                             inode_invalidate(inode);
> >> > >
> >> > >                 mdc_from_iatt (mdc, iatt);
> >> > >
> >> > >                 time (&mdc->ia_time);
> >> > >         }
> >> > >
> >> > > Best Regards,
> >> > > George
> >> > > -----Original Message-----
> >> > > From: Raghavendra Gowdappa [mailto:rgowdapp at redhat.com]
> >> > > Sent: Thursday, October 13, 2016 8:58 PM
> >> > > To: Lian, George (Nokia - CN/Hangzhou) <george.lian at nokia.com>
> >> > > Cc: Gluster-devel at gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS
> >> > > <I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>; Zhang, Bingxuan
> >> (Nokia
> >> > > -
> >> > > CN/Hangzhou) <bingxuan.zhang at nokia.com>; Zizka, Jan (Nokia -
> >> CZ/Prague)
> >> > > <jan.zizka at nokia.com>
> >> > > Subject: Re: [Gluster-devel] Issue about the size of fstat is less
> >> than the
> >> > > really size of the syslog file
> >> > >
> >> > >
> >> > >
> >> > > ----- Original Message -----
> >> > > > From: "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
> >> > > > To: Gluster-devel at gluster.org
> >> > > > Cc: "I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS"
> >> > > > <I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>, "Bingxuan Zhang
> >> > > > (Nokia
> >> > > > - CN/Hangzhou)" <bingxuan.zhang at nokia.com>, "Jan Zizka (Nokia -
> >> > > > CZ/Prague)"
> >> > > > <jan.zizka at nokia.com>
> >> > > > Sent: Thursday, October 13, 2016 2:33:53 PM
> >> > > > Subject: [Gluster-devel] Issue about the size of fstat is less than
> >> the
> >> > > > really size of the syslog file
> >> > > >
> >> > > > Hi, Dear Expert,
> >> > > > We have use glusterfs as a network filesystem, and syslog store in
> >> there,
> >> > > > some clients on different host may write the syslog file via
> >> “glusterfs”
> >> > > > mount point.
> >> > > > Now we encounter an issue when we “tail” the syslog file, it will
> >> > > > occasional
> >> > > > failed with error “ file truncated ”
> >> > > > As we study and trace with the “tail” source code, it failed with
> >> the
> >> > > > following code:
> >> > > > if ( S_ISREG (mode) && stats.st_size < f[i].size )
> >> > > > {
> >> > > > error (0, 0, _("%s: file truncated"), quotef (name));
> >> > > > /* Assume the file was truncated to 0,
> >> > > > and therefore output all "new" data. */
> >> > > > xlseek (fd, 0, SEEK_SET, name);
> >> > > > f[i].size = 0;
> >> > > > }
> >> > > > When stats.st_size < f[i].size, what mean the size report by fstat
> >> is
> >> > > > less
> >> > > > than “tail” had read, it lead to “file truncated”, we also use
> >> “strace”
> >> > > > tools to trace the tail application, the related tail strace log as
> >> the
> >> > > > below:
> >> > > > nanosleep({1, 0}, NULL) = 0
> >> > > > fstat(3, {st_mode=S_IFREG|0644, st_size=192543105, ...}) = 0
> >> > > > nanosleep({1, 0}, NULL) = 0
> >> > > > fstat(3, {st_mode=S_IFREG|0644, st_size=192543105, ...}) = 0
> >> > > > nanosleep({1, 0}, NULL) = 0
> >> > > > fstat(3, {st_mode=S_IFREG|0644, st_size=192543105, ...}) = 0
> >> > > > nanosleep({1, 0}, NULL) = 0
> >> > > > fstat(3, {st_mode=S_IFREG|0644, st_size=192544549, ...}) = 0
> >> > > > read(3, " Data … -"..., 8192) = 1444
> >> > > > read(3, " Data.. "..., 8192) = 720
> >> > > > read(3, "", 8192) = 0
> >> > > > fstat(3, {st_mode=S_IFREG|0644, st_size=192544789, ...}) = 0
> >> > > > write(1, “DATA…..” ) = 2164
> >> > > > write(2, "tail: ", 6tail: ) = 6
> >> > > > write(2, "/mnt/log/master/syslog: file tru"...,
> >> 38/mnt/log/master/syslog:
> >> > > > file truncated) = 38
> >> > > > as the above strace log, tail has read 1444+720=2164 bytes,
> >> > > > but fstat tell “tail” 192544789 – 192543105 = 1664 which less than
> >> 2164,
> >> > > > so
> >> > > > it lead to “tail” application “file truncated”.
> >> > > > And if we turn off “write-behind” feature, the issue will not be
> >> > > > reproduced
> >> > > > any more.
> >> > >
> >> > > That seems strange. There are no writes happening on the fd/inode
> >> through
> >> > > which tail is reading/stating from. So, it seems strange that
> >> write-behind
> >> > > is involved here. I suspect whether any of
> >> md-cache/read-ahead/io-cache is
> >> > > causing the issue. Can you,
> >> > >
> >> > > 1. Turn off md-cache, read-ahead, io-cache xlators
> >> > > 2. mount glusterfs with --attribute-timeout=0
> >> > > 3. set write-behind on
> >> > >
> >> > > and rerun the tests? If you don't hit the issue, you can experiment by
> >> > > turning on/off of md-cache, read-ahead and io-cache translators and
> >> see
> >> > > what
> >> > > are the minimal number of xlators that need to be turned off to not
> >> hit the
> >> > > issue (with write-behind on)?
> >> > >
> >> > > regards,
> >> > > Raghavendra
> >> > >
> >> > > > So we think it may be related to cache consistence issue due to
> >> > > > performance
> >> > > > consider, but we still have concern that:
> >> > > > The syslog file is used only with “Append” mode, so the size of file
> >> > > > shouldn’t be reduced, when a client read the file, why “fstat” can’t
> >> > > > return
> >> > > > the really size match to the cache?
> >> > > > From current investigation, we doubt that the current implement of
> >> > > > “glusterfs” has a bug on “fstat” when cache is on.
> >> > > > Your comments is our highly appreciated!
> >> > > > Thanks & Best Regards
> >> > > > George
> >> > > >
> >> > > > _______________________________________________
> >> > > > Gluster-devel mailing list
> >> > > > Gluster-devel at gluster.org
> >> > > > http://www.gluster.org/mailman/listinfo/gluster-devel
> >> > >
> >> >
> >> _______________________________________________
> >> Gluster-devel mailing list
> >> Gluster-devel at gluster.org
> >> http://www.gluster.org/mailman/listinfo/gluster-devel
> >>
> >
> >
> >
> > --
> > Pranith
> >
> 
> 
> 
> --
> Pranith
>