[Gluster-devel] Issue about the size of fstat is less than the really size of the syslog file

Tue Nov 1 13:25:31 UTC 2016

Hi George,

You can try your tests in two scenarios:

1. Switch off following performance translators:
a. write-behind
b. quick-read
c. readdir-ahead

Once you see that the issue goes away in the above scenario, you can apply
write-behind fix [1]  and turn write-behind on.

We have a fix in quick-read at [2].

You can also try to test with the quick-read fix and turning quick-read on.
However, since your issue is stale stat I doubt quick-read is not the
culprit here (The issue with quick-read is stale data).

We see that readdir-ahead is not updating/flushing the stats stored along
with dentries when write or file modifications happen. That seem to be the
issue with readdir-ahead. For now, during both of the above two scenarios,
please keep readdir-ahead off.

[1] http://review.gluster.org/15757/2
[2] http://review.gluster.org/#/c/15759/2

regards,
Raghavendra

On Tue, Nov 1, 2016 at 7:55 AM, Pranith Kumar Karampuri <pkarampu at redhat.com
> wrote:

>
>
> On Tue, Nov 1, 2016 at 7:32 AM, Lian, George (Nokia - CN/Hangzhou) <
> george.lian at nokia.com> wrote:
>
>> Hi,
>>
>>
>>
>> I will test it with your patches and update to you when I have result.
>>
>
> hi George,
>       Please use http://review.gluster.org/#/c/15757/2 i.e. second
> version of Raghavendra's patch. I tested it and it worked fine. We are
> still trying to figure out quick-read and readdir-ahead as I type this mail.
>
>
>>
>> Thanks a lots
>>
>>
>>
>> Best Regards,
>>
>> George
>>
>>
>>
>> *From:* Pranith Kumar Karampuri [mailto:pkarampu at redhat.com]
>> *Sent:* Monday, October 31, 2016 11:23 AM
>> *To:* Lian, George (Nokia - CN/Hangzhou) <george.lian at nokia.com>
>> *Cc:* Raghavendra Gowdappa <rgowdapp at redhat.com>; Zhang, Bingxuan (Nokia
>> - CN/Hangzhou) <bingxuan.zhang at nokia.com>; Gluster-devel at gluster.org;
>> Zizka, Jan (Nokia - CZ/Prague) <jan.zizka at nokia.com>
>>
>> *Subject:* Re: [Gluster-devel] Issue about the size of fstat is less
>> than the really size of the syslog file
>>
>>
>>
>> Removing I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com, it is causing
>> mail delivery problems for me.
>>
>> George,
>>
>>          Raghavendra and I made some progress on this issue. We were in
>> parallel working on another issue which is similar where elastic search
>> indices are getting corrupted because of wrong stat sizes in our opinion.
>> So I have been running different translator stacks in identifying the
>> problematic xlators which are leading to indices corruption.
>>
>>           We found the list to be 1) Write-behind, 2) Quick-read, 3)
>> Readdir-ahead. Raghavendra and I just had a chat and we are suspecting that
>> lack of lookup/readdirp implementation in write-behind could be the reason
>> for this problem. Similar problems may exist in other two xlators too. But
>> we are working on write-behind with priority.
>>
>> Our theory is this:
>>
>> If we do a 4KB write for example and it is cached in write-behind and we
>> do a lookup on the file/do a readdirp on the directory with this file we
>> send out wrong stat value to the kernel. There are different caches between
>> kernel and gluster which may lead to fstat never coming till write-behind.
>> So we need to make sure that we don't get into this situation.
>>
>> Action items:
>>
>>  At the moment Raghavendra is working on a patch to implement
>> lookup/readdirp in write-behind. I am going to test the same for elastic
>> search. Will it be possible for you to test your application against the
>> same patch and confirm that the patch fixes the problem?
>>
>>
>>
>> On Fri, Oct 28, 2016 at 12:08 PM, Pranith Kumar Karampuri <
>> pkarampu at redhat.com> wrote:
>>
>> hi George,
>>
>>        It would help if we can identify the bare minimum xlators which
>> are contributing to the issue like Raghavendra was mentioning earlier. We
>> were wondering if it is possible for you to help us in identifying the
>> issue by running the workload on a modified setup? We can suggest testing
>> out using custom volfiles so that we can slowly build the graph which could
>> be causing this issue. We would like you guys to try out this problem with
>> just posix-xlator and fuse and nothing else.
>>
>>
>>
>> On Thu, Oct 27, 2016 at 1:40 PM, Lian, George (Nokia - CN/Hangzhou) <
>> george.lian at nokia.com> wrote:
>>
>> Hi, Raghavendra,
>>
>> Could you please give some suggestion for this issue? we try to find the
>> clue for this issue for a long time, but it has no progress:(
>>
>> Thanks & Best Regards,
>> George
>>
>> -----Original Message-----
>> From: Lian, George (Nokia - CN/Hangzhou)
>> Sent: Wednesday, October 19, 2016 4:40 PM
>> To: 'Raghavendra Gowdappa' <rgowdapp at redhat.com>
>> Cc: Gluster-devel at gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS <
>> I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>; Zhang, Bingxuan
>> (Nokia - CN/Hangzhou) <bingxuan.zhang at nokia.com>; Zizka, Jan (Nokia -
>> CZ/Prague) <jan.zizka at nokia.com>
>> Subject: RE: [Gluster-devel] Issue about the size of fstat is less than
>> the really size of the syslog file
>>
>> Hi, Raghavendra
>>
>> Just now, we test it with glusterfs log with debug-level "TRACE", and let
>> some application trigger "glusterfs" produce large log, in that case, when
>> we set write-behind and stat-prefetch both OFF,
>> Tail the glusterfs log such like mnt-{VOLUME-NAME}.log, it still failed
>> with "file truncated",
>>
>> So that means if file's IO in huge amount, the issue will still be there
>> even write-behind and stat-prefetch both OFF.
>>
>> Best Regards,
>> George
>>
>> -----Original Message-----
>> From: Raghavendra Gowdappa [mailto:rgowdapp at redhat.com]
>>
>> Sent: Wednesday, October 19, 2016 2:54 PM
>> To: Lian, George (Nokia - CN/Hangzhou) <george.lian at nokia.com>
>> Cc: Gluster-devel at gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS <
>> I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>; Zhang, Bingxuan
>> (Nokia - CN/Hangzhou) <bingxuan.zhang at nokia.com>; Zizka, Jan (Nokia -
>> CZ/Prague) <jan.zizka at nokia.com>
>> Subject: Re: [Gluster-devel] Issue about the size of fstat is less than
>> the really size of the syslog file
>>
>>
>>
>> ----- Original Message -----
>> > From: "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
>> > To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>> > Cc: Gluster-devel at gluster.org, "I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS"
>> > <I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>, "Bingxuan Zhang
>> (Nokia - CN/Hangzhou)"
>> > <bingxuan.zhang at nokia.com>, "Jan Zizka (Nokia - CZ/Prague)" <
>> jan.zizka at nokia.com>
>> > Sent: Wednesday, October 19, 2016 12:05:01 PM
>> > Subject: RE: [Gluster-devel] Issue about the size of fstat is less than
>> the really size of the syslog file
>> >
>> > Hi, Raghavendra,
>> >
>> > Thanks a lots for your quickly update!
>> > In my case, there are so many process(write) is writing to the syslog
>> file,
>> > it do involve the writer is in the same host and writing in same mount
>> point
>> > while the tail(reader) is reading it.
>> >
>> > The bug I just guess is:
>> > When a writer write the data with write-behind, it call the call-back
>> > function " mdc_writev_cbk" and called "mdc_inode_iatt_set_validate" to
>> > validate the "iatt" data, but with the code I mentioned last mail, it do
>> > nothing.
>>
>> mdc_inode_iatt_set_validate has following code
>>
>> <snippet>
>>                 if (!iatt || !iatt->ia_ctime) {
>>                         mdc->ia_time = 0;
>>                         goto unlock;
>>                 }
>> </snippet>
>>
>> Which means a NULL iatt sets mdc->ia_time to 0. This results in
>> subsequent lookup/stat calls to be NOT served from md-cache. Instead, the
>> stat is served from backend bricks. So, I don't see an issue here.
>>
>> However, one case where a NULL iatt is different from a valid iatt (which
>> differs from the value stored in md-cache) is that the latter results in a
>> call to inode_invalidate. This invalidation propagates to kernel and all
>> dentry and page cache corresponding to file is purged. So, I am suspecting
>> whether the stale stat you saw was served from kernel cache (not from
>> glusterfs). If this is the case, having mount options "attribute-timeout=0"
>> and "entry-timeout=0" should've helped.
>>
>> I am still at loss to point out the RCA for this issue.
>>
>>
>> > And in same time, the reader(tail) read the "iatt" data, but in case of
>> the
>> > cache-time is not timeout, it will return the "iatt" data without the
>> last
>> > change.
>> >
>> > Do your think it is a possible bug?
>> >
>> > Thanks & Best Regards,
>> > George
>> >
>> > -----Original Message-----
>> > From: Raghavendra Gowdappa [mailto:rgowdapp at redhat.com]
>> > Sent: Wednesday, October 19, 2016 2:06 PM
>> > To: Lian, George (Nokia - CN/Hangzhou) <george.lian at nokia.com>
>> > Cc: Gluster-devel at gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS
>> > <I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>; Zhang, Bingxuan
>> (Nokia -
>> > CN/Hangzhou) <bingxuan.zhang at nokia.com>; Zizka, Jan (Nokia - CZ/Prague)
>> > <jan.zizka at nokia.com>
>> > Subject: Re: [Gluster-devel] Issue about the size of fstat is less than
>> the
>> > really size of the syslog file
>> >
>> >
>> >
>> > ----- Original Message -----
>> > > From: "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
>> > > To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>> > > Cc: Gluster-devel at gluster.org, "I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS"
>> > > <I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>, "Bingxuan Zhang
>> (Nokia
>> > > - CN/Hangzhou)"
>> > > <bingxuan.zhang at nokia.com>, "Jan Zizka (Nokia - CZ/Prague)"
>> > > <jan.zizka at nokia.com>
>> > > Sent: Wednesday, October 19, 2016 10:51:24 AM
>> > > Subject: RE: [Gluster-devel] Issue about the size of fstat is less
>> than the
>> > > really size of the syslog file
>> > >
>> > > Hi, Raghavendra,
>> > >
>> > > When we disable md-cache(gluster volume set log
>> > > performance.md-cache-timeout
>> > > 0),  the issue seems gone.
>> > > (we can't disable with " gluster volume set log performance.md-cache
>> off"
>> > > why?)
>> >
>> > Please use
>> > #gluster volume set log performance.stat-prefetch off
>> >
>> > >
>> > > So I double confuse that the code I abstract in last mail maybe have
>> some
>> > > issue for this case.
>> > > Could you please share your comments?
>> >
>> > Please find my comments below.
>> >
>> > >
>> > > Thanks & Best Regards,
>> > > George
>> > >
>> > > -----Original Message-----
>> > > From: Lian, George (Nokia - CN/Hangzhou)
>> > > Sent: Friday, October 14, 2016 1:44 PM
>> > > To: 'Raghavendra Gowdappa' <rgowdapp at redhat.com>
>> > > Cc: Gluster-devel at gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS
>> > > <I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>; Zhang, Bingxuan
>> (Nokia
>> > > -
>> > > CN/Hangzhou) <bingxuan.zhang at nokia.com>; Zizka, Jan (Nokia -
>> CZ/Prague)
>> > > <jan.zizka at nokia.com>
>> > > Subject: RE: [Gluster-devel] Issue about the size of fstat is less
>> than the
>> > > really size of the syslog file
>> > >
>> > > Hi, Raghavendra,
>> > >
>> > > Our version of GlusterFS is 3.6.9, and I also check the newest code
>> of main
>> > > branch, the function of " mdc_inode_iatt_set_validate" is almost
>> same, from
>> > > the following code of this function,
>> > > We could see a "TODO" comments inline, does it mean if we enhance
>> > > write-behind feature, the "iatt" field in callback will be NULL, so
>> that
>> > > inode_invalidate will not be called? So the size of file will not
>> update
>> > > since "write behind" enabled ?
>> > > Is it the root cause for "tail" application failed with "file
>> truncated"
>> > > issue ?
>> > >
>> > >         LOCK (&mdc->lock);
>> > >         {
>> > >                 if (!iatt || !iatt->ia_ctime) {
>> > >                         mdc->ia_time = 0;
>> > >                         goto unlock;
>> > >                 }
>> > >
>> > >             /*
>> > >              * Invalidate the inode if the mtime or ctime has changed
>> > >              * and the prebuf doesn't match the value we have cached.
>> > >              * TODO: writev returns with a NULL iatt due to
>> > >              * performance/write-behind, causing invalidation on
>> writes.
>> > >              */
>> >
>> > The issue explained in this comment is hit only when writes are done.
>> But, in
>> > your use-case only "tail" is the application running on the mount (If I
>> am
>> > not wrong, the  writer is running on a different mountpoint). So, I
>> doubt
>> > you are hitting this issue. But, you are saying that the issue goes away
>> > when write-behind/md-cache is turned off pointing to some interaction
>> > between md-cache and write-behind causing the issue. I need more time to
>> > look into this issue. Can you file a bug on this?
>> >
>> > >             if (IA_ISREG(inode->ia_type) &&
>> > >                 ((iatt->ia_mtime != mdc->md_mtime) ||
>> > >                 (iatt->ia_ctime != mdc->md_ctime)))
>> > >                     if (!prebuf || (prebuf->ia_ctime !=
>> mdc->md_ctime) ||
>> > >                         (prebuf->ia_mtime != mdc->md_mtime))
>> > >                             inode_invalidate(inode);
>> > >
>> > >                 mdc_from_iatt (mdc, iatt);
>> > >
>> > >                 time (&mdc->ia_time);
>> > >         }
>> > >
>> > > Best Regards,
>> > > George
>> > > -----Original Message-----
>> > > From: Raghavendra Gowdappa [mailto:rgowdapp at redhat.com]
>> > > Sent: Thursday, October 13, 2016 8:58 PM
>> > > To: Lian, George (Nokia - CN/Hangzhou) <george.lian at nokia.com>
>> > > Cc: Gluster-devel at gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS
>> > > <I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>; Zhang, Bingxuan
>> (Nokia
>> > > -
>> > > CN/Hangzhou) <bingxuan.zhang at nokia.com>; Zizka, Jan (Nokia -
>> CZ/Prague)
>> > > <jan.zizka at nokia.com>
>> > > Subject: Re: [Gluster-devel] Issue about the size of fstat is less
>> than the
>> > > really size of the syslog file
>> > >
>> > >
>> > >
>> > > ----- Original Message -----
>> > > > From: "George Lian (Nokia - CN/Hangzhou)" <george.lian at nokia.com>
>> > > > To: Gluster-devel at gluster.org
>> > > > Cc: "I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS"
>> > > > <I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX at internal.nsn.com>, "Bingxuan Zhang
>> > > > (Nokia
>> > > > - CN/Hangzhou)" <bingxuan.zhang at nokia.com>, "Jan Zizka (Nokia -
>> > > > CZ/Prague)"
>> > > > <jan.zizka at nokia.com>
>> > > > Sent: Thursday, October 13, 2016 2:33:53 PM
>> > > > Subject: [Gluster-devel] Issue about the size of fstat is less than
>> the
>> > > > really size of the syslog file
>> > > >
>> > > > Hi, Dear Expert,
>> > > > We have use glusterfs as a network filesystem, and syslog store in
>> there,
>> > > > some clients on different host may write the syslog file via
>> “glusterfs”
>> > > > mount point.
>> > > > Now we encounter an issue when we “tail” the syslog file, it will
>> > > > occasional
>> > > > failed with error “ file truncated ”
>> > > > As we study and trace with the “tail” source code, it failed with
>> the
>> > > > following code:
>> > > > if ( S_ISREG (mode) && stats.st_size < f[i].size )
>> > > > {
>> > > > error (0, 0, _("%s: file truncated"), quotef (name));
>> > > > /* Assume the file was truncated to 0,
>> > > > and therefore output all "new" data. */
>> > > > xlseek (fd, 0, SEEK_SET, name);
>> > > > f[i].size = 0;
>> > > > }
>> > > > When stats.st_size < f[i].size, what mean the size report by fstat
>> is
>> > > > less
>> > > > than “tail” had read, it lead to “file truncated”, we also use
>> “strace”
>> > > > tools to trace the tail application, the related tail strace log as
>> the
>> > > > below:
>> > > > nanosleep({1, 0}, NULL) = 0
>> > > > fstat(3, {st_mode=S_IFREG|0644, st_size=192543105, ...}) = 0
>> > > > nanosleep({1, 0}, NULL) = 0
>> > > > fstat(3, {st_mode=S_IFREG|0644, st_size=192543105, ...}) = 0
>> > > > nanosleep({1, 0}, NULL) = 0
>> > > > fstat(3, {st_mode=S_IFREG|0644, st_size=192543105, ...}) = 0
>> > > > nanosleep({1, 0}, NULL) = 0
>> > > > fstat(3, {st_mode=S_IFREG|0644, st_size=192544549, ...}) = 0
>> > > > read(3, " Data … -"..., 8192) = 1444
>> > > > read(3, " Data.. "..., 8192) = 720
>> > > > read(3, "", 8192) = 0
>> > > > fstat(3, {st_mode=S_IFREG|0644, st_size=192544789, ...}) = 0
>> > > > write(1, “DATA…..” ) = 2164
>> > > > write(2, "tail: ", 6tail: ) = 6
>> > > > write(2, "/mnt/log/master/syslog: file tru"...,
>> 38/mnt/log/master/syslog:
>> > > > file truncated) = 38
>> > > > as the above strace log, tail has read 1444+720=2164 bytes,
>> > > > but fstat tell “tail” 192544789 – 192543105 = 1664 which less than
>> 2164,
>> > > > so
>> > > > it lead to “tail” application “file truncated”.
>> > > > And if we turn off “write-behind” feature, the issue will not be
>> > > > reproduced
>> > > > any more.
>> > >
>> > > That seems strange. There are no writes happening on the fd/inode
>> through
>> > > which tail is reading/stating from. So, it seems strange that
>> write-behind
>> > > is involved here. I suspect whether any of
>> md-cache/read-ahead/io-cache is
>> > > causing the issue. Can you,
>> > >
>> > > 1. Turn off md-cache, read-ahead, io-cache xlators
>> > > 2. mount glusterfs with --attribute-timeout=0
>> > > 3. set write-behind on
>> > >
>> > > and rerun the tests? If you don't hit the issue, you can experiment by
>> > > turning on/off of md-cache, read-ahead and io-cache translators and
>> see
>> > > what
>> > > are the minimal number of xlators that need to be turned off to not
>> hit the
>> > > issue (with write-behind on)?
>> > >
>> > > regards,
>> > > Raghavendra
>> > >
>> > > > So we think it may be related to cache consistence issue due to
>> > > > performance
>> > > > consider, but we still have concern that:
>> > > > The syslog file is used only with “Append” mode, so the size of file
>> > > > shouldn’t be reduced, when a client read the file, why “fstat” can’t
>> > > > return
>> > > > the really size match to the cache?
>> > > > From current investigation, we doubt that the current implement of
>> > > > “glusterfs” has a bug on “fstat” when cache is on.
>> > > > Your comments is our highly appreciated!
>> > > > Thanks & Best Regards
>> > > > George
>> > > >
>> > > > _______________________________________________
>> > > > Gluster-devel mailing list
>> > > > Gluster-devel at gluster.org
>> > > > http://www.gluster.org/mailman/listinfo/gluster-devel
>> > >
>> >
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>
>>
>>
>> --
>>
>> Pranith
>>
>>
>>
>>
>> --
>>
>> Pranith
>>
>
>
>
> --
> Pranith
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>

-- 
Raghavendra G
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20161101/1109d4ca/attachment-0001.html>