[Bugs] [Bug 1449593] New: When either killing or restarting a brick with performance.stat-prefetch on , stat sometimes returns a bad st_size value.

Wed May 10 10:48:06 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1449593

            Bug ID: 1449593
           Summary: When either killing or restarting a brick with
                    performance.stat-prefetch on, stat sometimes returns a
                    bad st_size value.
           Product: Red Hat Gluster Storage
           Version: 3.3
         Component: arbiter
          Keywords: Triaged
          Severity: medium
          Assignee: ravishankar at redhat.com
          Reporter: ravishankar at redhat.com
        QA Contact: ksandha at redhat.com
                CC: bugs at gluster.org, miklos.fokin at appeartv.com,
                    pkarampu at redhat.com, ravishankar at redhat.com,
                    rhs-bugs at redhat.com, storage-qa-internal at redhat.com
        Depends On: 1444892
            Blocks: 1449329

+++ This bug was initially created as a clone of Bug #1444892 +++

Description of problem:
I have an application which counts how many bytes have been written to a file
(the value returned by pwrite gets added to the sum on each iteration).
After the application writes the file it does an fdatasync, and then calls
fstat.
I kill one of the brick processes (brick1) and restart it after a few seconds,
then give it enough time to heal (checking with volume heal info and grepping
for greater than 0 number of entries).

The fstat sometimes returns a bad st_size, even though it doesn't give any
errors.
If this happens after a restart it simply returns a size that is a bit less
than what it should be (around 100 megabytes).
If this happens after killing the brick process, the size is 0 (happened once
with a 900 megabyte file, but otherwise always with less than 100 megabyte long
ones).

Version-Release number of selected component (if applicable):
3.10.0

How reproducible:
It happens rarely (once in about 150-300 tries), but with the attached
application and script it takes between 5-15 minutes to reproduce on my own
computer.

Steps to Reproduce:
1. Create a volume with a replication of 3 from which 1 brick is an arbiter.
2. Run the application (a.out from main.cpp in my scripts, can be automated
with compile_run.sh) to create and write a file many times on the volume.
3. Run the script (test_fstat.sh) to continuously kill and restart the first
brick.
4. After a while the application should detect a bad fstat st_size, where it
will exit, outputting the st_size value and the script will output whether it
detected the application exiting after it killed a brick or after restarting
it.

Actual results:
Sometimes fstat returns an st_size of either 0, or less than what the file size
should be.

Expected results:
fstat should always return the correct file size, or produce an error.

Additional info:
With performance.stat-prefetch turned off I ran the application and the script
for about 45 minutes, but had no luck reproducing the bug.
It usually took 5-15 minutes to get an error with the option turned on.
Also the results seem to depend on the number of iterations/writes (thus
probably the size of the file being written) in the application.
When the number in the loop's condition (number of 400KB writes on a file) is
150, the problem happens both after killing and restarting.
When the number was higher (200000) I only got bad results after restarts,
except once out of around 10 times.

- FUSE mount configuration:
-o direct-io-mode=on passed explicitly to mount

- volume configuration:
nfs.disable: on
transport.address-family: inet
cluster.consistent-metadata: yes
cluster.eager-lock: on
cluster.readdir-optimize: on
cluster.self-heal-readdir-size: 64KB
cluster.self-heal-daemon: on
cluster.read-hash-mode: 2
cluster.use-compound-fops: on
cluster.ensure-durability: on
cluster.entry-self-heal: off
cluster.data-self-heal: off
cluster.metadata-self-heal: off
performance.quick-read: off
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: off
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
performance.flush-behind: off
performance.write-behind: off
performance.open-behind: off
cluster.background-self-heal-count: 1
network.inode-lru-limit: 1024
network.ping-timeout: 1
performance.io-cache: off
cluster.locking-scheme: granular
cluster.granular-entry-heal: enable

--- Additional comment from  on 2017-04-26 07:04:34 EDT ---

Yesterday I got a suggestion about a review that might fix the problem, with
the changes being in performance.readdir-ahead
(https://review.gluster.org/#/c/16419/), but I did manage to reproduce the
issue with it being off (and stat-prefetch being on) as well.

--- Additional comment from  on 2017-05-04 09:02:52 EDT ---

Today I found a part in the code that seems to be the problem when the faulty 0
size from fstat gets shown after a kill.
Although it only has an indirect connection to stat-prefetch, but the issue
seems to be what I describe in the following paragraph, as with the solution I
suggest I didn't get a reproduction in 30-35 minutes, after which I got the
other problem where the size is just a bit lower than what it should be.

In afr_fsync_cbk, when receiving data from the bricks there are two times when
the fstat data can get updated: first is the initial update, and then the one
from the brick we selected to get the final data from.
During the debugging I found that if the initial update is coming from the
arbiter the size will be 0.
If the brick we selected to get the final data from is down, we get a struct
filled with zeroes, and an error value, thus we don't get a second update.
The change with which I didn't get a 0 size was simply waiting for a
non-arbiter brick to be the first update.

--- Additional comment from Pranith Kumar K on 2017-05-05 01:21:59 EDT ---

As per the mail and patch on gluster-devel mailing list, it seems to be
replicate issue. Changing the component and assignee.

--- Additional comment from Ravishankar N on 2017-05-05 03:29:24 EDT ---

(In reply to miklos.fokin from comment #2)

> In afr_fsync_cbk, when receiving data from the bricks there are two times
> when the fstat data can get updated: first is the initial update, and then
> the one from the brick we selected to get the final data from.
> During the debugging I found that if the initial update is coming from the
> arbiter the size will be 0.
> If the brick we selected to get the final data from is down, we get a struct
> filled with zeroes, and an error value, thus we don't get a second update.

Hi Miklos, did you  observe while debugging that in afr_fsync_cbk(), the
'read_subvol' is indeed the brick you brought down? I haven't tried your test
yet but if a brick is brought down, then the read_subvol is updated on the next
lookup/inode_refresh where the brick is not marked readable any more until heal
is complete. So, if you brought down brick1, then the call to
afr_data_subvol_get() in afr_fsync_cbk() should give you brick2.

--- Additional comment from  on 2017-05-05 04:11 EDT ---

--- Additional comment from  on 2017-05-05 04:13:45 EDT ---

(In reply to Ravishankar N from comment #4)
> (In reply to miklos.fokin from comment #2)
> 
> > In afr_fsync_cbk, when receiving data from the bricks there are two times
> > when the fstat data can get updated: first is the initial update, and then
> > the one from the brick we selected to get the final data from.
> > During the debugging I found that if the initial update is coming from the
> > arbiter the size will be 0.
> > If the brick we selected to get the final data from is down, we get a struct
> > filled with zeroes, and an error value, thus we don't get a second update.
> 
> Hi Miklos, did you  observe while debugging that in afr_fsync_cbk(), the
> 'read_subvol' is indeed the brick you brought down? I haven't tried your
> test yet but if a brick is brought down, then the read_subvol is updated on
> the next lookup/inode_refresh where the brick is not marked readable any
> more until heal is complete. So, if you brought down brick1, then the call
> to afr_data_subvol_get() in afr_fsync_cbk() should give you brick2.

Hello Ravishankar, the thing I did was putting some logging into that
afr_fsync_cbk() to see what code paths get taken (after doing the same with
other ones and finally getting there with the debugging).
I also added logging for printing out the post_buff in each call to the
function.

The first log was in "if (op_ret == 0) {" the branch.
Another one was in "if (local->op_ret == -1)".
Another was in "if (child_index == read_subvol)".
I am also attaching the complete diff with the logs and the fix to show you the
code, perhaps it is easier to get it from there.

When things went wrong I received 3 calls:
  1 with a mostly correct structure, apart from the number of blocks and size
of the file which were zero, here I got a log from the first and second places
  1 with a completely correct structure, but here I only got a log from the
first place
  1 with a completely zero structure and no logs from any of the branches

I am writing this from memory, since I moved on and didn't save the logs, but I
can reproduce it and send you the file in case you need it.
These findings led me to believe that since the third branch never got taken,
it was the third reply that was selected, but since it returned an error it
never got processed.

--- Additional comment from  on 2017-05-09 03:44:48 EDT ---

In my last comment I forgot to mention that for debugging I had to set the
mount's attribute-timeout to 0, so that I could get the logs that I wanted.
Thought I would mention that as it might be helpful to someone wanting to do
the same.

--- Additional comment from Ravishankar N on 2017-05-09 04:09:28 EDT ---

Hi Miklos, I was able to re-create the issue with the test case you gave; I got
zero st_size within a couple of minutes of running it.I will be sending the
patch out soon. The fix is indeed in afr_fsync_cbk with some modifications that
should handle even the non-zero but incorrect st_size errors that you observed.

--- Additional comment from  on 2017-05-09 04:23:25 EDT ---

Hey, great news, thanks!

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1444892
[Bug 1444892] When either killing or restarting a brick with
performance.stat-prefetch on, stat sometimes returns a bad st_size value.
https://bugzilla.redhat.com/show_bug.cgi?id=1449329
[Bug 1449329] When either killing or restarting a brick with
performance.stat-prefetch on, stat sometimes returns a bad st_size value.
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=GZVEHe7Ugx&a=cc_unsubscribe