[Bugs] [Bug 1444892] New: When either killing or restarting a brick with performance.stat-prefetch on , stat sometimes returns a bad st_size value.

Mon Apr 24 13:45:04 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1444892

            Bug ID: 1444892
           Summary: When either killing or restarting a brick with
                    performance.stat-prefetch on, stat sometimes returns a
                    bad st_size value.
           Product: GlusterFS
           Version: 3.10
         Component: stat-prefetch
          Severity: medium
          Assignee: bugs at gluster.org
          Reporter: miklos.fokin at appeartv.com
                CC: bugs at gluster.org

Created attachment 1273617
  --> https://bugzilla.redhat.com/attachment.cgi?id=1273617&action=edit
Compressed file with main.cpp to generate the application, compile_run.sh to
compile and run it, test_fstat.sh to kill and restart one of the brick
processes.

Description of problem:
I have an application which counts how many bytes have been written to a file
(the value returned by pwrite gets added to the sum on each iteration).
After the application writes the file it does an fdatasync, and then calls
fstat.
I kill one of the brick processes (brick1) and restart it after a few seconds,
then give it enough time to heal (checking with volume heal info and grepping
for greater than 0 number of entries).

The fstat sometimes returns a bad st_size, even though it doesn't give any
errors.
If this happens after a restart it simply returns a size that is a bit less
than what it should be (around 100 megabytes).
If this happens after killing the brick process, the size is 0 (happened once
with a 900 megabyte file, but otherwise always with less than 100 megabyte long
ones).

Version-Release number of selected component (if applicable):
3.10.0

How reproducible:
It happens rarely (once in about 150-300 tries), but with the attached
application and script it takes between 5-15 minutes to reproduce on my own
computer.

Steps to Reproduce:
1. Create a volume with a replication of 3 from which 1 brick is an arbiter.
2. Run the application (a.out from main.cpp in my scripts, can be automated
with compile_run.sh) to create and write a file many times on the volume.
3. Run the script (test_fstat.sh) to continuously kill and restart the first
brick.
4. After a while the application should detect a bad fstat st_size, where it
will exit, outputting the st_size value and the script will output whether it
detected the application exiting after it killed a brick or after restarting
it.

Actual results:
Sometimes fstat returns an st_size of either 0, or less than what the file size
should be.

Expected results:
fstat should always return the correct file size, or produce an error.

Additional info:
With performance.stat-prefetch turned off I ran the application and the script
for about 45 minutes, but had no luck reproducing the bug.
It usually took 5-15 minutes to get an error with the option turned on.
Also the results seem to depend on the number of iterations/writes (thus
probably the size of the file being written) in the application.
When the number in the loop's condition (number of 400KB writes on a file) is
150, the problem happens both after killing and restarting.
When the number was higher (200000) I only got bad results after restarts,
except once out of around 10 times.

- FUSE mount configuration:
-o direct-io-mode=on passed explicitly to mount

- volume configuration:
nfs.disable: on
transport.address-family: inet
cluster.consistent-metadata: yes
cluster.eager-lock: on
cluster.readdir-optimize: on
cluster.self-heal-readdir-size: 64KB
cluster.self-heal-daemon: on
cluster.read-hash-mode: 2
cluster.use-compound-fops: on
cluster.ensure-durability: on
cluster.entry-self-heal: off
cluster.data-self-heal: off
cluster.metadata-self-heal: off
performance.quick-read: off
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: off
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
performance.flush-behind: off
performance.write-behind: off
performance.open-behind: off
cluster.background-self-heal-count: 1
network.inode-lru-limit: 1024
network.ping-timeout: 1
performance.io-cache: off
cluster.locking-scheme: granular
cluster.granular-entry-heal: enable

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.