[Gluster-devel] stat() returns invalid file size when self healing

Wed Apr 12 08:27:48 UTC 2017

Hi,

I'm observing strange behavior when accessing glusterfs 3.10.0 volume 
through FUSE mount: when self-healing, stat() on a file that I know has 
non-zero size and is being appended to results in stat() return code 0, 
and st_size being set to 0 as well.

Next week I'm planning to find a minimal reproducible example and file a 
bug report. I wasn't able to find any references to similar issues, but 
I wanted to make sure that it isn't an already known problem.

Some notes about my current setup:
- Multiple applications are writing to multiple FUSE mounts pointing to 
the same gluster volume. Only one of those applicatuibs is writing to a 
given file at a time. I am only appending to files, or to be specific 
calling pwrite() with offset set to file size obtained by stat(). (I'm 
not sure if using O_APPEND would change anything, but still it would be 
a workaround, so shouldn't matter.)
- The issue happens even if no reads are performed on those files, e.g. 
load is no higher than usual.
- Since I'm calling stat() only before writing, and only one node writes 
to a given file, it means that stat() returns invalid size even to 
clients that write to the file.

Steps to reproduce:
0. Have multiple processes constantly appending data to files.
1. Stop one replica.
2. Wait few minutes.
3. Start that replica again - shd starts self healing.
4. stat() on some of the files that are being healed returns st_size 
equal to 0.

Setup:
- glusterfs 3.10.0

- volume type: replicas with arbiters
Type: Distributed-Replicate
Number of Bricks: 12 x (2 + 1) = 36

- FUSE mount configuration:
-o direct-io-mode=on passed explicitly to mount

- volume configuration:
cluster.consistent-metadata: yes
cluster.eager-lock: on
cluster.readdir-optimize: on
cluster.self-heal-readdir-size: 64KB
cluster.self-heal-daemon: on
cluster.read-hash-mode: 2
cluster.use-compound-fops: on
cluster.ensure-durability: on
cluster.granular-entry-heal: enable
cluster.entry-self-heal: off
cluster.data-self-heal: off
cluster.metadata-self-heal: off
performance.quick-read: off
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
performance.flush-behind: off
performance.write-behind: off
performance.open-behind: off
cluster.background-self-heal-count: 1
network.inode-lru-limit: 1024
network.ping-timeout: 1
performance.io-cache: off
transport.address-family: inet
nfs.disable: on
cluster.locking-scheme: granular

I have already verified that following options do not influence this 
behavior:
- cluster.data-self-heal-algorithm (all possible values)
- cluster.eager-lock
- cluster.consistent-metadata
- performance.stat-prefetch

I would greatly appreciate any hints on what may be wrong with the 
current setup, or what to focus on (or not) in minimal reproducible example.

thanks and best regards,
Matt