[GEDI] Strange data corruption issue with gluster (libgfapi) and ZFS

Tue Feb 25 14:12:27 UTC 2020

On Mon, Feb 24, 2020 at 1:35 PM Stefan Ring <stefanrin at gmail.com> wrote:
>
> What I plan to do next is look at the block ranges being written in
> the hope of finding overlaps there.

Status update:

I still have not found out what is actually causing this. I have not
found concurrent writes to overlapping file areas. But what I can say
is that by switching qemu_gluster_co_rw to the synchronous glusterfs
api (glfs_pwritev), the problem goes away.

Unfortunately, I have not yet been able to find exactly how the qcow2
file is grown. It looks like this happens just as a consequence of
writing beyond the end. Because contrary to my expectations, neither
qemu_gluster_co_pwrite_zeroes nor qemu_gluster_do_truncate is ever
called. My current line of thinking is that there must be something
special about in-flight writes which grow the file.

I find many instances with the following pattern:

current file length (= max position + size written): p
write request n writes from (p + hole_size), thus leaving a hole
request n+1 writes exactly hole_size, starting from p, thus completely
filling the hole
The two requests' in-flight times overlap.
hole_size can be almost any value (7-127).

I see fewer data errors than instances of this pattern though.