[GEDI] Strange data corruption issue with gluster (libgfapi) and ZFS

Thu Feb 27 21:12:50 UTC 2020

On Tue, Feb 25, 2020 at 3:12 PM Stefan Ring <stefanrin at gmail.com> wrote:
>
> I find many instances with the following pattern:
>
> current file length (= max position + size written): p
> write request n writes from (p + hole_size), thus leaving a hole
> request n+1 writes exactly hole_size, starting from p, thus completely
> filling the hole
> The two requests' in-flight times overlap.
> hole_size can be almost any value (7-127).

Victory! I have a reproducer in the form of a plain C libgfapi client.

However, I have not been able to trigger corruption by just executing
the simple pattern in an artificial way. Currently, I need to feed my
reproducer 2 GB of data that I streamed out of the qemu block driver.
I get two possible end states out of my reproducer: The correct one or
a corrupted one, where 48 KB are zeroed out. It takes no more than 10
runs to get each of them at least once. The corrupted end state is
exactly the same that I got from the real qemu process from where I
obtained the streamed trace. This gives me a lot of confidence in the
soundness of my reproducer.

More details will follow.