[GEDI] Strange data corruption issue with gluster (libgfapi) and ZFS

Stefan Ring stefanrin at gmail.com
Thu Feb 27 22:25:36 UTC 2020

On Thu, Feb 27, 2020 at 10:12 PM Stefan Ring <stefanrin at gmail.com> wrote:
> Victory! I have a reproducer in the form of a plain C libgfapi client.
> However, I have not been able to trigger corruption by just executing
> the simple pattern in an artificial way. Currently, I need to feed my
> reproducer 2 GB of data that I streamed out of the qemu block driver.
> I get two possible end states out of my reproducer: The correct one or
> a corrupted one, where 48 KB are zeroed out. It takes no more than 10
> runs to get each of them at least once. The corrupted end state is
> exactly the same that I got from the real qemu process from where I
> obtained the streamed trace. This gives me a lot of confidence in the
> soundness of my reproducer.
> More details will follow.

Ok, so the exact sequence of activity around the corruption is this:

8700 and so on are the sequential request numbers. All of these
requests are writes. Blocks are 512 bytes.

  grows the file to a certain size (2134144 blocks)

<8700 retires, nothing in flight>

  writes 55 blocks inside currently allocated file range, close to the
end (7 blocks short)

  writes 54 blocks from the end of 8701, growing the file by 47 blocks

<8702 retires, 8701 remains in flight>

  writes from the end of 8702, growing the file by 81 blocks

<8703 retires, 8701 remains in flight>

  writes 1623 blocks also from the end of 8702, growing the file by 1542 blocks

<8701 retires>
<8704 retires>

The exact range covered by 8703 ends up zeroed out.

If 8701 retires earlier (before 8702 is issued), everything is fine.

More information about the integration mailing list