[Gluster-devel] relative ordering of writes to same file from two different fds

Thu Sep 22 14:02:23 UTC 2016

> I don't understand the Jeff snippet above - if they are
> non-overlapping writes to dfferent offsets, this would never happen.

The question is not whether it *would* happen, but whether it would be
*allowed* to happen, and my point is that POSIX is often a poor guide.
Sometimes it's unreasonably strict, sometimes it's very lax.

That said, my example was kind of bad because it doesn't actually work
unless issues of durability are brought in.  Let's say that there's a
crash between the writes and the reads.  (It's not even clear when POSIX
would consider a distributed system to have crashed.  Let's just say
*everything* dies.)  While the strict write requirements apply to the
non-durable state before it's flushed, and thus affect what gets
flushed when writes overlap, it's entirely permissible for
non-overlapping writes to be flushed out of order.  This is even quite
likely if the writes are on different file descriptors.

http://pubs.opengroup.org/onlinepubs/9699919799/functions/fsync.html

> If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily
> on the conformance document to tell the user what can be expected from
> the system. It is explicitly intended that a null implementation is
> permitted.

That's my absolute favorite part of POSIX, by the way.  It amounts to
"do whatever you want" in standards language.  What this really means is
that, when the system comes back up, the results of the second write
could be available even though the first was lost.  I'm not saying it
happens.  I'm not saying it's good or useful behavior.  I'm just saying
the standard permits it.

> If they are the same offset and at the same time, then you can have an
> undefined results where you might get fragments of A and fragments of
> B (where you might be able to see some odd things if the write spans
> pages/blocks).

This is where POSIX goes the other way and *over*specifies behavior.
Normal linearizability requires that an action appear to be atomic at
*some* point between issuance and completion.  However, the POSIX "after
a write" wording forces this to be at the exact moment of completion.
It's not undefined.  If two writes overlap in both space and time, the
one that completes last *must* win.  Those "odd things" you mention
might be considered non-conformance with the standard.

Fortunately, Linux is not POSIX.  Linus and others have been quite clear
on that.  As much as I've talked about formal standards here, "what you
can get away with" is the real standard.  The page-cache behavior that
local filesystems rely on is IMO a poor guide, because extending that
behavior across physical systems is difficult to do completely and
impossible to do without impacting performance.  What matters is whether
users will accept this kind of reordering.  Here's what I think:

 (1) An expectation of ordering is only valid if the order is completely
     unambiguous.

 (2) This can only be the case if there was some coordination between
     when the first write completes and when the second is issued.

 (3) The coordinating entities could be on different machines, in which
     case the potential for reordering is unavoidable (short of us
     adding write-behind serialization across all clients).

 (4) If it's unavoidable in the distributed case, there's not much value
     in trying to make it airtight in the local case.

In other words, standards aside, I'm kind of with Raghavendra on this.
We shouldn't add this much complexity and possibly degrade performance
unless we can provide a *meaningful guarantee* to users, and this area
is already such a swamp that any user relying on particular behavior is
likely to get themselves in trouble no matter what we do.