[Gluster-devel] relative ordering of writes to same file from two different fds

Thu Sep 22 04:07:53 UTC 2016

On 09/21/2016 08:58 PM, Jeff Darcy wrote:
>> However, my understanding is that filesystems need not maintain the relative
>> order of writes (as it received from vfs/kernel) on two different fds. Also,
>> if we have to maintain the order it might come with increased latency. The
>> increased latency can be because of having "newer" writes to wait on "older"
>> ones. This wait can fill up write-behind buffer and can eventually result in
>> a full write-behind cache and hence not able to "write-back" newer writes.
> IEEE 1003.1, 2013 edition
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html
>
>> After a write() to a regular file has successfully returned:
>>
>> Any successful read() from each byte position in the file that was
>> modified by that write shall return the data specified by the write()
>> for that position until >such byte positions are again modified.
>>
>> Any subsequent successful write() to the same byte position in the
>> file shall overwrite that file data.
> Note that the reference is to a *file*, not to a file *descriptor*.
> It's an application of the general POSIX assumption that time is
> simple, locking is cheap (if it's even necessary), and therefore
> time-based requirements like linearizability - what this is - are
> easy to satisfy.  I know that's not very realistic nowadays, but
> it's pretty clear: according to the standard as it's still written,
> P2's write *is* required to overwrite P1's.  Same vs. different fd
> or process/thread doesn't even come into play.
>
> Just for fun, I'll point out that the standard snippet above
> doesn't say anything about *non overlapping* writes.  Does POSIX
> allow the following?
>
>     write A
>     write B
>     read B, get new value
>     read A, get *old* value
>
> This is a non-linearizable result, which would surely violate
> some people's (notably POSIX authors') expectations, but good
> luck finding anything in that standard which actually precludes
> it.
>

I will reply to both comments here.

First, I think that all file systems will perform this way since this is really 
a function of how the page cache works and O_DIRECT.

More broadly, this is not a promise or hard and fast thing - the traditional way 
applications that do concurrent writes is to make sure that they use either 
whole file or byte range locking when one or more threads/processes are doing IO 
to the same file concurrently.

I don't understand the Jeff snippet above - if they are non-overlapping writes 
to dfferent offsets, this would never happen.

If the writes are to the same offset and happened at different times, it would 
not happen either.

If they are the same offset and at the same time, then you can have an undefined 
results where you might get fragments of A and fragments of B (where you might 
be able to see some odd things if the write spans pages/blocks).

This last case is where the normal best practice comes in to suggest using locking.

Ric