[Gluster-devel] relative ordering of writes to same file from two different fds

Raghavendra Talur rtalur at redhat.com
Wed Sep 21 17:28:45 UTC 2016


On Wed, Sep 21, 2016 at 6:32 PM, Ric Wheeler <ricwheeler at gmail.com> wrote:

> On 09/21/2016 08:06 AM, Raghavendra Gowdappa wrote:
>
>> Hi all,
>>
>> This mail is to figure out the behavior of write to same file from two
>> different fds. As Ryan quotes in one of comments,
>>
>> <comment>
>>
>> I think it’s not safe. in this case:
>> 1. P1 write to F1 use FD1
>> 2. after P1 write finish, P2 write to the same place use FD2
>> since they are not conflict with each other now, the order the 2 writes
>> send to underlying fs is not determined. so the final data may be P1’s or
>> P2’s.
>> this semantics is not the same with linux buffer io. linux buffer io will
>> make the second write cover the first one, this is to say the final data is
>> P2’s.
>> you can see it from linux NFS (as we are all network filesystem)
>> fs/nfs/file.c:nfs_write_begin(), nfs will flush ‘incompatible’ request
>> first before another write begin. the way 2 request is determine to be
>> ‘incompatible’ is that they are from 2 different open fds.
>> I think write-behind behaviour should keep the same with linux page cache.
>>
>> </comment>
>>
>
> I think that how this actually would work is that both would be written to
> the same page in the page cache (if not using buffered IO), so as long as
> they do not happen at the same time, you would get the second P2 copy of
> data each time.
>

I apologize if my understanding is wrong but IMO this is exactly what we do
in write-behind too. The cache is inode based and ensures that writes are
ordered irrespective of the FD used for the write.


Here is the commit message which brought the change
-------------------------------------------------------------------------------------
write-behind: implement causal ordering and other cleanup


Rules of causal ordering implemented:¬






 - If request A arrives after the acknowledgement (to the app,¬

   i.e, STACK_UNWIND) of another request B, then request B is¬

   said to have 'caused' request A.¬



- (corollary) Two requests, which at any point of time, are¬

   unacknowledged simultaneously in the system can never 'cause'¬

   each other (wb_inode->gen is based on this)¬



 - If request A is caused by request B, AND request A's region¬

   has an overlap with request B's region, then then the fulfillment¬

   of request A is guaranteed to happen after the fulfillment of B.¬



 - FD of origin is not considered for the determination of causal¬

   ordering.¬



 - Append operation's region is considered the whole file.¬



 Other cleanup:¬



 - wb_file_t not required any more.¬



 - wb_local_t not required any more.¬



 - O_RDONLY fd's operations now go through the queue to make sure¬

   writes in the requested region get fulfilled be
-----------------------------------------------------------------------------------------------

Thanks,
Raghavendra Talur


>
> Same story for using O_DIRECT - that write bypasses the page cache and
> will update the data directly.
>
> What might happen in practice though is that your applications might use
> higher level IO routines and they might buffer data internally. If that
> happens, there is no ordering that is predictable.
>
> Regards,
>
> Ric
>
>
>
>> However, my understanding is that filesystems need not maintain the
>> relative order of writes (as it received from vfs/kernel) on two different
>> fds. Also, if we have to maintain the order it might come with increased
>> latency. The increased latency can be because of having "newer" writes to
>> wait on "older" ones. This wait can fill up write-behind buffer and can
>> eventually result in a full write-behind cache and hence not able to
>> "write-back" newer writes.
>>
>> * What does POSIX say about it?
>> * How do other filesystems behave in this scenario?
>>
>>
>> Also, the current write-behind implementation has the concept of
>> "generation numbers". To quote from comment:
>>
>> <write-behind src>
>>
>>          uint64_t     gen;    /* Liability generation number. Represents
>>                                  the current 'state' of liability. Every
>>                                  new addition to the liability list bumps
>>                                  the generation number.
>>
>>
>>                                                     a newly arrived request
>> is only required
>>                                  to perform causal checks against the
>> entries
>>                                  in the liability list which were present
>>                                  at the time of its addition. the
>> generation
>>                                  number at the time of its addition is
>> stored
>>                                  in the request and used during checks.
>>
>>
>>                                                     the liability list can
>> grow while the request
>>                                  waits in the todo list waiting for its
>>                                  dependent operations to complete. however
>>                                  it is not of the request's concern to
>> depend
>>                                  itself on those new entries which arrived
>>                                  after it arrived (i.e, those that have a
>>                                  liability generation higher than itself)
>>                               */
>> </src>
>>
>> So, if a single thread is doing writes on two different fds, generation
>> numbers are sufficient to enforce the relative ordering. If writes are from
>> two different threads/processes, I think write-behind is not obligated to
>> maintain their order. Comments?
>>
>> [1] http://review.gluster.org/#/c/15380/
>>
>> regards,
>> Raghavendra
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20160921/7ae25df2/attachment-0001.html>


More information about the Gluster-devel mailing list