[Gluster-devel] relative ordering of writes to same file from two different fds
Raghavendra Talur
rtalur at redhat.com
Wed Sep 21 17:28:45 UTC 2016
On Wed, Sep 21, 2016 at 6:32 PM, Ric Wheeler <ricwheeler at gmail.com> wrote:
> On 09/21/2016 08:06 AM, Raghavendra Gowdappa wrote:
>
>> Hi all,
>>
>> This mail is to figure out the behavior of write to same file from two
>> different fds. As Ryan quotes in one of comments,
>>
>> <comment>
>>
>> I think it’s not safe. in this case:
>> 1. P1 write to F1 use FD1
>> 2. after P1 write finish, P2 write to the same place use FD2
>> since they are not conflict with each other now, the order the 2 writes
>> send to underlying fs is not determined. so the final data may be P1’s or
>> P2’s.
>> this semantics is not the same with linux buffer io. linux buffer io will
>> make the second write cover the first one, this is to say the final data is
>> P2’s.
>> you can see it from linux NFS (as we are all network filesystem)
>> fs/nfs/file.c:nfs_write_begin(), nfs will flush ‘incompatible’ request
>> first before another write begin. the way 2 request is determine to be
>> ‘incompatible’ is that they are from 2 different open fds.
>> I think write-behind behaviour should keep the same with linux page cache.
>>
>> </comment>
>>
>
> I think that how this actually would work is that both would be written to
> the same page in the page cache (if not using buffered IO), so as long as
> they do not happen at the same time, you would get the second P2 copy of
> data each time.
>
I apologize if my understanding is wrong but IMO this is exactly what we do
in write-behind too. The cache is inode based and ensures that writes are
ordered irrespective of the FD used for the write.
Here is the commit message which brought the change
-------------------------------------------------------------------------------------
write-behind: implement causal ordering and other cleanup
Rules of causal ordering implemented:¬
- If request A arrives after the acknowledgement (to the app,¬
i.e, STACK_UNWIND) of another request B, then request B is¬
said to have 'caused' request A.¬
- (corollary) Two requests, which at any point of time, are¬
unacknowledged simultaneously in the system can never 'cause'¬
each other (wb_inode->gen is based on this)¬
- If request A is caused by request B, AND request A's region¬
has an overlap with request B's region, then then the fulfillment¬
of request A is guaranteed to happen after the fulfillment of B.¬
- FD of origin is not considered for the determination of causal¬
ordering.¬
- Append operation's region is considered the whole file.¬
Other cleanup:¬
- wb_file_t not required any more.¬
- wb_local_t not required any more.¬
- O_RDONLY fd's operations now go through the queue to make sure¬
writes in the requested region get fulfilled be
-----------------------------------------------------------------------------------------------
Thanks,
Raghavendra Talur
>
> Same story for using O_DIRECT - that write bypasses the page cache and
> will update the data directly.
>
> What might happen in practice though is that your applications might use
> higher level IO routines and they might buffer data internally. If that
> happens, there is no ordering that is predictable.
>
> Regards,
>
> Ric
>
>
>
>> However, my understanding is that filesystems need not maintain the
>> relative order of writes (as it received from vfs/kernel) on two different
>> fds. Also, if we have to maintain the order it might come with increased
>> latency. The increased latency can be because of having "newer" writes to
>> wait on "older" ones. This wait can fill up write-behind buffer and can
>> eventually result in a full write-behind cache and hence not able to
>> "write-back" newer writes.
>>
>> * What does POSIX say about it?
>> * How do other filesystems behave in this scenario?
>>
>>
>> Also, the current write-behind implementation has the concept of
>> "generation numbers". To quote from comment:
>>
>> <write-behind src>
>>
>> uint64_t gen; /* Liability generation number. Represents
>> the current 'state' of liability. Every
>> new addition to the liability list bumps
>> the generation number.
>>
>>
>> a newly arrived request
>> is only required
>> to perform causal checks against the
>> entries
>> in the liability list which were present
>> at the time of its addition. the
>> generation
>> number at the time of its addition is
>> stored
>> in the request and used during checks.
>>
>>
>> the liability list can
>> grow while the request
>> waits in the todo list waiting for its
>> dependent operations to complete. however
>> it is not of the request's concern to
>> depend
>> itself on those new entries which arrived
>> after it arrived (i.e, those that have a
>> liability generation higher than itself)
>> */
>> </src>
>>
>> So, if a single thread is doing writes on two different fds, generation
>> numbers are sufficient to enforce the relative ordering. If writes are from
>> two different threads/processes, I think write-behind is not obligated to
>> maintain their order. Comments?
>>
>> [1] http://review.gluster.org/#/c/15380/
>>
>> regards,
>> Raghavendra
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20160921/7ae25df2/attachment-0001.html>
More information about the Gluster-devel
mailing list