[Gluster-devel] relative ordering of writes to same file from two different fds
Raghavendra G
raghavendra at gluster.com
Fri Sep 23 12:01:51 UTC 2016
On Wed, Sep 21, 2016 at 10:58 PM, Raghavendra Talur <rtalur at redhat.com>
wrote:
>
>
> On Wed, Sep 21, 2016 at 6:32 PM, Ric Wheeler <ricwheeler at gmail.com> wrote:
>
>> On 09/21/2016 08:06 AM, Raghavendra Gowdappa wrote:
>>
>>> Hi all,
>>>
>>> This mail is to figure out the behavior of write to same file from two
>>> different fds. As Ryan quotes in one of comments,
>>>
>>> <comment>
>>>
>>> I think it’s not safe. in this case:
>>> 1. P1 write to F1 use FD1
>>> 2. after P1 write finish, P2 write to the same place use FD2
>>> since they are not conflict with each other now, the order the 2 writes
>>> send to underlying fs is not determined. so the final data may be P1’s or
>>> P2’s.
>>> this semantics is not the same with linux buffer io. linux buffer io
>>> will make the second write cover the first one, this is to say the final
>>> data is P2’s.
>>> you can see it from linux NFS (as we are all network filesystem)
>>> fs/nfs/file.c:nfs_write_begin(), nfs will flush ‘incompatible’ request
>>> first before another write begin. the way 2 request is determine to be
>>> ‘incompatible’ is that they are from 2 different open fds.
>>> I think write-behind behaviour should keep the same with linux page
>>> cache.
>>>
>>> </comment>
>>>
>>
>> I think that how this actually would work is that both would be written
>> to the same page in the page cache (if not using buffered IO), so as long
>> as they do not happen at the same time, you would get the second P2 copy of
>> data each time.
>>
>
> I apologize if my understanding is wrong but IMO this is exactly what we
> do in write-behind too. The cache is inode based and ensures that writes
> are ordered irrespective of the FD used for the write.
>
>
> Here is the commit message which brought the change
> ------------------------------------------------------------
> -------------------------
> write-behind: implement causal ordering and other cleanup
>
>
> Rules of causal ordering implemented:¬
>
>
>
>
>
>
> - If request A arrives after the acknowledgement (to the app,¬
>
> i.e, STACK_UNWIND) of another request B, then request B is¬
>
> said to have 'caused' request A.¬
>
>
With the above principle, two write requests (p1 and p2 in example above)
issued by _two different threads/processes_ there need _not always_ be a
'causal' relationship (whether there is a causal relationship is purely
based on the "chance" that write-behind chose to ack one/both of them and
their timing of arrival). So, current write-behind is agnostic to the
ordering of p1 and p2 (when done by two threads).
However if p1 and p2 are issued by same thread there is _always_ a causal
relationship (p2 being caused by p1).
>
>
> - (corollary) Two requests, which at any point of time, are¬
>
> unacknowledged simultaneously in the system can never 'cause'¬
>
> each other (wb_inode->gen is based on this)¬
>
>
>
> - If request A is caused by request B, AND request A's region¬
>
> has an overlap with request B's region, then then the fulfillment¬
>
> of request A is guaranteed to happen after the fulfillment of B.¬
>
>
>
> - FD of origin is not considered for the determination of causal¬
>
> ordering.¬
>
>
>
> - Append operation's region is considered the whole file.¬
>
>
>
> Other cleanup:¬
>
>
>
> - wb_file_t not required any more.¬
>
>
>
> - wb_local_t not required any more.¬
>
>
>
> - O_RDONLY fd's operations now go through the queue to make sure¬
>
> writes in the requested region get fulfilled be
> ------------------------------------------------------------
> -----------------------------------
>
> Thanks,
> Raghavendra Talur
>
>
>>
>> Same story for using O_DIRECT - that write bypasses the page cache and
>> will update the data directly.
>>
>> What might happen in practice though is that your applications might use
>> higher level IO routines and they might buffer data internally. If that
>> happens, there is no ordering that is predictable.
>>
>> Regards,
>>
>> Ric
>>
>>
>>
>>> However, my understanding is that filesystems need not maintain the
>>> relative order of writes (as it received from vfs/kernel) on two different
>>> fds. Also, if we have to maintain the order it might come with increased
>>> latency. The increased latency can be because of having "newer" writes to
>>> wait on "older" ones. This wait can fill up write-behind buffer and can
>>> eventually result in a full write-behind cache and hence not able to
>>> "write-back" newer writes.
>>>
>>> * What does POSIX say about it?
>>> * How do other filesystems behave in this scenario?
>>>
>>>
>>> Also, the current write-behind implementation has the concept of
>>> "generation numbers". To quote from comment:
>>>
>>> <write-behind src>
>>>
>>> uint64_t gen; /* Liability generation number. Represents
>>> the current 'state' of liability. Every
>>> new addition to the liability list bumps
>>> the generation number.
>>>
>>>
>>> a newly arrived
>>> request is only required
>>> to perform causal checks against the
>>> entries
>>> in the liability list which were present
>>> at the time of its addition. the
>>> generation
>>> number at the time of its addition is
>>> stored
>>> in the request and used during checks.
>>>
>>>
>>> the liability list
>>> can grow while the request
>>> waits in the todo list waiting for its
>>> dependent operations to complete.
>>> however
>>> it is not of the request's concern to
>>> depend
>>> itself on those new entries which
>>> arrived
>>> after it arrived (i.e, those that have a
>>> liability generation higher than itself)
>>> */
>>> </src>
>>>
>>> So, if a single thread is doing writes on two different fds, generation
>>> numbers are sufficient to enforce the relative ordering. If writes are from
>>> two different threads/processes, I think write-behind is not obligated to
>>> maintain their order. Comments?
>>>
>>> [1] http://review.gluster.org/#/c/15380/
>>>
>>> regards,
>>> Raghavendra
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>
>>
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>
--
Raghavendra G
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20160923/7cec1396/attachment.html>
More information about the Gluster-devel
mailing list