[Gluster-devel] patch for "limited performance for disperse volumes"

Xavier Hernandez xhernandez at datalab.es
Fri Feb 10 08:07:22 UTC 2017

Hi Raghavendra,

On 10/02/17 04:51, Raghavendra Gowdappa wrote:
> +gluster-devel
> ----- Original Message -----
>> From: "Milind Changire" <mchangir at redhat.com>
>> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>> Cc: "rhs-zteam" <rhs-zteam at redhat.com>
>> Sent: Thursday, February 9, 2017 11:00:18 PM
>> Subject: patch for "limited performance for disperse volumes"
>> My first comment was:
>> looks like patch for "limited performance for disperse volume" [1] is going
>> to be helpful for all other types of volumes as well; but how do we
>> guarantee ordering for writes over the same fd for the same offset and
>> length in the file ?
>> then thinking over a bit and in case you missed my comment over IRC:
>> I was thinking about network multi-pathing and rpc requests(two writes)
>> being routed through different interfaces to gluster nodes which might
>> lead to a non-increasing transaction ID sequence and hence might lead
>> to incorrect final value if the older write is committed to the same
>> offset+length
>> then it dawned on me that for blocking operations the write() call
>> wont return until the data is safe on the disk across the network or
>> the intermediate translators have cached it appropriately to be
>> written behind.
>> so would the patch work for two non-blocking writes originating for the
>> same fd from the same thread for the same offset+length and being
>> routed over multi-pathing and write #2 getting routed quicker than
>> write #1 ?
> To be honest I've not considered the case of asynchronous writes from application till now. What is the ordering guarantee the OS/filesystems provide for two async writes? For eg., if there are two writes w1 and w2, when is w2 issued?
> * After cbk of w1 is called or
> * parallely just after async_write (w1) returns (cbk of w1 is not invoked yet)?
> What do POSIX or other standards (or expectation from OS) say about ordering in case 2 above?

I'm not an expert on POSIX. But I've found this [1]:

     2.9.7 Thread Interactions with Regular File Operations

     All of the following functions shall be atomic with respect to
     each other in the effects specified in POSIX.1-2008 when they
     operate on regular files or symbolic links: [...] write [...]

     If two threads each call one of these functions, each call shall
     either see all of the specified effects of the other call, or none
     of them. The requirement on the close() function shall also apply
     whenever a file descriptor is successfully closed, however caused
     (for example, as a consequence of calling close(), calling dup2(),
     or of process termination).

Not sure if this also applies to write requests issued asynchronously 
from the same thread, but this would be the worst case (if the OS 
already orders it, we won't have any problem).

As I see it, this is already satisfied by EC because it doesn't allow 
two concurrent writes to happen at the same time. They can be reordered 
if the second one arrives before the first one, but they are executed 
atomically as POSIX requires. Not sure if AFR also satisfies this 
condition, but I think so.

 From the point of view of EC it's irrelevant if the write comes from 
the same thread or from different processes on different clients. They 
are handled in the same way.

However a thing to be aware of (from the man page of write):

     [...] among the effects that should be atomic across threads (and
     processes) are updates of the file offset. However, on Linux before
     version 3.14, this was not the case: if two processes that share an
     open file description (see open(2)) perform a write() (or
     writev(2)) at the same time, then the I/O operations were not atomic
     with respect updating the file offset, with the result that the
     blocks of data output by the two processes might (incorrectly)
     overlap. This problem was fixed in Linux 3.14.



> [1] https://review.gluster.org/15036
>> just thinking aloud
>> --
>> Milind
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel

More information about the Gluster-devel mailing list