[Gluster-devel] Disperse volume : Sequential Writes

Fri Jun 16 10:20:09 UTC 2017

On Fri, Jun 16, 2017 at 3:12 PM, Xavier Hernandez <xhernandez at datalab.es>
wrote:

> On 16/06/17 10:51, Pranith Kumar Karampuri wrote:
>
>>
>>
>> On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez
>> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>
>>     On 15/06/17 11:50, Pranith Kumar Karampuri wrote:
>>
>>
>>
>>         On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey
>>         <aspandey at redhat.com <mailto:aspandey at redhat.com>
>>         <mailto:aspandey at redhat.com <mailto:aspandey at redhat.com>>> wrote:
>>
>>             Hi All,
>>
>>             We have been facing some issues in disperse (EC) volume.
>>             We know that currently EC is not good for random IO as it
>>         requires
>>             READ-MODIFY-WRITE fop
>>             cycle if an offset and offset+length falls in the middle of
>>         strip size.
>>
>>             Unfortunately, it could also happen with sequential writes.
>>             Consider an EC volume with configuration  4+2. The stripe
>>         size for
>>             this would be 512 * 4 = 2048. That is, 2048 bytes of user data
>>             stored in one stripe.
>>             Let's say 2048 + 512 = 2560 bytes are already written on this
>>             volume. 512 Bytes would be in second stripe.
>>             Now, if there are sequential writes with offset 2560 and of
>>         size 1
>>             Byte, we have to read the whole stripe, encode it with 1
>>         Byte and
>>             then again have to write it back.
>>             Next, write with offset 2561 and size of 1 Byte will again
>>             READ-MODIFY-WRITE the whole stripe. This is causing bad
>>         performance.
>>
>>             There are some tools and scenario's where such kind of load is
>>             coming and users are not aware of that.
>>             Example: fio and zip
>>
>>             Solution:
>>             One possible solution to deal with this issue is to keep
>>         last stripe
>>             in memory.
>>             This way, we need not to read it again and we can save READ
>> fop
>>             going over the network.
>>             Considering the above example, we have to keep last 2048 bytes
>>             (maximum)  in memory per file. This should not be a big
>>             deal as we already keep some data like xattr's and size info
>> in
>>             memory and based on that we take decisions.
>>
>>             Please provide your thoughts on this and also if you have
>>         any other
>>             solution.
>>
>>
>>         Just adding more details.
>>         The stripe will be in memory only when lock on the inode is
>> active.
>>
>>
>>     I think that's ok.
>>
>>         One
>>         thing we are yet to decide on is: do we want to read the stripe
>>         everytime we get the lock or just after an extending write is
>>         performed.
>>         I am thinking keeping the stripe in memory just after an
>>         extending write
>>         is better as it doesn't involve extra network operation.
>>
>>
>>     I wouldn't read the last stripe unconditionally every time we lock
>>     the inode. There's no benefit at all on random writes (in fact it's
>>     worse) and a sequential write will issue the read anyway when
>>     needed. The only difference is a small delay for the first operation
>>     after a lock.
>>
>>
>> Yes, perfect.
>>
>>
>>
>>     What I would do is to keep the last stripe of every write (we can
>>     consider to do it per fd), even if it's not the last stripe of the
>>     file (to also optimize sequential rewrites).
>>
>>
>> Ah! good point. But if we remember it per fd, one fd's cached data can
>> be over-written by another fd on the disk so we need to also do cache
>> invalidation.
>>
>
> We only cache data if we have the inodelk, so all related fd's must be
> from the same client, and we'll control all its writes so cache
> invalidation in this case is pretty easy.
>
> There exists the possibility to have two fd's from the same client writing
> to the same region. To control this we would need some range checking in
> the writes, but all this is local, so it's easy to control it.
>
> Anyway, this is probably not a common case, so we could start by caching
> only the last stripe of the last write, ignoring the fd.
>
> May be implementation should consider this possibility.
>> Yet to think about how to do this. But it is a good point. We should
>> consider this.
>>
>
> Maybe we could keep a list of cached stripes sorted by offset in the inode
> (if the maximum number of entries is small, we could keep the list not
> sorted). Each fd should store the offset of the last write. Cached stripes
> should have a ref counter just to account for the case that two fd's point
> to the same offset.
>
> When a new write arrives, we check the offset stored in the fd and see if
> it corresponds to a sequential write. If so, we look at the inode list to
> find the cached stripe, otherwise we can release the cached stripe.
>
> We can limit the number of cached entries and release the least recently
> used when we reach some maximum.
>

Yeah, this works :-).
Ashish,
    Can all of this be implemented by 3.12?

>
>
>>
>>
>>     One thing I've observed is that a 'dd' with block size of 1MB gets
>>     split into multiple 128KB blocks that are sent in parallel and not
>>     necessarily processed in the sequential order. This means that big
>>     block sizes won't benefit much from this optimization since they
>>     will be seen as partially non-sequential writes. Anyway the change
>>     won't hurt.
>>
>>
>> In this case as per the solution we won't cache anything right? Because
>> we didn't request anything from the disk. We will only keep the data in
>> cache if it is not aligned write which is at the current EOF. At least
>> that is what I had in mind.
>>
>
> Suppose we are writing multiple 1MB blocks at offset 1. If each write is
> split into 8 blocks of 128KB, all writes will be not aligned, and can be
> received in any order. Suppose that the first write happens to be at offset
> 128K + 1. We don't have anything cached, so we read the needed stripes and
> cache the last one. Now the next write is at offset 1. In this case we
> won't get any benefit from the previous write, since the stripe we need is
> not cached. However the write from the user point of view is sequential.
>
> It won't hurt but it won't take all benefits from the new caching
> mechanism.
>
> As a mitigating factor, we could consider to extend the previous solution
> I've explained to allow caching multiple stripes per fd. A small number
> like 8 would be enough.
>
> Xavi
>
>
>>
>>
>>     Xavi
>>
>>
>>
>>
>>
>>             ---
>>             Ashish
>>
>>
>>
>>             _______________________________________________
>>             Gluster-devel mailing list
>>             Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>>         <mailto:Gluster-devel at gluster.org
>>         <mailto:Gluster-devel at gluster.org>>
>>             http://lists.gluster.org/mailman/listinfo/gluster-devel
>>         <http://lists.gluster.org/mailman/listinfo/gluster-devel>
>>             <http://lists.gluster.org/mailman/listinfo/gluster-devel
>>         <http://lists.gluster.org/mailman/listinfo/gluster-devel>>
>>
>>
>>
>>
>>         --
>>         Pranith
>>
>>
>>
>>
>>
>> --
>> Pranith
>>
>
>

-- 
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170616/33fe9d56/attachment.html>