[Gluster-devel] Disperse volume : Sequential Writes

Mon Jul 3 03:35:54 UTC 2017

Ashish, Xavi,
       I think it is better to implement this change as a separate
read-after-write caching xlator which we can load between EC and client
xlator. That way EC will not get a lot more functionality than necessary
and may be this xlator can be used somewhere else in the stack if possible.

On Fri, Jun 16, 2017 at 4:19 PM, Ashish Pandey <aspandey at redhat.com> wrote:

>
> I think it should be done as we have agreement on basic design.
>
> ------------------------------
> *From: *"Pranith Kumar Karampuri" <pkarampu at redhat.com>
> *To: *"Xavier Hernandez" <xhernandez at datalab.es>
> *Cc: *"Ashish Pandey" <aspandey at redhat.com>, "Gluster Devel" <
> gluster-devel at gluster.org>
> *Sent: *Friday, June 16, 2017 3:50:09 PM
> *Subject: *Re: [Gluster-devel] Disperse volume : Sequential Writes
>
>
>
>
> On Fri, Jun 16, 2017 at 3:12 PM, Xavier Hernandez <xhernandez at datalab.es>
> wrote:
>
>> On 16/06/17 10:51, Pranith Kumar Karampuri wrote:
>>
>>>
>>>
>>> On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez
>>> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>>
>>>     On 15/06/17 11:50, Pranith Kumar Karampuri wrote:
>>>
>>>
>>>
>>>         On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey
>>>         <aspandey at redhat.com <mailto:aspandey at redhat.com>
>>>         <mailto:aspandey at redhat.com <mailto:aspandey at redhat.com>>>
>>> wrote:
>>>
>>>             Hi All,
>>>
>>>             We have been facing some issues in disperse (EC) volume.
>>>             We know that currently EC is not good for random IO as it
>>>         requires
>>>             READ-MODIFY-WRITE fop
>>>             cycle if an offset and offset+length falls in the middle of
>>>         strip size.
>>>
>>>             Unfortunately, it could also happen with sequential writes.
>>>             Consider an EC volume with configuration  4+2. The stripe
>>>         size for
>>>             this would be 512 * 4 = 2048. That is, 2048 bytes of user
>>> data
>>>             stored in one stripe.
>>>             Let's say 2048 + 512 = 2560 bytes are already written on this
>>>             volume. 512 Bytes would be in second stripe.
>>>             Now, if there are sequential writes with offset 2560 and of
>>>         size 1
>>>             Byte, we have to read the whole stripe, encode it with 1
>>>         Byte and
>>>             then again have to write it back.
>>>             Next, write with offset 2561 and size of 1 Byte will again
>>>             READ-MODIFY-WRITE the whole stripe. This is causing bad
>>>         performance.
>>>
>>>             There are some tools and scenario's where such kind of load
>>> is
>>>             coming and users are not aware of that.
>>>             Example: fio and zip
>>>
>>>             Solution:
>>>             One possible solution to deal with this issue is to keep
>>>         last stripe
>>>             in memory.
>>>             This way, we need not to read it again and we can save READ
>>> fop
>>>             going over the network.
>>>             Considering the above example, we have to keep last 2048
>>> bytes
>>>             (maximum)  in memory per file. This should not be a big
>>>             deal as we already keep some data like xattr's and size info
>>> in
>>>             memory and based on that we take decisions.
>>>
>>>             Please provide your thoughts on this and also if you have
>>>         any other
>>>             solution.
>>>
>>>
>>>         Just adding more details.
>>>         The stripe will be in memory only when lock on the inode is
>>> active.
>>>
>>>
>>>     I think that's ok.
>>>
>>>         One
>>>         thing we are yet to decide on is: do we want to read the stripe
>>>         everytime we get the lock or just after an extending write is
>>>         performed.
>>>         I am thinking keeping the stripe in memory just after an
>>>         extending write
>>>         is better as it doesn't involve extra network operation.
>>>
>>>
>>>     I wouldn't read the last stripe unconditionally every time we lock
>>>     the inode. There's no benefit at all on random writes (in fact it's
>>>     worse) and a sequential write will issue the read anyway when
>>>     needed. The only difference is a small delay for the first operation
>>>     after a lock.
>>>
>>>
>>> Yes, perfect.
>>>
>>>
>>>
>>>     What I would do is to keep the last stripe of every write (we can
>>>     consider to do it per fd), even if it's not the last stripe of the
>>>     file (to also optimize sequential rewrites).
>>>
>>>
>>> Ah! good point. But if we remember it per fd, one fd's cached data can
>>> be over-written by another fd on the disk so we need to also do cache
>>> invalidation.
>>>
>>
>> We only cache data if we have the inodelk, so all related fd's must be
>> from the same client, and we'll control all its writes so cache
>> invalidation in this case is pretty easy.
>>
>> There exists the possibility to have two fd's from the same client
>> writing to the same region. To control this we would need some range
>> checking in the writes, but all this is local, so it's easy to control it.
>>
>> Anyway, this is probably not a common case, so we could start by caching
>> only the last stripe of the last write, ignoring the fd.
>>
>> May be implementation should consider this possibility.
>>> Yet to think about how to do this. But it is a good point. We should
>>> consider this.
>>>
>>
>> Maybe we could keep a list of cached stripes sorted by offset in the
>> inode (if the maximum number of entries is small, we could keep the list
>> not sorted). Each fd should store the offset of the last write. Cached
>> stripes should have a ref counter just to account for the case that two
>> fd's point to the same offset.
>>
>> When a new write arrives, we check the offset stored in the fd and see if
>> it corresponds to a sequential write. If so, we look at the inode list to
>> find the cached stripe, otherwise we can release the cached stripe.
>>
>> We can limit the number of cached entries and release the least recently
>> used when we reach some maximum.
>>
>
> Yeah, this works :-).
> Ashish,
>     Can all of this be implemented by 3.12?
>
>
>>
>>
>>>
>>>
>>>     One thing I've observed is that a 'dd' with block size of 1MB gets
>>>     split into multiple 128KB blocks that are sent in parallel and not
>>>     necessarily processed in the sequential order. This means that big
>>>     block sizes won't benefit much from this optimization since they
>>>     will be seen as partially non-sequential writes. Anyway the change
>>>     won't hurt.
>>>
>>>
>>> In this case as per the solution we won't cache anything right? Because
>>> we didn't request anything from the disk. We will only keep the data in
>>> cache if it is not aligned write which is at the current EOF. At least
>>> that is what I had in mind.
>>>
>>
>> Suppose we are writing multiple 1MB blocks at offset 1. If each write is
>> split into 8 blocks of 128KB, all writes will be not aligned, and can be
>> received in any order. Suppose that the first write happens to be at offset
>> 128K + 1. We don't have anything cached, so we read the needed stripes and
>> cache the last one. Now the next write is at offset 1. In this case we
>> won't get any benefit from the previous write, since the stripe we need is
>> not cached. However the write from the user point of view is sequential.
>>
>> It won't hurt but it won't take all benefits from the new caching
>> mechanism.
>>
>> As a mitigating factor, we could consider to extend the previous solution
>> I've explained to allow caching multiple stripes per fd. A small number
>> like 8 would be enough.
>>
>> Xavi
>>
>>
>>>
>>>
>>>     Xavi
>>>
>>>
>>>
>>>
>>>
>>>             ---
>>>             Ashish
>>>
>>>
>>>
>>>             _______________________________________________
>>>             Gluster-devel mailing list
>>>             Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>>>         <mailto:Gluster-devel at gluster.org
>>>         <mailto:Gluster-devel at gluster.org>>
>>>             http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>         <http://lists.gluster.org/mailman/listinfo/gluster-devel>
>>>             <http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>         <http://lists.gluster.org/mailman/listinfo/gluster-devel>>
>>>
>>>
>>>
>>>
>>>         --
>>>         Pranith
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Pranith
>>>
>>
>>
>
>
> --
> Pranith
>
>

-- 
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170703/5d829a3b/attachment-0001.html>