[Gluster-devel] Disperse volume : Sequential Writes
Pranith Kumar Karampuri
pkarampu at redhat.com
Mon Jul 3 03:35:54 UTC 2017
Ashish, Xavi,
I think it is better to implement this change as a separate
read-after-write caching xlator which we can load between EC and client
xlator. That way EC will not get a lot more functionality than necessary
and may be this xlator can be used somewhere else in the stack if possible.
On Fri, Jun 16, 2017 at 4:19 PM, Ashish Pandey <aspandey at redhat.com> wrote:
>
> I think it should be done as we have agreement on basic design.
>
> ------------------------------
> *From: *"Pranith Kumar Karampuri" <pkarampu at redhat.com>
> *To: *"Xavier Hernandez" <xhernandez at datalab.es>
> *Cc: *"Ashish Pandey" <aspandey at redhat.com>, "Gluster Devel" <
> gluster-devel at gluster.org>
> *Sent: *Friday, June 16, 2017 3:50:09 PM
> *Subject: *Re: [Gluster-devel] Disperse volume : Sequential Writes
>
>
>
>
> On Fri, Jun 16, 2017 at 3:12 PM, Xavier Hernandez <xhernandez at datalab.es>
> wrote:
>
>> On 16/06/17 10:51, Pranith Kumar Karampuri wrote:
>>
>>>
>>>
>>> On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez
>>> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>>
>>> On 15/06/17 11:50, Pranith Kumar Karampuri wrote:
>>>
>>>
>>>
>>> On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey
>>> <aspandey at redhat.com <mailto:aspandey at redhat.com>
>>> <mailto:aspandey at redhat.com <mailto:aspandey at redhat.com>>>
>>> wrote:
>>>
>>> Hi All,
>>>
>>> We have been facing some issues in disperse (EC) volume.
>>> We know that currently EC is not good for random IO as it
>>> requires
>>> READ-MODIFY-WRITE fop
>>> cycle if an offset and offset+length falls in the middle of
>>> strip size.
>>>
>>> Unfortunately, it could also happen with sequential writes.
>>> Consider an EC volume with configuration 4+2. The stripe
>>> size for
>>> this would be 512 * 4 = 2048. That is, 2048 bytes of user
>>> data
>>> stored in one stripe.
>>> Let's say 2048 + 512 = 2560 bytes are already written on this
>>> volume. 512 Bytes would be in second stripe.
>>> Now, if there are sequential writes with offset 2560 and of
>>> size 1
>>> Byte, we have to read the whole stripe, encode it with 1
>>> Byte and
>>> then again have to write it back.
>>> Next, write with offset 2561 and size of 1 Byte will again
>>> READ-MODIFY-WRITE the whole stripe. This is causing bad
>>> performance.
>>>
>>> There are some tools and scenario's where such kind of load
>>> is
>>> coming and users are not aware of that.
>>> Example: fio and zip
>>>
>>> Solution:
>>> One possible solution to deal with this issue is to keep
>>> last stripe
>>> in memory.
>>> This way, we need not to read it again and we can save READ
>>> fop
>>> going over the network.
>>> Considering the above example, we have to keep last 2048
>>> bytes
>>> (maximum) in memory per file. This should not be a big
>>> deal as we already keep some data like xattr's and size info
>>> in
>>> memory and based on that we take decisions.
>>>
>>> Please provide your thoughts on this and also if you have
>>> any other
>>> solution.
>>>
>>>
>>> Just adding more details.
>>> The stripe will be in memory only when lock on the inode is
>>> active.
>>>
>>>
>>> I think that's ok.
>>>
>>> One
>>> thing we are yet to decide on is: do we want to read the stripe
>>> everytime we get the lock or just after an extending write is
>>> performed.
>>> I am thinking keeping the stripe in memory just after an
>>> extending write
>>> is better as it doesn't involve extra network operation.
>>>
>>>
>>> I wouldn't read the last stripe unconditionally every time we lock
>>> the inode. There's no benefit at all on random writes (in fact it's
>>> worse) and a sequential write will issue the read anyway when
>>> needed. The only difference is a small delay for the first operation
>>> after a lock.
>>>
>>>
>>> Yes, perfect.
>>>
>>>
>>>
>>> What I would do is to keep the last stripe of every write (we can
>>> consider to do it per fd), even if it's not the last stripe of the
>>> file (to also optimize sequential rewrites).
>>>
>>>
>>> Ah! good point. But if we remember it per fd, one fd's cached data can
>>> be over-written by another fd on the disk so we need to also do cache
>>> invalidation.
>>>
>>
>> We only cache data if we have the inodelk, so all related fd's must be
>> from the same client, and we'll control all its writes so cache
>> invalidation in this case is pretty easy.
>>
>> There exists the possibility to have two fd's from the same client
>> writing to the same region. To control this we would need some range
>> checking in the writes, but all this is local, so it's easy to control it.
>>
>> Anyway, this is probably not a common case, so we could start by caching
>> only the last stripe of the last write, ignoring the fd.
>>
>> May be implementation should consider this possibility.
>>> Yet to think about how to do this. But it is a good point. We should
>>> consider this.
>>>
>>
>> Maybe we could keep a list of cached stripes sorted by offset in the
>> inode (if the maximum number of entries is small, we could keep the list
>> not sorted). Each fd should store the offset of the last write. Cached
>> stripes should have a ref counter just to account for the case that two
>> fd's point to the same offset.
>>
>> When a new write arrives, we check the offset stored in the fd and see if
>> it corresponds to a sequential write. If so, we look at the inode list to
>> find the cached stripe, otherwise we can release the cached stripe.
>>
>> We can limit the number of cached entries and release the least recently
>> used when we reach some maximum.
>>
>
> Yeah, this works :-).
> Ashish,
> Can all of this be implemented by 3.12?
>
>
>>
>>
>>>
>>>
>>> One thing I've observed is that a 'dd' with block size of 1MB gets
>>> split into multiple 128KB blocks that are sent in parallel and not
>>> necessarily processed in the sequential order. This means that big
>>> block sizes won't benefit much from this optimization since they
>>> will be seen as partially non-sequential writes. Anyway the change
>>> won't hurt.
>>>
>>>
>>> In this case as per the solution we won't cache anything right? Because
>>> we didn't request anything from the disk. We will only keep the data in
>>> cache if it is not aligned write which is at the current EOF. At least
>>> that is what I had in mind.
>>>
>>
>> Suppose we are writing multiple 1MB blocks at offset 1. If each write is
>> split into 8 blocks of 128KB, all writes will be not aligned, and can be
>> received in any order. Suppose that the first write happens to be at offset
>> 128K + 1. We don't have anything cached, so we read the needed stripes and
>> cache the last one. Now the next write is at offset 1. In this case we
>> won't get any benefit from the previous write, since the stripe we need is
>> not cached. However the write from the user point of view is sequential.
>>
>> It won't hurt but it won't take all benefits from the new caching
>> mechanism.
>>
>> As a mitigating factor, we could consider to extend the previous solution
>> I've explained to allow caching multiple stripes per fd. A small number
>> like 8 would be enough.
>>
>> Xavi
>>
>>
>>>
>>>
>>> Xavi
>>>
>>>
>>>
>>>
>>>
>>> ---
>>> Ashish
>>>
>>>
>>>
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>>> <mailto:Gluster-devel at gluster.org
>>> <mailto:Gluster-devel at gluster.org>>
>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>> <http://lists.gluster.org/mailman/listinfo/gluster-devel>
>>> <http://lists.gluster.org/mailman/listinfo/gluster-devel
>>> <http://lists.gluster.org/mailman/listinfo/gluster-devel>>
>>>
>>>
>>>
>>>
>>> --
>>> Pranith
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Pranith
>>>
>>
>>
>
>
> --
> Pranith
>
>
--
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170703/5d829a3b/attachment-0001.html>
More information about the Gluster-devel
mailing list