[Gluster-devel] Disperse volume : Sequential Writes
Pranith Kumar Karampuri
pkarampu at redhat.com
Fri Jun 16 10:20:09 UTC 2017
On Fri, Jun 16, 2017 at 3:12 PM, Xavier Hernandez <xhernandez at datalab.es>
wrote:
> On 16/06/17 10:51, Pranith Kumar Karampuri wrote:
>
>>
>>
>> On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez
>> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>
>> On 15/06/17 11:50, Pranith Kumar Karampuri wrote:
>>
>>
>>
>> On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey
>> <aspandey at redhat.com <mailto:aspandey at redhat.com>
>> <mailto:aspandey at redhat.com <mailto:aspandey at redhat.com>>> wrote:
>>
>> Hi All,
>>
>> We have been facing some issues in disperse (EC) volume.
>> We know that currently EC is not good for random IO as it
>> requires
>> READ-MODIFY-WRITE fop
>> cycle if an offset and offset+length falls in the middle of
>> strip size.
>>
>> Unfortunately, it could also happen with sequential writes.
>> Consider an EC volume with configuration 4+2. The stripe
>> size for
>> this would be 512 * 4 = 2048. That is, 2048 bytes of user data
>> stored in one stripe.
>> Let's say 2048 + 512 = 2560 bytes are already written on this
>> volume. 512 Bytes would be in second stripe.
>> Now, if there are sequential writes with offset 2560 and of
>> size 1
>> Byte, we have to read the whole stripe, encode it with 1
>> Byte and
>> then again have to write it back.
>> Next, write with offset 2561 and size of 1 Byte will again
>> READ-MODIFY-WRITE the whole stripe. This is causing bad
>> performance.
>>
>> There are some tools and scenario's where such kind of load is
>> coming and users are not aware of that.
>> Example: fio and zip
>>
>> Solution:
>> One possible solution to deal with this issue is to keep
>> last stripe
>> in memory.
>> This way, we need not to read it again and we can save READ
>> fop
>> going over the network.
>> Considering the above example, we have to keep last 2048 bytes
>> (maximum) in memory per file. This should not be a big
>> deal as we already keep some data like xattr's and size info
>> in
>> memory and based on that we take decisions.
>>
>> Please provide your thoughts on this and also if you have
>> any other
>> solution.
>>
>>
>> Just adding more details.
>> The stripe will be in memory only when lock on the inode is
>> active.
>>
>>
>> I think that's ok.
>>
>> One
>> thing we are yet to decide on is: do we want to read the stripe
>> everytime we get the lock or just after an extending write is
>> performed.
>> I am thinking keeping the stripe in memory just after an
>> extending write
>> is better as it doesn't involve extra network operation.
>>
>>
>> I wouldn't read the last stripe unconditionally every time we lock
>> the inode. There's no benefit at all on random writes (in fact it's
>> worse) and a sequential write will issue the read anyway when
>> needed. The only difference is a small delay for the first operation
>> after a lock.
>>
>>
>> Yes, perfect.
>>
>>
>>
>> What I would do is to keep the last stripe of every write (we can
>> consider to do it per fd), even if it's not the last stripe of the
>> file (to also optimize sequential rewrites).
>>
>>
>> Ah! good point. But if we remember it per fd, one fd's cached data can
>> be over-written by another fd on the disk so we need to also do cache
>> invalidation.
>>
>
> We only cache data if we have the inodelk, so all related fd's must be
> from the same client, and we'll control all its writes so cache
> invalidation in this case is pretty easy.
>
> There exists the possibility to have two fd's from the same client writing
> to the same region. To control this we would need some range checking in
> the writes, but all this is local, so it's easy to control it.
>
> Anyway, this is probably not a common case, so we could start by caching
> only the last stripe of the last write, ignoring the fd.
>
> May be implementation should consider this possibility.
>> Yet to think about how to do this. But it is a good point. We should
>> consider this.
>>
>
> Maybe we could keep a list of cached stripes sorted by offset in the inode
> (if the maximum number of entries is small, we could keep the list not
> sorted). Each fd should store the offset of the last write. Cached stripes
> should have a ref counter just to account for the case that two fd's point
> to the same offset.
>
> When a new write arrives, we check the offset stored in the fd and see if
> it corresponds to a sequential write. If so, we look at the inode list to
> find the cached stripe, otherwise we can release the cached stripe.
>
> We can limit the number of cached entries and release the least recently
> used when we reach some maximum.
>
Yeah, this works :-).
Ashish,
Can all of this be implemented by 3.12?
>
>
>>
>>
>> One thing I've observed is that a 'dd' with block size of 1MB gets
>> split into multiple 128KB blocks that are sent in parallel and not
>> necessarily processed in the sequential order. This means that big
>> block sizes won't benefit much from this optimization since they
>> will be seen as partially non-sequential writes. Anyway the change
>> won't hurt.
>>
>>
>> In this case as per the solution we won't cache anything right? Because
>> we didn't request anything from the disk. We will only keep the data in
>> cache if it is not aligned write which is at the current EOF. At least
>> that is what I had in mind.
>>
>
> Suppose we are writing multiple 1MB blocks at offset 1. If each write is
> split into 8 blocks of 128KB, all writes will be not aligned, and can be
> received in any order. Suppose that the first write happens to be at offset
> 128K + 1. We don't have anything cached, so we read the needed stripes and
> cache the last one. Now the next write is at offset 1. In this case we
> won't get any benefit from the previous write, since the stripe we need is
> not cached. However the write from the user point of view is sequential.
>
> It won't hurt but it won't take all benefits from the new caching
> mechanism.
>
> As a mitigating factor, we could consider to extend the previous solution
> I've explained to allow caching multiple stripes per fd. A small number
> like 8 would be enough.
>
> Xavi
>
>
>>
>>
>> Xavi
>>
>>
>>
>>
>>
>> ---
>> Ashish
>>
>>
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>> <mailto:Gluster-devel at gluster.org
>> <mailto:Gluster-devel at gluster.org>>
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>> <http://lists.gluster.org/mailman/listinfo/gluster-devel>
>> <http://lists.gluster.org/mailman/listinfo/gluster-devel
>> <http://lists.gluster.org/mailman/listinfo/gluster-devel>>
>>
>>
>>
>>
>> --
>> Pranith
>>
>>
>>
>>
>>
>> --
>> Pranith
>>
>
>
--
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170616/33fe9d56/attachment.html>
More information about the Gluster-devel
mailing list