[Gluster-devel] Disperse volume : Sequential Writes
Pranith Kumar Karampuri
pkarampu at redhat.com
Wed Jul 5 10:16:39 UTC 2017
On Tue, Jul 4, 2017 at 1:39 PM, Xavier Hernandez <xhernandez at datalab.es>
wrote:
> Hi Pranith,
>
> On 03/07/17 05:35, Pranith Kumar Karampuri wrote:
>
>> Ashish, Xavi,
>> I think it is better to implement this change as a separate
>> read-after-write caching xlator which we can load between EC and client
>> xlator. That way EC will not get a lot more functionality than necessary
>> and may be this xlator can be used somewhere else in the stack if
>> possible.
>>
>
> while this seems a good way to separate functionalities, it has a big
> problem. If we add a caching xlator between ec and *all* of its subvolumes,
> it will only be able to cache encoded data. So, when ec needs the "cached"
> data, it will need to issue a request to each of its subvolumes and compute
> the decoded data before being able to use it, so we don't avoid the
> decoding overhead.
>
> Also, if we want to make the xlator generic, it will probably cache a lot
> more data than ec really needs. Increasing memory footprint considerably
> for no real use.
>
> Additionally, this new xlator will need to guarantee that the cached data
> is current, so it will need its own locking logic (that would be another
> copy&paste of the existing logic in one of the current xlators) which is
> slow and difficult to maintain, or it will need to intercept and reuse
> locking calls from parent xlators, which can be quite complex since we have
> multiple xlator levels where locks can be taken, not only ec.
>
> This is a relatively simple change to make inside ec, but a very complex
> change (IMO) if we want to do it as a stand-alone xlator and be generic
> enough to be reused and work safely in other places of the stack.
>
> If we want to separate functionalities I think we should create a new
> concept of xlator which is transversal to the "traditional" xlator stack.
>
> Current xlators are linear in the sense that each one operates only at one
> place (it can be moved by reconfiguration, but once instantiated, it always
> work at the same place) and passes data to the next one.
>
> A transversal xlator (or maybe a service xlator would be better) would be
> one not bound to any place of the stack, but could be used by all other
> xlators to implement some service, like caching, multithreading, locking,
> ... these are features that many xlators need but cannot use easily (nor
> efficiently) if they are implicitly implemented in some specific place of
> the stack outside its control.
>
> The transaction framework we already talked, could be though as one of
> these service xlators. Multithreading could also benefit of this approach
> because xlators would have more control about what things can be processed
> by a background thread and which ones not. Probably there are other
> features that could benefit from this approach.
>
> In the case of brick multiplexing, if some xlators are removed from each
> stack and loaded as global services, most probably the memory footprint
> will be lower and the resource usage more optimized.
>
I like the service xlator approach. But I don't think we have enough time
to make it operational in the short term. Let us go with implementation of
this feature in EC for now. I didn't realize the extra cost of decoding
when I thought about the separation. So I guess we will stick to the old
idea for now.
>
> Just an idea...
>
> Xavi
>
>
>> On Fri, Jun 16, 2017 at 4:19 PM, Ashish Pandey <aspandey at redhat.com
>> <mailto:aspandey at redhat.com>> wrote:
>>
>>
>> I think it should be done as we have agreement on basic design.
>>
>> ------------------------------------------------------------
>> ------------
>> *From: *"Pranith Kumar Karampuri" <pkarampu at redhat.com
>> <mailto:pkarampu at redhat.com>>
>> *To: *"Xavier Hernandez" <xhernandez at datalab.es
>> <mailto:xhernandez at datalab.es>>
>> *Cc: *"Ashish Pandey" <aspandey at redhat.com
>> <mailto:aspandey at redhat.com>>, "Gluster Devel"
>> <gluster-devel at gluster.org <mailto:gluster-devel at gluster.org>>
>> *Sent: *Friday, June 16, 2017 3:50:09 PM
>> *Subject: *Re: [Gluster-devel] Disperse volume : Sequential Writes
>>
>>
>>
>>
>> On Fri, Jun 16, 2017 at 3:12 PM, Xavier Hernandez
>> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>
>> On 16/06/17 10:51, Pranith Kumar Karampuri wrote:
>>
>>
>>
>> On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez
>> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>> <mailto:xhernandez at datalab.es
>>
>> <mailto:xhernandez at datalab.es>>> wrote:
>>
>> On 15/06/17 11:50, Pranith Kumar Karampuri wrote:
>>
>>
>>
>> On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey
>> <aspandey at redhat.com <mailto:aspandey at redhat.com>
>> <mailto:aspandey at redhat.com <mailto:aspandey at redhat.com>>
>> <mailto:aspandey at redhat.com
>> <mailto:aspandey at redhat.com> <mailto:aspandey at redhat.com
>> <mailto:aspandey at redhat.com>>>> wrote:
>>
>> Hi All,
>>
>> We have been facing some issues in disperse (EC)
>> volume.
>> We know that currently EC is not good for random
>> IO as it
>> requires
>> READ-MODIFY-WRITE fop
>> cycle if an offset and offset+length falls in
>> the middle of
>> strip size.
>>
>> Unfortunately, it could also happen with
>> sequential writes.
>> Consider an EC volume with configuration 4+2.
>> The stripe
>> size for
>> this would be 512 * 4 = 2048. That is, 2048
>> bytes of user data
>> stored in one stripe.
>> Let's say 2048 + 512 = 2560 bytes are already
>> written on this
>> volume. 512 Bytes would be in second stripe.
>> Now, if there are sequential writes with offset
>> 2560 and of
>> size 1
>> Byte, we have to read the whole stripe, encode
>> it with 1
>> Byte and
>> then again have to write it back.
>> Next, write with offset 2561 and size of 1 Byte
>> will again
>> READ-MODIFY-WRITE the whole stripe. This is
>> causing bad
>> performance.
>>
>> There are some tools and scenario's where such
>> kind of load is
>> coming and users are not aware of that.
>> Example: fio and zip
>>
>> Solution:
>> One possible solution to deal with this issue is
>> to keep
>> last stripe
>> in memory.
>> This way, we need not to read it again and we
>> can save READ fop
>> going over the network.
>> Considering the above example, we have to keep
>> last 2048 bytes
>> (maximum) in memory per file. This should not
>> be a big
>> deal as we already keep some data like xattr's
>> and size info in
>> memory and based on that we take decisions.
>>
>> Please provide your thoughts on this and also if
>> you have
>> any other
>> solution.
>>
>>
>> Just adding more details.
>> The stripe will be in memory only when lock on the
>> inode is active.
>>
>>
>> I think that's ok.
>>
>> One
>> thing we are yet to decide on is: do we want to read
>> the stripe
>> everytime we get the lock or just after an extending
>> write is
>> performed.
>> I am thinking keeping the stripe in memory just after
>> an
>> extending write
>> is better as it doesn't involve extra network
>> operation.
>>
>>
>> I wouldn't read the last stripe unconditionally every
>> time we lock
>> the inode. There's no benefit at all on random writes
>> (in fact it's
>> worse) and a sequential write will issue the read anyway
>> when
>> needed. The only difference is a small delay for the
>> first operation
>> after a lock.
>>
>>
>> Yes, perfect.
>>
>>
>>
>> What I would do is to keep the last stripe of every
>> write (we can
>> consider to do it per fd), even if it's not the last
>> stripe of the
>> file (to also optimize sequential rewrites).
>>
>>
>> Ah! good point. But if we remember it per fd, one fd's
>> cached data can
>> be over-written by another fd on the disk so we need to also
>> do cache
>> invalidation.
>>
>>
>> We only cache data if we have the inodelk, so all related fd's
>> must be from the same client, and we'll control all its writes
>> so cache invalidation in this case is pretty easy.
>>
>> There exists the possibility to have two fd's from the same
>> client writing to the same region. To control this we would need
>> some range checking in the writes, but all this is local, so
>> it's easy to control it.
>>
>> Anyway, this is probably not a common case, so we could start by
>> caching only the last stripe of the last write, ignoring the fd.
>>
>> May be implementation should consider this possibility.
>> Yet to think about how to do this. But it is a good point.
>> We should
>> consider this.
>>
>>
>> Maybe we could keep a list of cached stripes sorted by offset in
>> the inode (if the maximum number of entries is small, we could
>> keep the list not sorted). Each fd should store the offset of
>> the last write. Cached stripes should have a ref counter just to
>> account for the case that two fd's point to the same offset.
>>
>> When a new write arrives, we check the offset stored in the fd
>> and see if it corresponds to a sequential write. If so, we look
>> at the inode list to find the cached stripe, otherwise we can
>> release the cached stripe.
>>
>> We can limit the number of cached entries and release the least
>> recently used when we reach some maximum.
>>
>>
>> Yeah, this works :-).
>> Ashish,
>> Can all of this be implemented by 3.12?
>>
>>
>>
>>
>>
>>
>> One thing I've observed is that a 'dd' with block size
>> of 1MB gets
>> split into multiple 128KB blocks that are sent in
>> parallel and not
>> necessarily processed in the sequential order. This
>> means that big
>> block sizes won't benefit much from this optimization
>> since they
>> will be seen as partially non-sequential writes. Anyway
>> the change
>> won't hurt.
>>
>>
>> In this case as per the solution we won't cache anything
>> right? Because
>> we didn't request anything from the disk. We will only keep
>> the data in
>> cache if it is not aligned write which is at the current
>> EOF. At least
>> that is what I had in mind.
>>
>>
>> Suppose we are writing multiple 1MB blocks at offset 1. If each
>> write is split into 8 blocks of 128KB, all writes will be not
>> aligned, and can be received in any order. Suppose that the
>> first write happens to be at offset 128K + 1. We don't have
>> anything cached, so we read the needed stripes and cache the
>> last one. Now the next write is at offset 1. In this case we
>> won't get any benefit from the previous write, since the stripe
>> we need is not cached. However the write from the user point of
>> view is sequential.
>>
>> It won't hurt but it won't take all benefits from the new
>> caching mechanism.
>>
>> As a mitigating factor, we could consider to extend the previous
>> solution I've explained to allow caching multiple stripes per
>> fd. A small number like 8 would be enough.
>>
>> Xavi
>>
>>
>>
>>
>> Xavi
>>
>>
>>
>>
>>
>> ---
>> Ashish
>>
>>
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> <mailto:Gluster-devel at gluster.org>
>> <mailto:Gluster-devel at gluster.org
>> <mailto:Gluster-devel at gluster.org>>
>> <mailto:Gluster-devel at gluster.org
>> <mailto:Gluster-devel at gluster.org>
>> <mailto:Gluster-devel at gluster.org
>> <mailto:Gluster-devel at gluster.org>>>
>>
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>> <http://lists.gluster.org/mailman/listinfo/gluster-devel>
>>
>> <http://lists.gluster.org/mailman/listinfo/gluster-devel
>> <http://lists.gluster.org/mailman/listinfo/gluster-devel>>
>>
>> <http://lists.gluster.org/mailman/listinfo/gluster-devel
>> <http://lists.gluster.org/mailman/listinfo/gluster-devel>
>>
>> <http://lists.gluster.org/mailman/listinfo/gluster-devel
>> <http://lists.gluster.org/mailman/listinfo/gluster-devel>>>
>>
>>
>>
>>
>> --
>> Pranith
>>
>>
>>
>>
>>
>> --
>> Pranith
>>
>>
>>
>>
>>
>> --
>> Pranith
>>
>>
>>
>>
>> --
>> Pranith
>>
>
>
--
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170705/479c938c/attachment-0001.html>
More information about the Gluster-devel
mailing list