[Gluster-devel] Disperse volume : Sequential Writes

Wed Jul 5 10:16:39 UTC 2017

On Tue, Jul 4, 2017 at 1:39 PM, Xavier Hernandez <xhernandez at datalab.es>
wrote:

> Hi Pranith,
>
> On 03/07/17 05:35, Pranith Kumar Karampuri wrote:
>
>> Ashish, Xavi,
>>        I think it is better to implement this change as a separate
>> read-after-write caching xlator which we can load between EC and client
>> xlator. That way EC will not get a lot more functionality than necessary
>> and may be this xlator can be used somewhere else in the stack if
>> possible.
>>
>
> while this seems a good way to separate functionalities, it has a big
> problem. If we add a caching xlator between ec and *all* of its subvolumes,
> it will only be able to cache encoded data. So, when ec needs the "cached"
> data, it will need to issue a request to each of its subvolumes and compute
> the decoded data before being able to use it, so we don't avoid the
> decoding overhead.
>
> Also, if we want to make the xlator generic, it will probably cache a lot
> more data than ec really needs. Increasing memory footprint considerably
> for no real use.
>
> Additionally, this new xlator will need to guarantee that the cached data
> is current, so it will need its own locking logic (that would be another
> copy&paste of the existing logic in one of the current xlators) which is
> slow and difficult to maintain, or it will need to intercept and reuse
> locking calls from parent xlators, which can be quite complex since we have
> multiple xlator levels where locks can be taken, not only ec.
>
> This is a relatively simple change to make inside ec, but a very complex
> change (IMO) if we want to do it as a stand-alone xlator and be generic
> enough to be reused and work safely in other places of the stack.
>
> If we want to separate functionalities I think we should create a new
> concept of xlator which is transversal to the "traditional" xlator stack.
>
> Current xlators are linear in the sense that each one operates only at one
> place (it can be moved by reconfiguration, but once instantiated, it always
> work at the same place) and passes data to the next one.
>
> A transversal xlator (or maybe a service xlator would be better) would be
> one not bound to any place of the stack, but could be used by all other
> xlators to implement some service, like caching, multithreading, locking,
> ... these are features that many xlators need but cannot use easily (nor
> efficiently) if they are implicitly implemented in some specific place of
> the stack outside its control.
>
> The transaction framework we already talked, could be though as one of
> these service xlators. Multithreading could also benefit of this approach
> because xlators would have more control about what things can be processed
> by a background thread and which ones not. Probably there are other
> features that could benefit from this approach.
>
> In the case of brick multiplexing, if some xlators are removed from each
> stack and loaded as global services, most probably the memory footprint
> will be lower and the resource usage more optimized.
>

I like the service xlator approach. But I don't think we have enough time
to make it operational in the short term. Let us go with implementation of
this feature in EC for now. I didn't realize the extra cost of decoding
when I thought about the separation. So I guess we will stick to the old
idea for now.

>
> Just an idea...
>
> Xavi
>
>
>> On Fri, Jun 16, 2017 at 4:19 PM, Ashish Pandey <aspandey at redhat.com
>> <mailto:aspandey at redhat.com>> wrote:
>>
>>
>>     I think it should be done as we have agreement on basic design.
>>
>>     ------------------------------------------------------------
>> ------------
>>     *From: *"Pranith Kumar Karampuri" <pkarampu at redhat.com
>>     <mailto:pkarampu at redhat.com>>
>>     *To: *"Xavier Hernandez" <xhernandez at datalab.es
>>     <mailto:xhernandez at datalab.es>>
>>     *Cc: *"Ashish Pandey" <aspandey at redhat.com
>>     <mailto:aspandey at redhat.com>>, "Gluster Devel"
>>     <gluster-devel at gluster.org <mailto:gluster-devel at gluster.org>>
>>     *Sent: *Friday, June 16, 2017 3:50:09 PM
>>     *Subject: *Re: [Gluster-devel] Disperse volume : Sequential Writes
>>
>>
>>
>>
>>     On Fri, Jun 16, 2017 at 3:12 PM, Xavier Hernandez
>>     <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>
>>         On 16/06/17 10:51, Pranith Kumar Karampuri wrote:
>>
>>
>>
>>             On Fri, Jun 16, 2017 at 12:02 PM, Xavier Hernandez
>>             <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>>             <mailto:xhernandez at datalab.es
>>
>>             <mailto:xhernandez at datalab.es>>> wrote:
>>
>>                 On 15/06/17 11:50, Pranith Kumar Karampuri wrote:
>>
>>
>>
>>                     On Thu, Jun 15, 2017 at 11:51 AM, Ashish Pandey
>>                     <aspandey at redhat.com <mailto:aspandey at redhat.com>
>>             <mailto:aspandey at redhat.com <mailto:aspandey at redhat.com>>
>>                     <mailto:aspandey at redhat.com
>>             <mailto:aspandey at redhat.com> <mailto:aspandey at redhat.com
>>             <mailto:aspandey at redhat.com>>>> wrote:
>>
>>                         Hi All,
>>
>>                         We have been facing some issues in disperse (EC)
>>             volume.
>>                         We know that currently EC is not good for random
>>             IO as it
>>                     requires
>>                         READ-MODIFY-WRITE fop
>>                         cycle if an offset and offset+length falls in
>>             the middle of
>>                     strip size.
>>
>>                         Unfortunately, it could also happen with
>>             sequential writes.
>>                         Consider an EC volume with configuration  4+2.
>>             The stripe
>>                     size for
>>                         this would be 512 * 4 = 2048. That is, 2048
>>             bytes of user data
>>                         stored in one stripe.
>>                         Let's say 2048 + 512 = 2560 bytes are already
>>             written on this
>>                         volume. 512 Bytes would be in second stripe.
>>                         Now, if there are sequential writes with offset
>>             2560 and of
>>                     size 1
>>                         Byte, we have to read the whole stripe, encode
>>             it with 1
>>                     Byte and
>>                         then again have to write it back.
>>                         Next, write with offset 2561 and size of 1 Byte
>>             will again
>>                         READ-MODIFY-WRITE the whole stripe. This is
>>             causing bad
>>                     performance.
>>
>>                         There are some tools and scenario's where such
>>             kind of load is
>>                         coming and users are not aware of that.
>>                         Example: fio and zip
>>
>>                         Solution:
>>                         One possible solution to deal with this issue is
>>             to keep
>>                     last stripe
>>                         in memory.
>>                         This way, we need not to read it again and we
>>             can save READ fop
>>                         going over the network.
>>                         Considering the above example, we have to keep
>>             last 2048 bytes
>>                         (maximum)  in memory per file. This should not
>>             be a big
>>                         deal as we already keep some data like xattr's
>>             and size info in
>>                         memory and based on that we take decisions.
>>
>>                         Please provide your thoughts on this and also if
>>             you have
>>                     any other
>>                         solution.
>>
>>
>>                     Just adding more details.
>>                     The stripe will be in memory only when lock on the
>>             inode is active.
>>
>>
>>                 I think that's ok.
>>
>>                     One
>>                     thing we are yet to decide on is: do we want to read
>>             the stripe
>>                     everytime we get the lock or just after an extending
>>             write is
>>                     performed.
>>                     I am thinking keeping the stripe in memory just after
>> an
>>                     extending write
>>                     is better as it doesn't involve extra network
>> operation.
>>
>>
>>                 I wouldn't read the last stripe unconditionally every
>>             time we lock
>>                 the inode. There's no benefit at all on random writes
>>             (in fact it's
>>                 worse) and a sequential write will issue the read anyway
>>             when
>>                 needed. The only difference is a small delay for the
>>             first operation
>>                 after a lock.
>>
>>
>>             Yes, perfect.
>>
>>
>>
>>                 What I would do is to keep the last stripe of every
>>             write (we can
>>                 consider to do it per fd), even if it's not the last
>>             stripe of the
>>                 file (to also optimize sequential rewrites).
>>
>>
>>             Ah! good point. But if we remember it per fd, one fd's
>>             cached data can
>>             be over-written by another fd on the disk so we need to also
>>             do cache
>>             invalidation.
>>
>>
>>         We only cache data if we have the inodelk, so all related fd's
>>         must be from the same client, and we'll control all its writes
>>         so cache invalidation in this case is pretty easy.
>>
>>         There exists the possibility to have two fd's from the same
>>         client writing to the same region. To control this we would need
>>         some range checking in the writes, but all this is local, so
>>         it's easy to control it.
>>
>>         Anyway, this is probably not a common case, so we could start by
>>         caching only the last stripe of the last write, ignoring the fd.
>>
>>             May be implementation should consider this possibility.
>>             Yet to think about how to do this. But it is a good point.
>>             We should
>>             consider this.
>>
>>
>>         Maybe we could keep a list of cached stripes sorted by offset in
>>         the inode (if the maximum number of entries is small, we could
>>         keep the list not sorted). Each fd should store the offset of
>>         the last write. Cached stripes should have a ref counter just to
>>         account for the case that two fd's point to the same offset.
>>
>>         When a new write arrives, we check the offset stored in the fd
>>         and see if it corresponds to a sequential write. If so, we look
>>         at the inode list to find the cached stripe, otherwise we can
>>         release the cached stripe.
>>
>>         We can limit the number of cached entries and release the least
>>         recently used when we reach some maximum.
>>
>>
>>     Yeah, this works :-).
>>     Ashish,
>>         Can all of this be implemented by 3.12?
>>
>>
>>
>>
>>
>>
>>                 One thing I've observed is that a 'dd' with block size
>>             of 1MB gets
>>                 split into multiple 128KB blocks that are sent in
>>             parallel and not
>>                 necessarily processed in the sequential order. This
>>             means that big
>>                 block sizes won't benefit much from this optimization
>>             since they
>>                 will be seen as partially non-sequential writes. Anyway
>>             the change
>>                 won't hurt.
>>
>>
>>             In this case as per the solution we won't cache anything
>>             right? Because
>>             we didn't request anything from the disk. We will only keep
>>             the data in
>>             cache if it is not aligned write which is at the current
>>             EOF. At least
>>             that is what I had in mind.
>>
>>
>>         Suppose we are writing multiple 1MB blocks at offset 1. If each
>>         write is split into 8 blocks of 128KB, all writes will be not
>>         aligned, and can be received in any order. Suppose that the
>>         first write happens to be at offset 128K + 1. We don't have
>>         anything cached, so we read the needed stripes and cache the
>>         last one. Now the next write is at offset 1. In this case we
>>         won't get any benefit from the previous write, since the stripe
>>         we need is not cached. However the write from the user point of
>>         view is sequential.
>>
>>         It won't hurt but it won't take all benefits from the new
>>         caching mechanism.
>>
>>         As a mitigating factor, we could consider to extend the previous
>>         solution I've explained to allow caching multiple stripes per
>>         fd. A small number like 8 would be enough.
>>
>>         Xavi
>>
>>
>>
>>
>>                 Xavi
>>
>>
>>
>>
>>
>>                         ---
>>                         Ashish
>>
>>
>>
>>                         _______________________________________________
>>                         Gluster-devel mailing list
>>                         Gluster-devel at gluster.org
>>             <mailto:Gluster-devel at gluster.org>
>>             <mailto:Gluster-devel at gluster.org
>>             <mailto:Gluster-devel at gluster.org>>
>>                     <mailto:Gluster-devel at gluster.org
>>             <mailto:Gluster-devel at gluster.org>
>>                     <mailto:Gluster-devel at gluster.org
>>             <mailto:Gluster-devel at gluster.org>>>
>>
>>             http://lists.gluster.org/mailman/listinfo/gluster-devel
>>             <http://lists.gluster.org/mailman/listinfo/gluster-devel>
>>
>>             <http://lists.gluster.org/mailman/listinfo/gluster-devel
>>             <http://lists.gluster.org/mailman/listinfo/gluster-devel>>
>>
>>             <http://lists.gluster.org/mailman/listinfo/gluster-devel
>>             <http://lists.gluster.org/mailman/listinfo/gluster-devel>
>>
>>             <http://lists.gluster.org/mailman/listinfo/gluster-devel
>>             <http://lists.gluster.org/mailman/listinfo/gluster-devel>>>
>>
>>
>>
>>
>>                     --
>>                     Pranith
>>
>>
>>
>>
>>
>>             --
>>             Pranith
>>
>>
>>
>>
>>
>>     --
>>     Pranith
>>
>>
>>
>>
>> --
>> Pranith
>>
>
>

-- 
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170705/479c938c/attachment-0001.html>