[Gluster-devel] regarding GF_CONTENT_KEY and dht2 - perf with small files

Thu Feb 4 10:58:57 UTC 2016

On Thu, Feb 04, 2016 at 11:34:04AM +0530, Shyam wrote:
> On 02/04/2016 09:38 AM, Vijay Bellur wrote:
> >On 02/03/2016 11:34 AM, Venky Shankar wrote:
> >>On Wed, Feb 03, 2016 at 09:24:06AM -0500, Jeff Darcy wrote:
> >>>>Problem is with workloads which know the files that need to be read
> >>>>without readdir, like hyperlinks (webserver), swift objects etc. These
> >>>>are two I know of which will have this problem, which can't be improved
> >>>>because we don't have metadata, data co-located. I have been trying to
> >>>>think of a solution for past few days. Nothing good is coming up :-/
> >>>
> >>>In those cases, caching (at the MDS) would certainly help a lot.  Some
> >>>variation of the compounding infrastructure under development for Samba
> >>>etc. might also apply, since this really is a compound operation.
> 
> Compounding in this case can help, but still without the cache, the read has
> to go to the DS, and on such a compounding, the MDS would reach out to the
> DS for the information than the client. Another possibility based on what we
> decide as the cache mechanism.
> 
> >>
> >>When a client is done modifying a file, MDS would refresh it's size,
> >>mtime
> >>attributes by fetching it from the DS. As part of this refresh, DS could
> >>additionally send back the content if the file size falls in range, with
> >>MDS persisting it, sending it back for subsequent lookup calls as it does
> >>now. The content (on MDS) can be zapped once the file size crosses the
> >>defined limit.
> 
> Venky, when you say persisting, I assume on disk, is that right?

Definitely on-disk.

> 
> If so, then the MDS storage size requirements would increase (based on
> amount of file data that need to be stored). As of now it is only inodes,
> and as we move to a db a record. In this case we may have *fatter* MDS
> partitions. Any comments/thoughts on that?

The MDS storage requirement does go up by a considerable amount due to the
fact that normally the number of MDS nodes would be far less in number than
the DS nodes. So, yes, the MDS does become fat, but it's important to have
data inline with it's inode to boost small file performance (at least when
the file is not under modification).

> 
> As with memory I would assume some form of eviction of data from MDS, to
> control the space utilization here as a possibility.

Maybe. Using TTL in a key-value store might be an option. But, IIRC, TTLs
can be set for an entire record and not for parts of a record. We'd need
to think more about this anyway.

> 
> >>
> >
> >I like the idea. However the memory implications of maintaining content
> >in MDS is something to watch out for. quick-read is interested in files
> >of size 64k by default and with a reasonable number of files in that
> >range, we might end up consuming significant memory with this scheme.
> 
> Vijay, I think what Venky states is to stash the file on the local storage
> and not in memory. If it was in memory then brick process restarts would
> nuke the cache, and either we need mechanisms to rebuild/warm the cache or
> just start caching afresh.
> 
> If we were caching in memory, then yes the concern is valid, and one
> possibility is  some form of LRU for the same, to keep memory consumption in
> check.

As stated earlier, it's a persistent cache which may or may not have a layer
of in-memory cache itself. I would leave all that to the key-value DB (when
we use one) as it most probably would be doing that.

> 
> Overall I would steer away from memory for this use case, and use the disk,
> as we do not know which files to cache (well in either case, but disk offers
> us more space to possibly punt on that issue). For files where the cache is
> missing and the file is small enough, either perform async read from the
> client (gaining some overlap time with the app) or just let it be, as we
> would get the open/read anyway, but would slow things down.

Yes. async reads for files which have missing inline data with inode plus
satisfy the size range requirement.

> 
> >
> >-Vijay
> >_______________________________________________
> >Gluster-devel mailing list
> >Gluster-devel at gluster.org
> >http://www.gluster.org/mailman/listinfo/gluster-devel