[Gluster-devel] md-cache improvements

Michael Adam obnox at samba.org
Tue Aug 16 23:22:51 UTC 2016


Hi all,

On 2016-08-15 at 22:39 -0400, Vijay Bellur wrote:
> Hi Poornima, Dan -
> 
> Let us have a hangout/bluejeans session this week to discuss the planned
> md-cache improvements, proposed timelines and sort out open questions if
> any.

Because the initial mail creates the impression that this is
a topic that people are merely discussing, let me point out
that it has actually moved way beyond that stage already:

Poornima has been working hard on these cache improvements
since late 2015 at least. (And desperately looking for review
and support since at least springtime..) See all her patches
that have now finally already gone into master recently
(e.g. http://review.gluster.org/#/c/12951/ for an old one
that has just been merged)
and all the patches that she has still up for review
(e.g. http://review.gluster.org/#/c/15002/ for a big one).

These changes were mainly motivated by samba-workloads,
since the chatty, md-heavy smb protocol is suffering most
notably from the lack of proper caching of this metadata.
The good news is that it recently started getting more
attention and we are seeing very, very promising performance
test results!
Full functional and regression testings are also underway.

Discussion the state of affairs in a real call
could be very useful indeed. Sometimes this can be
less awkward than using the list..

> Would 11:00 UTC on Wednesday work for everyone in the To: list?

Not on the To: list myself, but would work for me.. :-)
Although I have to admit it may really be very short notice for
some...

And since Poornima drove the project thus far, and was mainly
supported by Rajesh J and R.Talur from the gluster side for long
stretches of time, afaict, I think these three should be present
bare minimum.

Thanks - Michael


> On 08/11/2016 01:04 AM, Poornima Gurusiddaiah wrote:
> > 
> > My comments inline.
> > 
> > Regards,
> > Poornima
> > 
> > ----- Original Message -----
> > > From: "Dan Lambright" <dlambrig at redhat.com>
> > > To: "Gluster Devel" <gluster-devel at gluster.org>
> > > Sent: Wednesday, August 10, 2016 10:35:58 PM
> > > Subject: [Gluster-devel] md-cache improvements
> > > 
> > > 
> > > There have been recurring discussions within the gluster community to build
> > > on existing support for md-cache and upcalls to help performance for small
> > > file workloads. In certain cases, "lookup amplification" dominates data
> > > transfers, i.e. the cumulative round trip times of multiple LOOKUPs from the
> > > client mitigates benefits from faster backend storage.
> > > 
> > > To tackle this problem, one suggestion is to more aggressively utilize
> > > md-cache to cache inodes on the client than is currently done. The inodes
> > > would be cached until they are invalidated by the server.
> > > 
> > > Several gluster development engineers within the DHT, NFS, and Samba teams
> > > have been involved with related efforts, which have been underway for some
> > > time now. At this juncture, comments are requested from gluster developers.
> > > 
> > > (1) .. help call out where additional upcalls would be needed to invalidate
> > > stale client cache entries (in particular, need feedback from DHT/AFR
> > > areas),
> > > 
> > > (2) .. identify failure cases, when we cannot trust the contents of md-cache,
> > > e.g. when an upcall may have been dropped by the network
> > 
> > Yes, this needs to be handled.
> > It can happen only when there is a one way disconnect, where the server cannot
> > reach client and notify fails. We can have a retry for the same until the cache
> > expiry time.
> > 
> > > 
> > > (3) .. point out additional improvements which md-cache needs. For example,
> > > it cannot be allowed to grow unbounded.
> > 
> > This is being worked on, and will be targetted for 3.9
> > 
> > > 
> > > Dan
> > > 
> > > ----- Original Message -----
> > > > From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> > > > 
> > > > List of areas where we need invalidation notification:
> > > > 1. Any changes to xattrs used by xlators to store metadata (like dht layout
> > > > xattr, afr xattrs etc).
> > 
> > Currently, md-cache will negotiate(using ipc) with the brick, a list of xattrs
> > that it needs invalidation for. Other xlators can add the xattrs they are interested
> > in to the ipc. But then these xlators need to manage their own caching and processing
> > the invalidation request, as md-cache will be above all cluater xlators.
> > reference: http://review.gluster.org/#/c/15002/
> > 
> > > > 2. Scenarios where individual xlator feels like it needs a lookup. For
> > > > example failed directory creation on non-hashed subvol in dht during mkdir.
> > > > Though dht succeeds mkdir, it would be better to not cache this inode as a
> > > > subsequent lookup will heal the directory and make things better.
> > 
> > For this, these xlators can specify an indicator in the dict of
> > the fop cbk, to not cache. This should be fairly simple to implement.
> > 
> > > > 3. removing of files
> > 
> > When an unlink is issued from the mount point, the cache is invalidated.
> > 
> > > > 4. writev on brick (to invalidate read cache on client)
> > 
> > writev on brick from any other client will invalidate the metadata cache on all
> > the other clients.
> > 
> > > > 
> > > > Other questions:
> > > > 5. Does md-cache has cache management? like lru or an upper limit for
> > > > cache.
> > 
> > Currently md-cache doesn't have any cache-management, we will be targeting this
> > for 3.9
> > 
> > > > 6. Network disconnects and invalidating cache. When a network disconnect
> > > > happens we need to invalidate cache for inodes present on that brick as we
> > > > might be missing some notifications. Current approach of purging cache of
> > > > all inodes might not be optimal as it might rollback benefits of caching.
> > > > Also, please note that network disconnects are not rare events.
> > 
> > Network disconnects are handled to a minimal extent, where any brick down will
> > cause the whole of the cache to be invalidated. Invalidating only the list of
> > inodes that belong to that perticular brick will need the support from the
> > underlying cluster xlators.
> > 
> > > > 
> > > > regards,
> > > > Raghavendra
> > > _______________________________________________
> > > Gluster-devel mailing list
> > > Gluster-devel at gluster.org
> > > http://www.gluster.org/mailman/listinfo/gluster-devel
> > > 
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> > 
> 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: not available
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20160817/b79c72c8/attachment-0001.sig>


More information about the Gluster-devel mailing list