[Gluster-devel] memory cache for initgroups

Fri Nov 7 09:13:03 UTC 2014

On Fri, Nov 07, 2014 at 09:59:32AM +0100, Niels de Vos wrote:
> On Thu, Nov 06, 2014 at 05:32:53PM -0500, Simo Sorce wrote:
> > On Thu, 6 Nov 2014 22:02:29 +0100
> > Niels de Vos <ndevos at redhat.com> wrote:
> > 
> > > On Thu, Nov 06, 2014 at 11:45:18PM +0530, Vijay Bellur wrote:
> > > > On 11/03/2014 08:12 PM, Jakub Hrozek wrote:
> > > > >On Mon, Nov 03, 2014 at 03:41:43PM +0100, Jakub Hrozek wrote:
> > > > >>On Mon, Nov 03, 2014 at 08:53:06AM -0500, Simo Sorce wrote:
> > > > >>>On Mon, 3 Nov 2014 13:57:08 +0100
> > > > >>>Jakub Hrozek <jhrozek at redhat.com> wrote:
> > > > >>>
> > > > >>>>Hi,
> > > > >>>>
> > > > >>>>we had short discussion on $SUBJECT with Simo on IRC already,
> > > > >>>>but there are multiple people involved from multiple timezones,
> > > > >>>>so I think a mailing list thread would be better trackable.
> > > > >>>>
> > > > >>>>Can we add another memory cache file to SSSD, that would track
> > > > >>>>initgroups/getgrouplist results for the NSS responder? I realize
> > > > >>>>initgroups is a bit different operation than getpw{uid,nam} and
> > > > >>>>getgr{gid,nam} but what if the new memcache was only used by
> > > > >>>>the NSS responder and at the same time invalidated when
> > > > >>>>initgroups is initiated by the PAM responder to ensure the
> > > > >>>>memcache is up-to-date?
> > > > >>>
> > > > >>>Can you describe the use case before jumping into a proposed
> > > > >>>solution ?
> > > > >>
> > > > >>Many getgrouplist() or initgroups() calls in a quick succession.
> > > > >>One user is GlusterFS -- I'm not quite sure what the reason is
> > > > >>there, maybe Vijay can elaborate.
> > > > >
> > > > 
> > > > GlusterFS server invokes getgrouplist() to identify gids associated
> > > > with an user on whose behalf a rpc request has been sent over the
> > > > wire. There is a gid caching layer in GlusterFS and getgrouplist()
> > > > does get called only if there is a gid cache miss. In the worst
> > > > case, getgrouplist() can be invoked for every rpc request that
> > > > GlusterFS receives and that seems to be the case in a deployment
> > > > where we found that sssd was being busy. I am not certain about the
> > > > sequence of operations that can cause the cache to be missed.
> > > > 
> > > > Adding Niels who is more familiar with the gid resolution & caching
> > > > features in GlusterFS.
> > > 
> > > Just to add some background information on the getgrouplist().
> > > GlusterFS uses several processes that can call getgrouplist():
> > > - NFS-server, a single process per system
> > > - brick, a process per exported filesystem/directory, potentally
> > > several per system
> > > 
> > >   [Here, a Gluster environment has many systems (vm/physical). Each
> > >    system normally runs the NFS-server, and a number of brick
> > > processes. The layout of the volume is important, but it is very
> > > common to have one or more distributed volumes that use multiple
> > > bricks on the same system (and many other systems).]
> > > 
> > > The need for resolving the groups of a user comes in when users belong
> > > to many groups. The RPC protocols can not carry a huge list of groups,
> > > so the resolving can be done on the server side when the protocol hits
> > > its limits (> 16 for NFS, approx. > 93 for GlusterFS).
> > > 
> > > Upon using a Gluster volume, certain operations are sent to all the
> > > bricks (i.e. some directory related operations). I can imagine that
> > > a network share which is used by many users, trigger many
> > > getgrouplist() calls in different brick processes at the (almost)
> > > same time.
> > > 
> > > For reference, the usage of getgrouplist() in the brick process can be
> > > found here:
> > > -
> > > https://github.com/gluster/glusterfs/blob/master/xlators/protocol/server/src/server-helpers.c#L24
> > > 
> > > The gid_resolve() function get called in case the brick process should
> > > resolve the groups (and ignore the list of groups from the protocol).
> > > It uses the gidcache functions from a private library:
> > > -
> > > https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.h
> > > -
> > > https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.c
> > > 
> > > The default time for the gidcache to expire is 2 seconds. Users should
> > > be able to configure this to 30 seconds (or anything else) with:
> > > 
> > >     # gluster volume set <VOLUME> server.gid-timeout 30
> > > 
> > > 
> > > I think this should explain the use-case sufficiently, but let me know
> > > if there are any remaining questions. It might well be possible to
> > > make this code more sssd friendly. I'm sure that we as Gluster
> > > developers are open to any suggestions.
> > 
> > 
> > TBH this looks a little bit strange, other filesystems (as well as the
> > kernel) create a credentials token when a user first authenticate and
> > keep these credentials attached to the user session for the duration.
> > Why does GlusterFS keeps hammering the system requesting the same
> > information again and again ?
> 
> The GlusterFS protocol itself is very much stateless, similar to NFSv3.
> We need all the groups of the user on the server-side (brick) to allow
> the backing filesystem (mostly XFS) perform the permission checking. In
> the current GlusterFS protocol, there is no user authentication. (Well,
> there has been work done on adding support for SSL, maybe that could be
> used for tracking sessions on a per-client, not user, basis.)
> 
> Just for clarity, a GlusterFS client (like a fuse-mount, or the
> samba/vfs_glusterfs module) is used by many different users. The client
> builds the connection to the volume. After that, all users with access
> to the fuse-mount or samba-share are using the same client connection.
> 
> By default the client sends a list of groups in each RPC request, and
> the server-side trusts the list the client provides. However, for
> environments where these lists are too small to hold all the groups,
> there is an option to do the group resolving on the server side. This is
> the "server.manage-gids" volume option, which acts very much like the
> "rpc.mountd --manage-gids" functionality for NFS.
> 
> > Keep in mind that the use of getgrouplist() is an inherently costly
> > operation. Even adding caches, the system cannot cache for long because
> > it needs to return updated results eventually. Only the application
> > know when a user session terminates and/or the list needs to be
> > refreshed, so "caching" for this type of operation should be done
> > mostly on the application side.
> 
> I assume that your "application side" here is the brick process that
> runs on the same system as sssd. As mentioned above, the brick processes
> do cache the result of getgrouplist(). It may well be possible that the
> default expiry of 2 seconds is too short for many environments. But
> users can change that timeout easily with the "server.gid-timeout"
> volume option.

I guess that might be a viable option to work around the problem for the
user who initially reported it, but it also doesn't align with what I
saw in the logs..the sssd_nss logs showed 4000 initgroup requests over
two minutes from maybe about 10 users..

> 
> From my understanding of this thread, we (the Gluster Community) have
> two things to do:
> 
> 1. Clearly document side-effects that can be caused by enabling the
>    "server.manage-gids" option, and suggest increasing the
>    "server.gid-timeout" value (maybe change the default?).
> 
> 2. Think about improving the GlusterFS protocol(s) and introduce some
>    kind of credentials token that is linked with the groups of a user.
>    Token expiry should invalidate the group-cache. One option would be
>    to use Kerberos like NFS (RPCSEC_GSS).
> 
> 
> Does this all make sense to others too? I'm adding gluster-devel@ to CC
> so that others can chime in and this topic won't be forgotton.
> 
> Thanks,
> Niels

And on the SSSD side, we need to think about an initgroups cache. So far
I filed ticket https://fedorahosted.org/sssd/ticket/2485 listing the two
options Simo outlined earlier.

GlusterFS is not the only project that requested faster initgroups
caching, Alexander's slapi-nis would also benefit from the new cache
(Although with slapi-nis we also have a bit conflicting RFE to stop
using NSS interfaces and go to SSSD directly, but that's something for
us to solve..)