[Gluster-devel] memory cache for initgroups

Fri Nov 7 08:59:32 UTC 2014

On Thu, Nov 06, 2014 at 05:32:53PM -0500, Simo Sorce wrote:
> On Thu, 6 Nov 2014 22:02:29 +0100
> Niels de Vos <ndevos at redhat.com> wrote:
> 
> > On Thu, Nov 06, 2014 at 11:45:18PM +0530, Vijay Bellur wrote:
> > > On 11/03/2014 08:12 PM, Jakub Hrozek wrote:
> > > >On Mon, Nov 03, 2014 at 03:41:43PM +0100, Jakub Hrozek wrote:
> > > >>On Mon, Nov 03, 2014 at 08:53:06AM -0500, Simo Sorce wrote:
> > > >>>On Mon, 3 Nov 2014 13:57:08 +0100
> > > >>>Jakub Hrozek <jhrozek at redhat.com> wrote:
> > > >>>
> > > >>>>Hi,
> > > >>>>
> > > >>>>we had short discussion on $SUBJECT with Simo on IRC already,
> > > >>>>but there are multiple people involved from multiple timezones,
> > > >>>>so I think a mailing list thread would be better trackable.
> > > >>>>
> > > >>>>Can we add another memory cache file to SSSD, that would track
> > > >>>>initgroups/getgrouplist results for the NSS responder? I realize
> > > >>>>initgroups is a bit different operation than getpw{uid,nam} and
> > > >>>>getgr{gid,nam} but what if the new memcache was only used by
> > > >>>>the NSS responder and at the same time invalidated when
> > > >>>>initgroups is initiated by the PAM responder to ensure the
> > > >>>>memcache is up-to-date?
> > > >>>
> > > >>>Can you describe the use case before jumping into a proposed
> > > >>>solution ?
> > > >>
> > > >>Many getgrouplist() or initgroups() calls in a quick succession.
> > > >>One user is GlusterFS -- I'm not quite sure what the reason is
> > > >>there, maybe Vijay can elaborate.
> > > >
> > > 
> > > GlusterFS server invokes getgrouplist() to identify gids associated
> > > with an user on whose behalf a rpc request has been sent over the
> > > wire. There is a gid caching layer in GlusterFS and getgrouplist()
> > > does get called only if there is a gid cache miss. In the worst
> > > case, getgrouplist() can be invoked for every rpc request that
> > > GlusterFS receives and that seems to be the case in a deployment
> > > where we found that sssd was being busy. I am not certain about the
> > > sequence of operations that can cause the cache to be missed.
> > > 
> > > Adding Niels who is more familiar with the gid resolution & caching
> > > features in GlusterFS.
> > 
> > Just to add some background information on the getgrouplist().
> > GlusterFS uses several processes that can call getgrouplist():
> > - NFS-server, a single process per system
> > - brick, a process per exported filesystem/directory, potentally
> > several per system
> > 
> >   [Here, a Gluster environment has many systems (vm/physical). Each
> >    system normally runs the NFS-server, and a number of brick
> > processes. The layout of the volume is important, but it is very
> > common to have one or more distributed volumes that use multiple
> > bricks on the same system (and many other systems).]
> > 
> > The need for resolving the groups of a user comes in when users belong
> > to many groups. The RPC protocols can not carry a huge list of groups,
> > so the resolving can be done on the server side when the protocol hits
> > its limits (> 16 for NFS, approx. > 93 for GlusterFS).
> > 
> > Upon using a Gluster volume, certain operations are sent to all the
> > bricks (i.e. some directory related operations). I can imagine that
> > a network share which is used by many users, trigger many
> > getgrouplist() calls in different brick processes at the (almost)
> > same time.
> > 
> > For reference, the usage of getgrouplist() in the brick process can be
> > found here:
> > -
> > https://github.com/gluster/glusterfs/blob/master/xlators/protocol/server/src/server-helpers.c#L24
> > 
> > The gid_resolve() function get called in case the brick process should
> > resolve the groups (and ignore the list of groups from the protocol).
> > It uses the gidcache functions from a private library:
> > -
> > https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.h
> > -
> > https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.c
> > 
> > The default time for the gidcache to expire is 2 seconds. Users should
> > be able to configure this to 30 seconds (or anything else) with:
> > 
> >     # gluster volume set <VOLUME> server.gid-timeout 30
> > 
> > 
> > I think this should explain the use-case sufficiently, but let me know
> > if there are any remaining questions. It might well be possible to
> > make this code more sssd friendly. I'm sure that we as Gluster
> > developers are open to any suggestions.
> 
> 
> TBH this looks a little bit strange, other filesystems (as well as the
> kernel) create a credentials token when a user first authenticate and
> keep these credentials attached to the user session for the duration.
> Why does GlusterFS keeps hammering the system requesting the same
> information again and again ?

The GlusterFS protocol itself is very much stateless, similar to NFSv3.
We need all the groups of the user on the server-side (brick) to allow
the backing filesystem (mostly XFS) perform the permission checking. In
the current GlusterFS protocol, there is no user authentication. (Well,
there has been work done on adding support for SSL, maybe that could be
used for tracking sessions on a per-client, not user, basis.)

Just for clarity, a GlusterFS client (like a fuse-mount, or the
samba/vfs_glusterfs module) is used by many different users. The client
builds the connection to the volume. After that, all users with access
to the fuse-mount or samba-share are using the same client connection.

By default the client sends a list of groups in each RPC request, and
the server-side trusts the list the client provides. However, for
environments where these lists are too small to hold all the groups,
there is an option to do the group resolving on the server side. This is
the "server.manage-gids" volume option, which acts very much like the
"rpc.mountd --manage-gids" functionality for NFS.

> Keep in mind that the use of getgrouplist() is an inherently costly
> operation. Even adding caches, the system cannot cache for long because
> it needs to return updated results eventually. Only the application
> know when a user session terminates and/or the list needs to be
> refreshed, so "caching" for this type of operation should be done
> mostly on the application side.

I assume that your "application side" here is the brick process that
runs on the same system as sssd. As mentioned above, the brick processes
do cache the result of getgrouplist(). It may well be possible that the
default expiry of 2 seconds is too short for many environments. But
users can change that timeout easily with the "server.gid-timeout"
volume option.

>From my understanding of this thread, we (the Gluster Community) have
two things to do:

1. Clearly document side-effects that can be caused by enabling the
   "server.manage-gids" option, and suggest increasing the
   "server.gid-timeout" value (maybe change the default?).

2. Think about improving the GlusterFS protocol(s) and introduce some
   kind of credentials token that is linked with the groups of a user.
   Token expiry should invalidate the group-cache. One option would be
   to use Kerberos like NFS (RPCSEC_GSS).

Does this all make sense to others too? I'm adding gluster-devel@ to CC
so that others can chime in and this topic won't be forgotton.

Thanks,
Niels