[Gluster-devel] memory cache for initgroups

Fri Nov 7 10:57:27 UTC 2014

On Fri, Nov 07, 2014 at 10:13:03AM +0100, Jakub Hrozek wrote:
> On Fri, Nov 07, 2014 at 09:59:32AM +0100, Niels de Vos wrote:
> > On Thu, Nov 06, 2014 at 05:32:53PM -0500, Simo Sorce wrote:
> > > On Thu, 6 Nov 2014 22:02:29 +0100
> > > Niels de Vos <ndevos at redhat.com> wrote:
> > > 
> > > > On Thu, Nov 06, 2014 at 11:45:18PM +0530, Vijay Bellur wrote:
> > > > > On 11/03/2014 08:12 PM, Jakub Hrozek wrote:
> > > > > >On Mon, Nov 03, 2014 at 03:41:43PM +0100, Jakub Hrozek wrote:
> > > > > >>On Mon, Nov 03, 2014 at 08:53:06AM -0500, Simo Sorce wrote:
> > > > > >>>On Mon, 3 Nov 2014 13:57:08 +0100
> > > > > >>>Jakub Hrozek <jhrozek at redhat.com> wrote:
> > > > > >>>
> > > > > >>>>Hi,
> > > > > >>>>
> > > > > >>>>we had short discussion on $SUBJECT with Simo on IRC already,
> > > > > >>>>but there are multiple people involved from multiple timezones,
> > > > > >>>>so I think a mailing list thread would be better trackable.
> > > > > >>>>
> > > > > >>>>Can we add another memory cache file to SSSD, that would track
> > > > > >>>>initgroups/getgrouplist results for the NSS responder? I realize
> > > > > >>>>initgroups is a bit different operation than getpw{uid,nam} and
> > > > > >>>>getgr{gid,nam} but what if the new memcache was only used by
> > > > > >>>>the NSS responder and at the same time invalidated when
> > > > > >>>>initgroups is initiated by the PAM responder to ensure the
> > > > > >>>>memcache is up-to-date?
> > > > > >>>
> > > > > >>>Can you describe the use case before jumping into a proposed
> > > > > >>>solution ?
> > > > > >>
> > > > > >>Many getgrouplist() or initgroups() calls in a quick succession.
> > > > > >>One user is GlusterFS -- I'm not quite sure what the reason is
> > > > > >>there, maybe Vijay can elaborate.
> > > > > >
> > > > > 
> > > > > GlusterFS server invokes getgrouplist() to identify gids associated
> > > > > with an user on whose behalf a rpc request has been sent over the
> > > > > wire. There is a gid caching layer in GlusterFS and getgrouplist()
> > > > > does get called only if there is a gid cache miss. In the worst
> > > > > case, getgrouplist() can be invoked for every rpc request that
> > > > > GlusterFS receives and that seems to be the case in a deployment
> > > > > where we found that sssd was being busy. I am not certain about the
> > > > > sequence of operations that can cause the cache to be missed.
> > > > > 
> > > > > Adding Niels who is more familiar with the gid resolution & caching
> > > > > features in GlusterFS.
> > > > 
> > > > Just to add some background information on the getgrouplist().
> > > > GlusterFS uses several processes that can call getgrouplist():
> > > > - NFS-server, a single process per system
> > > > - brick, a process per exported filesystem/directory, potentally
> > > > several per system
> > > > 
> > > >   [Here, a Gluster environment has many systems (vm/physical). Each
> > > >    system normally runs the NFS-server, and a number of brick
> > > > processes. The layout of the volume is important, but it is very
> > > > common to have one or more distributed volumes that use multiple
> > > > bricks on the same system (and many other systems).]
> > > > 
> > > > The need for resolving the groups of a user comes in when users belong
> > > > to many groups. The RPC protocols can not carry a huge list of groups,
> > > > so the resolving can be done on the server side when the protocol hits
> > > > its limits (> 16 for NFS, approx. > 93 for GlusterFS).
> > > > 
> > > > Upon using a Gluster volume, certain operations are sent to all the
> > > > bricks (i.e. some directory related operations). I can imagine that
> > > > a network share which is used by many users, trigger many
> > > > getgrouplist() calls in different brick processes at the (almost)
> > > > same time.
> > > > 
> > > > For reference, the usage of getgrouplist() in the brick process can be
> > > > found here:
> > > > -
> > > > https://github.com/gluster/glusterfs/blob/master/xlators/protocol/server/src/server-helpers.c#L24
> > > > 
> > > > The gid_resolve() function get called in case the brick process should
> > > > resolve the groups (and ignore the list of groups from the protocol).
> > > > It uses the gidcache functions from a private library:
> > > > -
> > > > https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.h
> > > > -
> > > > https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.c
> > > > 
> > > > The default time for the gidcache to expire is 2 seconds. Users should
> > > > be able to configure this to 30 seconds (or anything else) with:
> > > > 
> > > >     # gluster volume set <VOLUME> server.gid-timeout 30
> > > > 
> > > > 
> > > > I think this should explain the use-case sufficiently, but let me know
> > > > if there are any remaining questions. It might well be possible to
> > > > make this code more sssd friendly. I'm sure that we as Gluster
> > > > developers are open to any suggestions.
> > > 
> > > 
> > > TBH this looks a little bit strange, other filesystems (as well as the
> > > kernel) create a credentials token when a user first authenticate and
> > > keep these credentials attached to the user session for the duration.
> > > Why does GlusterFS keeps hammering the system requesting the same
> > > information again and again ?
> > 
> > The GlusterFS protocol itself is very much stateless, similar to NFSv3.
> > We need all the groups of the user on the server-side (brick) to allow
> > the backing filesystem (mostly XFS) perform the permission checking. In
> > the current GlusterFS protocol, there is no user authentication. (Well,
> > there has been work done on adding support for SSL, maybe that could be
> > used for tracking sessions on a per-client, not user, basis.)
> > 
> > Just for clarity, a GlusterFS client (like a fuse-mount, or the
> > samba/vfs_glusterfs module) is used by many different users. The client
> > builds the connection to the volume. After that, all users with access
> > to the fuse-mount or samba-share are using the same client connection.
> > 
> > By default the client sends a list of groups in each RPC request, and
> > the server-side trusts the list the client provides. However, for
> > environments where these lists are too small to hold all the groups,
> > there is an option to do the group resolving on the server side. This is
> > the "server.manage-gids" volume option, which acts very much like the
> > "rpc.mountd --manage-gids" functionality for NFS.
> > 
> > > Keep in mind that the use of getgrouplist() is an inherently costly
> > > operation. Even adding caches, the system cannot cache for long because
> > > it needs to return updated results eventually. Only the application
> > > know when a user session terminates and/or the list needs to be
> > > refreshed, so "caching" for this type of operation should be done
> > > mostly on the application side.
> > 
> > I assume that your "application side" here is the brick process that
> > runs on the same system as sssd. As mentioned above, the brick processes
> > do cache the result of getgrouplist(). It may well be possible that the
> > default expiry of 2 seconds is too short for many environments. But
> > users can change that timeout easily with the "server.gid-timeout"
> > volume option.
> 
> I guess that might be a viable option to work around the problem for the
> user who initially reported it, but it also doesn't align with what I
> saw in the logs..the sssd_nss logs showed 4000 initgroup requests over
> two minutes from maybe about 10 users..

That would be easily possible in case the server.gid-timeout is set to 0
and the server hosts multiple bricks for the same volume. Each brick
process will call getgrouplist() for each RPC call it receives. I can
imagine an explosion of requests like that when some multi-threaded I/O
is done. More details on the Gluster environment and workload where this
was observed would help.

It is also possible that our gid-cache has a bug and does not apply its
timeout at all. This would be concerning and requires some investigation
on the Gluster side. But again, we'll need to verify the settings of the
Gluster volume for this.

> 
> > 
> > From my understanding of this thread, we (the Gluster Community) have
> > two things to do:
> > 
> > 1. Clearly document side-effects that can be caused by enabling the
> >    "server.manage-gids" option, and suggest increasing the
> >    "server.gid-timeout" value (maybe change the default?).
> > 
> > 2. Think about improving the GlusterFS protocol(s) and introduce some
> >    kind of credentials token that is linked with the groups of a user.
> >    Token expiry should invalidate the group-cache. One option would be
> >    to use Kerberos like NFS (RPCSEC_GSS).
> > 
> > 
> > Does this all make sense to others too? I'm adding gluster-devel@ to CC
> > so that others can chime in and this topic won't be forgotton.
> > 
> > Thanks,
> > Niels
> 
> And on the SSSD side, we need to think about an initgroups cache. So far
> I filed ticket https://fedorahosted.org/sssd/ticket/2485 listing the two
> options Simo outlined earlier.
> 
> GlusterFS is not the only project that requested faster initgroups
> caching, Alexander's slapi-nis would also benefit from the new cache
> (Although with slapi-nis we also have a bit conflicting RFE to stop
> using NSS interfaces and go to SSSD directly, but that's something for
> us to solve..)

Okay, thanks. We'll be filing bugs for our changes when we discussed it
a little more amoung the Gluster developers.

Cheers,
Niels