[Gluster-devel] [SSSD] memory cache for initgroups

Fri Nov 7 09:17:30 UTC 2014

On (07/11/14 10:13), Jakub Hrozek wrote:
>On Fri, Nov 07, 2014 at 09:59:32AM +0100, Niels de Vos wrote:
>> On Thu, Nov 06, 2014 at 05:32:53PM -0500, Simo Sorce wrote:
>> > On Thu, 6 Nov 2014 22:02:29 +0100
>> > Niels de Vos <ndevos at redhat.com> wrote:
>> > 
>> > > On Thu, Nov 06, 2014 at 11:45:18PM +0530, Vijay Bellur wrote:
>> > > > On 11/03/2014 08:12 PM, Jakub Hrozek wrote:
>> > > > >On Mon, Nov 03, 2014 at 03:41:43PM +0100, Jakub Hrozek wrote:
>> > > > >>On Mon, Nov 03, 2014 at 08:53:06AM -0500, Simo Sorce wrote:
>> > > > >>>On Mon, 3 Nov 2014 13:57:08 +0100
>> > > > >>>Jakub Hrozek <jhrozek at redhat.com> wrote:
>> > > > >>>
>> > > > >>>>Hi,
>> > > > >>>>
>> > > > >>>>we had short discussion on $SUBJECT with Simo on IRC already,
>> > > > >>>>but there are multiple people involved from multiple timezones,
>> > > > >>>>so I think a mailing list thread would be better trackable.
>> > > > >>>>
>> > > > >>>>Can we add another memory cache file to SSSD, that would track
>> > > > >>>>initgroups/getgrouplist results for the NSS responder? I realize
>> > > > >>>>initgroups is a bit different operation than getpw{uid,nam} and
>> > > > >>>>getgr{gid,nam} but what if the new memcache was only used by
>> > > > >>>>the NSS responder and at the same time invalidated when
>> > > > >>>>initgroups is initiated by the PAM responder to ensure the
>> > > > >>>>memcache is up-to-date?
>> > > > >>>
>> > > > >>>Can you describe the use case before jumping into a proposed
>> > > > >>>solution ?
>> > > > >>
>> > > > >>Many getgrouplist() or initgroups() calls in a quick succession.
>> > > > >>One user is GlusterFS -- I'm not quite sure what the reason is
>> > > > >>there, maybe Vijay can elaborate.
>> > > > >
>> > > > 
>> > > > GlusterFS server invokes getgrouplist() to identify gids associated
>> > > > with an user on whose behalf a rpc request has been sent over the
>> > > > wire. There is a gid caching layer in GlusterFS and getgrouplist()
>> > > > does get called only if there is a gid cache miss. In the worst
>> > > > case, getgrouplist() can be invoked for every rpc request that
>> > > > GlusterFS receives and that seems to be the case in a deployment
>> > > > where we found that sssd was being busy. I am not certain about the
>> > > > sequence of operations that can cause the cache to be missed.
>> > > > 
>> > > > Adding Niels who is more familiar with the gid resolution & caching
>> > > > features in GlusterFS.
>> > > 
>> > > Just to add some background information on the getgrouplist().
>> > > GlusterFS uses several processes that can call getgrouplist():
>> > > - NFS-server, a single process per system
>> > > - brick, a process per exported filesystem/directory, potentally
>> > > several per system
>> > > 
>> > >   [Here, a Gluster environment has many systems (vm/physical). Each
>> > >    system normally runs the NFS-server, and a number of brick
>> > > processes. The layout of the volume is important, but it is very
>> > > common to have one or more distributed volumes that use multiple
>> > > bricks on the same system (and many other systems).]
>> > > 
>> > > The need for resolving the groups of a user comes in when users belong
>> > > to many groups. The RPC protocols can not carry a huge list of groups,
>> > > so the resolving can be done on the server side when the protocol hits
>> > > its limits (> 16 for NFS, approx. > 93 for GlusterFS).
>> > > 
>> > > Upon using a Gluster volume, certain operations are sent to all the
>> > > bricks (i.e. some directory related operations). I can imagine that
>> > > a network share which is used by many users, trigger many
>> > > getgrouplist() calls in different brick processes at the (almost)
>> > > same time.
>> > > 
>> > > For reference, the usage of getgrouplist() in the brick process can be
>> > > found here:
>> > > -
>> > > https://github.com/gluster/glusterfs/blob/master/xlators/protocol/server/src/server-helpers.c#L24
>> > > 
>> > > The gid_resolve() function get called in case the brick process should
>> > > resolve the groups (and ignore the list of groups from the protocol).
>> > > It uses the gidcache functions from a private library:
>> > > -
>> > > https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.h
>> > > -
>> > > https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.c
>> > > 
>> > > The default time for the gidcache to expire is 2 seconds. Users should
>> > > be able to configure this to 30 seconds (or anything else) with:
>> > > 
>> > >     # gluster volume set <VOLUME> server.gid-timeout 30
>> > > 
>> > > 
>> > > I think this should explain the use-case sufficiently, but let me know
>> > > if there are any remaining questions. It might well be possible to
>> > > make this code more sssd friendly. I'm sure that we as Gluster
>> > > developers are open to any suggestions.
>> > 
>> > 
>> > TBH this looks a little bit strange, other filesystems (as well as the
>> > kernel) create a credentials token when a user first authenticate and
>> > keep these credentials attached to the user session for the duration.
>> > Why does GlusterFS keeps hammering the system requesting the same
>> > information again and again ?
>> 
>> The GlusterFS protocol itself is very much stateless, similar to NFSv3.
>> We need all the groups of the user on the server-side (brick) to allow
>> the backing filesystem (mostly XFS) perform the permission checking. In
>> the current GlusterFS protocol, there is no user authentication. (Well,
>> there has been work done on adding support for SSL, maybe that could be
>> used for tracking sessions on a per-client, not user, basis.)
>> 
>> Just for clarity, a GlusterFS client (like a fuse-mount, or the
>> samba/vfs_glusterfs module) is used by many different users. The client
>> builds the connection to the volume. After that, all users with access
>> to the fuse-mount or samba-share are using the same client connection.
>> 
>> By default the client sends a list of groups in each RPC request, and
>> the server-side trusts the list the client provides. However, for
>> environments where these lists are too small to hold all the groups,
>> there is an option to do the group resolving on the server side. This is
>> the "server.manage-gids" volume option, which acts very much like the
>> "rpc.mountd --manage-gids" functionality for NFS.
>> 
>> > Keep in mind that the use of getgrouplist() is an inherently costly
>> > operation. Even adding caches, the system cannot cache for long because
>> > it needs to return updated results eventually. Only the application
>> > know when a user session terminates and/or the list needs to be
>> > refreshed, so "caching" for this type of operation should be done
>> > mostly on the application side.
>> 
>> I assume that your "application side" here is the brick process that
>> runs on the same system as sssd. As mentioned above, the brick processes
>> do cache the result of getgrouplist(). It may well be possible that the
>> default expiry of 2 seconds is too short for many environments. But
>> users can change that timeout easily with the "server.gid-timeout"
>> volume option.
>
>I guess that might be a viable option to work around the problem for the
>user who initially reported it, but it also doesn't align with what I
>saw in the logs..the sssd_nss logs showed 4000 initgroup requests over
>two minutes from maybe about 10 users..
>
>> 
>> From my understanding of this thread, we (the Gluster Community) have
>> two things to do:
>> 
>> 1. Clearly document side-effects that can be caused by enabling the
>>    "server.manage-gids" option, and suggest increasing the
>>    "server.gid-timeout" value (maybe change the default?).
>> 
>> 2. Think about improving the GlusterFS protocol(s) and introduce some
>>    kind of credentials token that is linked with the groups of a user.
>>    Token expiry should invalidate the group-cache. One option would be
>>    to use Kerberos like NFS (RPCSEC_GSS).
>> 
>> 
>> Does this all make sense to others too? I'm adding gluster-devel@ to CC
>> so that others can chime in and this topic won't be forgotton.
>> 
>> Thanks,
>> Niels
>
>And on the SSSD side, we need to think about an initgroups cache. So far
>I filed ticket https://fedorahosted.org/sssd/ticket/2485 listing the two
>options Simo outlined earlier.
>
>GlusterFS is not the only project that requested faster initgroups
>caching, Alexander's slapi-nis would also benefit from the new cache
>(Although with slapi-nis we also have a bit conflicting RFE to stop
>using NSS interfaces and go to SSSD directly, but that's something for
>us to solve..)
memory cache is used in nss responder as well :-)

LS