[Gluster-devel] memory cache for initgroups

Fri Nov 7 10:13:59 UTC 2014

On Fri, 07 Nov 2014, Niels de Vos wrote:
>On Thu, Nov 06, 2014 at 05:32:53PM -0500, Simo Sorce wrote:
>> On Thu, 6 Nov 2014 22:02:29 +0100
>> Niels de Vos <ndevos at redhat.com> wrote:
>>
>> > On Thu, Nov 06, 2014 at 11:45:18PM +0530, Vijay Bellur wrote:
>> > > On 11/03/2014 08:12 PM, Jakub Hrozek wrote:
>> > > >On Mon, Nov 03, 2014 at 03:41:43PM +0100, Jakub Hrozek wrote:
>> > > >>On Mon, Nov 03, 2014 at 08:53:06AM -0500, Simo Sorce wrote:
>> > > >>>On Mon, 3 Nov 2014 13:57:08 +0100
>> > > >>>Jakub Hrozek <jhrozek at redhat.com> wrote:
>> > > >>>
>> > > >>>>Hi,
>> > > >>>>
>> > > >>>>we had short discussion on $SUBJECT with Simo on IRC already,
>> > > >>>>but there are multiple people involved from multiple timezones,
>> > > >>>>so I think a mailing list thread would be better trackable.
>> > > >>>>
>> > > >>>>Can we add another memory cache file to SSSD, that would track
>> > > >>>>initgroups/getgrouplist results for the NSS responder? I realize
>> > > >>>>initgroups is a bit different operation than getpw{uid,nam} and
>> > > >>>>getgr{gid,nam} but what if the new memcache was only used by
>> > > >>>>the NSS responder and at the same time invalidated when
>> > > >>>>initgroups is initiated by the PAM responder to ensure the
>> > > >>>>memcache is up-to-date?
>> > > >>>
>> > > >>>Can you describe the use case before jumping into a proposed
>> > > >>>solution ?
>> > > >>
>> > > >>Many getgrouplist() or initgroups() calls in a quick succession.
>> > > >>One user is GlusterFS -- I'm not quite sure what the reason is
>> > > >>there, maybe Vijay can elaborate.
>> > > >
>> > >
>> > > GlusterFS server invokes getgrouplist() to identify gids associated
>> > > with an user on whose behalf a rpc request has been sent over the
>> > > wire. There is a gid caching layer in GlusterFS and getgrouplist()
>> > > does get called only if there is a gid cache miss. In the worst
>> > > case, getgrouplist() can be invoked for every rpc request that
>> > > GlusterFS receives and that seems to be the case in a deployment
>> > > where we found that sssd was being busy. I am not certain about the
>> > > sequence of operations that can cause the cache to be missed.
>> > >
>> > > Adding Niels who is more familiar with the gid resolution & caching
>> > > features in GlusterFS.
>> >
>> > Just to add some background information on the getgrouplist().
>> > GlusterFS uses several processes that can call getgrouplist():
>> > - NFS-server, a single process per system
>> > - brick, a process per exported filesystem/directory, potentally
>> > several per system
>> >
>> >   [Here, a Gluster environment has many systems (vm/physical). Each
>> >    system normally runs the NFS-server, and a number of brick
>> > processes. The layout of the volume is important, but it is very
>> > common to have one or more distributed volumes that use multiple
>> > bricks on the same system (and many other systems).]
>> >
>> > The need for resolving the groups of a user comes in when users belong
>> > to many groups. The RPC protocols can not carry a huge list of groups,
>> > so the resolving can be done on the server side when the protocol hits
>> > its limits (> 16 for NFS, approx. > 93 for GlusterFS).
>> >
>> > Upon using a Gluster volume, certain operations are sent to all the
>> > bricks (i.e. some directory related operations). I can imagine that
>> > a network share which is used by many users, trigger many
>> > getgrouplist() calls in different brick processes at the (almost)
>> > same time.
>> >
>> > For reference, the usage of getgrouplist() in the brick process can be
>> > found here:
>> > -
>> > https://github.com/gluster/glusterfs/blob/master/xlators/protocol/server/src/server-helpers.c#L24
>> >
>> > The gid_resolve() function get called in case the brick process should
>> > resolve the groups (and ignore the list of groups from the protocol).
>> > It uses the gidcache functions from a private library:
>> > -
>> > https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.h
>> > -
>> > https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.c
>> >
>> > The default time for the gidcache to expire is 2 seconds. Users should
>> > be able to configure this to 30 seconds (or anything else) with:
>> >
>> >     # gluster volume set <VOLUME> server.gid-timeout 30
>> >
>> >
>> > I think this should explain the use-case sufficiently, but let me know
>> > if there are any remaining questions. It might well be possible to
>> > make this code more sssd friendly. I'm sure that we as Gluster
>> > developers are open to any suggestions.
What I can see is that also NFS xlator does call it in nfs_fix_groups():
https://github.com/gluster/glusterfs/blob/master/xlators/nfs/server/src/nfs-fops.c#L96
which is then used in nlm4_file_open_and_resume().

>> TBH this looks a little bit strange, other filesystems (as well as the
>> kernel) create a credentials token when a user first authenticate and
>> keep these credentials attached to the user session for the duration.
>> Why does GlusterFS keeps hammering the system requesting the same
>> information again and again ?
>
>The GlusterFS protocol itself is very much stateless, similar to NFSv3.
>We need all the groups of the user on the server-side (brick) to allow
>the backing filesystem (mostly XFS) perform the permission checking. In
>the current GlusterFS protocol, there is no user authentication. (Well,
>there has been work done on adding support for SSL, maybe that could be
>used for tracking sessions on a per-client, not user, basis.)
>
>Just for clarity, a GlusterFS client (like a fuse-mount, or the
>samba/vfs_glusterfs module) is used by many different users. The client
>builds the connection to the volume. After that, all users with access
>to the fuse-mount or samba-share are using the same client connection.
>
>By default the client sends a list of groups in each RPC request, and
>the server-side trusts the list the client provides. However, for
>environments where these lists are too small to hold all the groups,
>there is an option to do the group resolving on the server side. This is
>the "server.manage-gids" volume option, which acts very much like the
>"rpc.mountd --manage-gids" functionality for NFS.
In case of complex group membership fetching the group list might take
longer than default 2 seconds which is not that unusual for LDAP-backed
configurations with high network latency. Setting group membership
caching to higher threshold by default is reasonable, as well as tying
it to a client connection/authentication source. After all, group
membership for a particular user doesn't really change every two
seconds, or even every 30 seconds.

It would definitely help to have a hand from GlusterFS protocol to
allow client to hint the server side that as result of authentication,
user properties did change, thus a refresh is needed on the server in
case gid cache is involved. That could be a per RPC flag and at worst
client would be forcing current behavior of dropping gid cache content
too often. There seem to be already too much trust to GlusterFS client
by the server side.

>> Keep in mind that the use of getgrouplist() is an inherently costly
>> operation. Even adding caches, the system cannot cache for long because
>> it needs to return updated results eventually. Only the application
>> know when a user session terminates and/or the list needs to be
>> refreshed, so "caching" for this type of operation should be done
>> mostly on the application side.
>
>I assume that your "application side" here is the brick process that
>runs on the same system as sssd. As mentioned above, the brick processes
>do cache the result of getgrouplist(). It may well be possible that the
>default expiry of 2 seconds is too short for many environments. But
>users can change that timeout easily with the "server.gid-timeout"
>volume option.
>
>From my understanding of this thread, we (the Gluster Community) have
>two things to do:
>
>1. Clearly document side-effects that can be caused by enabling the
>   "server.manage-gids" option, and suggest increasing the
>   "server.gid-timeout" value (maybe change the default?).
>
>2. Think about improving the GlusterFS protocol(s) and introduce some
>   kind of credentials token that is linked with the groups of a user.
>   Token expiry should invalidate the group-cache. One option would be
>   to use Kerberos like NFS (RPCSEC_GSS).
>
>
>Does this all make sense to others too? I'm adding gluster-devel@ to CC
>so that others can chime in and this topic won't be forgotton.
Yes, both these approaches make sense. Changing default to something
more reasonable like several minutes would already help to reduce
contention on an nss provider -- I've encountered that nss_files also
behaves badly when too many requests come from parallel threads from the
same process and can easily lock itself up in serializing access to
/etc/passwd or /etc/groups.

Perhaps you can also add a third one, like I proposed above -- to allow
per-RPC flag that gives a hint from client to the server about use of
cached gids in case of server.manage-gids option is set on the server
side.

-- 
/ Alexander Bokovoy