[Gluster-devel] memory cache for initgroups

Niels de Vos ndevos at redhat.com
Fri Nov 7 12:43:05 UTC 2014


On Fri, Nov 07, 2014 at 12:13:59PM +0200, Alexander Bokovoy wrote:
> On Fri, 07 Nov 2014, Niels de Vos wrote:
> >On Thu, Nov 06, 2014 at 05:32:53PM -0500, Simo Sorce wrote:
> >>On Thu, 6 Nov 2014 22:02:29 +0100
> >>Niels de Vos <ndevos at redhat.com> wrote:
> >>
> >>> On Thu, Nov 06, 2014 at 11:45:18PM +0530, Vijay Bellur wrote:
> >>> > On 11/03/2014 08:12 PM, Jakub Hrozek wrote:
> >>> > >On Mon, Nov 03, 2014 at 03:41:43PM +0100, Jakub Hrozek wrote:
> >>> > >>On Mon, Nov 03, 2014 at 08:53:06AM -0500, Simo Sorce wrote:
> >>> > >>>On Mon, 3 Nov 2014 13:57:08 +0100
> >>> > >>>Jakub Hrozek <jhrozek at redhat.com> wrote:
> >>> > >>>
> >>> > >>>>Hi,
> >>> > >>>>
> >>> > >>>>we had short discussion on $SUBJECT with Simo on IRC already,
> >>> > >>>>but there are multiple people involved from multiple timezones,
> >>> > >>>>so I think a mailing list thread would be better trackable.
> >>> > >>>>
> >>> > >>>>Can we add another memory cache file to SSSD, that would track
> >>> > >>>>initgroups/getgrouplist results for the NSS responder? I realize
> >>> > >>>>initgroups is a bit different operation than getpw{uid,nam} and
> >>> > >>>>getgr{gid,nam} but what if the new memcache was only used by
> >>> > >>>>the NSS responder and at the same time invalidated when
> >>> > >>>>initgroups is initiated by the PAM responder to ensure the
> >>> > >>>>memcache is up-to-date?
> >>> > >>>
> >>> > >>>Can you describe the use case before jumping into a proposed
> >>> > >>>solution ?
> >>> > >>
> >>> > >>Many getgrouplist() or initgroups() calls in a quick succession.
> >>> > >>One user is GlusterFS -- I'm not quite sure what the reason is
> >>> > >>there, maybe Vijay can elaborate.
> >>> > >
> >>> >
> >>> > GlusterFS server invokes getgrouplist() to identify gids associated
> >>> > with an user on whose behalf a rpc request has been sent over the
> >>> > wire. There is a gid caching layer in GlusterFS and getgrouplist()
> >>> > does get called only if there is a gid cache miss. In the worst
> >>> > case, getgrouplist() can be invoked for every rpc request that
> >>> > GlusterFS receives and that seems to be the case in a deployment
> >>> > where we found that sssd was being busy. I am not certain about the
> >>> > sequence of operations that can cause the cache to be missed.
> >>> >
> >>> > Adding Niels who is more familiar with the gid resolution & caching
> >>> > features in GlusterFS.
> >>>
> >>> Just to add some background information on the getgrouplist().
> >>> GlusterFS uses several processes that can call getgrouplist():
> >>> - NFS-server, a single process per system
> >>> - brick, a process per exported filesystem/directory, potentally
> >>> several per system
> >>>
> >>>   [Here, a Gluster environment has many systems (vm/physical). Each
> >>>    system normally runs the NFS-server, and a number of brick
> >>> processes. The layout of the volume is important, but it is very
> >>> common to have one or more distributed volumes that use multiple
> >>> bricks on the same system (and many other systems).]
> >>>
> >>> The need for resolving the groups of a user comes in when users belong
> >>> to many groups. The RPC protocols can not carry a huge list of groups,
> >>> so the resolving can be done on the server side when the protocol hits
> >>> its limits (> 16 for NFS, approx. > 93 for GlusterFS).
> >>>
> >>> Upon using a Gluster volume, certain operations are sent to all the
> >>> bricks (i.e. some directory related operations). I can imagine that
> >>> a network share which is used by many users, trigger many
> >>> getgrouplist() calls in different brick processes at the (almost)
> >>> same time.
> >>>
> >>> For reference, the usage of getgrouplist() in the brick process can be
> >>> found here:
> >>> -
> >>> https://github.com/gluster/glusterfs/blob/master/xlators/protocol/server/src/server-helpers.c#L24
> >>>
> >>> The gid_resolve() function get called in case the brick process should
> >>> resolve the groups (and ignore the list of groups from the protocol).
> >>> It uses the gidcache functions from a private library:
> >>> -
> >>> https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.h
> >>> -
> >>> https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.c
> >>>
> >>> The default time for the gidcache to expire is 2 seconds. Users should
> >>> be able to configure this to 30 seconds (or anything else) with:
> >>>
> >>>     # gluster volume set <VOLUME> server.gid-timeout 30
> >>>
> >>>
> >>> I think this should explain the use-case sufficiently, but let me know
> >>> if there are any remaining questions. It might well be possible to
> >>> make this code more sssd friendly. I'm sure that we as Gluster
> >>> developers are open to any suggestions.
> What I can see is that also NFS xlator does call it in nfs_fix_groups():
> https://github.com/gluster/glusterfs/blob/master/xlators/nfs/server/src/nfs-fops.c#L96
> which is then used in nlm4_file_open_and_resume().

Yes. The NFS-server is a single process that all Gluster volumes use.
The group cache is shared more efficiently when compared to the group
cache that gets duplicated in all the brick processes. Also, this is
only used when a client system mounts a volume over NFS. The rest of the
described functionality is pretty much the same. It is possible to that
group cashing is used in both the Gluster/NFS-server and the brick
processes.

> 
> >>TBH this looks a little bit strange, other filesystems (as well as the
> >>kernel) create a credentials token when a user first authenticate and
> >>keep these credentials attached to the user session for the duration.
> >>Why does GlusterFS keeps hammering the system requesting the same
> >>information again and again ?
> >
> >The GlusterFS protocol itself is very much stateless, similar to NFSv3.
> >We need all the groups of the user on the server-side (brick) to allow
> >the backing filesystem (mostly XFS) perform the permission checking. In
> >the current GlusterFS protocol, there is no user authentication. (Well,
> >there has been work done on adding support for SSL, maybe that could be
> >used for tracking sessions on a per-client, not user, basis.)
> >
> >Just for clarity, a GlusterFS client (like a fuse-mount, or the
> >samba/vfs_glusterfs module) is used by many different users. The client
> >builds the connection to the volume. After that, all users with access
> >to the fuse-mount or samba-share are using the same client connection.
> >
> >By default the client sends a list of groups in each RPC request, and
> >the server-side trusts the list the client provides. However, for
> >environments where these lists are too small to hold all the groups,
> >there is an option to do the group resolving on the server side. This is
> >the "server.manage-gids" volume option, which acts very much like the
> >"rpc.mountd --manage-gids" functionality for NFS.
> In case of complex group membership fetching the group list might take
> longer than default 2 seconds which is not that unusual for LDAP-backed
> configurations with high network latency. Setting group membership
> caching to higher threshold by default is reasonable, as well as tying
> it to a client connection/authentication source. After all, group
> membership for a particular user doesn't really change every two
> seconds, or even every 30 seconds.

Yes, I agree. If you have a suggestion for changing the default of 2
seconds to something more practical, we'll consider that. Any advise on
that is welcome. Even a description of things to take into account when
a user wants to configure the timeout would be appreciated.

> It would definitely help to have a hand from GlusterFS protocol to
> allow client to hint the server side that as result of authentication,
> user properties did change, thus a refresh is needed on the server in
> case gid cache is involved. That could be a per RPC flag and at worst
> client would be forcing current behavior of dropping gid cache content
> too often. There seem to be already too much trust to GlusterFS client
> by the server side.

I think using RPCSEC_GSS could be used for that? When the credentials of
a user are invalidated, or have changed (?), the process verifying the
credentials (brick process) should invalidate/refresh the group cache
too.

The GlusterFS protocol does not support RPCSEC_GSS yet, but we can
definitely think about adding that if it has an advantage here. The
protocol changes can be done relatively easy:
  - We would probably have something like a virtual xattr that a client
    can send to drop its own group cache.

For me, detecting the change in groups on the client-side would need
more thinking. Hints? The glusterfs-fuse client is all user-space, and
can be extended with functionalities easily.

I do not think NFS-clients can communicate a need for refreshing the
group list, so that might still be an issue. Gluster/NFS only supports
RPC/AUTH_UNIX and NFSv3. In future, nfs-ganesha will mostly be used, and
I think we are more flexible there.

> >>Keep in mind that the use of getgrouplist() is an inherently costly
> >>operation. Even adding caches, the system cannot cache for long because
> >>it needs to return updated results eventually. Only the application
> >>know when a user session terminates and/or the list needs to be
> >>refreshed, so "caching" for this type of operation should be done
> >>mostly on the application side.
> >
> >I assume that your "application side" here is the brick process that
> >runs on the same system as sssd. As mentioned above, the brick processes
> >do cache the result of getgrouplist(). It may well be possible that the
> >default expiry of 2 seconds is too short for many environments. But
> >users can change that timeout easily with the "server.gid-timeout"
> >volume option.
> >
> >From my understanding of this thread, we (the Gluster Community) have
> >two things to do:
> >
> >1. Clearly document side-effects that can be caused by enabling the
> >  "server.manage-gids" option, and suggest increasing the
> >  "server.gid-timeout" value (maybe change the default?).
> >
> >2. Think about improving the GlusterFS protocol(s) and introduce some
> >  kind of credentials token that is linked with the groups of a user.
> >  Token expiry should invalidate the group-cache. One option would be
> >  to use Kerberos like NFS (RPCSEC_GSS).
> >
> >
> >Does this all make sense to others too? I'm adding gluster-devel@ to CC
> >so that others can chime in and this topic won't be forgotton.
> Yes, both these approaches make sense. Changing default to something
> more reasonable like several minutes would already help to reduce
> contention on an nss provider -- I've encountered that nss_files also
> behaves badly when too many requests come from parallel threads from the
> same process and can easily lock itself up in serializing access to
> /etc/passwd or /etc/groups.
> 
> Perhaps you can also add a third one, like I proposed above -- to allow
> per-RPC flag that gives a hint from client to the server about use of
> cached gids in case of server.manage-gids option is set on the server
> side.

Thanks for sharing your thoughts! It really helps to get ideas and
opinions from others. It's much appreciated.

Niels


More information about the Gluster-devel mailing list