[Gluster-devel] memory cache for initgroups

Fri Nov 7 13:42:50 UTC 2014

On Fri, 7 Nov 2014 09:59:32 +0100
Niels de Vos <ndevos at redhat.com> wrote:

> On Thu, Nov 06, 2014 at 05:32:53PM -0500, Simo Sorce wrote:
> > On Thu, 6 Nov 2014 22:02:29 +0100
> > Niels de Vos <ndevos at redhat.com> wrote:
> > 
> > > On Thu, Nov 06, 2014 at 11:45:18PM +0530, Vijay Bellur wrote:
> > > > On 11/03/2014 08:12 PM, Jakub Hrozek wrote:
> > > > >On Mon, Nov 03, 2014 at 03:41:43PM +0100, Jakub Hrozek wrote:
> > > > >>On Mon, Nov 03, 2014 at 08:53:06AM -0500, Simo Sorce wrote:
> > > > >>>On Mon, 3 Nov 2014 13:57:08 +0100
> > > > >>>Jakub Hrozek <jhrozek at redhat.com> wrote:
> > > > >>>
> > > > >>>>Hi,
> > > > >>>>
> > > > >>>>we had short discussion on $SUBJECT with Simo on IRC
> > > > >>>>already, but there are multiple people involved from
> > > > >>>>multiple timezones, so I think a mailing list thread would
> > > > >>>>be better trackable.
> > > > >>>>
> > > > >>>>Can we add another memory cache file to SSSD, that would
> > > > >>>>track initgroups/getgrouplist results for the NSS
> > > > >>>>responder? I realize initgroups is a bit different
> > > > >>>>operation than getpw{uid,nam} and getgr{gid,nam} but what
> > > > >>>>if the new memcache was only used by the NSS responder and
> > > > >>>>at the same time invalidated when initgroups is initiated
> > > > >>>>by the PAM responder to ensure the memcache is up-to-date?
> > > > >>>
> > > > >>>Can you describe the use case before jumping into a proposed
> > > > >>>solution ?
> > > > >>
> > > > >>Many getgrouplist() or initgroups() calls in a quick
> > > > >>succession. One user is GlusterFS -- I'm not quite sure what
> > > > >>the reason is there, maybe Vijay can elaborate.
> > > > >
> > > > 
> > > > GlusterFS server invokes getgrouplist() to identify gids
> > > > associated with an user on whose behalf a rpc request has been
> > > > sent over the wire. There is a gid caching layer in GlusterFS
> > > > and getgrouplist() does get called only if there is a gid cache
> > > > miss. In the worst case, getgrouplist() can be invoked for
> > > > every rpc request that GlusterFS receives and that seems to be
> > > > the case in a deployment where we found that sssd was being
> > > > busy. I am not certain about the sequence of operations that
> > > > can cause the cache to be missed.
> > > > 
> > > > Adding Niels who is more familiar with the gid resolution &
> > > > caching features in GlusterFS.
> > > 
> > > Just to add some background information on the getgrouplist().
> > > GlusterFS uses several processes that can call getgrouplist():
> > > - NFS-server, a single process per system
> > > - brick, a process per exported filesystem/directory, potentally
> > > several per system
> > > 
> > >   [Here, a Gluster environment has many systems (vm/physical).
> > > Each system normally runs the NFS-server, and a number of brick
> > > processes. The layout of the volume is important, but it is very
> > > common to have one or more distributed volumes that use multiple
> > > bricks on the same system (and many other systems).]
> > > 
> > > The need for resolving the groups of a user comes in when users
> > > belong to many groups. The RPC protocols can not carry a huge
> > > list of groups, so the resolving can be done on the server side
> > > when the protocol hits its limits (> 16 for NFS, approx. > 93 for
> > > GlusterFS).
> > > 
> > > Upon using a Gluster volume, certain operations are sent to all
> > > the bricks (i.e. some directory related operations). I can
> > > imagine that a network share which is used by many users, trigger
> > > many getgrouplist() calls in different brick processes at the
> > > (almost) same time.
> > > 
> > > For reference, the usage of getgrouplist() in the brick process
> > > can be found here:
> > > -
> > > https://github.com/gluster/glusterfs/blob/master/xlators/protocol/server/src/server-helpers.c#L24
> > > 
> > > The gid_resolve() function get called in case the brick process
> > > should resolve the groups (and ignore the list of groups from the
> > > protocol). It uses the gidcache functions from a private library:
> > > -
> > > https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.h
> > > -
> > > https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.c
> > > 
> > > The default time for the gidcache to expire is 2 seconds. Users
> > > should be able to configure this to 30 seconds (or anything else)
> > > with:
> > > 
> > >     # gluster volume set <VOLUME> server.gid-timeout 30
> > > 
> > > 
> > > I think this should explain the use-case sufficiently, but let me
> > > know if there are any remaining questions. It might well be
> > > possible to make this code more sssd friendly. I'm sure that we
> > > as Gluster developers are open to any suggestions.
> > 
> > 
> > TBH this looks a little bit strange, other filesystems (as well as
> > the kernel) create a credentials token when a user first
> > authenticate and keep these credentials attached to the user
> > session for the duration. Why does GlusterFS keeps hammering the
> > system requesting the same information again and again ?
> 
> The GlusterFS protocol itself is very much stateless, similar to
> NFSv3. We need all the groups of the user on the server-side (brick)
> to allow the backing filesystem (mostly XFS) perform the permission
> checking. In the current GlusterFS protocol, there is no user
> authentication. (Well, there has been work done on adding support for
> SSL, maybe that could be used for tracking sessions on a per-client,
> not user, basis.)
> 
> Just for clarity, a GlusterFS client (like a fuse-mount, or the
> samba/vfs_glusterfs module) is used by many different users. The
> client builds the connection to the volume. After that, all users
> with access to the fuse-mount or samba-share are using the same
> client connection.
> 
> By default the client sends a list of groups in each RPC request, and
> the server-side trusts the list the client provides. However, for
> environments where these lists are too small to hold all the groups,
> there is an option to do the group resolving on the server side. This
> is the "server.manage-gids" volume option, which acts very much like
> the "rpc.mountd --manage-gids" functionality for NFS.

Instead of sending a list of groups every time ... wouldn't it be
better to send a "session token" (a random 128bit uuid) and let the
bricks use this value to associate their cached lists ?

This way you can control how caching is done from the client side.

> > Keep in mind that the use of getgrouplist() is an inherently costly
> > operation. Even adding caches, the system cannot cache for long
> > because it needs to return updated results eventually. Only the
> > application know when a user session terminates and/or the list
> > needs to be refreshed, so "caching" for this type of operation
> > should be done mostly on the application side.
> 
> I assume that your "application side" here is the brick process that
> runs on the same system as sssd. As mentioned above, the brick
> processes do cache the result of getgrouplist(). It may well be
> possible that the default expiry of 2 seconds is too short for many
> environments. But users can change that timeout easily with the
> "server.gid-timeout" volume option.

Well the problem is that, unless you know you have some sort of user
session, longer caches have the only effect of upsetting users that
have had their credentials just changed.

The way it *should* work, at least if you want posix compatibility[1],
is that once a user, on a client, start a session, is that his
credentials never change until the user logs out and logs back in.
Regardless of what happens in the identity management system (or the
passwd/group files).

> From my understanding of this thread, we (the Gluster Community) have
> two things to do:
> 
> 1. Clearly document side-effects that can be caused by enabling the
>    "server.manage-gids" option, and suggest increasing the
>    "server.gid-timeout" value (maybe change the default?).
> 
> 2. Think about improving the GlusterFS protocol(s) and introduce some
>    kind of credentials token that is linked with the groups of a user.
>    Token expiry should invalidate the group-cache. One option would be
>    to use Kerberos like NFS (RPCSEC_GSS).

Using RPCSEC_GSS is one good way to tie a user to its credentials, as
said credentials are tied to the GSS context and never changed until
the context is destroyed. Using, in general, a token created on
"session establishment"[2] and used until valid would resolve a host of
issues and make your filesystem more posix compliant and predictable
when it comes to access control decisions.

> Does this all make sense to others too? I'm adding gluster-devel@ to
> CC so that others can chime in and this topic won't be forgotton.

It does.

Simo.

[1] IIRC Posix requires that the credentials set in the kernel at login
time are used throughout the lifetime of the process unchanged. This is
particularly important as a process may *intentionally* drop auxiliary
groups or even change its credentials set entirely (like root switching
to a different uid and arbitrary set of gids).
You may decide this is not something you want to care about for network
access, and in fact NFS + RPC_GSSSEC do es *not* do this, as it always
computes the credentials set on the server side at (GSSAPI) context
establishment time. Up to you to decide what semantics you want to
follow, but they should be at least predictable if at all possible.

[2] You will have to define what this means for GlusterFS, I can see
only a few constraints to make it useful.
- the session needs to be initiated by a specific client
- you need a way to either pass the information that a new session is
  being established or pass the credential set to the bricks
- you need to cache this session on the bricks side and you cannot
  discard it at will (yes, this means state needs to be kept)*
- if a client connects randomly to multiple bricks, it means this cache
  needs to be distributed and accessible to any brick anywhere that
  needs the information
- if state cannot be kept then you have no other option but to always
  re-transmit the whole credential token, as big as it may be (maximum
  size on a linux system would be 256K at the moment (1 32bit uid + 65k
  32bit gids).

* the reason you do not want to let each brick resolve the groups is
that you may end up with different bricks having a different list of
groups a uid is member of. This would lead to nasty, very hard to debug
access issues that admins would hate you for :)

-- 
Simo Sorce * Red Hat, Inc * New York