[Gluster-devel] memory cache for initgroups

Fri Nov 7 17:48:15 UTC 2014

On Fri, Nov 07, 2014 at 08:42:50AM -0500, Simo Sorce wrote:
> On Fri, 7 Nov 2014 09:59:32 +0100
> Niels de Vos <ndevos at redhat.com> wrote:
> 
> > On Thu, Nov 06, 2014 at 05:32:53PM -0500, Simo Sorce wrote:
> > > On Thu, 6 Nov 2014 22:02:29 +0100
> > > Niels de Vos <ndevos at redhat.com> wrote:
> > > 
> > > > On Thu, Nov 06, 2014 at 11:45:18PM +0530, Vijay Bellur wrote:
> > > > > On 11/03/2014 08:12 PM, Jakub Hrozek wrote:
> > > > > >On Mon, Nov 03, 2014 at 03:41:43PM +0100, Jakub Hrozek wrote:
> > > > > >>On Mon, Nov 03, 2014 at 08:53:06AM -0500, Simo Sorce wrote:
> > > > > >>>On Mon, 3 Nov 2014 13:57:08 +0100
> > > > > >>>Jakub Hrozek <jhrozek at redhat.com> wrote:
> > > > > >>>
> > > > > >>>>Hi,
> > > > > >>>>
> > > > > >>>>we had short discussion on $SUBJECT with Simo on IRC
> > > > > >>>>already, but there are multiple people involved from
> > > > > >>>>multiple timezones, so I think a mailing list thread would
> > > > > >>>>be better trackable.
> > > > > >>>>
> > > > > >>>>Can we add another memory cache file to SSSD, that would
> > > > > >>>>track initgroups/getgrouplist results for the NSS
> > > > > >>>>responder? I realize initgroups is a bit different
> > > > > >>>>operation than getpw{uid,nam} and getgr{gid,nam} but what
> > > > > >>>>if the new memcache was only used by the NSS responder and
> > > > > >>>>at the same time invalidated when initgroups is initiated
> > > > > >>>>by the PAM responder to ensure the memcache is up-to-date?
> > > > > >>>
> > > > > >>>Can you describe the use case before jumping into a proposed
> > > > > >>>solution ?
> > > > > >>
> > > > > >>Many getgrouplist() or initgroups() calls in a quick
> > > > > >>succession. One user is GlusterFS -- I'm not quite sure what
> > > > > >>the reason is there, maybe Vijay can elaborate.
> > > > > >
> > > > > 
> > > > > GlusterFS server invokes getgrouplist() to identify gids
> > > > > associated with an user on whose behalf a rpc request has been
> > > > > sent over the wire. There is a gid caching layer in GlusterFS
> > > > > and getgrouplist() does get called only if there is a gid cache
> > > > > miss. In the worst case, getgrouplist() can be invoked for
> > > > > every rpc request that GlusterFS receives and that seems to be
> > > > > the case in a deployment where we found that sssd was being
> > > > > busy. I am not certain about the sequence of operations that
> > > > > can cause the cache to be missed.
> > > > > 
> > > > > Adding Niels who is more familiar with the gid resolution &
> > > > > caching features in GlusterFS.
> > > > 
> > > > Just to add some background information on the getgrouplist().
> > > > GlusterFS uses several processes that can call getgrouplist():
> > > > - NFS-server, a single process per system
> > > > - brick, a process per exported filesystem/directory, potentally
> > > > several per system
> > > > 
> > > >   [Here, a Gluster environment has many systems (vm/physical).
> > > > Each system normally runs the NFS-server, and a number of brick
> > > > processes. The layout of the volume is important, but it is very
> > > > common to have one or more distributed volumes that use multiple
> > > > bricks on the same system (and many other systems).]
> > > > 
> > > > The need for resolving the groups of a user comes in when users
> > > > belong to many groups. The RPC protocols can not carry a huge
> > > > list of groups, so the resolving can be done on the server side
> > > > when the protocol hits its limits (> 16 for NFS, approx. > 93 for
> > > > GlusterFS).
> > > > 
> > > > Upon using a Gluster volume, certain operations are sent to all
> > > > the bricks (i.e. some directory related operations). I can
> > > > imagine that a network share which is used by many users, trigger
> > > > many getgrouplist() calls in different brick processes at the
> > > > (almost) same time.
> > > > 
> > > > For reference, the usage of getgrouplist() in the brick process
> > > > can be found here:
> > > > -
> > > > https://github.com/gluster/glusterfs/blob/master/xlators/protocol/server/src/server-helpers.c#L24
> > > > 
> > > > The gid_resolve() function get called in case the brick process
> > > > should resolve the groups (and ignore the list of groups from the
> > > > protocol). It uses the gidcache functions from a private library:
> > > > -
> > > > https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.h
> > > > -
> > > > https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/gidcache.c
> > > > 
> > > > The default time for the gidcache to expire is 2 seconds. Users
> > > > should be able to configure this to 30 seconds (or anything else)
> > > > with:
> > > > 
> > > >     # gluster volume set <VOLUME> server.gid-timeout 30
> > > > 
> > > > 
> > > > I think this should explain the use-case sufficiently, but let me
> > > > know if there are any remaining questions. It might well be
> > > > possible to make this code more sssd friendly. I'm sure that we
> > > > as Gluster developers are open to any suggestions.
> > > 
> > > 
> > > TBH this looks a little bit strange, other filesystems (as well as
> > > the kernel) create a credentials token when a user first
> > > authenticate and keep these credentials attached to the user
> > > session for the duration. Why does GlusterFS keeps hammering the
> > > system requesting the same information again and again ?
> > 
> > The GlusterFS protocol itself is very much stateless, similar to
> > NFSv3. We need all the groups of the user on the server-side (brick)
> > to allow the backing filesystem (mostly XFS) perform the permission
> > checking. In the current GlusterFS protocol, there is no user
> > authentication. (Well, there has been work done on adding support for
> > SSL, maybe that could be used for tracking sessions on a per-client,
> > not user, basis.)
> > 
> > Just for clarity, a GlusterFS client (like a fuse-mount, or the
> > samba/vfs_glusterfs module) is used by many different users. The
> > client builds the connection to the volume. After that, all users
> > with access to the fuse-mount or samba-share are using the same
> > client connection.
> > 
> > By default the client sends a list of groups in each RPC request, and
> > the server-side trusts the list the client provides. However, for
> > environments where these lists are too small to hold all the groups,
> > there is an option to do the group resolving on the server side. This
> > is the "server.manage-gids" volume option, which acts very much like
> > the "rpc.mountd --manage-gids" functionality for NFS.
> 
> Instead of sending a list of groups every time ... wouldn't it be
> better to send a "session token" (a random 128bit uuid) and let the
> bricks use this value to associate their cached lists ?
> 
> This way you can control how caching is done from the client side.

Yes, I was hoping RPCSEC_GSS could help with that. But that is a major
change and it'll take a while for it to get stable and used in
deployments.

Looking at it, there is a AUTH_SHORT option that we probably can use. We
do not use AUTH_SYS, but some variation called AUTH_GLUSTERFS. In the
end, they function pretty much the same. More on AUTH_SHORT:
- http://tools.ietf.org/html/rfc5531#page-25

One of the difficulties would be to have all the bricks be aware of the
token. There is no inter-brick communication...

> > > Keep in mind that the use of getgrouplist() is an inherently costly
> > > operation. Even adding caches, the system cannot cache for long
> > > because it needs to return updated results eventually. Only the
> > > application know when a user session terminates and/or the list
> > > needs to be refreshed, so "caching" for this type of operation
> > > should be done mostly on the application side.
> > 
> > I assume that your "application side" here is the brick process that
> > runs on the same system as sssd. As mentioned above, the brick
> > processes do cache the result of getgrouplist(). It may well be
> > possible that the default expiry of 2 seconds is too short for many
> > environments. But users can change that timeout easily with the
> > "server.gid-timeout" volume option.
> 
> Well the problem is that, unless you know you have some sort of user
> session, longer caches have the only effect of upsetting users that
> have had their credentials just changed.
> 
> The way it *should* work, at least if you want posix compatibility[1],
> is that once a user, on a client, start a session, is that his
> credentials never change until the user logs out and logs back in.
> Regardless of what happens in the identity management system (or the
> passwd/group files).

Well, we would like to improve the Gluster behaviour, making it "posix
complaint" at the same time works for me. I imagine that this would be
possible when we use RPCSEC_GSS, just like NFS does.

> > From my understanding of this thread, we (the Gluster Community) have
> > two things to do:
> > 
> > 1. Clearly document side-effects that can be caused by enabling the
> >    "server.manage-gids" option, and suggest increasing the
> >    "server.gid-timeout" value (maybe change the default?).
> > 
> > 2. Think about improving the GlusterFS protocol(s) and introduce some
> >    kind of credentials token that is linked with the groups of a user.
> >    Token expiry should invalidate the group-cache. One option would be
> >    to use Kerberos like NFS (RPCSEC_GSS).
> 
> Using RPCSEC_GSS is one good way to tie a user to its credentials, as
> said credentials are tied to the GSS context and never changed until
> the context is destroyed. Using, in general, a token created on
> "session establishment"[2] and used until valid would resolve a host of
> issues and make your filesystem more posix compliant and predictable
> when it comes to access control decisions.

The biggest advantage for the Gluster use-case seems to be that the
token is valid on all the systems hosting a brick for a particular
volume. At least, I hope that is the case. Because of the nature of the
scale-out, scale-up filesystem, systems and bricks can get added
whenever a sysadmin deems it necessary. I do not immediately see a
solution to prevent your [*] footnote, that would require Gluster to
pass credentials (and tokens?) around to all the bricks when they get
online. It is not impossible, but requires quite some more work.

> > Does this all make sense to others too? I'm adding gluster-devel@ to
> > CC so that others can chime in and this topic won't be forgotton.
> 
> It does.
> 
> Simo.
> 
> 
> 
> [1] IIRC Posix requires that the credentials set in the kernel at login
> time are used throughout the lifetime of the process unchanged. This is
> particularly important as a process may *intentionally* drop auxiliary
> groups or even change its credentials set entirely (like root switching
> to a different uid and arbitrary set of gids).
> You may decide this is not something you want to care about for network
> access, and in fact NFS + RPC_GSSSEC do es *not* do this, as it always
> computes the credentials set on the server side at (GSSAPI) context
> establishment time. Up to you to decide what semantics you want to
> follow, but they should be at least predictable if at all possible.

If the NFS + RPC_GSSSEC semantics are well understood, they should work
for Gluster too. The main requirements would be that a userspace process
can get the token for a user, and pass that on through a library call
that then does the GlusterFS RPC stuff. Samba with vfs_glusterfs would
be one of these users, glusterfs-fuse an other.

> [2] You will have to define what this means for GlusterFS, I can see
> only a few constraints to make it useful.
> - the session needs to be initiated by a specific client
> - you need a way to either pass the information that a new session is
>   being established or pass the credential set to the bricks
> - you need to cache this session on the bricks side and you cannot
>   discard it at will (yes, this means state needs to be kept)*
> - if a client connects randomly to multiple bricks, it means this cache
>   needs to be distributed and accessible to any brick anywhere that
>   needs the information
> - if state cannot be kept then you have no other option but to always
>   re-transmit the whole credential token, as big as it may be (maximum
>   size on a linux system would be 256K at the moment (1 32bit uid + 65k
>   32bit gids).

Maybe we can ask for the whole credential token when a client connects
to the brick for the first time, and after that use the session token.
This would solve the issue I mentioned above about adding systems and
bricks.

> * the reason you do not want to let each brick resolve the groups is
> that you may end up with different bricks having a different list of
> groups a uid is member of. This would lead to nasty, very hard to debug
> access issues that admins would hate you for :)

Yes, that is a very good point.

Thanks again,
Niels