[Gluster-devel] Spurious regression: Checking on 3 test cases

Tue May 26 15:14:40 UTC 2015

On Tue, May 26, 2015 at 10:17:05AM -0400, Jiffin Thottan wrote:
> 
> 
> ----- Original Message -----
> From: "Vijay Bellur" <vbellur at redhat.com>
> To: "Krishnan Parthasarathi" <kparthas at redhat.com>, "Shyam" <srangana at redhat.com>
> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> Sent: Friday, 22 May, 2015 9:49:15 AM
> Subject: Re: [Gluster-devel] Spurious regression: Checking on 3 test cases
> 
> On 05/22/2015 07:13 AM, Krishnan Parthasarathi wrote:
> >
> >> Are the following tests in any spurious regression failure lists? or,
> >> noticed by others?
...
> >> 2) ./tests/basic/mount-nfs-auth.t
> >>     Run:
> >> http://build.gluster.org/job/rackspace-regression-2GB-triggered/9406/consoleFull
> >>
> 
> Right now there is no easy fix for the issue. It may require to
> reconstruct entire netgroup/export structures used for this feature.

Indeed, my suspicion is that the current structures for
netgroups/exports and the auth_caching is not completely thread safe.
The structures use a dict for gathering entries, and hash some of the
contents of the entry as a key. This makes it quick to check if the
entry has been cached.

There also is a refresh thread that can read the exports and netgroups
file from disk, and creates entries in the dict for the respective
functionality.

The problem (likely) occurs when this happens:

    Step | NFS-client                    |  refresh thread
   ------+-------------------------------+---------------------------
         |                               |
     1   | mount request                 |
         | fetch entry from caching dict |
         |                               |
     2   |                               | timeout expired
         |                               | read file from disk
         |                               | create and fill a new dict
         |                               | replace dict with new one
         |                               |
     3   |                               | free old dict and entries
         |                               |
     4   | try to use the cache entry    |
         | SEGFAULT                      |
         |                               |

Step 1 and 2 can happen at the same time, but 3 has to come after 1 and
before 4. Step 4 is a very minimal usage of the cache entry, this makes
hitting this problem very rare.

Because the netgroups/exports cache entries are kept in a dict, there is
a lot of type-casting going on. It is not trivial to understand how this
works. My preference would be to modify the structures so that we can do
without the dicts, but that is not straight forward either.

I would like to have a refcount for the entries that are fetched from
the dict. But that means that after type-casting the contents from the
dict, there still is a window where the cache entry can get free'd (or
--refcount'd).

The next best thing to do, is adding a lock on the cache structure
itself. Everywhere the cache is accessed, taking a read-lock would be
needed. Adding an entry to the cache would require a write-lock. Looks
like a decent amount of work, and needs careful checking to not miss any
occurrences. This is currently what I think is the most suitable
approach.

Testing can probably be done by adding a delay in the mount path. Either
a gdb script or systemtap that delays the execution of the thread that
handles the mount.

Other ideas are very much welcome :)
Niels