[Gluster-devel] Spurious regression: Checking on 3 test cases

Jiffin Tony Thottan jthottan at redhat.com
Wed May 27 04:57:35 UTC 2015



On 26/05/15 20:44, Niels de Vos wrote:
> On Tue, May 26, 2015 at 10:17:05AM -0400, Jiffin Thottan wrote:
>>
>> ----- Original Message -----
>> From: "Vijay Bellur" <vbellur at redhat.com>
>> To: "Krishnan Parthasarathi" <kparthas at redhat.com>, "Shyam" <srangana at redhat.com>
>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
>> Sent: Friday, 22 May, 2015 9:49:15 AM
>> Subject: Re: [Gluster-devel] Spurious regression: Checking on 3 test cases
>>
>> On 05/22/2015 07:13 AM, Krishnan Parthasarathi wrote:
>>>> Are the following tests in any spurious regression failure lists? or,
>>>> noticed by others?
> ...
>>>> 2) ./tests/basic/mount-nfs-auth.t
>>>>      Run:
>>>> http://build.gluster.org/job/rackspace-regression-2GB-triggered/9406/consoleFull
>>>>
>> Right now there is no easy fix for the issue. It may require to
>> reconstruct entire netgroup/export structures used for this feature.
> Indeed, my suspicion is that the current structures for
> netgroups/exports and the auth_caching is not completely thread safe.
> The structures use a dict for gathering entries, and hash some of the
> contents of the entry as a key. This makes it quick to check if the
> entry has been cached.
>
> There also is a refresh thread that can read the exports and netgroups
> file from disk, and creates entries in the dict for the respective
> functionality.
>
> The problem (likely) occurs when this happens:
>
>      Step | NFS-client                    |  refresh thread
>     ------+-------------------------------+---------------------------
>           |                               |
>       1   | mount request                 |
>           | fetch entry from caching dict |
>           |                               |
>       2   |                               | timeout expired
>           |                               | read file from disk
>           |                               | create and fill a new dict
>           |                               | replace dict with new one
>           |                               |
>       3   |                               | free old dict and entries
>           |                               |
>       4   | try to use the cache entry    |
>           | SEGFAULT                      |
>           |                               |
>
> Step 1 and 2 can happen at the same time, but 3 has to come after 1 and
> before 4. Step 4 is a very minimal usage of the cache entry, this makes
> hitting this problem very rare.
>
> Because the netgroups/exports cache entries are kept in a dict, there is
> a lot of type-casting going on. It is not trivial to understand how this
> works. My preference would be to modify the structures so that we can do
> without the dicts, but that is not straight forward either.
>
> I would like to have a refcount for the entries that are fetched from
> the dict. But that means that after type-casting the contents from the
> dict, there still is a window where the cache entry can get free'd (or
> --refcount'd).
>
> The next best thing to do, is adding a lock on the cache structure
> itself. Everywhere the cache is accessed, taking a read-lock would be
> needed. Adding an entry to the cache would require a write-lock. Looks
> like a decent amount of work, and needs careful checking to not miss any
> occurrences. This is currently what I think is the most suitable
> approach.

+1 for the detailed explanation.
> Testing can probably be done by adding a delay in the mount path. Either
> a gdb script or systemtap that delays the execution of the thread that
> handles the mount.
>
> Other ideas are very much welcome :)
> Niels



More information about the Gluster-devel mailing list