[Gluster-devel] NetBSD regression tests hanging after ./tests/basic/mgmt_v3-locks.t
Rajesh Joseph
rjoseph at redhat.com
Mon Jun 15 12:58:26 UTC 2015
On Monday 15 June 2015 05:21 PM, Kaushal M wrote:
> The hang we observe is not something specific to Gluster. I've
> observed this kind of hangs when a filesystem which is in use goes
> offline.
> For example I've accidently shutdown machines which were being used
> for mounting nfs, which lead to the client systems hanging completely
> and required a hard reboot.
>
> If there are ways to avoid these kinds hangs when they eventually
> occur, I'm all ears.
For these test cases can't we use the nfs soft mount option to prevent
the hang?
>
> On Mon, Jun 15, 2015 at 4:38 PM, Pranith Kumar Karampuri
> <pkarampu at redhat.com> wrote:
>> Emmanuel,
>> I am not sure of the feasibility but just wanted to ask you. Do you
>> think there is a possibility to error out operations on the mount when mount
>> crashes instead of hanging? That would prevent a lot of manual intervention
>> even in future.
>>
>> Pranith.
>>
>> On 06/15/2015 01:35 PM, Niels de Vos wrote:
>>> Hi,
>>>
>>> sometimes the NetBSD regression tests hang with messages like this:
>>>
>>> [12:29:07] ./tests/basic/mgmt_v3-locks.t
>>> ........................................... ok 79867 ms
>>> No volumes present
>>> mount_nfs: can't access /patchy: Permission denied
>>> mount_nfs: can't access /patchy: Permission denied
>>> mount_nfs: can't access /patchy: Permission denied
>>>
>>> Most (if not all) of these hangs are caused by a crashing Gluster/NFS
>>> process. Once the Gluster/NFS server is not reachable anymore,
>>> unmounting fails.
>>>
>>> The only way to recover is to reboot the VM and retrigger the test. For
>>> rebooting, the http://build.gluster.org/job/reboot-vm job can be used,
>>> and retriggering works by clicking the "retrigger" link in the left menu
>>> once the test has been marked as failed/aborted.
>>>
>>> When logging in on the NetBSD system that hangs, you can verify with
>>> these steps:
>>>
>>> 1. check if there is a /glusterfsd.core file
>>> 2. run gdb on the core:
>>>
>>> # cd /build/install
>>> # gdb --core=/glusterfsd.core sbin/glusterfs
>>> ...
>>> Program terminated with signal SIGSEGV, Segmentation fault.
>>> #0 0xb9b94f0b in auth_cache_lookup (cache=0xb9aa2310, fh=0xb9044bf8,
>>> host_addr=0xb900e400 "104.130.205.187", timestamp=0xbf7fd900,
>>> can_write=0xbf7fd8fc)
>>> at
>>>
>>> /home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/nfs/server/src/auth-cache.c:164
>>> 164 *can_write = lookup_res->item->opts->rw;
>>>
>>> 3. verify the lookup_res structure:
>>>
>>> (gdb) p *lookup_res
>>> $1 = {timestamp = 1434284981, item = 0xb901e3b0}
>>> (gdb) p *lookup_res->item
>>> $2 = {name = 0xffffff00 <error: Cannot access memory at address
>>> 0xffffff00>, opts = 0xffffffff}
>>>
>>>
>>> A fix for this has been sent, it is currently waiting for an update to
>>> the prosed reference counting:
>>>
>>> - http://review.gluster.org/11022
>>> core: add "gf_ref_t" for common refcounting structures
>>> - http://review.gluster.org/11023
>>> nfs: refcount each auth_cache_entry and related data_t
>>>
>>> Thanks,
>>> Niels
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
More information about the Gluster-devel
mailing list