[Gluster-devel] gluster doesn't like Oracle's FSINFO RPC call

Fri Apr 12 19:58:04 UTC 2013

KERBOOM

[michael at fleming1 ~]$ sudo mount -a -t nfs
[sudo] password for michael:
mount: fearless1:/gv0 failed, reason given by server: No such file or
directory
mount: fearless1:/gv0/fleming1/db0/ALTUS_config failed, reason given by
server: unknown nfs status return value: 22
mount: fearless1:/gv0/fleming1/db0/ALTUS_data failed, reason given by
server: unknown nfs status return value: 22
mount: fearless1:/gv0/fleming1/db0/ALTUS_flash failed, reason given by
server: unknown nfs status return value: 22
mount.nfs: mount point /db/flash_recovery_area/ALTUS/onlinelog does not
exist

nfs.log:
[2013-04-12 15:55:16.507084] E [nfs3.c:305:__nfs3_get_volume_id]
(-->/usr/lib64/glusterfs/3.3.1/xlator/nfs/server.so(nfs3_fsinfo+0x22c)
[0x7f45bfbb852c]
(-->/usr/lib64/glusterfs/3.3.1/xlator/nfs/server.so(nfs3_fsinfo_reply+0x29)
[0x7f45bfbb2ce9]
(-->/usr/lib64/glusterfs/3.3.1/xlator/nfs/server.so(nfs3_request_xlator_deviceid+0x51)
[0x7f45bfbb2481]))) 0-nfs-nfsv3: invalid argument: xl
[2013-04-12 15:55:16.538560] E [nfs3.c:4706:nfs3_fsinfo] 0-nfs-nfsv3:
Bad Handle
[2013-04-12 15:55:16.538580] W [nfs3-helpers.c:3389:nfs3_log_common_res]
0-nfs-nfsv3: XID: 242c1550, FSINFO: NFS: 10001(Illegal NFS file handle),
POSIX: 14(Bad address)
[2013-04-12 15:55:16.538617] E [nfs3.c:305:__nfs3_get_volume_id]
(-->/usr/lib64/glusterfs/3.3.1/xlator/nfs/server.so(nfs3_fsinfo+0x22c)
[0x7f45bfbb852c]
(-->/usr/lib64/glusterfs/3.3.1/xlator/nfs/server.so(nfs3_fsinfo_reply+0x29)
[0x7f45bfbb2ce9]
(-->/usr/lib64/glusterfs/3.3.1/xlator/nfs/server.so(nfs3_request_xlator_deviceid+0x51)
[0x7f45bfbb2481]))) 0-nfs-nfsv3: invalid argument: xl

(I tried both with and without modifying your uint32_t size to a
'int32_t size' to correct the signedness of the argument)

Get ahold of me in IRC and let's get this figured out. I've got a
debugger attached.

M.

On 13-04-12 11:32 AM, Niels de Vos wrote:
> On Fri, Apr 12, 2013 at 05:23:08PM +0200, Niels de Vos wrote:
>> On Thu, Apr 11, 2013 at 12:37:30PM -0400, Michael Brown wrote:
>>> That actually broke everything (including Linux trying to mount NFS).
>>>
>>> I've modified it slightly to be:
>>>
>>> bool_t
>>> xdr_nfs_fh3 (XDR *xdrs, nfs_fh3 *objp)
>>> {
>>>         if (!xdr_bytes (xdrs, (char **)&objp->data.data_val, (u_int *)
>>> &objp->data.data_len, NFS3_FHSIZE))
>>>                 if (!xdr_opaque (xdrs, &objp, (u_int *)
>>> &objp->data.data_len))
>>>                         return FALSE;
>>>         return TRUE;
>>> }
>>>
>>> (i.e. only call the xdr_opaque function if the xdr_bytes decode fails)
>> Nah, that won't work. The xdr_* functions are modifying the position of 
>> the cursor in the XDR-stream. Subsequent reads will continue where the 
>> previous one finished.
>>
>> What you probably need to do is something like this:
>>
>> xdr_nfs_fh3 (XDR *xdrs, nfs_fh3 *objp)
>> {
>> 	uint32_t size;
>>
>> 	if (!xdr_int (xdrs, &size))
>> 		if (!xdr_opaque (xdrs, (u_int *)&objp->data.data_len, size))
> ^ that should be objp->data.data_val of course :-/
>
>> 			return FALSE
>> 	return TRUE;
>> }
>>
>> That will read the size of the fhandle first, to determine how long the opaque 
>> fhandle is, and use that size to read it.
>>
>> Cheers,
>> Niels
>>
>>> But I get no change in behaviour.
>>>
>>> Also get these warnings:
>>>
>>> xdr-nfs3.c: In function 'xdr_nfs_fh3':
>>> xdr-nfs3.c:197: warning: passing argument 2 of 'xdr_opaque' from
>>> incompatible pointer type
>>> /usr/include/rpc/xdr.h:313: note: expected 'caddr_t' but argument is of
>>> type 'struct nfs_fh3 **'
>>> xdr-nfs3.c:197: warning: passing argument 3 of 'xdr_opaque' makes
>>> integer from pointer without a cast
>>> /usr/include/rpc/xdr.h:313: note: expected 'u_int' but argument is of
>>> type 'u_int *'
>>>
>>> M.
>>>
>>> On 13-04-11 07:42 AM, Niels de Vos wrote:
>>>> My guess is that this (untested) change would fix it, can you try that?
>>>>
>>>> --- a/rpc/xdr/src/xdr-nfs3.c
>>>> +++ b/rpc/xdr/src/xdr-nfs3.c
>>>> @@ -184,7 +184,7 @@ xdr_specdata3 (XDR *xdrs, specdata3 *objp)
>>>>  bool_t
>>>>  xdr_nfs_fh3 (XDR *xdrs, nfs_fh3 *objp)
>>>>  {
>>>> -	 if (!xdr_bytes (xdrs, (char **)&objp->data.data_val, (u_int *) &objp->data.data_len, NFS3_FHSIZE))
>>>> +	 if (!xdr_opaque (xdrs, &objp, (u_int *) &objp->data.data_len))
>>>>  		 return FALSE;
>>>>  	return TRUE;
>>>>  }
>>>>
>>>>
>>>> HTH,
>>>> Niels
>>>>
>>>>> All I get out of gluster is:
>>>>> [2013-04-08 12:54:32.206312] E [nfs3.c:4741:nfs3svc_fsinfo] 0-nfs-nfsv3:
>>>>> Error decoding arguments
>>>>>
>>>>>
>>>>> I've attached abridged packet captures and text explanations of the
>>>>> packets (thanks to wireshark).
>>>>>
>>>>> Can someone please look at this and determine if it's gluster's parsing
>>>>> of the RPC call to blame, or if it's Oracle?
>>>>>
>>>>> This is the same setup on which I reported the NFS race condition bug.
>>>>> It does have that patch applied.
>>>>> Details:
>>>>> http://lists.gnu.org/archive/html/gluster-devel/2013-04/msg00014.html
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Michael
>>>>>
>>>>> -- 
>>>>> Michael Brown               | `One of the main causes of the fall of
>>>>> Systems Consultant          | the Roman Empire was that, lacking zero,
>>>>> Net Direct Inc.             | they had no way to indicate successful
>>>>> ?: +1 519 883 1172 x5106    | termination of their C programs.' - Firth
>>>>>
>>>>
>>>>
>>>>
>>>>> _______________________________________________
>>>>> Gluster-devel mailing list
>>>>> Gluster-devel at nongnu.org
>>>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>>>
>>> -- 
>>> Michael Brown               | `One of the main causes of the fall of
>>> Systems Consultant          | the Roman Empire was that, lacking zero,
>>> Net Direct Inc.             | they had no way to indicate successful
>>> ☎: +1 519 883 1172 x5106    | termination of their C programs.' - Firth
>>>
>> -- 
>> Niels de Vos
>> Sr. Software Maintenance Engineer
>> Support Engineering Group
>> Red Hat Global Support Services
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at nongnu.org
>> https://lists.nongnu.org/mailman/listinfo/gluster-devel

-- 
Michael Brown               | `One of the main causes of the fall of
Systems Consultant          | the Roman Empire was that, lacking zero,
Net Direct Inc.             | they had no way to indicate successful
☎: +1 519 883 1172 x5106    | termination of their C programs.' - Firth

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130412/fdcd63a4/attachment-0001.html>