[Gluster-devel] gluster doesn't like Oracle's FSINFO RPC call

Mon Apr 15 18:23:15 UTC 2013

Given that the xdr_decode function already has done it's work by the
time the check for the extra bytes is made
(https://gist.github.com/Supermathie/5389349#file-gistfile1-c-L19) I was
curious to see what would happen if I just ignored the failing call:

bool_t
xdr_nfs_fh3 (XDR *xdrs, nfs_fh3 *objp)
{
if (!xdr_uint32 (xdrs, &objp->data.data_len)) {
gf_log("glusterfs (nfs)", GF_LOG_ERROR, "xdr_uint32 failed, data_len:
%d", objp->data.data_len);
return FALSE;
}
if (!xdr_opaque (xdrs, objp->data.data_val, objp->data.data_len)) {
gf_log("glusterfs (nfs)", GF_LOG_ERROR, "xdr_opaque failed, data_len:
%d, (ignoring)", objp->data.data_len);
//return FALSE;
}
return TRUE;
}

Things Go Badly:

[2013-04-15 14:13:42.757528] E [xdr-nfs3.c:201:xdr_nfs_fh3] 0-glusterfs
(nfs): xdr_opaque failed, data_len: 34, (ignoring)
pending frames:

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2013-04-15 14:13:42
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.3.1
/lib64/libc.so.6[0x3c48c32920]
/lib64/libc.so.6(xdr_string+0xa7)[0x3c48d17ac7]
/usr/lib64/libgfxdr.so.0(xdr_filename3+0xe)[0x7fab414302ce]
/usr/lib64/libgfxdr.so.0(xdr_diropargs3+0x2d)[0x7fab4143089d]
/usr/lib64/libgfxdr.so.0(xdr_lookup3args+0x9)[0x7fab41430959]
/usr/lib64/libgfxdr.so.0(xdr_to_generic+0x73)[0x7fab4142a7c3]
/usr/lib64/glusterfs/3.3.1/xlator/nfs/server.so(nfs3svc_lookup+0xa5)[0x7fab3c9d1e25]
/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x293)[0x7fab41643443]
/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x93)[0x7fab416435b3]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7fab41644018]
/usr/lib64/glusterfs/3.3.1/rpc-transport/socket.so(socket_event_poll_in+0x34)[0x7fab3e130924]
/usr/lib64/glusterfs/3.3.1/rpc-transport/socket.so(socket_event_handler+0xc7)[0x7fab3e130a07]
/usr/lib64/libglusterfs.so.0(+0x3ed14)[0x7fab4188ed14]
/usr/sbin/glusterfs(main+0x58a)[0x40741a]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3c48c1ecdd]
/usr/sbin/glusterfs[0x4043c9]
---------

Looks like the fsinfo call is OK, but there's code elsewhere that falls
over when given this file handle.

If nothing else, these codepaths ought to be checked so that a rogue
client can't crash the NFS server by passing a poorly-encoded XDR frame.
(I know it wasn't my changes as I reverted the change above, restarted
the NFS server and it crashed again).

At this point, I'm going to explore modifying the nfs3 file handle
glusterd is passing out to be 36 bytes long. Yeah, ugly and we shouldn't
have to do it, but it'll (hopefully) make it work.

M.

On 13-04-12 06:50 PM, Niels de Vos wrote:
> On Fri, Apr 12, 2013 at 03:58:04PM -0400, Michael Brown wrote:
>> KERBOOM
>>
>> [michael at fleming1 ~]$ sudo mount -a -t nfs
>> [sudo] password for michael:
>> mount: fearless1:/gv0 failed, reason given by server: No such file or
>> directory
>> mount: fearless1:/gv0/fleming1/db0/ALTUS_config failed, reason given by
>> server: unknown nfs status return value: 22
>> mount: fearless1:/gv0/fleming1/db0/ALTUS_data failed, reason given by
>> server: unknown nfs status return value: 22
>> mount: fearless1:/gv0/fleming1/db0/ALTUS_flash failed, reason given by
>> server: unknown nfs status return value: 22
>> mount.nfs: mount point /db/flash_recovery_area/ALTUS/onlinelog does not
>> exist
>>
>> nfs.log:
>> [2013-04-12 15:55:16.507084] E [nfs3.c:305:__nfs3_get_volume_id]
>> (-->/usr/lib64/glusterfs/3.3.1/xlator/nfs/server.so(nfs3_fsinfo+0x22c)
>> [0x7f45bfbb852c]
>> (-->/usr/lib64/glusterfs/3.3.1/xlator/nfs/server.so(nfs3_fsinfo_reply+0x29)
>> [0x7f45bfbb2ce9]
>> (-->/usr/lib64/glusterfs/3.3.1/xlator/nfs/server.so(nfs3_request_xlator_deviceid+0x51)
>> [0x7f45bfbb2481]))) 0-nfs-nfsv3: invalid argument: xl
>> [2013-04-12 15:55:16.538560] E [nfs3.c:4706:nfs3_fsinfo] 0-nfs-nfsv3:
>> Bad Handle
>> [2013-04-12 15:55:16.538580] W [nfs3-helpers.c:3389:nfs3_log_common_res]
>> 0-nfs-nfsv3: XID: 242c1550, FSINFO: NFS: 10001(Illegal NFS file handle),
>> POSIX: 14(Bad address)
>> [2013-04-12 15:55:16.538617] E [nfs3.c:305:__nfs3_get_volume_id]
>> (-->/usr/lib64/glusterfs/3.3.1/xlator/nfs/server.so(nfs3_fsinfo+0x22c)
>> [0x7f45bfbb852c]
>> (-->/usr/lib64/glusterfs/3.3.1/xlator/nfs/server.so(nfs3_fsinfo_reply+0x29)
>> [0x7f45bfbb2ce9]
>> (-->/usr/lib64/glusterfs/3.3.1/xlator/nfs/server.so(nfs3_request_xlator_deviceid+0x51)
>> [0x7f45bfbb2481]))) 0-nfs-nfsv3: invalid argument: xl
>>
>> (I tried both with and without modifying your uint32_t size to a
>> 'int32_t size' to correct the signedness of the argument)
>>
>> Get ahold of me in IRC and let's get this figured out. I've got a
>> debugger attached.
> 23:51 < ndevos> Supermathie: ah, I've thought of the error in my 
>    suggestion - that function is used to encode and decode
> 23:52 < ndevos> which means, that the size parameter must be set 
>    correctly - the .data_len attribute contain the size when encoding, 
>    and should be overwritten when decoding
> 23:53 < ndevos> KERBOOM happens when an idea is only half looked at :-/
>
> Maybe something the attached patch works better? It should encode/decode 
> both the length and the fhandle value. Compile tested only.
>
> Niels
>
>> M.
>>
>> On 13-04-12 11:32 AM, Niels de Vos wrote:
>>> On Fri, Apr 12, 2013 at 05:23:08PM +0200, Niels de Vos wrote:
>>>> On Thu, Apr 11, 2013 at 12:37:30PM -0400, Michael Brown wrote:
>>>>> That actually broke everything (including Linux trying to mount NFS).
>>>>>
>>>>> I've modified it slightly to be:
>>>>>
>>>>> bool_t
>>>>> xdr_nfs_fh3 (XDR *xdrs, nfs_fh3 *objp)
>>>>> {
>>>>>         if (!xdr_bytes (xdrs, (char **)&objp->data.data_val, (u_int *)
>>>>> &objp->data.data_len, NFS3_FHSIZE))
>>>>>                 if (!xdr_opaque (xdrs, &objp, (u_int *)
>>>>> &objp->data.data_len))
>>>>>                         return FALSE;
>>>>>         return TRUE;
>>>>> }
>>>>>
>>>>> (i.e. only call the xdr_opaque function if the xdr_bytes decode fails)
>>>> Nah, that won't work. The xdr_* functions are modifying the position of 
>>>> the cursor in the XDR-stream. Subsequent reads will continue where the 
>>>> previous one finished.
>>>>
>>>> What you probably need to do is something like this:
>>>>
>>>> xdr_nfs_fh3 (XDR *xdrs, nfs_fh3 *objp)
>>>> {
>>>> 	uint32_t size;
>>>>
>>>> 	if (!xdr_int (xdrs, &size))
>>>> 		if (!xdr_opaque (xdrs, (u_int *)&objp->data.data_len, size))
>>> ^ that should be objp->data.data_val of course :-/
>>>
>>>> 			return FALSE
>>>> 	return TRUE;
>>>> }
>>>>
>>>> That will read the size of the fhandle first, to determine how long the opaque 
>>>> fhandle is, and use that size to read it.
>>>>
>>>> Cheers,
>>>> Niels
>>>>
>>>>> But I get no change in behaviour.
>>>>>
>>>>> Also get these warnings:
>>>>>
>>>>> xdr-nfs3.c: In function 'xdr_nfs_fh3':
>>>>> xdr-nfs3.c:197: warning: passing argument 2 of 'xdr_opaque' from
>>>>> incompatible pointer type
>>>>> /usr/include/rpc/xdr.h:313: note: expected 'caddr_t' but argument is of
>>>>> type 'struct nfs_fh3 **'
>>>>> xdr-nfs3.c:197: warning: passing argument 3 of 'xdr_opaque' makes
>>>>> integer from pointer without a cast
>>>>> /usr/include/rpc/xdr.h:313: note: expected 'u_int' but argument is of
>>>>> type 'u_int *'
>>>>>
>>>>> M.
>>>>>
>>>>> On 13-04-11 07:42 AM, Niels de Vos wrote:
>>>>>> My guess is that this (untested) change would fix it, can you try that?
>>>>>>
>>>>>> --- a/rpc/xdr/src/xdr-nfs3.c
>>>>>> +++ b/rpc/xdr/src/xdr-nfs3.c
>>>>>> @@ -184,7 +184,7 @@ xdr_specdata3 (XDR *xdrs, specdata3 *objp)
>>>>>>  bool_t
>>>>>>  xdr_nfs_fh3 (XDR *xdrs, nfs_fh3 *objp)
>>>>>>  {
>>>>>> -	 if (!xdr_bytes (xdrs, (char **)&objp->data.data_val, (u_int *) &objp->data.data_len, NFS3_FHSIZE))
>>>>>> +	 if (!xdr_opaque (xdrs, &objp, (u_int *) &objp->data.data_len))
>>>>>>  		 return FALSE;
>>>>>>  	return TRUE;
>>>>>>  }
>>>>>>
>>>>>>
>>>>>> HTH,
>>>>>> Niels
>>>>>>
>>>>>>> All I get out of gluster is:
>>>>>>> [2013-04-08 12:54:32.206312] E [nfs3.c:4741:nfs3svc_fsinfo] 0-nfs-nfsv3:
>>>>>>> Error decoding arguments
>>>>>>>
>>>>>>>
>>>>>>> I've attached abridged packet captures and text explanations of the
>>>>>>> packets (thanks to wireshark).
>>>>>>>
>>>>>>> Can someone please look at this and determine if it's gluster's parsing
>>>>>>> of the RPC call to blame, or if it's Oracle?
>>>>>>>
>>>>>>> This is the same setup on which I reported the NFS race condition bug.
>>>>>>> It does have that patch applied.
>>>>>>> Details:
>>>>>>> http://lists.gnu.org/archive/html/gluster-devel/2013-04/msg00014.html
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Michael
>>>>>>>
>>>>>>> -- 
>>>>>>> Michael Brown               | `One of the main causes of the fall of
>>>>>>> Systems Consultant          | the Roman Empire was that, lacking zero,
>>>>>>> Net Direct Inc.             | they had no way to indicate successful
>>>>>>> ?: +1 519 883 1172 x5106    | termination of their C programs.' - Firth
>>>>>>>
>>>>>>
>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Gluster-devel mailing list
>>>>>>> Gluster-devel at nongnu.org
>>>>>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>>>>> -- 
>>>>> Michael Brown               | `One of the main causes of the fall of
>>>>> Systems Consultant          | the Roman Empire was that, lacking zero,
>>>>> Net Direct Inc.             | they had no way to indicate successful
>>>>> ☎: +1 519 883 1172 x5106    | termination of their C programs.' - Firth
>>>>>
>>>> -- 
>>>> Niels de Vos
>>>> Sr. Software Maintenance Engineer
>>>> Support Engineering Group
>>>> Red Hat Global Support Services
>>>>
>>>> _______________________________________________
>>>> Gluster-devel mailing list
>>>> Gluster-devel at nongnu.org
>>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>>
>> -- 
>> Michael Brown               | `One of the main causes of the fall of
>> Systems Consultant          | the Roman Empire was that, lacking zero,
>> Net Direct Inc.             | they had no way to indicate successful
>> ☎: +1 519 883 1172 x5106    | termination of their C programs.' - Firth
>>

-- 
Michael Brown               | `One of the main causes of the fall of
Systems Consultant          | the Roman Empire was that, lacking zero,
Net Direct Inc.             | they had no way to indicate successful
☎: +1 519 883 1172 x5106    | termination of their C programs.' - Firth