[Gluster-devel] Possible bug in the communications layer ?

Wed May 4 13:01:41 UTC 2016

On 04/05/16 14:47, Raghavendra Gowdappa wrote:
>
>
> ----- Original Message -----
>> From: "Xavier Hernandez" <xhernandez at datalab.es>
>> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
>> Sent: Wednesday, May 4, 2016 5:37:56 PM
>> Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
>>
>> I think I've found the problem.
>>
>> 1567         case SP_STATE_READING_PROC_HEADER:
>>   1568                 __socket_proto_read (priv, ret);
>>   1569
>>   1570                 /* there can be 'xdata' in read response, figure
>> it out */
>>   1571                 xdrmem_create (&xdr, proghdr_buf, default_read_size,
>>   1572                                XDR_DECODE);
>>   1573
>>   1574                 /* This will fail if there is xdata sent from
>> server, if not,
>>   1575                    well and good, we don't need to worry about  */
>>   1576                 xdr_gfs3_read_rsp (&xdr, &read_rsp);
>>   1577
>>   1578                 free (read_rsp.xdata.xdata_val);
>>   1579
>>   1580                 /* need to round off to proper roof (%4), as XDR
>> packing pads
>>   1581                    the end of opaque object with '0' */
>>   1582                 size = roof (read_rsp.xdata.xdata_len, 4);
>>   1583
>>   1584                 if (!size) {
>>   1585                         frag->call_body.reply.accepted_success_state
>>   1586                                 = SP_STATE_READ_PROC_OPAQUE;
>>   1587                         goto read_proc_opaque;
>>   1588                 }
>>   1589
>>   1590                 __socket_proto_init_pending (priv, size);
>>   1591
>>   1592                 frag->call_body.reply.accepted_success_state
>>   1593                         = SP_STATE_READING_PROC_OPAQUE;
>>   1594
>>   1595         case SP_STATE_READING_PROC_OPAQUE:
>>   1596                 __socket_proto_read (priv, ret);
>>   1597
>>   1598                 frag->call_body.reply.accepted_success_state
>>   1599                         = SP_STATE_READ_PROC_OPAQUE;
>>
>> On line 1568 we read, at most, 116 bytes because we calculate the size
>> of a read response without xdata. Then we detect that we really need
>> more data for xdata (BTW, read_rsp.xdata.xdata_val will be always
>> allocated even if xdr_gfs3_read_rsp() fails ?)
>
> No. It need not be. Its guaranteed that only on a successful completion it is allocated. However, _if_ decoding fails only because xdr stream doesn't include xdata bits, but xdata_len is zero (by initializing it to default_read_size), then xdr library would've filled read_rsp.xdata.xdata_len (read_rsp.xdata.xdata_val can still be NULL).

The question is: is it guaranteed that after an unsuccessful completion 
xdata_val will be NULL (i.e. not touched by the function, even if 
xadata_len is != 0) ? otherwise the free() could corrupt memory.

>
>>
>> So we get into line 1596 with the pending info initialized to read the
>> remaining data. This is the __socket_proto_read macro:
>>
>>    166 /* This will be used in a switch case and breaks from the switch
>> case if all
>>    167  * the pending data is not read.
>>    168  */
>>    169 #define __socket_proto_read(priv, ret)
>>        \
>>    170                 {
>>        \
>>    171                 size_t bytes_read = 0;
>>        \
>>    172                 struct gf_sock_incoming *in;
>>        \
>>    173                 in = &priv->incoming;
>>        \
>>    174
>>        \
>>    175                 __socket_proto_update_pending (priv);
>>        \
>>    176
>>        \
>>    177                 ret = __socket_readv (this,
>>        \
>>    178                                       in->pending_vector, 1,
>>        \
>>    179                                       &in->pending_vector,
>>        \
>>    180                                       &in->pending_count,
>>        \
>>    181                                       &bytes_read);
>>        \
>>    182                 if (ret == -1)
>>        \
>>    183                         break;
>>        \
>>    184                 __socket_proto_update_priv_after_read (priv, ret,
>> bytes_read); \
>>    185         }
>>
>> We read from the socket using __socket_readv(). It it fails, we quit.
>> However if the socket doesn't have more data to read, this function does
>> not return -1:
>>
>>    555                         ret = __socket_cached_read (this,
>> opvector, opcount);
>>    556
>>    557                         if (ret == 0) {
>>    558
>> gf_log(this->name,GF_LOG_DEBUG,"EOF on socket");
>>    559                                 errno = ENODATA;
>>    560                                 ret = -1;
>>    561                         }
>>    562                         if (ret == -1 && errno == EAGAIN) {
>>    563                                 /* done for now */
>>    564                                 break;
>>    565                         }
>>    566                         this->total_bytes_read += ret;
>>
>> If __socket_cached_read() fails with errno == EAGAIN, we break and
>> return opcount, which is >= 0. Causing the process to continue instead
>> of waiting for more data.
>
> No. If you observe, there is a call to another macro __socket_proto_update_priv_after_read at line 184. As can be seen in the definition below, if (ret > 0) - which is the case in case of short-read - we do break. I hope I am not missing anything here :).
>
> #define __socket_proto_update_priv_after_read(priv, ret, bytes_read)    \
>         {                                                               \
> 		struct gf_sock_incoming_frag *frag;                     \
>                 frag = &priv->incoming.frag;                            \
>                                                                         \
>                 frag->fragcurrent += bytes_read;                        \
>                 frag->bytes_read += bytes_read;                         \
>                                                                         \
>                 if ((ret > 0) || (frag->remaining_size != 0)) {         \
> 			if (frag->remaining_size != 0 && ret == 0) {    \
> 				__socket_proto_reset_pending (priv);    \
>                         }                                               \
>                                                                         \
>                         gf_log (this->name, GF_LOG_TRACE,               \
>                                 "partial read on non-blocking socket"); \
>                                                                         \
>                         break;                                          \
>                 }                                                       \
>         }
>

I didn't see this. However the fact is that doing the change I've 
described, it seems to work.

I'll analyze it a bit more.

Xavi

>>
>> As a side note, there's another problem here: if errno is not EAGAIN,
>> we'll update this->total_bytes_read substracting one. This shouldn't be
>> done when ret < 0.
>>
>> There are other places where ret is set to -1, but opcount is returned.
>> I guess that we should also set opcount = -1 on these places, but I
>> don't have a deep knowledge about this implementation.
>>
>> I've done a quick test checking for (ret != 0) instead of (ret == -1) in
>> __socket_proto_read() and it seemed to work.
>>
>> Could anyone with more knowledge about the communications layer verify
>> this and explain what would be the best solution ?
>>
>> Xavi
>>
>> On 29/04/16 14:52, Xavier Hernandez wrote:
>>> With your patch applied, it seems that the bug is not hit.
>>>
>>> I guess it's a timing issue that the new logging hides. Maybe no more
>>> data available after reading the partial readv header ? (it will arrive
>>> later).
>>>
>>> I'll continue testing...
>>>
>>> Xavi
>>>
>>> On 29/04/16 13:48, Raghavendra Gowdappa wrote:
>>>> Attaching the patch.
>>>>
>>>> ----- Original Message -----
>>>>> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>>>>> To: "Xavier Hernandez" <xhernandez at datalab.es>
>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
>>>>> Sent: Friday, April 29, 2016 5:14:02 PM
>>>>> Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
>>>>>
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Xavier Hernandez" <xhernandez at datalab.es>
>>>>>> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
>>>>>> Sent: Friday, April 29, 2016 1:21:57 PM
>>>>>> Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
>>>>>>
>>>>>> Hi Raghavendra,
>>>>>>
>>>>>> yes, the readv response contains xdata. The dict length is 38 (0x26)
>>>>>> and, at the moment of failure, rsp.xdata.xdata_len already contains
>>>>>> 0x26.
>>>>>
>>>>> rsp.xdata.xdata_len having 0x26 even when decoding failed indicates
>>>>> that the
>>>>> approach used in socket.c to get the length of xdata is correct.
>>>>> However, I
>>>>> cannot find any other way of xdata going into payload vector other than
>>>>> xdata_len being zero. Just to be double sure, I've a patch containing
>>>>> debug
>>>>> message printing xdata_len when decoding fails in socket.c. Can you
>>>>> please
>>>>> apply the patch, run the tests and revert back with results?
>>>>>
>>>>>>
>>>>>> Xavi
>>>>>>
>>>>>> On 29/04/16 09:10, Raghavendra Gowdappa wrote:
>>>>>>>
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>>>>>>>> To: "Xavier Hernandez" <xhernandez at datalab.es>
>>>>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
>>>>>>>> Sent: Friday, April 29, 2016 12:36:43 PM
>>>>>>>> Subject: Re: [Gluster-devel] Possible bug in the communications
>>>>>>>> layer ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ----- Original Message -----
>>>>>>>>> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>>>>>>>>> To: "Xavier Hernandez" <xhernandez at datalab.es>
>>>>>>>>> Cc: "Jeff Darcy" <jdarcy at redhat.com>, "Gluster Devel"
>>>>>>>>> <gluster-devel at gluster.org>
>>>>>>>>> Sent: Friday, April 29, 2016 12:07:59 PM
>>>>>>>>> Subject: Re: [Gluster-devel] Possible bug in the communications
>>>>>>>>> layer ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ----- Original Message -----
>>>>>>>>>> From: "Xavier Hernandez" <xhernandez at datalab.es>
>>>>>>>>>> To: "Jeff Darcy" <jdarcy at redhat.com>
>>>>>>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
>>>>>>>>>> Sent: Thursday, April 28, 2016 8:15:36 PM
>>>>>>>>>> Subject: Re: [Gluster-devel] Possible bug in the communications
>>>>>>>>>> layer
>>>>>>>>>> ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Jeff,
>>>>>>>>>>
>>>>>>>>>> On 28.04.2016 15:20, Jeff Darcy wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This happens with Gluster 3.7.11 accessed through Ganesha and
>>>>>>>>>> gfapi.
>>>>>>>>>> The
>>>>>>>>>> volume is a distributed-disperse 4*(4+2). I'm able to reproduce the
>>>>>>>>>> problem
>>>>>>>>>> easily doing the following test: iozone -t2 -s10g -r1024k -i0 -w
>>>>>>>>>> -F/iozone{1..2}.dat echo 3 >/proc/sys/vm/drop_caches iozone -t2
>>>>>>>>>> -s10g
>>>>>>>>>> -r1024k -i1 -w -F/iozone{1..2}.dat The error happens soon after
>>>>>>>>>> starting
>>>>>>>>>> the
>>>>>>>>>> read test. As can be seen in the data below,
>>>>>>>>>> client3_3_readv_cbk() is
>>>>>>>>>> processing an iovec of 116 bytes, however it should be of 154 bytes
>>>>>>>>>> (the
>>>>>>>>>> buffer in memory really seems to contain 154 bytes). The data on
>>>>>>>>>> the
>>>>>>>>>> network
>>>>>>>>>> seems ok (at least I haven't been able to identify any problem), so
>>>>>>>>>> this
>>>>>>>>>> must be a processing error on the client side. The last field in
>>>>>>>>>> cut
>>>>>>>>>> buffer
>>>>>>>>>> of the sequentialized data corresponds to the length of the xdata
>>>>>>>>>> field:
>>>>>>>>>> 0x26. So at least 38 more byte should be present.
>>>>>>>>>> Nice detective work, Xavi.  It would be *very* interesting to
>>>>>>>>>> see what
>>>>>>>>>> the value of the "count" parameter is (it's unfortunately optimized
>>>>>>>>>> out).
>>>>>>>>>> I'll bet it's two, and iov[1].iov_len is 38.  I have a weak
>>>>>>>>>> memory of
>>>>>>>>>> some problems with how this iov is put together, a couple of years
>>>>>>>>>> ago,
>>>>>>>>>> and it looks like you might have tripped over one more.
>>>>>>>>>> It seems you are right. The count is 2 and the first 38 bytes of
>>>>>>>>>> the
>>>>>>>>>> second
>>>>>>>>>> vector contains the remaining data of xdata field.
>>>>>>>>>
>>>>>>>>> This is the bug. client3_3_readv_cbk (and for that matter all the
>>>>>>>>> actors/cbks) expects response in utmost two vectors:
>>>>>>>>> 1. Program header containing request or response. This is
>>>>>>>>> subjected to
>>>>>>>>> decoding/encoding. This vector should point to a buffer that
>>>>>>>>> contains
>>>>>>>>> the
>>>>>>>>> entire program header/response contiguously.
>>>>>>>>> 2. If the procedure returns payload (like readv response or a write
>>>>>>>>> request),
>>>>>>>>> second vector contains the buffer pointing to the entire
>>>>>>>>> (contiguous)
>>>>>>>>> payload. Note that this payload is raw and is not subjected to
>>>>>>>>> encoding/decoding.
>>>>>>>>>
>>>>>>>>> In your case, this _clean_ separation is broken with part of program
>>>>>>>>> header
>>>>>>>>> slipping into 2nd vector supposed to contain read data (may be
>>>>>>>>> because
>>>>>>>>> of
>>>>>>>>> rpc fragmentation). I think this is a bug in socket layer. I'll
>>>>>>>>> update
>>>>>>>>> more
>>>>>>>>> on this.
>>>>>>>>
>>>>>>>> Does your read response include xdata too? I think the code
>>>>>>>> related to
>>>>>>>> reading xdata in readv response is a bit murky.
>>>>>>>>
>>>>>>>> <socket.c/__socket_read_accepted_successful_reply>
>>>>>>>>
>>>>>>>>         case SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT:
>>>>>>>>                 default_read_size = xdr_sizeof ((xdrproc_t)
>>>>>>>>                 xdr_gfs3_read_rsp,
>>>>>>>>                                                 &read_rsp);
>>>>>>>>
>>>>>>>>                 proghdr_buf = frag->fragcurrent;
>>>>>>>>
>>>>>>>>                 __socket_proto_init_pending (priv,
>>>>>>>> default_read_size);
>>>>>>>>
>>>>>>>>                 frag->call_body.reply.accepted_success_state
>>>>>>>>                         = SP_STATE_READING_PROC_HEADER;
>>>>>>>>
>>>>>>>>                 /* fall through */
>>>>>>>>
>>>>>>>>         case SP_STATE_READING_PROC_HEADER:
>>>>>>>>                 __socket_proto_read (priv, ret);
>>>>>>>>
>>>>>>>>>>>>> By this time we've read read response _minus_ the xdata
>>>>>>>
>>>>>>> I meant we have read "readv response header"
>>>>>>>
>>>>>>>>
>>>>>>>>                 /* there can be 'xdata' in read response, figure
>>>>>>>> it out
>>>>>>>>                 */
>>>>>>>>                 xdrmem_create (&xdr, proghdr_buf, default_read_size,
>>>>>>>>                                XDR_DECODE);
>>>>>>>>
>>>>>>>>>>>>>> We created xdr stream above with "default_read_size" (this
>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>> include xdata)
>>>>>>>>
>>>>>>>>                 /* This will fail if there is xdata sent from
>>>>>>>> server, if
>>>>>>>>                 not,
>>>>>>>>                    well and good, we don't need to worry about  */
>>>>>>>>
>>>>>>>>>>>>>> what if xdata is present and decoding failed (as length of xdr
>>>>>>>>>>>>>> stream
>>>>>>>>>>>>>> above - default_read_size - doesn't include xdata)? would we
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>> valid value in read_rsp.xdata.xdata_len? This is the part I am
>>>>>>>>>>>>>> confused about. If read_rsp.xdata.xdata_len is not correct then
>>>>>>>>>>>>>> there
>>>>>>>>>>>>>> is a possibility that xdata might not be entirely present in
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> vector socket passes to higher layers as progheader (with
>>>>>>>>>>>>>> part or
>>>>>>>>>>>>>> entire xdata spilling over to payload vector).
>>>>>>>>
>>>>>>>>                 xdr_gfs3_read_rsp (&xdr, &read_rsp);
>>>>>>>>
>>>>>>>>                 free (read_rsp.xdata.xdata_val);
>>>>>>>>
>>>>>>>>                 /* need to round off to proper roof (%4), as XDR
>>>>>>>> packing
>>>>>>>>                 pads
>>>>>>>>                    the end of opaque object with '0' */
>>>>>>>>                 size = roof (read_rsp.xdata.xdata_len, 4);
>>>>>>>>
>>>>>>>>                 if (!size) {
>>>>>>>>                         frag->call_body.reply.accepted_success_state
>>>>>>>>                                 = SP_STATE_READ_PROC_OPAQUE;
>>>>>>>>                         goto read_proc_opaque;
>>>>>>>>                 }
>>>>>>>>
>>>>>>>> </socket.c>
>>>>>>>>
>>>>>>>> Can you please confirm whether there was an xdata in the readv
>>>>>>>> response
>>>>>>>> (may
>>>>>>>> be by looking in bricks) whose decoding failed?
>>>>>>>>
>>>>>>>> regards,
>>>>>>>> Raghavendra
>>>>>>>> _______________________________________________
>>>>>>>> Gluster-devel mailing list
>>>>>>>> Gluster-devel at gluster.org
>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>>>>
>>>>>>
>>>>>
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>