[Gluster-devel] Possible bug in the communications layer ?

Xavier Hernandez xhernandez at datalab.es
Thu May 5 08:21:20 UTC 2016


I've undone all changes and now I'm unable to reproduce the problem, so 
the modification I did is probably incorrect and not the root cause, as 
you described.

I'll continue investigating...

Xavi

On 04/05/16 15:01, Xavier Hernandez wrote:
> On 04/05/16 14:47, Raghavendra Gowdappa wrote:
>>
>>
>> ----- Original Message -----
>>> From: "Xavier Hernandez" <xhernandez at datalab.es>
>>> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
>>> Sent: Wednesday, May 4, 2016 5:37:56 PM
>>> Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
>>>
>>> I think I've found the problem.
>>>
>>> 1567         case SP_STATE_READING_PROC_HEADER:
>>>   1568                 __socket_proto_read (priv, ret);
>>>   1569
>>>   1570                 /* there can be 'xdata' in read response, figure
>>> it out */
>>>   1571                 xdrmem_create (&xdr, proghdr_buf,
>>> default_read_size,
>>>   1572                                XDR_DECODE);
>>>   1573
>>>   1574                 /* This will fail if there is xdata sent from
>>> server, if not,
>>>   1575                    well and good, we don't need to worry
>>> about  */
>>>   1576                 xdr_gfs3_read_rsp (&xdr, &read_rsp);
>>>   1577
>>>   1578                 free (read_rsp.xdata.xdata_val);
>>>   1579
>>>   1580                 /* need to round off to proper roof (%4), as XDR
>>> packing pads
>>>   1581                    the end of opaque object with '0' */
>>>   1582                 size = roof (read_rsp.xdata.xdata_len, 4);
>>>   1583
>>>   1584                 if (!size) {
>>>   1585
>>> frag->call_body.reply.accepted_success_state
>>>   1586                                 = SP_STATE_READ_PROC_OPAQUE;
>>>   1587                         goto read_proc_opaque;
>>>   1588                 }
>>>   1589
>>>   1590                 __socket_proto_init_pending (priv, size);
>>>   1591
>>>   1592                 frag->call_body.reply.accepted_success_state
>>>   1593                         = SP_STATE_READING_PROC_OPAQUE;
>>>   1594
>>>   1595         case SP_STATE_READING_PROC_OPAQUE:
>>>   1596                 __socket_proto_read (priv, ret);
>>>   1597
>>>   1598                 frag->call_body.reply.accepted_success_state
>>>   1599                         = SP_STATE_READ_PROC_OPAQUE;
>>>
>>> On line 1568 we read, at most, 116 bytes because we calculate the size
>>> of a read response without xdata. Then we detect that we really need
>>> more data for xdata (BTW, read_rsp.xdata.xdata_val will be always
>>> allocated even if xdr_gfs3_read_rsp() fails ?)
>>
>> No. It need not be. Its guaranteed that only on a successful
>> completion it is allocated. However, _if_ decoding fails only because
>> xdr stream doesn't include xdata bits, but xdata_len is zero (by
>> initializing it to default_read_size), then xdr library would've
>> filled read_rsp.xdata.xdata_len (read_rsp.xdata.xdata_val can still be
>> NULL).
>
> The question is: is it guaranteed that after an unsuccessful completion
> xdata_val will be NULL (i.e. not touched by the function, even if
> xadata_len is != 0) ? otherwise the free() could corrupt memory.
>
>>
>>>
>>> So we get into line 1596 with the pending info initialized to read the
>>> remaining data. This is the __socket_proto_read macro:
>>>
>>>    166 /* This will be used in a switch case and breaks from the switch
>>> case if all
>>>    167  * the pending data is not read.
>>>    168  */
>>>    169 #define __socket_proto_read(priv, ret)
>>>        \
>>>    170                 {
>>>        \
>>>    171                 size_t bytes_read = 0;
>>>        \
>>>    172                 struct gf_sock_incoming *in;
>>>        \
>>>    173                 in = &priv->incoming;
>>>        \
>>>    174
>>>        \
>>>    175                 __socket_proto_update_pending (priv);
>>>        \
>>>    176
>>>        \
>>>    177                 ret = __socket_readv (this,
>>>        \
>>>    178                                       in->pending_vector, 1,
>>>        \
>>>    179                                       &in->pending_vector,
>>>        \
>>>    180                                       &in->pending_count,
>>>        \
>>>    181                                       &bytes_read);
>>>        \
>>>    182                 if (ret == -1)
>>>        \
>>>    183                         break;
>>>        \
>>>    184                 __socket_proto_update_priv_after_read (priv, ret,
>>> bytes_read); \
>>>    185         }
>>>
>>> We read from the socket using __socket_readv(). It it fails, we quit.
>>> However if the socket doesn't have more data to read, this function does
>>> not return -1:
>>>
>>>    555                         ret = __socket_cached_read (this,
>>> opvector, opcount);
>>>    556
>>>    557                         if (ret == 0) {
>>>    558
>>> gf_log(this->name,GF_LOG_DEBUG,"EOF on socket");
>>>    559                                 errno = ENODATA;
>>>    560                                 ret = -1;
>>>    561                         }
>>>    562                         if (ret == -1 && errno == EAGAIN) {
>>>    563                                 /* done for now */
>>>    564                                 break;
>>>    565                         }
>>>    566                         this->total_bytes_read += ret;
>>>
>>> If __socket_cached_read() fails with errno == EAGAIN, we break and
>>> return opcount, which is >= 0. Causing the process to continue instead
>>> of waiting for more data.
>>
>> No. If you observe, there is a call to another macro
>> __socket_proto_update_priv_after_read at line 184. As can be seen in
>> the definition below, if (ret > 0) - which is the case in case of
>> short-read - we do break. I hope I am not missing anything here :).
>>
>> #define __socket_proto_update_priv_after_read(priv, ret, bytes_read)    \
>>         {                                                               \
>>         struct gf_sock_incoming_frag *frag;                     \
>>                 frag = &priv->incoming.frag;                            \
>>                                                                         \
>>                 frag->fragcurrent += bytes_read;                        \
>>                 frag->bytes_read += bytes_read;                         \
>>                                                                         \
>>                 if ((ret > 0) || (frag->remaining_size != 0)) {         \
>>             if (frag->remaining_size != 0 && ret == 0) {    \
>>                 __socket_proto_reset_pending (priv);    \
>>                         }                                               \
>>                                                                         \
>>                         gf_log (this->name, GF_LOG_TRACE,               \
>>                                 "partial read on non-blocking socket"); \
>>                                                                         \
>>                         break;                                          \
>>                 }                                                       \
>>         }
>>
>
> I didn't see this. However the fact is that doing the change I've
> described, it seems to work.
>
> I'll analyze it a bit more.
>
> Xavi
>
>>>
>>> As a side note, there's another problem here: if errno is not EAGAIN,
>>> we'll update this->total_bytes_read substracting one. This shouldn't be
>>> done when ret < 0.
>>>
>>> There are other places where ret is set to -1, but opcount is returned.
>>> I guess that we should also set opcount = -1 on these places, but I
>>> don't have a deep knowledge about this implementation.
>>>
>>> I've done a quick test checking for (ret != 0) instead of (ret == -1) in
>>> __socket_proto_read() and it seemed to work.
>>>
>>> Could anyone with more knowledge about the communications layer verify
>>> this and explain what would be the best solution ?
>>>
>>> Xavi
>>>
>>> On 29/04/16 14:52, Xavier Hernandez wrote:
>>>> With your patch applied, it seems that the bug is not hit.
>>>>
>>>> I guess it's a timing issue that the new logging hides. Maybe no more
>>>> data available after reading the partial readv header ? (it will arrive
>>>> later).
>>>>
>>>> I'll continue testing...
>>>>
>>>> Xavi
>>>>
>>>> On 29/04/16 13:48, Raghavendra Gowdappa wrote:
>>>>> Attaching the patch.
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>>>>>> To: "Xavier Hernandez" <xhernandez at datalab.es>
>>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
>>>>>> Sent: Friday, April 29, 2016 5:14:02 PM
>>>>>> Subject: Re: [Gluster-devel] Possible bug in the communications
>>>>>> layer ?
>>>>>>
>>>>>>
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Xavier Hernandez" <xhernandez at datalab.es>
>>>>>>> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>>>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
>>>>>>> Sent: Friday, April 29, 2016 1:21:57 PM
>>>>>>> Subject: Re: [Gluster-devel] Possible bug in the communications
>>>>>>> layer ?
>>>>>>>
>>>>>>> Hi Raghavendra,
>>>>>>>
>>>>>>> yes, the readv response contains xdata. The dict length is 38 (0x26)
>>>>>>> and, at the moment of failure, rsp.xdata.xdata_len already contains
>>>>>>> 0x26.
>>>>>>
>>>>>> rsp.xdata.xdata_len having 0x26 even when decoding failed indicates
>>>>>> that the
>>>>>> approach used in socket.c to get the length of xdata is correct.
>>>>>> However, I
>>>>>> cannot find any other way of xdata going into payload vector other
>>>>>> than
>>>>>> xdata_len being zero. Just to be double sure, I've a patch containing
>>>>>> debug
>>>>>> message printing xdata_len when decoding fails in socket.c. Can you
>>>>>> please
>>>>>> apply the patch, run the tests and revert back with results?
>>>>>>
>>>>>>>
>>>>>>> Xavi
>>>>>>>
>>>>>>> On 29/04/16 09:10, Raghavendra Gowdappa wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> ----- Original Message -----
>>>>>>>>> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>>>>>>>>> To: "Xavier Hernandez" <xhernandez at datalab.es>
>>>>>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
>>>>>>>>> Sent: Friday, April 29, 2016 12:36:43 PM
>>>>>>>>> Subject: Re: [Gluster-devel] Possible bug in the communications
>>>>>>>>> layer ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ----- Original Message -----
>>>>>>>>>> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>>>>>>>>>> To: "Xavier Hernandez" <xhernandez at datalab.es>
>>>>>>>>>> Cc: "Jeff Darcy" <jdarcy at redhat.com>, "Gluster Devel"
>>>>>>>>>> <gluster-devel at gluster.org>
>>>>>>>>>> Sent: Friday, April 29, 2016 12:07:59 PM
>>>>>>>>>> Subject: Re: [Gluster-devel] Possible bug in the communications
>>>>>>>>>> layer ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>> From: "Xavier Hernandez" <xhernandez at datalab.es>
>>>>>>>>>>> To: "Jeff Darcy" <jdarcy at redhat.com>
>>>>>>>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
>>>>>>>>>>> Sent: Thursday, April 28, 2016 8:15:36 PM
>>>>>>>>>>> Subject: Re: [Gluster-devel] Possible bug in the communications
>>>>>>>>>>> layer
>>>>>>>>>>> ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Jeff,
>>>>>>>>>>>
>>>>>>>>>>> On 28.04.2016 15:20, Jeff Darcy wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This happens with Gluster 3.7.11 accessed through Ganesha and
>>>>>>>>>>> gfapi.
>>>>>>>>>>> The
>>>>>>>>>>> volume is a distributed-disperse 4*(4+2). I'm able to
>>>>>>>>>>> reproduce the
>>>>>>>>>>> problem
>>>>>>>>>>> easily doing the following test: iozone -t2 -s10g -r1024k -i0 -w
>>>>>>>>>>> -F/iozone{1..2}.dat echo 3 >/proc/sys/vm/drop_caches iozone -t2
>>>>>>>>>>> -s10g
>>>>>>>>>>> -r1024k -i1 -w -F/iozone{1..2}.dat The error happens soon after
>>>>>>>>>>> starting
>>>>>>>>>>> the
>>>>>>>>>>> read test. As can be seen in the data below,
>>>>>>>>>>> client3_3_readv_cbk() is
>>>>>>>>>>> processing an iovec of 116 bytes, however it should be of 154
>>>>>>>>>>> bytes
>>>>>>>>>>> (the
>>>>>>>>>>> buffer in memory really seems to contain 154 bytes). The data on
>>>>>>>>>>> the
>>>>>>>>>>> network
>>>>>>>>>>> seems ok (at least I haven't been able to identify any
>>>>>>>>>>> problem), so
>>>>>>>>>>> this
>>>>>>>>>>> must be a processing error on the client side. The last field in
>>>>>>>>>>> cut
>>>>>>>>>>> buffer
>>>>>>>>>>> of the sequentialized data corresponds to the length of the
>>>>>>>>>>> xdata
>>>>>>>>>>> field:
>>>>>>>>>>> 0x26. So at least 38 more byte should be present.
>>>>>>>>>>> Nice detective work, Xavi.  It would be *very* interesting to
>>>>>>>>>>> see what
>>>>>>>>>>> the value of the "count" parameter is (it's unfortunately
>>>>>>>>>>> optimized
>>>>>>>>>>> out).
>>>>>>>>>>> I'll bet it's two, and iov[1].iov_len is 38.  I have a weak
>>>>>>>>>>> memory of
>>>>>>>>>>> some problems with how this iov is put together, a couple of
>>>>>>>>>>> years
>>>>>>>>>>> ago,
>>>>>>>>>>> and it looks like you might have tripped over one more.
>>>>>>>>>>> It seems you are right. The count is 2 and the first 38 bytes of
>>>>>>>>>>> the
>>>>>>>>>>> second
>>>>>>>>>>> vector contains the remaining data of xdata field.
>>>>>>>>>>
>>>>>>>>>> This is the bug. client3_3_readv_cbk (and for that matter all the
>>>>>>>>>> actors/cbks) expects response in utmost two vectors:
>>>>>>>>>> 1. Program header containing request or response. This is
>>>>>>>>>> subjected to
>>>>>>>>>> decoding/encoding. This vector should point to a buffer that
>>>>>>>>>> contains
>>>>>>>>>> the
>>>>>>>>>> entire program header/response contiguously.
>>>>>>>>>> 2. If the procedure returns payload (like readv response or a
>>>>>>>>>> write
>>>>>>>>>> request),
>>>>>>>>>> second vector contains the buffer pointing to the entire
>>>>>>>>>> (contiguous)
>>>>>>>>>> payload. Note that this payload is raw and is not subjected to
>>>>>>>>>> encoding/decoding.
>>>>>>>>>>
>>>>>>>>>> In your case, this _clean_ separation is broken with part of
>>>>>>>>>> program
>>>>>>>>>> header
>>>>>>>>>> slipping into 2nd vector supposed to contain read data (may be
>>>>>>>>>> because
>>>>>>>>>> of
>>>>>>>>>> rpc fragmentation). I think this is a bug in socket layer. I'll
>>>>>>>>>> update
>>>>>>>>>> more
>>>>>>>>>> on this.
>>>>>>>>>
>>>>>>>>> Does your read response include xdata too? I think the code
>>>>>>>>> related to
>>>>>>>>> reading xdata in readv response is a bit murky.
>>>>>>>>>
>>>>>>>>> <socket.c/__socket_read_accepted_successful_reply>
>>>>>>>>>
>>>>>>>>>         case SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT:
>>>>>>>>>                 default_read_size = xdr_sizeof ((xdrproc_t)
>>>>>>>>>                 xdr_gfs3_read_rsp,
>>>>>>>>>                                                 &read_rsp);
>>>>>>>>>
>>>>>>>>>                 proghdr_buf = frag->fragcurrent;
>>>>>>>>>
>>>>>>>>>                 __socket_proto_init_pending (priv,
>>>>>>>>> default_read_size);
>>>>>>>>>
>>>>>>>>>                 frag->call_body.reply.accepted_success_state
>>>>>>>>>                         = SP_STATE_READING_PROC_HEADER;
>>>>>>>>>
>>>>>>>>>                 /* fall through */
>>>>>>>>>
>>>>>>>>>         case SP_STATE_READING_PROC_HEADER:
>>>>>>>>>                 __socket_proto_read (priv, ret);
>>>>>>>>>
>>>>>>>>>>>>>> By this time we've read read response _minus_ the xdata
>>>>>>>>
>>>>>>>> I meant we have read "readv response header"
>>>>>>>>
>>>>>>>>>
>>>>>>>>>                 /* there can be 'xdata' in read response, figure
>>>>>>>>> it out
>>>>>>>>>                 */
>>>>>>>>>                 xdrmem_create (&xdr, proghdr_buf,
>>>>>>>>> default_read_size,
>>>>>>>>>                                XDR_DECODE);
>>>>>>>>>
>>>>>>>>>>>>>>> We created xdr stream above with "default_read_size" (this
>>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>>> include xdata)
>>>>>>>>>
>>>>>>>>>                 /* This will fail if there is xdata sent from
>>>>>>>>> server, if
>>>>>>>>>                 not,
>>>>>>>>>                    well and good, we don't need to worry about  */
>>>>>>>>>
>>>>>>>>>>>>>>> what if xdata is present and decoding failed (as length
>>>>>>>>>>>>>>> of xdr
>>>>>>>>>>>>>>> stream
>>>>>>>>>>>>>>> above - default_read_size - doesn't include xdata)? would we
>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> valid value in read_rsp.xdata.xdata_len? This is the part
>>>>>>>>>>>>>>> I am
>>>>>>>>>>>>>>> confused about. If read_rsp.xdata.xdata_len is not
>>>>>>>>>>>>>>> correct then
>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>> is a possibility that xdata might not be entirely present in
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> vector socket passes to higher layers as progheader (with
>>>>>>>>>>>>>>> part or
>>>>>>>>>>>>>>> entire xdata spilling over to payload vector).
>>>>>>>>>
>>>>>>>>>                 xdr_gfs3_read_rsp (&xdr, &read_rsp);
>>>>>>>>>
>>>>>>>>>                 free (read_rsp.xdata.xdata_val);
>>>>>>>>>
>>>>>>>>>                 /* need to round off to proper roof (%4), as XDR
>>>>>>>>> packing
>>>>>>>>>                 pads
>>>>>>>>>                    the end of opaque object with '0' */
>>>>>>>>>                 size = roof (read_rsp.xdata.xdata_len, 4);
>>>>>>>>>
>>>>>>>>>                 if (!size) {
>>>>>>>>>
>>>>>>>>> frag->call_body.reply.accepted_success_state
>>>>>>>>>                                 = SP_STATE_READ_PROC_OPAQUE;
>>>>>>>>>                         goto read_proc_opaque;
>>>>>>>>>                 }
>>>>>>>>>
>>>>>>>>> </socket.c>
>>>>>>>>>
>>>>>>>>> Can you please confirm whether there was an xdata in the readv
>>>>>>>>> response
>>>>>>>>> (may
>>>>>>>>> be by looking in bricks) whose decoding failed?
>>>>>>>>>
>>>>>>>>> regards,
>>>>>>>>> Raghavendra
>>>>>>>>> _______________________________________________
>>>>>>>>> Gluster-devel mailing list
>>>>>>>>> Gluster-devel at gluster.org
>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>>>>>
>>>>>>>
>>>>>>
>>>> _______________________________________________
>>>> Gluster-devel mailing list
>>>> Gluster-devel at gluster.org
>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel


More information about the Gluster-devel mailing list