[Gluster-devel] Possible bug in the communications layer ?

Wed May 4 12:07:56 UTC 2016

I think I've found the problem.

1567         case SP_STATE_READING_PROC_HEADER:
  1568                 __socket_proto_read (priv, ret);
  1569
  1570                 /* there can be 'xdata' in read response, figure 
it out */
  1571                 xdrmem_create (&xdr, proghdr_buf, default_read_size,
  1572                                XDR_DECODE);
  1573
  1574                 /* This will fail if there is xdata sent from 
server, if not,
  1575                    well and good, we don't need to worry about  */
  1576                 xdr_gfs3_read_rsp (&xdr, &read_rsp);
  1577
  1578                 free (read_rsp.xdata.xdata_val);
  1579
  1580                 /* need to round off to proper roof (%4), as XDR 
packing pads
  1581                    the end of opaque object with '0' */
  1582                 size = roof (read_rsp.xdata.xdata_len, 4);
  1583
  1584                 if (!size) {
  1585                         frag->call_body.reply.accepted_success_state
  1586                                 = SP_STATE_READ_PROC_OPAQUE;
  1587                         goto read_proc_opaque;
  1588                 }
  1589
  1590                 __socket_proto_init_pending (priv, size);
  1591
  1592                 frag->call_body.reply.accepted_success_state
  1593                         = SP_STATE_READING_PROC_OPAQUE;
  1594
  1595         case SP_STATE_READING_PROC_OPAQUE:
  1596                 __socket_proto_read (priv, ret);
  1597
  1598                 frag->call_body.reply.accepted_success_state
  1599                         = SP_STATE_READ_PROC_OPAQUE;

On line 1568 we read, at most, 116 bytes because we calculate the size 
of a read response without xdata. Then we detect that we really need 
more data for xdata (BTW, read_rsp.xdata.xdata_val will be always 
allocated even if xdr_gfs3_read_rsp() fails ?)

So we get into line 1596 with the pending info initialized to read the 
remaining data. This is the __socket_proto_read macro:

   166 /* This will be used in a switch case and breaks from the switch 
case if all
   167  * the pending data is not read.
   168  */
   169 #define __socket_proto_read(priv, ret) 
       \
   170                 { 
       \
   171                 size_t bytes_read = 0; 
       \
   172                 struct gf_sock_incoming *in; 
       \
   173                 in = &priv->incoming; 
       \
   174 
       \
   175                 __socket_proto_update_pending (priv); 
       \
   176 
       \
   177                 ret = __socket_readv (this, 
       \
   178                                       in->pending_vector, 1, 
       \
   179                                       &in->pending_vector, 
       \
   180                                       &in->pending_count, 
       \
   181                                       &bytes_read); 
       \
   182                 if (ret == -1) 
       \
   183                         break; 
       \
   184                 __socket_proto_update_priv_after_read (priv, ret, 
bytes_read); \
   185         }

We read from the socket using __socket_readv(). It it fails, we quit. 
However if the socket doesn't have more data to read, this function does 
not return -1:

   555                         ret = __socket_cached_read (this, 
opvector, opcount);
   556
   557                         if (ret == 0) {
   558 
gf_log(this->name,GF_LOG_DEBUG,"EOF on socket");
   559                                 errno = ENODATA;
   560                                 ret = -1;
   561                         }
   562                         if (ret == -1 && errno == EAGAIN) {
   563                                 /* done for now */
   564                                 break;
   565                         }
   566                         this->total_bytes_read += ret;

If __socket_cached_read() fails with errno == EAGAIN, we break and 
return opcount, which is >= 0. Causing the process to continue instead 
of waiting for more data.

As a side note, there's another problem here: if errno is not EAGAIN, 
we'll update this->total_bytes_read substracting one. This shouldn't be 
done when ret < 0.

There are other places where ret is set to -1, but opcount is returned. 
I guess that we should also set opcount = -1 on these places, but I 
don't have a deep knowledge about this implementation.

I've done a quick test checking for (ret != 0) instead of (ret == -1) in 
__socket_proto_read() and it seemed to work.

Could anyone with more knowledge about the communications layer verify 
this and explain what would be the best solution ?

Xavi

On 29/04/16 14:52, Xavier Hernandez wrote:
> With your patch applied, it seems that the bug is not hit.
>
> I guess it's a timing issue that the new logging hides. Maybe no more
> data available after reading the partial readv header ? (it will arrive
> later).
>
> I'll continue testing...
>
> Xavi
>
> On 29/04/16 13:48, Raghavendra Gowdappa wrote:
>> Attaching the patch.
>>
>> ----- Original Message -----
>>> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>>> To: "Xavier Hernandez" <xhernandez at datalab.es>
>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
>>> Sent: Friday, April 29, 2016 5:14:02 PM
>>> Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
>>>
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Xavier Hernandez" <xhernandez at datalab.es>
>>>> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
>>>> Sent: Friday, April 29, 2016 1:21:57 PM
>>>> Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
>>>>
>>>> Hi Raghavendra,
>>>>
>>>> yes, the readv response contains xdata. The dict length is 38 (0x26)
>>>> and, at the moment of failure, rsp.xdata.xdata_len already contains
>>>> 0x26.
>>>
>>> rsp.xdata.xdata_len having 0x26 even when decoding failed indicates
>>> that the
>>> approach used in socket.c to get the length of xdata is correct.
>>> However, I
>>> cannot find any other way of xdata going into payload vector other than
>>> xdata_len being zero. Just to be double sure, I've a patch containing
>>> debug
>>> message printing xdata_len when decoding fails in socket.c. Can you
>>> please
>>> apply the patch, run the tests and revert back with results?
>>>
>>>>
>>>> Xavi
>>>>
>>>> On 29/04/16 09:10, Raghavendra Gowdappa wrote:
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>>>>>> To: "Xavier Hernandez" <xhernandez at datalab.es>
>>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
>>>>>> Sent: Friday, April 29, 2016 12:36:43 PM
>>>>>> Subject: Re: [Gluster-devel] Possible bug in the communications
>>>>>> layer ?
>>>>>>
>>>>>>
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>>>>>>> To: "Xavier Hernandez" <xhernandez at datalab.es>
>>>>>>> Cc: "Jeff Darcy" <jdarcy at redhat.com>, "Gluster Devel"
>>>>>>> <gluster-devel at gluster.org>
>>>>>>> Sent: Friday, April 29, 2016 12:07:59 PM
>>>>>>> Subject: Re: [Gluster-devel] Possible bug in the communications
>>>>>>> layer ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> From: "Xavier Hernandez" <xhernandez at datalab.es>
>>>>>>>> To: "Jeff Darcy" <jdarcy at redhat.com>
>>>>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
>>>>>>>> Sent: Thursday, April 28, 2016 8:15:36 PM
>>>>>>>> Subject: Re: [Gluster-devel] Possible bug in the communications
>>>>>>>> layer
>>>>>>>> ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Jeff,
>>>>>>>>
>>>>>>>> On 28.04.2016 15:20, Jeff Darcy wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> This happens with Gluster 3.7.11 accessed through Ganesha and
>>>>>>>> gfapi.
>>>>>>>> The
>>>>>>>> volume is a distributed-disperse 4*(4+2). I'm able to reproduce the
>>>>>>>> problem
>>>>>>>> easily doing the following test: iozone -t2 -s10g -r1024k -i0 -w
>>>>>>>> -F/iozone{1..2}.dat echo 3 >/proc/sys/vm/drop_caches iozone -t2
>>>>>>>> -s10g
>>>>>>>> -r1024k -i1 -w -F/iozone{1..2}.dat The error happens soon after
>>>>>>>> starting
>>>>>>>> the
>>>>>>>> read test. As can be seen in the data below,
>>>>>>>> client3_3_readv_cbk() is
>>>>>>>> processing an iovec of 116 bytes, however it should be of 154 bytes
>>>>>>>> (the
>>>>>>>> buffer in memory really seems to contain 154 bytes). The data on
>>>>>>>> the
>>>>>>>> network
>>>>>>>> seems ok (at least I haven't been able to identify any problem), so
>>>>>>>> this
>>>>>>>> must be a processing error on the client side. The last field in
>>>>>>>> cut
>>>>>>>> buffer
>>>>>>>> of the sequentialized data corresponds to the length of the xdata
>>>>>>>> field:
>>>>>>>> 0x26. So at least 38 more byte should be present.
>>>>>>>> Nice detective work, Xavi.  It would be *very* interesting to
>>>>>>>> see what
>>>>>>>> the value of the "count" parameter is (it's unfortunately optimized
>>>>>>>> out).
>>>>>>>> I'll bet it's two, and iov[1].iov_len is 38.  I have a weak
>>>>>>>> memory of
>>>>>>>> some problems with how this iov is put together, a couple of years
>>>>>>>> ago,
>>>>>>>> and it looks like you might have tripped over one more.
>>>>>>>> It seems you are right. The count is 2 and the first 38 bytes of
>>>>>>>> the
>>>>>>>> second
>>>>>>>> vector contains the remaining data of xdata field.
>>>>>>>
>>>>>>> This is the bug. client3_3_readv_cbk (and for that matter all the
>>>>>>> actors/cbks) expects response in utmost two vectors:
>>>>>>> 1. Program header containing request or response. This is
>>>>>>> subjected to
>>>>>>> decoding/encoding. This vector should point to a buffer that
>>>>>>> contains
>>>>>>> the
>>>>>>> entire program header/response contiguously.
>>>>>>> 2. If the procedure returns payload (like readv response or a write
>>>>>>> request),
>>>>>>> second vector contains the buffer pointing to the entire
>>>>>>> (contiguous)
>>>>>>> payload. Note that this payload is raw and is not subjected to
>>>>>>> encoding/decoding.
>>>>>>>
>>>>>>> In your case, this _clean_ separation is broken with part of program
>>>>>>> header
>>>>>>> slipping into 2nd vector supposed to contain read data (may be
>>>>>>> because
>>>>>>> of
>>>>>>> rpc fragmentation). I think this is a bug in socket layer. I'll
>>>>>>> update
>>>>>>> more
>>>>>>> on this.
>>>>>>
>>>>>> Does your read response include xdata too? I think the code
>>>>>> related to
>>>>>> reading xdata in readv response is a bit murky.
>>>>>>
>>>>>> <socket.c/__socket_read_accepted_successful_reply>
>>>>>>
>>>>>>         case SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT:
>>>>>>                 default_read_size = xdr_sizeof ((xdrproc_t)
>>>>>>                 xdr_gfs3_read_rsp,
>>>>>>                                                 &read_rsp);
>>>>>>
>>>>>>                 proghdr_buf = frag->fragcurrent;
>>>>>>
>>>>>>                 __socket_proto_init_pending (priv,
>>>>>> default_read_size);
>>>>>>
>>>>>>                 frag->call_body.reply.accepted_success_state
>>>>>>                         = SP_STATE_READING_PROC_HEADER;
>>>>>>
>>>>>>                 /* fall through */
>>>>>>
>>>>>>         case SP_STATE_READING_PROC_HEADER:
>>>>>>                 __socket_proto_read (priv, ret);
>>>>>>
>>>>>>>>>>> By this time we've read read response _minus_ the xdata
>>>>>
>>>>> I meant we have read "readv response header"
>>>>>
>>>>>>
>>>>>>                 /* there can be 'xdata' in read response, figure
>>>>>> it out
>>>>>>                 */
>>>>>>                 xdrmem_create (&xdr, proghdr_buf, default_read_size,
>>>>>>                                XDR_DECODE);
>>>>>>
>>>>>>>>>>>> We created xdr stream above with "default_read_size" (this
>>>>>>>>>>>> doesn't
>>>>>>>>>>>> include xdata)
>>>>>>
>>>>>>                 /* This will fail if there is xdata sent from
>>>>>> server, if
>>>>>>                 not,
>>>>>>                    well and good, we don't need to worry about  */
>>>>>>
>>>>>>>>>>>> what if xdata is present and decoding failed (as length of xdr
>>>>>>>>>>>> stream
>>>>>>>>>>>> above - default_read_size - doesn't include xdata)? would we
>>>>>>>>>>>> have
>>>>>>>>>>>> a
>>>>>>>>>>>> valid value in read_rsp.xdata.xdata_len? This is the part I am
>>>>>>>>>>>> confused about. If read_rsp.xdata.xdata_len is not correct then
>>>>>>>>>>>> there
>>>>>>>>>>>> is a possibility that xdata might not be entirely present in
>>>>>>>>>>>> the
>>>>>>>>>>>> vector socket passes to higher layers as progheader (with
>>>>>>>>>>>> part or
>>>>>>>>>>>> entire xdata spilling over to payload vector).
>>>>>>
>>>>>>                 xdr_gfs3_read_rsp (&xdr, &read_rsp);
>>>>>>
>>>>>>                 free (read_rsp.xdata.xdata_val);
>>>>>>
>>>>>>                 /* need to round off to proper roof (%4), as XDR
>>>>>> packing
>>>>>>                 pads
>>>>>>                    the end of opaque object with '0' */
>>>>>>                 size = roof (read_rsp.xdata.xdata_len, 4);
>>>>>>
>>>>>>                 if (!size) {
>>>>>>                         frag->call_body.reply.accepted_success_state
>>>>>>                                 = SP_STATE_READ_PROC_OPAQUE;
>>>>>>                         goto read_proc_opaque;
>>>>>>                 }
>>>>>>
>>>>>> </socket.c>
>>>>>>
>>>>>> Can you please confirm whether there was an xdata in the readv
>>>>>> response
>>>>>> (may
>>>>>> be by looking in bricks) whose decoding failed?
>>>>>>
>>>>>> regards,
>>>>>> Raghavendra
>>>>>> _______________________________________________
>>>>>> Gluster-devel mailing list
>>>>>> Gluster-devel at gluster.org
>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>>
>>>>
>>>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel