[Gluster-devel] Possible bug in the communications layer ?

Wed May 4 12:47:44 UTC 2016

----- Original Message -----
> From: "Xavier Hernandez" <xhernandez at datalab.es>
> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> Sent: Wednesday, May 4, 2016 5:37:56 PM
> Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
> 
> I think I've found the problem.
> 
> 1567         case SP_STATE_READING_PROC_HEADER:
>   1568                 __socket_proto_read (priv, ret);
>   1569
>   1570                 /* there can be 'xdata' in read response, figure
> it out */
>   1571                 xdrmem_create (&xdr, proghdr_buf, default_read_size,
>   1572                                XDR_DECODE);
>   1573
>   1574                 /* This will fail if there is xdata sent from
> server, if not,
>   1575                    well and good, we don't need to worry about  */
>   1576                 xdr_gfs3_read_rsp (&xdr, &read_rsp);
>   1577
>   1578                 free (read_rsp.xdata.xdata_val);
>   1579
>   1580                 /* need to round off to proper roof (%4), as XDR
> packing pads
>   1581                    the end of opaque object with '0' */
>   1582                 size = roof (read_rsp.xdata.xdata_len, 4);
>   1583
>   1584                 if (!size) {
>   1585                         frag->call_body.reply.accepted_success_state
>   1586                                 = SP_STATE_READ_PROC_OPAQUE;
>   1587                         goto read_proc_opaque;
>   1588                 }
>   1589
>   1590                 __socket_proto_init_pending (priv, size);
>   1591
>   1592                 frag->call_body.reply.accepted_success_state
>   1593                         = SP_STATE_READING_PROC_OPAQUE;
>   1594
>   1595         case SP_STATE_READING_PROC_OPAQUE:
>   1596                 __socket_proto_read (priv, ret);
>   1597
>   1598                 frag->call_body.reply.accepted_success_state
>   1599                         = SP_STATE_READ_PROC_OPAQUE;
> 
> On line 1568 we read, at most, 116 bytes because we calculate the size
> of a read response without xdata. Then we detect that we really need
> more data for xdata (BTW, read_rsp.xdata.xdata_val will be always
> allocated even if xdr_gfs3_read_rsp() fails ?)

No. It need not be. Its guaranteed that only on a successful completion it is allocated. However, _if_ decoding fails only because xdr stream doesn't include xdata bits, but xdata_len is zero (by initializing it to default_read_size), then xdr library would've filled read_rsp.xdata.xdata_len (read_rsp.xdata.xdata_val can still be NULL).

> 
> So we get into line 1596 with the pending info initialized to read the
> remaining data. This is the __socket_proto_read macro:
> 
>    166 /* This will be used in a switch case and breaks from the switch
> case if all
>    167  * the pending data is not read.
>    168  */
>    169 #define __socket_proto_read(priv, ret)
>        \
>    170                 {
>        \
>    171                 size_t bytes_read = 0;
>        \
>    172                 struct gf_sock_incoming *in;
>        \
>    173                 in = &priv->incoming;
>        \
>    174
>        \
>    175                 __socket_proto_update_pending (priv);
>        \
>    176
>        \
>    177                 ret = __socket_readv (this,
>        \
>    178                                       in->pending_vector, 1,
>        \
>    179                                       &in->pending_vector,
>        \
>    180                                       &in->pending_count,
>        \
>    181                                       &bytes_read);
>        \
>    182                 if (ret == -1)
>        \
>    183                         break;
>        \
>    184                 __socket_proto_update_priv_after_read (priv, ret,
> bytes_read); \
>    185         }
> 
> We read from the socket using __socket_readv(). It it fails, we quit.
> However if the socket doesn't have more data to read, this function does
> not return -1:
> 
>    555                         ret = __socket_cached_read (this,
> opvector, opcount);
>    556
>    557                         if (ret == 0) {
>    558
> gf_log(this->name,GF_LOG_DEBUG,"EOF on socket");
>    559                                 errno = ENODATA;
>    560                                 ret = -1;
>    561                         }
>    562                         if (ret == -1 && errno == EAGAIN) {
>    563                                 /* done for now */
>    564                                 break;
>    565                         }
>    566                         this->total_bytes_read += ret;
> 
> If __socket_cached_read() fails with errno == EAGAIN, we break and
> return opcount, which is >= 0. Causing the process to continue instead
> of waiting for more data.

No. If you observe, there is a call to another macro __socket_proto_update_priv_after_read at line 184. As can be seen in the definition below, if (ret > 0) - which is the case in case of short-read - we do break. I hope I am not missing anything here :).

#define __socket_proto_update_priv_after_read(priv, ret, bytes_read)    \
        {                                                               \
		struct gf_sock_incoming_frag *frag;                     \
                frag = &priv->incoming.frag;                            \
                                                                        \
                frag->fragcurrent += bytes_read;                        \
                frag->bytes_read += bytes_read;                         \
                                                                        \
                if ((ret > 0) || (frag->remaining_size != 0)) {         \
			if (frag->remaining_size != 0 && ret == 0) {    \
				__socket_proto_reset_pending (priv);    \
                        }                                               \
                                                                        \
                        gf_log (this->name, GF_LOG_TRACE,               \
                                "partial read on non-blocking socket"); \
                                                                        \
                        break;                                          \
                }                                                       \
        }

> 
> As a side note, there's another problem here: if errno is not EAGAIN,
> we'll update this->total_bytes_read substracting one. This shouldn't be
> done when ret < 0.
> 
> There are other places where ret is set to -1, but opcount is returned.
> I guess that we should also set opcount = -1 on these places, but I
> don't have a deep knowledge about this implementation.
> 
> I've done a quick test checking for (ret != 0) instead of (ret == -1) in
> __socket_proto_read() and it seemed to work.
> 
> Could anyone with more knowledge about the communications layer verify
> this and explain what would be the best solution ?
> 
> Xavi
> 
> On 29/04/16 14:52, Xavier Hernandez wrote:
> > With your patch applied, it seems that the bug is not hit.
> >
> > I guess it's a timing issue that the new logging hides. Maybe no more
> > data available after reading the partial readv header ? (it will arrive
> > later).
> >
> > I'll continue testing...
> >
> > Xavi
> >
> > On 29/04/16 13:48, Raghavendra Gowdappa wrote:
> >> Attaching the patch.
> >>
> >> ----- Original Message -----
> >>> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> >>> To: "Xavier Hernandez" <xhernandez at datalab.es>
> >>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> >>> Sent: Friday, April 29, 2016 5:14:02 PM
> >>> Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
> >>>
> >>>
> >>>
> >>> ----- Original Message -----
> >>>> From: "Xavier Hernandez" <xhernandez at datalab.es>
> >>>> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> >>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> >>>> Sent: Friday, April 29, 2016 1:21:57 PM
> >>>> Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
> >>>>
> >>>> Hi Raghavendra,
> >>>>
> >>>> yes, the readv response contains xdata. The dict length is 38 (0x26)
> >>>> and, at the moment of failure, rsp.xdata.xdata_len already contains
> >>>> 0x26.
> >>>
> >>> rsp.xdata.xdata_len having 0x26 even when decoding failed indicates
> >>> that the
> >>> approach used in socket.c to get the length of xdata is correct.
> >>> However, I
> >>> cannot find any other way of xdata going into payload vector other than
> >>> xdata_len being zero. Just to be double sure, I've a patch containing
> >>> debug
> >>> message printing xdata_len when decoding fails in socket.c. Can you
> >>> please
> >>> apply the patch, run the tests and revert back with results?
> >>>
> >>>>
> >>>> Xavi
> >>>>
> >>>> On 29/04/16 09:10, Raghavendra Gowdappa wrote:
> >>>>>
> >>>>>
> >>>>> ----- Original Message -----
> >>>>>> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> >>>>>> To: "Xavier Hernandez" <xhernandez at datalab.es>
> >>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> >>>>>> Sent: Friday, April 29, 2016 12:36:43 PM
> >>>>>> Subject: Re: [Gluster-devel] Possible bug in the communications
> >>>>>> layer ?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> ----- Original Message -----
> >>>>>>> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> >>>>>>> To: "Xavier Hernandez" <xhernandez at datalab.es>
> >>>>>>> Cc: "Jeff Darcy" <jdarcy at redhat.com>, "Gluster Devel"
> >>>>>>> <gluster-devel at gluster.org>
> >>>>>>> Sent: Friday, April 29, 2016 12:07:59 PM
> >>>>>>> Subject: Re: [Gluster-devel] Possible bug in the communications
> >>>>>>> layer ?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> ----- Original Message -----
> >>>>>>>> From: "Xavier Hernandez" <xhernandez at datalab.es>
> >>>>>>>> To: "Jeff Darcy" <jdarcy at redhat.com>
> >>>>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> >>>>>>>> Sent: Thursday, April 28, 2016 8:15:36 PM
> >>>>>>>> Subject: Re: [Gluster-devel] Possible bug in the communications
> >>>>>>>> layer
> >>>>>>>> ?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Jeff,
> >>>>>>>>
> >>>>>>>> On 28.04.2016 15:20, Jeff Darcy wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> This happens with Gluster 3.7.11 accessed through Ganesha and
> >>>>>>>> gfapi.
> >>>>>>>> The
> >>>>>>>> volume is a distributed-disperse 4*(4+2). I'm able to reproduce the
> >>>>>>>> problem
> >>>>>>>> easily doing the following test: iozone -t2 -s10g -r1024k -i0 -w
> >>>>>>>> -F/iozone{1..2}.dat echo 3 >/proc/sys/vm/drop_caches iozone -t2
> >>>>>>>> -s10g
> >>>>>>>> -r1024k -i1 -w -F/iozone{1..2}.dat The error happens soon after
> >>>>>>>> starting
> >>>>>>>> the
> >>>>>>>> read test. As can be seen in the data below,
> >>>>>>>> client3_3_readv_cbk() is
> >>>>>>>> processing an iovec of 116 bytes, however it should be of 154 bytes
> >>>>>>>> (the
> >>>>>>>> buffer in memory really seems to contain 154 bytes). The data on
> >>>>>>>> the
> >>>>>>>> network
> >>>>>>>> seems ok (at least I haven't been able to identify any problem), so
> >>>>>>>> this
> >>>>>>>> must be a processing error on the client side. The last field in
> >>>>>>>> cut
> >>>>>>>> buffer
> >>>>>>>> of the sequentialized data corresponds to the length of the xdata
> >>>>>>>> field:
> >>>>>>>> 0x26. So at least 38 more byte should be present.
> >>>>>>>> Nice detective work, Xavi.  It would be *very* interesting to
> >>>>>>>> see what
> >>>>>>>> the value of the "count" parameter is (it's unfortunately optimized
> >>>>>>>> out).
> >>>>>>>> I'll bet it's two, and iov[1].iov_len is 38.  I have a weak
> >>>>>>>> memory of
> >>>>>>>> some problems with how this iov is put together, a couple of years
> >>>>>>>> ago,
> >>>>>>>> and it looks like you might have tripped over one more.
> >>>>>>>> It seems you are right. The count is 2 and the first 38 bytes of
> >>>>>>>> the
> >>>>>>>> second
> >>>>>>>> vector contains the remaining data of xdata field.
> >>>>>>>
> >>>>>>> This is the bug. client3_3_readv_cbk (and for that matter all the
> >>>>>>> actors/cbks) expects response in utmost two vectors:
> >>>>>>> 1. Program header containing request or response. This is
> >>>>>>> subjected to
> >>>>>>> decoding/encoding. This vector should point to a buffer that
> >>>>>>> contains
> >>>>>>> the
> >>>>>>> entire program header/response contiguously.
> >>>>>>> 2. If the procedure returns payload (like readv response or a write
> >>>>>>> request),
> >>>>>>> second vector contains the buffer pointing to the entire
> >>>>>>> (contiguous)
> >>>>>>> payload. Note that this payload is raw and is not subjected to
> >>>>>>> encoding/decoding.
> >>>>>>>
> >>>>>>> In your case, this _clean_ separation is broken with part of program
> >>>>>>> header
> >>>>>>> slipping into 2nd vector supposed to contain read data (may be
> >>>>>>> because
> >>>>>>> of
> >>>>>>> rpc fragmentation). I think this is a bug in socket layer. I'll
> >>>>>>> update
> >>>>>>> more
> >>>>>>> on this.
> >>>>>>
> >>>>>> Does your read response include xdata too? I think the code
> >>>>>> related to
> >>>>>> reading xdata in readv response is a bit murky.
> >>>>>>
> >>>>>> <socket.c/__socket_read_accepted_successful_reply>
> >>>>>>
> >>>>>>         case SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT:
> >>>>>>                 default_read_size = xdr_sizeof ((xdrproc_t)
> >>>>>>                 xdr_gfs3_read_rsp,
> >>>>>>                                                 &read_rsp);
> >>>>>>
> >>>>>>                 proghdr_buf = frag->fragcurrent;
> >>>>>>
> >>>>>>                 __socket_proto_init_pending (priv,
> >>>>>> default_read_size);
> >>>>>>
> >>>>>>                 frag->call_body.reply.accepted_success_state
> >>>>>>                         = SP_STATE_READING_PROC_HEADER;
> >>>>>>
> >>>>>>                 /* fall through */
> >>>>>>
> >>>>>>         case SP_STATE_READING_PROC_HEADER:
> >>>>>>                 __socket_proto_read (priv, ret);
> >>>>>>
> >>>>>>>>>>> By this time we've read read response _minus_ the xdata
> >>>>>
> >>>>> I meant we have read "readv response header"
> >>>>>
> >>>>>>
> >>>>>>                 /* there can be 'xdata' in read response, figure
> >>>>>> it out
> >>>>>>                 */
> >>>>>>                 xdrmem_create (&xdr, proghdr_buf, default_read_size,
> >>>>>>                                XDR_DECODE);
> >>>>>>
> >>>>>>>>>>>> We created xdr stream above with "default_read_size" (this
> >>>>>>>>>>>> doesn't
> >>>>>>>>>>>> include xdata)
> >>>>>>
> >>>>>>                 /* This will fail if there is xdata sent from
> >>>>>> server, if
> >>>>>>                 not,
> >>>>>>                    well and good, we don't need to worry about  */
> >>>>>>
> >>>>>>>>>>>> what if xdata is present and decoding failed (as length of xdr
> >>>>>>>>>>>> stream
> >>>>>>>>>>>> above - default_read_size - doesn't include xdata)? would we
> >>>>>>>>>>>> have
> >>>>>>>>>>>> a
> >>>>>>>>>>>> valid value in read_rsp.xdata.xdata_len? This is the part I am
> >>>>>>>>>>>> confused about. If read_rsp.xdata.xdata_len is not correct then
> >>>>>>>>>>>> there
> >>>>>>>>>>>> is a possibility that xdata might not be entirely present in
> >>>>>>>>>>>> the
> >>>>>>>>>>>> vector socket passes to higher layers as progheader (with
> >>>>>>>>>>>> part or
> >>>>>>>>>>>> entire xdata spilling over to payload vector).
> >>>>>>
> >>>>>>                 xdr_gfs3_read_rsp (&xdr, &read_rsp);
> >>>>>>
> >>>>>>                 free (read_rsp.xdata.xdata_val);
> >>>>>>
> >>>>>>                 /* need to round off to proper roof (%4), as XDR
> >>>>>> packing
> >>>>>>                 pads
> >>>>>>                    the end of opaque object with '0' */
> >>>>>>                 size = roof (read_rsp.xdata.xdata_len, 4);
> >>>>>>
> >>>>>>                 if (!size) {
> >>>>>>                         frag->call_body.reply.accepted_success_state
> >>>>>>                                 = SP_STATE_READ_PROC_OPAQUE;
> >>>>>>                         goto read_proc_opaque;
> >>>>>>                 }
> >>>>>>
> >>>>>> </socket.c>
> >>>>>>
> >>>>>> Can you please confirm whether there was an xdata in the readv
> >>>>>> response
> >>>>>> (may
> >>>>>> be by looking in bricks) whose decoding failed?
> >>>>>>
> >>>>>> regards,
> >>>>>> Raghavendra
> >>>>>> _______________________________________________
> >>>>>> Gluster-devel mailing list
> >>>>>> Gluster-devel at gluster.org
> >>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
> >>>>>>
> >>>>
> >>>
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
>