[Gluster-devel] Possible bug in the communications layer ?
Raghavendra Gowdappa
rgowdapp at redhat.com
Wed May 4 12:47:44 UTC 2016
----- Original Message -----
> From: "Xavier Hernandez" <xhernandez at datalab.es>
> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> Sent: Wednesday, May 4, 2016 5:37:56 PM
> Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
>
> I think I've found the problem.
>
> 1567 case SP_STATE_READING_PROC_HEADER:
> 1568 __socket_proto_read (priv, ret);
> 1569
> 1570 /* there can be 'xdata' in read response, figure
> it out */
> 1571 xdrmem_create (&xdr, proghdr_buf, default_read_size,
> 1572 XDR_DECODE);
> 1573
> 1574 /* This will fail if there is xdata sent from
> server, if not,
> 1575 well and good, we don't need to worry about */
> 1576 xdr_gfs3_read_rsp (&xdr, &read_rsp);
> 1577
> 1578 free (read_rsp.xdata.xdata_val);
> 1579
> 1580 /* need to round off to proper roof (%4), as XDR
> packing pads
> 1581 the end of opaque object with '0' */
> 1582 size = roof (read_rsp.xdata.xdata_len, 4);
> 1583
> 1584 if (!size) {
> 1585 frag->call_body.reply.accepted_success_state
> 1586 = SP_STATE_READ_PROC_OPAQUE;
> 1587 goto read_proc_opaque;
> 1588 }
> 1589
> 1590 __socket_proto_init_pending (priv, size);
> 1591
> 1592 frag->call_body.reply.accepted_success_state
> 1593 = SP_STATE_READING_PROC_OPAQUE;
> 1594
> 1595 case SP_STATE_READING_PROC_OPAQUE:
> 1596 __socket_proto_read (priv, ret);
> 1597
> 1598 frag->call_body.reply.accepted_success_state
> 1599 = SP_STATE_READ_PROC_OPAQUE;
>
> On line 1568 we read, at most, 116 bytes because we calculate the size
> of a read response without xdata. Then we detect that we really need
> more data for xdata (BTW, read_rsp.xdata.xdata_val will be always
> allocated even if xdr_gfs3_read_rsp() fails ?)
No. It need not be. Its guaranteed that only on a successful completion it is allocated. However, _if_ decoding fails only because xdr stream doesn't include xdata bits, but xdata_len is zero (by initializing it to default_read_size), then xdr library would've filled read_rsp.xdata.xdata_len (read_rsp.xdata.xdata_val can still be NULL).
>
> So we get into line 1596 with the pending info initialized to read the
> remaining data. This is the __socket_proto_read macro:
>
> 166 /* This will be used in a switch case and breaks from the switch
> case if all
> 167 * the pending data is not read.
> 168 */
> 169 #define __socket_proto_read(priv, ret)
> \
> 170 {
> \
> 171 size_t bytes_read = 0;
> \
> 172 struct gf_sock_incoming *in;
> \
> 173 in = &priv->incoming;
> \
> 174
> \
> 175 __socket_proto_update_pending (priv);
> \
> 176
> \
> 177 ret = __socket_readv (this,
> \
> 178 in->pending_vector, 1,
> \
> 179 &in->pending_vector,
> \
> 180 &in->pending_count,
> \
> 181 &bytes_read);
> \
> 182 if (ret == -1)
> \
> 183 break;
> \
> 184 __socket_proto_update_priv_after_read (priv, ret,
> bytes_read); \
> 185 }
>
> We read from the socket using __socket_readv(). It it fails, we quit.
> However if the socket doesn't have more data to read, this function does
> not return -1:
>
> 555 ret = __socket_cached_read (this,
> opvector, opcount);
> 556
> 557 if (ret == 0) {
> 558
> gf_log(this->name,GF_LOG_DEBUG,"EOF on socket");
> 559 errno = ENODATA;
> 560 ret = -1;
> 561 }
> 562 if (ret == -1 && errno == EAGAIN) {
> 563 /* done for now */
> 564 break;
> 565 }
> 566 this->total_bytes_read += ret;
>
> If __socket_cached_read() fails with errno == EAGAIN, we break and
> return opcount, which is >= 0. Causing the process to continue instead
> of waiting for more data.
No. If you observe, there is a call to another macro __socket_proto_update_priv_after_read at line 184. As can be seen in the definition below, if (ret > 0) - which is the case in case of short-read - we do break. I hope I am not missing anything here :).
#define __socket_proto_update_priv_after_read(priv, ret, bytes_read) \
{ \
struct gf_sock_incoming_frag *frag; \
frag = &priv->incoming.frag; \
\
frag->fragcurrent += bytes_read; \
frag->bytes_read += bytes_read; \
\
if ((ret > 0) || (frag->remaining_size != 0)) { \
if (frag->remaining_size != 0 && ret == 0) { \
__socket_proto_reset_pending (priv); \
} \
\
gf_log (this->name, GF_LOG_TRACE, \
"partial read on non-blocking socket"); \
\
break; \
} \
}
>
> As a side note, there's another problem here: if errno is not EAGAIN,
> we'll update this->total_bytes_read substracting one. This shouldn't be
> done when ret < 0.
>
> There are other places where ret is set to -1, but opcount is returned.
> I guess that we should also set opcount = -1 on these places, but I
> don't have a deep knowledge about this implementation.
>
> I've done a quick test checking for (ret != 0) instead of (ret == -1) in
> __socket_proto_read() and it seemed to work.
>
> Could anyone with more knowledge about the communications layer verify
> this and explain what would be the best solution ?
>
> Xavi
>
> On 29/04/16 14:52, Xavier Hernandez wrote:
> > With your patch applied, it seems that the bug is not hit.
> >
> > I guess it's a timing issue that the new logging hides. Maybe no more
> > data available after reading the partial readv header ? (it will arrive
> > later).
> >
> > I'll continue testing...
> >
> > Xavi
> >
> > On 29/04/16 13:48, Raghavendra Gowdappa wrote:
> >> Attaching the patch.
> >>
> >> ----- Original Message -----
> >>> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> >>> To: "Xavier Hernandez" <xhernandez at datalab.es>
> >>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> >>> Sent: Friday, April 29, 2016 5:14:02 PM
> >>> Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
> >>>
> >>>
> >>>
> >>> ----- Original Message -----
> >>>> From: "Xavier Hernandez" <xhernandez at datalab.es>
> >>>> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> >>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> >>>> Sent: Friday, April 29, 2016 1:21:57 PM
> >>>> Subject: Re: [Gluster-devel] Possible bug in the communications layer ?
> >>>>
> >>>> Hi Raghavendra,
> >>>>
> >>>> yes, the readv response contains xdata. The dict length is 38 (0x26)
> >>>> and, at the moment of failure, rsp.xdata.xdata_len already contains
> >>>> 0x26.
> >>>
> >>> rsp.xdata.xdata_len having 0x26 even when decoding failed indicates
> >>> that the
> >>> approach used in socket.c to get the length of xdata is correct.
> >>> However, I
> >>> cannot find any other way of xdata going into payload vector other than
> >>> xdata_len being zero. Just to be double sure, I've a patch containing
> >>> debug
> >>> message printing xdata_len when decoding fails in socket.c. Can you
> >>> please
> >>> apply the patch, run the tests and revert back with results?
> >>>
> >>>>
> >>>> Xavi
> >>>>
> >>>> On 29/04/16 09:10, Raghavendra Gowdappa wrote:
> >>>>>
> >>>>>
> >>>>> ----- Original Message -----
> >>>>>> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> >>>>>> To: "Xavier Hernandez" <xhernandez at datalab.es>
> >>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> >>>>>> Sent: Friday, April 29, 2016 12:36:43 PM
> >>>>>> Subject: Re: [Gluster-devel] Possible bug in the communications
> >>>>>> layer ?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> ----- Original Message -----
> >>>>>>> From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> >>>>>>> To: "Xavier Hernandez" <xhernandez at datalab.es>
> >>>>>>> Cc: "Jeff Darcy" <jdarcy at redhat.com>, "Gluster Devel"
> >>>>>>> <gluster-devel at gluster.org>
> >>>>>>> Sent: Friday, April 29, 2016 12:07:59 PM
> >>>>>>> Subject: Re: [Gluster-devel] Possible bug in the communications
> >>>>>>> layer ?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> ----- Original Message -----
> >>>>>>>> From: "Xavier Hernandez" <xhernandez at datalab.es>
> >>>>>>>> To: "Jeff Darcy" <jdarcy at redhat.com>
> >>>>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> >>>>>>>> Sent: Thursday, April 28, 2016 8:15:36 PM
> >>>>>>>> Subject: Re: [Gluster-devel] Possible bug in the communications
> >>>>>>>> layer
> >>>>>>>> ?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Jeff,
> >>>>>>>>
> >>>>>>>> On 28.04.2016 15:20, Jeff Darcy wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> This happens with Gluster 3.7.11 accessed through Ganesha and
> >>>>>>>> gfapi.
> >>>>>>>> The
> >>>>>>>> volume is a distributed-disperse 4*(4+2). I'm able to reproduce the
> >>>>>>>> problem
> >>>>>>>> easily doing the following test: iozone -t2 -s10g -r1024k -i0 -w
> >>>>>>>> -F/iozone{1..2}.dat echo 3 >/proc/sys/vm/drop_caches iozone -t2
> >>>>>>>> -s10g
> >>>>>>>> -r1024k -i1 -w -F/iozone{1..2}.dat The error happens soon after
> >>>>>>>> starting
> >>>>>>>> the
> >>>>>>>> read test. As can be seen in the data below,
> >>>>>>>> client3_3_readv_cbk() is
> >>>>>>>> processing an iovec of 116 bytes, however it should be of 154 bytes
> >>>>>>>> (the
> >>>>>>>> buffer in memory really seems to contain 154 bytes). The data on
> >>>>>>>> the
> >>>>>>>> network
> >>>>>>>> seems ok (at least I haven't been able to identify any problem), so
> >>>>>>>> this
> >>>>>>>> must be a processing error on the client side. The last field in
> >>>>>>>> cut
> >>>>>>>> buffer
> >>>>>>>> of the sequentialized data corresponds to the length of the xdata
> >>>>>>>> field:
> >>>>>>>> 0x26. So at least 38 more byte should be present.
> >>>>>>>> Nice detective work, Xavi. It would be *very* interesting to
> >>>>>>>> see what
> >>>>>>>> the value of the "count" parameter is (it's unfortunately optimized
> >>>>>>>> out).
> >>>>>>>> I'll bet it's two, and iov[1].iov_len is 38. I have a weak
> >>>>>>>> memory of
> >>>>>>>> some problems with how this iov is put together, a couple of years
> >>>>>>>> ago,
> >>>>>>>> and it looks like you might have tripped over one more.
> >>>>>>>> It seems you are right. The count is 2 and the first 38 bytes of
> >>>>>>>> the
> >>>>>>>> second
> >>>>>>>> vector contains the remaining data of xdata field.
> >>>>>>>
> >>>>>>> This is the bug. client3_3_readv_cbk (and for that matter all the
> >>>>>>> actors/cbks) expects response in utmost two vectors:
> >>>>>>> 1. Program header containing request or response. This is
> >>>>>>> subjected to
> >>>>>>> decoding/encoding. This vector should point to a buffer that
> >>>>>>> contains
> >>>>>>> the
> >>>>>>> entire program header/response contiguously.
> >>>>>>> 2. If the procedure returns payload (like readv response or a write
> >>>>>>> request),
> >>>>>>> second vector contains the buffer pointing to the entire
> >>>>>>> (contiguous)
> >>>>>>> payload. Note that this payload is raw and is not subjected to
> >>>>>>> encoding/decoding.
> >>>>>>>
> >>>>>>> In your case, this _clean_ separation is broken with part of program
> >>>>>>> header
> >>>>>>> slipping into 2nd vector supposed to contain read data (may be
> >>>>>>> because
> >>>>>>> of
> >>>>>>> rpc fragmentation). I think this is a bug in socket layer. I'll
> >>>>>>> update
> >>>>>>> more
> >>>>>>> on this.
> >>>>>>
> >>>>>> Does your read response include xdata too? I think the code
> >>>>>> related to
> >>>>>> reading xdata in readv response is a bit murky.
> >>>>>>
> >>>>>> <socket.c/__socket_read_accepted_successful_reply>
> >>>>>>
> >>>>>> case SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT:
> >>>>>> default_read_size = xdr_sizeof ((xdrproc_t)
> >>>>>> xdr_gfs3_read_rsp,
> >>>>>> &read_rsp);
> >>>>>>
> >>>>>> proghdr_buf = frag->fragcurrent;
> >>>>>>
> >>>>>> __socket_proto_init_pending (priv,
> >>>>>> default_read_size);
> >>>>>>
> >>>>>> frag->call_body.reply.accepted_success_state
> >>>>>> = SP_STATE_READING_PROC_HEADER;
> >>>>>>
> >>>>>> /* fall through */
> >>>>>>
> >>>>>> case SP_STATE_READING_PROC_HEADER:
> >>>>>> __socket_proto_read (priv, ret);
> >>>>>>
> >>>>>>>>>>> By this time we've read read response _minus_ the xdata
> >>>>>
> >>>>> I meant we have read "readv response header"
> >>>>>
> >>>>>>
> >>>>>> /* there can be 'xdata' in read response, figure
> >>>>>> it out
> >>>>>> */
> >>>>>> xdrmem_create (&xdr, proghdr_buf, default_read_size,
> >>>>>> XDR_DECODE);
> >>>>>>
> >>>>>>>>>>>> We created xdr stream above with "default_read_size" (this
> >>>>>>>>>>>> doesn't
> >>>>>>>>>>>> include xdata)
> >>>>>>
> >>>>>> /* This will fail if there is xdata sent from
> >>>>>> server, if
> >>>>>> not,
> >>>>>> well and good, we don't need to worry about */
> >>>>>>
> >>>>>>>>>>>> what if xdata is present and decoding failed (as length of xdr
> >>>>>>>>>>>> stream
> >>>>>>>>>>>> above - default_read_size - doesn't include xdata)? would we
> >>>>>>>>>>>> have
> >>>>>>>>>>>> a
> >>>>>>>>>>>> valid value in read_rsp.xdata.xdata_len? This is the part I am
> >>>>>>>>>>>> confused about. If read_rsp.xdata.xdata_len is not correct then
> >>>>>>>>>>>> there
> >>>>>>>>>>>> is a possibility that xdata might not be entirely present in
> >>>>>>>>>>>> the
> >>>>>>>>>>>> vector socket passes to higher layers as progheader (with
> >>>>>>>>>>>> part or
> >>>>>>>>>>>> entire xdata spilling over to payload vector).
> >>>>>>
> >>>>>> xdr_gfs3_read_rsp (&xdr, &read_rsp);
> >>>>>>
> >>>>>> free (read_rsp.xdata.xdata_val);
> >>>>>>
> >>>>>> /* need to round off to proper roof (%4), as XDR
> >>>>>> packing
> >>>>>> pads
> >>>>>> the end of opaque object with '0' */
> >>>>>> size = roof (read_rsp.xdata.xdata_len, 4);
> >>>>>>
> >>>>>> if (!size) {
> >>>>>> frag->call_body.reply.accepted_success_state
> >>>>>> = SP_STATE_READ_PROC_OPAQUE;
> >>>>>> goto read_proc_opaque;
> >>>>>> }
> >>>>>>
> >>>>>> </socket.c>
> >>>>>>
> >>>>>> Can you please confirm whether there was an xdata in the readv
> >>>>>> response
> >>>>>> (may
> >>>>>> be by looking in bricks) whose decoding failed?
> >>>>>>
> >>>>>> regards,
> >>>>>> Raghavendra
> >>>>>> _______________________________________________
> >>>>>> Gluster-devel mailing list
> >>>>>> Gluster-devel at gluster.org
> >>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
> >>>>>>
> >>>>
> >>>
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
>
More information about the Gluster-devel
mailing list