[Gluster-devel] Cores generated with ./tests/geo-rep/georep-basic-dr-tarssh.t

Raghavendra G raghavendra at gluster.com
Fri Mar 4 09:39:39 UTC 2016


On Fri, Mar 4, 2016 at 2:02 PM, Raghavendra G <raghavendra at gluster.com>
wrote:

>
>
> On Thu, Mar 3, 2016 at 6:26 PM, Kotresh Hiremath Ravishankar <
> khiremat at redhat.com> wrote:
>
>> Hi,
>>
>> Yes, with this patch we need not set conn->trans to NULL in
>> rpc_clnt_disable
>>
>
> While [1] fixes the crash, things can be improved in the way how changelog
> is using rpc.
>
> 1. In the current code, there is an rpc_clnt object leak during disconnect
> event.
> 2. Also, freed "mydata" of changelog is still associated with rpc_clnt
> object (corollary of 1), though change log might not get any events with
> "mydata" (as connection is dead).
>
> I've discussed with Kotresh about changes needed, offline. So, following
> are the action items.
> 1. Soumya's patch [2] is valid and is needed for 3.7 branch too.
> 2. [2] can be accepted. However, someone might want to re-use an rpc
> object after disabling it, like introducing a new api rpc_clnt_enable_again
> (though no of such use-cases is very less). But [2] doesn't allow it. The
> point is as long as rpc-clnt object is alive, transport object is alive
> (though disconnected) and we can re-use it. So, I would prefer not to
> accept it.
>

[2] will be accepted now.


> 3. Kotresh will work on new changes to make sure changelog makes correct
> use of rpc-clnt.
>
> [1] http://review.gluster.org/#/c/13592
> [2] http://review.gluster.org/#/c/1359
>
> regards,
> Raghavendra.
>
>
>> Thanks and Regards,
>> Kotresh H R
>>
>> ----- Original Message -----
>> > From: "Soumya Koduri" <skoduri at redhat.com>
>> > To: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com>, "Raghavendra
>> G" <raghavendra at gluster.com>
>> > Cc: "Gluster Devel" <gluster-devel at gluster.org>
>> > Sent: Thursday, March 3, 2016 5:06:00 PM
>> > Subject: Re: [Gluster-devel] Cores generated with
>> ./tests/geo-rep/georep-basic-dr-tarssh.t
>> >
>> >
>> >
>> > On 03/03/2016 04:58 PM, Kotresh Hiremath Ravishankar wrote:
>> > > [Replying on top of my own reply]
>> > >
>> > > Hi,
>> > >
>> > > I have submitted the below patch [1] to avoid the issue of
>> > > 'rpc_clnt_submit'
>> > > getting reconnected. But it won't take care of memory leak problem
>> you were
>> > > trying to fix. That we have to carefully go through all cases and fix
>> it.
>> > > Please have a look at it.
>> > >
>> > Looks good. IIUC, with this patch, we need not set conn->trans to NULL
>> > in 'rpc_clnt_disable()'. Right? If yes, then it takes care of memleak as
>> > the transport object shall then get freed as part of
>> > 'rpc_clnt_trigger_destroy'.
>> >
>> >
>> > > http://review.gluster.org/#/c/13592/
>> > >
>> > > Thanks and Regards,
>> > > Kotresh H R
>> > >
>> > > ----- Original Message -----
>> > >> From: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com>
>> > >> To: "Soumya Koduri" <skoduri at redhat.com>
>> > >> Cc: "Raghavendra G" <raghavendra at gluster.com>, "Gluster Devel"
>> > >> <gluster-devel at gluster.org>
>> > >> Sent: Thursday, March 3, 2016 3:39:11 PM
>> > >> Subject: Re: [Gluster-devel] Cores generated with
>> > >> ./tests/geo-rep/georep-basic-dr-tarssh.t
>> > >>
>> > >> Hi Soumya,
>> > >>
>> > >> I tested the lastes patch [2] on master where your previous patch
>> [1] in
>> > >> merged.
>> > >> I see crashes at different places.
>> > >>
>> > >> 1. If there are code paths that are holding rpc object without
>> taking ref
>> > >> on
>> > >> it, all those
>> > >>     code path will crash on invoking rpc submit on that object as rpc
>> > >>     object
>> > >>     would have freed
>> > >>     by last unref on DISCONNECT event. I see this kind of use-case in
>> > >>     chagnelog rpc code.
>> > >>     Need to check on other users of rpc.
>> > Agree. We should fix all such code-paths. Since this seem to be an
>> > intricate fix, shall we take these patches only in master branch and not
>> > in 3.7 release for now till we fix all such paths as we encounter?
>> >
>> > >>
>> > >> 2. And also we need to take care of reconnect timers that are being
>> set
>> > >> and
>> > >> are re-tried to
>> > >>     connect back on expiration. In those cases also, we might crash
>> as rpc
>> > >>     object would have freed.
>> > Your patch addresses this..right?
>> >
>> > Thanks,
>> > Soumya
>> >
>> > >>
>> > >>
>> > >> [1] http://review.gluster.org/#/c/13507/
>> > >> [2] http://review.gluster.org/#/c/13587/
>> > >>
>> > >> Thanks and Regards,
>> > >> Kotresh H R
>> > >>
>> > >> ----- Original Message -----
>> > >>> From: "Soumya Koduri" <skoduri at redhat.com>
>> > >>> To: "Raghavendra G" <raghavendra at gluster.com>, "Kotresh Hiremath
>> > >>> Ravishankar" <khiremat at redhat.com>
>> > >>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
>> > >>> Sent: Thursday, March 3, 2016 12:24:00 PM
>> > >>> Subject: Re: [Gluster-devel] Cores generated with
>> > >>> ./tests/geo-rep/georep-basic-dr-tarssh.t
>> > >>>
>> > >>> Thanks a lot Kotresh.
>> > >>>
>> > >>> On 03/03/2016 08:47 AM, Raghavendra G wrote:
>> > >>>> Hi Soumya,
>> > >>>>
>> > >>>> Can you send a fix to this regression on upstream master too? This
>> patch
>> > >>>> is merged there.
>> > >>>>
>> > >>> I have submitted below patch.
>> > >>>   http://review.gluster.org/#/c/13587/
>> > >>>
>> > >>> Kindly review the same.
>> > >>>
>> > >>> Thanks,
>> > >>> Soumya
>> > >>>
>> > >>>> regards,
>> > >>>> Raghavendra
>> > >>>>
>> > >>>> On Tue, Mar 1, 2016 at 10:34 PM, Kotresh Hiremath Ravishankar
>> > >>>> <khiremat at redhat.com <mailto:khiremat at redhat.com>> wrote:
>> > >>>>
>> > >>>>      Hi Soumya,
>> > >>>>
>> > >>>>      I analysed the issue and found out that crash has happened
>> because
>> > >>>>      of the patch [1].
>> > >>>>
>> > >>>>      The patch doesn't set transport object to NULL in
>> > >>>>      'rpc_clnt_disable'
>> > >>>>      but instead does it on
>> > >>>>      'rpc_clnt_trigger_destroy'. So if there are pending rpc
>> invocations
>> > >>>>      on the rpc object that
>> > >>>>      is disabled (those instances are possible as happening now in
>> > >>>>      changelog), it will trigger a
>> > >>>>      CONNECT notify again with 'mydata' that is freed causing a
>> crash.
>> > >>>>      This happens because
>> > >>>>      'rpc_clnt_submit' reconnects if rpc is not connected.
>> > >>>>
>> > >>>>        rpc_clnt_submit (...) {
>> > >>>>          ...
>> > >>>>                       if (conn->connected == 0) {
>> > >>>>                               ret = rpc_transport_connect
>> (conn->trans,
>> > >>>>
>> > >>>>        conn->config.remote_port);
>> > >>>>                       }
>> > >>>>          ...
>> > >>>>        }
>> > >>>>
>> > >>>>      Without your patch, conn->trans was set NULL and hence CONNECT
>> > >>>>      fails
>> > >>>>      not resulting with
>> > >>>>      CONNECT notify call. And also the cleanup happens in failure
>> path.
>> > >>>>
>> > >>>>      So the memory leak can happen, if there is no try for rpc
>> > >>>>      invocation
>> > >>>>      after DISCONNECT.
>> > >>>>      It will be cleaned up otherwise.
>> > >>>>
>> > >>>>
>> > >>>>      [1] http://review.gluster.org/#/c/13507/
>> > >>>>
>> > >>>>      Thanks and Regards,
>> > >>>>      Kotresh H R
>> > >>>>
>> > >>>>      ----- Original Message -----
>> > >>>>       > From: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com
>> > >>>>      <mailto:khiremat at redhat.com>>
>> > >>>>       > To: "Soumya Koduri" <skoduri at redhat.com
>> > >>>>       > <mailto:skoduri at redhat.com>>
>> > >>>>       > Cc: avishwan at redhat.com <mailto:avishwan at redhat.com>,
>> "Gluster
>> > >>>>      Devel" <gluster-devel at gluster.org
>> > >>>>      <mailto:gluster-devel at gluster.org>>
>> > >>>>       > Sent: Monday, February 29, 2016 4:15:22 PM
>> > >>>>       > Subject: Re: Cores generated with
>> > >>>>      ./tests/geo-rep/georep-basic-dr-tarssh.t
>> > >>>>       >
>> > >>>>       > Hi Soumya,
>> > >>>>       >
>> > >>>>       > I just tested that it is reproducible only with your patch
>> both
>> > >>>>      in master and
>> > >>>>       > 3.76 branch.
>> > >>>>       > The geo-rep test cases are marked bad in master. So it's
>> not hit
>> > >>>>      in master.
>> > >>>>       > rpc is introduced
>> > >>>>       > in changelog xlator to communicate to applications via
>> > >>>>      libgfchangelog.
>> > >>>>       > Venky/Me will check
>> > >>>>       > why is the crash happening and will update.
>> > >>>>       >
>> > >>>>       >
>> > >>>>       > Thanks and Regards,
>> > >>>>       > Kotresh H R
>> > >>>>       >
>> > >>>>       > ----- Original Message -----
>> > >>>>       > > From: "Soumya Koduri" <skoduri at redhat.com
>> > >>>>      <mailto:skoduri at redhat.com>>
>> > >>>>       > > To: avishwan at redhat.com <mailto:avishwan at redhat.com>,
>> > >>>>       > > "kotresh"
>> > >>>>      <khiremat at redhat.com <mailto:khiremat at redhat.com>>
>> > >>>>       > > Cc: "Gluster Devel" <gluster-devel at gluster.org
>> > >>>>      <mailto:gluster-devel at gluster.org>>
>> > >>>>       > > Sent: Monday, February 29, 2016 2:10:51 PM
>> > >>>>       > > Subject: Cores generated with
>> > >>>>      ./tests/geo-rep/georep-basic-dr-tarssh.t
>> > >>>>       > >
>> > >>>>       > > Hi Aravinda/Kotresh,
>> > >>>>       > >
>> > >>>>       > > With [1], I consistently see cores generated with the
>> test
>> > >>>>       > > './tests/geo-rep/georep-basic-dr-tarssh.t' in release-3.7
>> > >>>>      branch. From
>> > >>>>       > > the cores, looks like we are trying to dereference a
>> freed
>> > >>>>       > > changelog_rpc_clnt_t(crpc) object in
>> changelog_rpc_notify().
>> > >>>>      Strangely
>> > >>>>       > > this was not reported in master branch.
>> > >>>>       > >
>> > >>>>       > > I tried debugging but couldn't find any possible
>> suspects. I
>> > >>>>      request you
>> > >>>>       > > to take a look and let me know if [1] caused any
>> regression.
>> > >>>>       > >
>> > >>>>       > > Thanks,
>> > >>>>       > > Soumya
>> > >>>>       > >
>> > >>>>       > > [1] http://review.gluster.org/#/c/13507/
>> > >>>>       > >
>> > >>>>       >
>> > >>>>      _______________________________________________
>> > >>>>      Gluster-devel mailing list
>> > >>>>      Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>> > >>>>      http://www.gluster.org/mailman/listinfo/gluster-devel
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> --
>> > >>>> Raghavendra G
>> > >>>
>> > >> _______________________________________________
>> > >> Gluster-devel mailing list
>> > >> Gluster-devel at gluster.org
>> > >> http://www.gluster.org/mailman/listinfo/gluster-devel
>> > >>
>> >
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
>
> --
> Raghavendra G
>



-- 
Raghavendra G
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20160304/b6171706/attachment-0001.html>


More information about the Gluster-devel mailing list