[Gluster-devel] Cores generated with ./tests/geo-rep/georep-basic-dr-tarssh.t

Kotresh Hiremath Ravishankar khiremat at redhat.com
Wed Mar 9 15:46:30 UTC 2016


Hi All,

The following patch is sent to address changelog rpc mem-leak issues.
The fix is intricate and needs lot of testing before taking in. Please
review the same.

http://review.gluster.org/#/c/13658/1

Thanks and Regards,
Kotresh H R


----- Original Message -----
> From: "Venky Shankar" <vshankar at redhat.com>
> To: "Raghavendra G" <raghavendra at gluster.com>
> Cc: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com>, "Gluster Devel" <gluster-devel at gluster.org>
> Sent: Monday, March 7, 2016 3:52:13 PM
> Subject: Re: [Gluster-devel] Cores generated with ./tests/geo-rep/georep-basic-dr-tarssh.t
> 
> On Fri, Mar 04, 2016 at 02:02:32PM +0530, Raghavendra G wrote:
> > On Thu, Mar 3, 2016 at 6:26 PM, Kotresh Hiremath Ravishankar <
> > khiremat at redhat.com> wrote:
> > 
> > > Hi,
> > >
> > > Yes, with this patch we need not set conn->trans to NULL in
> > > rpc_clnt_disable
> > >
> > 
> > While [1] fixes the crash, things can be improved in the way how changelog
> > is using rpc.
> > 
> > 1. In the current code, there is an rpc_clnt object leak during disconnect
> > event.
> 
> Just for the record (as I couldn't find this information while building up
> rpc infrastructure in changelog):
> 
> Unref rpc-clnt object after calling rpc_clnt_disable() in notify() upon
> RPC_CLNT_DISCONNECT. Free up 'mydata' in notify() upon RPC_CLNT_DESTROY.
> 
> > 2. Also, freed "mydata" of changelog is still associated with rpc_clnt
> > object (corollary of 1), though change log might not get any events with
> > "mydata" (as connection is dead).
> > 
> > I've discussed with Kotresh about changes needed, offline. So, following
> > are the action items.
> > 1. Soumya's patch [2] is valid and is needed for 3.7 branch too.
> > 2. [2] can be accepted. However, someone might want to re-use an rpc object
> > after disabling it, like introducing a new api rpc_clnt_enable_again
> > (though no of such use-cases is very less). But [2] doesn't allow it. The
> > point is as long as rpc-clnt object is alive, transport object is alive
> > (though disconnected) and we can re-use it. So, I would prefer not to
> > accept it.
> > 3. Kotresh will work on new changes to make sure changelog makes correct
> > use of rpc-clnt.
> > 
> > [1] http://review.gluster.org/#/c/13592
> > [2] http://review.gluster.org/#/c/1359
> > 
> > regards,
> > Raghavendra.
> > 
> > 
> > > Thanks and Regards,
> > > Kotresh H R
> > >
> > > ----- Original Message -----
> > > > From: "Soumya Koduri" <skoduri at redhat.com>
> > > > To: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com>, "Raghavendra
> > > G" <raghavendra at gluster.com>
> > > > Cc: "Gluster Devel" <gluster-devel at gluster.org>
> > > > Sent: Thursday, March 3, 2016 5:06:00 PM
> > > > Subject: Re: [Gluster-devel] Cores generated with
> > > ./tests/geo-rep/georep-basic-dr-tarssh.t
> > > >
> > > >
> > > >
> > > > On 03/03/2016 04:58 PM, Kotresh Hiremath Ravishankar wrote:
> > > > > [Replying on top of my own reply]
> > > > >
> > > > > Hi,
> > > > >
> > > > > I have submitted the below patch [1] to avoid the issue of
> > > > > 'rpc_clnt_submit'
> > > > > getting reconnected. But it won't take care of memory leak problem
> > > > > you
> > > were
> > > > > trying to fix. That we have to carefully go through all cases and fix
> > > it.
> > > > > Please have a look at it.
> > > > >
> > > > Looks good. IIUC, with this patch, we need not set conn->trans to NULL
> > > > in 'rpc_clnt_disable()'. Right? If yes, then it takes care of memleak
> > > > as
> > > > the transport object shall then get freed as part of
> > > > 'rpc_clnt_trigger_destroy'.
> > > >
> > > >
> > > > > http://review.gluster.org/#/c/13592/
> > > > >
> > > > > Thanks and Regards,
> > > > > Kotresh H R
> > > > >
> > > > > ----- Original Message -----
> > > > >> From: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com>
> > > > >> To: "Soumya Koduri" <skoduri at redhat.com>
> > > > >> Cc: "Raghavendra G" <raghavendra at gluster.com>, "Gluster Devel"
> > > > >> <gluster-devel at gluster.org>
> > > > >> Sent: Thursday, March 3, 2016 3:39:11 PM
> > > > >> Subject: Re: [Gluster-devel] Cores generated with
> > > > >> ./tests/geo-rep/georep-basic-dr-tarssh.t
> > > > >>
> > > > >> Hi Soumya,
> > > > >>
> > > > >> I tested the lastes patch [2] on master where your previous patch
> > > > >> [1]
> > > in
> > > > >> merged.
> > > > >> I see crashes at different places.
> > > > >>
> > > > >> 1. If there are code paths that are holding rpc object without
> > > > >> taking
> > > ref
> > > > >> on
> > > > >> it, all those
> > > > >>     code path will crash on invoking rpc submit on that object as
> > > > >>     rpc
> > > > >>     object
> > > > >>     would have freed
> > > > >>     by last unref on DISCONNECT event. I see this kind of use-case
> > > > >>     in
> > > > >>     chagnelog rpc code.
> > > > >>     Need to check on other users of rpc.
> > > > Agree. We should fix all such code-paths. Since this seem to be an
> > > > intricate fix, shall we take these patches only in master branch and
> > > > not
> > > > in 3.7 release for now till we fix all such paths as we encounter?
> > > >
> > > > >>
> > > > >> 2. And also we need to take care of reconnect timers that are being
> > > set
> > > > >> and
> > > > >> are re-tried to
> > > > >>     connect back on expiration. In those cases also, we might crash
> > > as rpc
> > > > >>     object would have freed.
> > > > Your patch addresses this..right?
> > > >
> > > > Thanks,
> > > > Soumya
> > > >
> > > > >>
> > > > >>
> > > > >> [1] http://review.gluster.org/#/c/13507/
> > > > >> [2] http://review.gluster.org/#/c/13587/
> > > > >>
> > > > >> Thanks and Regards,
> > > > >> Kotresh H R
> > > > >>
> > > > >> ----- Original Message -----
> > > > >>> From: "Soumya Koduri" <skoduri at redhat.com>
> > > > >>> To: "Raghavendra G" <raghavendra at gluster.com>, "Kotresh Hiremath
> > > > >>> Ravishankar" <khiremat at redhat.com>
> > > > >>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> > > > >>> Sent: Thursday, March 3, 2016 12:24:00 PM
> > > > >>> Subject: Re: [Gluster-devel] Cores generated with
> > > > >>> ./tests/geo-rep/georep-basic-dr-tarssh.t
> > > > >>>
> > > > >>> Thanks a lot Kotresh.
> > > > >>>
> > > > >>> On 03/03/2016 08:47 AM, Raghavendra G wrote:
> > > > >>>> Hi Soumya,
> > > > >>>>
> > > > >>>> Can you send a fix to this regression on upstream master too? This
> > > patch
> > > > >>>> is merged there.
> > > > >>>>
> > > > >>> I have submitted below patch.
> > > > >>>   http://review.gluster.org/#/c/13587/
> > > > >>>
> > > > >>> Kindly review the same.
> > > > >>>
> > > > >>> Thanks,
> > > > >>> Soumya
> > > > >>>
> > > > >>>> regards,
> > > > >>>> Raghavendra
> > > > >>>>
> > > > >>>> On Tue, Mar 1, 2016 at 10:34 PM, Kotresh Hiremath Ravishankar
> > > > >>>> <khiremat at redhat.com <mailto:khiremat at redhat.com>> wrote:
> > > > >>>>
> > > > >>>>      Hi Soumya,
> > > > >>>>
> > > > >>>>      I analysed the issue and found out that crash has happened
> > > because
> > > > >>>>      of the patch [1].
> > > > >>>>
> > > > >>>>      The patch doesn't set transport object to NULL in
> > > > >>>>      'rpc_clnt_disable'
> > > > >>>>      but instead does it on
> > > > >>>>      'rpc_clnt_trigger_destroy'. So if there are pending rpc
> > > invocations
> > > > >>>>      on the rpc object that
> > > > >>>>      is disabled (those instances are possible as happening now in
> > > > >>>>      changelog), it will trigger a
> > > > >>>>      CONNECT notify again with 'mydata' that is freed causing a
> > > crash.
> > > > >>>>      This happens because
> > > > >>>>      'rpc_clnt_submit' reconnects if rpc is not connected.
> > > > >>>>
> > > > >>>>        rpc_clnt_submit (...) {
> > > > >>>>          ...
> > > > >>>>                       if (conn->connected == 0) {
> > > > >>>>                               ret = rpc_transport_connect
> > > (conn->trans,
> > > > >>>>
> > > > >>>>        conn->config.remote_port);
> > > > >>>>                       }
> > > > >>>>          ...
> > > > >>>>        }
> > > > >>>>
> > > > >>>>      Without your patch, conn->trans was set NULL and hence
> > > > >>>>      CONNECT
> > > > >>>>      fails
> > > > >>>>      not resulting with
> > > > >>>>      CONNECT notify call. And also the cleanup happens in failure
> > > path.
> > > > >>>>
> > > > >>>>      So the memory leak can happen, if there is no try for rpc
> > > > >>>>      invocation
> > > > >>>>      after DISCONNECT.
> > > > >>>>      It will be cleaned up otherwise.
> > > > >>>>
> > > > >>>>
> > > > >>>>      [1] http://review.gluster.org/#/c/13507/
> > > > >>>>
> > > > >>>>      Thanks and Regards,
> > > > >>>>      Kotresh H R
> > > > >>>>
> > > > >>>>      ----- Original Message -----
> > > > >>>>       > From: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com
> > > > >>>>      <mailto:khiremat at redhat.com>>
> > > > >>>>       > To: "Soumya Koduri" <skoduri at redhat.com
> > > > >>>>       > <mailto:skoduri at redhat.com>>
> > > > >>>>       > Cc: avishwan at redhat.com <mailto:avishwan at redhat.com>,
> > > "Gluster
> > > > >>>>      Devel" <gluster-devel at gluster.org
> > > > >>>>      <mailto:gluster-devel at gluster.org>>
> > > > >>>>       > Sent: Monday, February 29, 2016 4:15:22 PM
> > > > >>>>       > Subject: Re: Cores generated with
> > > > >>>>      ./tests/geo-rep/georep-basic-dr-tarssh.t
> > > > >>>>       >
> > > > >>>>       > Hi Soumya,
> > > > >>>>       >
> > > > >>>>       > I just tested that it is reproducible only with your patch
> > > both
> > > > >>>>      in master and
> > > > >>>>       > 3.76 branch.
> > > > >>>>       > The geo-rep test cases are marked bad in master. So it's
> > > not hit
> > > > >>>>      in master.
> > > > >>>>       > rpc is introduced
> > > > >>>>       > in changelog xlator to communicate to applications via
> > > > >>>>      libgfchangelog.
> > > > >>>>       > Venky/Me will check
> > > > >>>>       > why is the crash happening and will update.
> > > > >>>>       >
> > > > >>>>       >
> > > > >>>>       > Thanks and Regards,
> > > > >>>>       > Kotresh H R
> > > > >>>>       >
> > > > >>>>       > ----- Original Message -----
> > > > >>>>       > > From: "Soumya Koduri" <skoduri at redhat.com
> > > > >>>>      <mailto:skoduri at redhat.com>>
> > > > >>>>       > > To: avishwan at redhat.com <mailto:avishwan at redhat.com>,
> > > > >>>>       > > "kotresh"
> > > > >>>>      <khiremat at redhat.com <mailto:khiremat at redhat.com>>
> > > > >>>>       > > Cc: "Gluster Devel" <gluster-devel at gluster.org
> > > > >>>>      <mailto:gluster-devel at gluster.org>>
> > > > >>>>       > > Sent: Monday, February 29, 2016 2:10:51 PM
> > > > >>>>       > > Subject: Cores generated with
> > > > >>>>      ./tests/geo-rep/georep-basic-dr-tarssh.t
> > > > >>>>       > >
> > > > >>>>       > > Hi Aravinda/Kotresh,
> > > > >>>>       > >
> > > > >>>>       > > With [1], I consistently see cores generated with the
> > > > >>>>       > > test
> > > > >>>>       > > './tests/geo-rep/georep-basic-dr-tarssh.t' in
> > > > >>>>       > > release-3.7
> > > > >>>>      branch. From
> > > > >>>>       > > the cores, looks like we are trying to dereference a
> > > > >>>>       > > freed
> > > > >>>>       > > changelog_rpc_clnt_t(crpc) object in
> > > changelog_rpc_notify().
> > > > >>>>      Strangely
> > > > >>>>       > > this was not reported in master branch.
> > > > >>>>       > >
> > > > >>>>       > > I tried debugging but couldn't find any possible
> > > suspects. I
> > > > >>>>      request you
> > > > >>>>       > > to take a look and let me know if [1] caused any
> > > regression.
> > > > >>>>       > >
> > > > >>>>       > > Thanks,
> > > > >>>>       > > Soumya
> > > > >>>>       > >
> > > > >>>>       > > [1] http://review.gluster.org/#/c/13507/
> > > > >>>>       > >
> > > > >>>>       >
> > > > >>>>      _______________________________________________
> > > > >>>>      Gluster-devel mailing list
> > > > >>>>      Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
> > > > >>>>      http://www.gluster.org/mailman/listinfo/gluster-devel
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> --
> > > > >>>> Raghavendra G
> > > > >>>
> > > > >> _______________________________________________
> > > > >> Gluster-devel mailing list
> > > > >> Gluster-devel at gluster.org
> > > > >> http://www.gluster.org/mailman/listinfo/gluster-devel
> > > > >>
> > > >
> > > _______________________________________________
> > > Gluster-devel mailing list
> > > Gluster-devel at gluster.org
> > > http://www.gluster.org/mailman/listinfo/gluster-devel
> > >
> > 
> > 
> > 
> > --
> > Raghavendra G
> 
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> 
> 


More information about the Gluster-devel mailing list