[Gluster-devel] Cores generated with ./tests/geo-rep/georep-basic-dr-tarssh.t

Raghavendra Gowdappa rgowdapp at redhat.com
Tue Jun 28 09:58:15 UTC 2016


Thanks for the reminder. Will take that up.

----- Original Message -----
> From: "Soumya Koduri" <skoduri at redhat.com>
> To: "Venky Shankar" <vshankar at redhat.com>, "Raghavendra G" <raghavendra at gluster.com>
> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> Sent: Tuesday, June 28, 2016 3:24:13 PM
> Subject: Re: [Gluster-devel] Cores generated with ./tests/geo-rep/georep-basic-dr-tarssh.t
> 
> Hi Raghavendra/Venky,
> 
> Gentle reminder. Oleksander (post-factum) confirmed that their
> production setup has been running with this patch included since
> sometime and had no issues. Shall we consider merging this patch if
> there are no review comments?
> 
> Thanks,
> Soumya
> 
> On 03/09/2016 09:16 PM, Kotresh Hiremath Ravishankar wrote:
> > Hi All,
> >
> > The following patch is sent to address changelog rpc mem-leak issues.
> > The fix is intricate and needs lot of testing before taking in. Please
> > review the same.
> >
> > http://review.gluster.org/#/c/13658/1
> >
> > Thanks and Regards,
> > Kotresh H R
> >
> >
> > ----- Original Message -----
> >> From: "Venky Shankar" <vshankar at redhat.com>
> >> To: "Raghavendra G" <raghavendra at gluster.com>
> >> Cc: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com>, "Gluster Devel"
> >> <gluster-devel at gluster.org>
> >> Sent: Monday, March 7, 2016 3:52:13 PM
> >> Subject: Re: [Gluster-devel] Cores generated with
> >> ./tests/geo-rep/georep-basic-dr-tarssh.t
> >>
> >> On Fri, Mar 04, 2016 at 02:02:32PM +0530, Raghavendra G wrote:
> >>> On Thu, Mar 3, 2016 at 6:26 PM, Kotresh Hiremath Ravishankar <
> >>> khiremat at redhat.com> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> Yes, with this patch we need not set conn->trans to NULL in
> >>>> rpc_clnt_disable
> >>>>
> >>>
> >>> While [1] fixes the crash, things can be improved in the way how
> >>> changelog
> >>> is using rpc.
> >>>
> >>> 1. In the current code, there is an rpc_clnt object leak during
> >>> disconnect
> >>> event.
> >>
> >> Just for the record (as I couldn't find this information while building up
> >> rpc infrastructure in changelog):
> >>
> >> Unref rpc-clnt object after calling rpc_clnt_disable() in notify() upon
> >> RPC_CLNT_DISCONNECT. Free up 'mydata' in notify() upon RPC_CLNT_DESTROY.
> >>
> >>> 2. Also, freed "mydata" of changelog is still associated with rpc_clnt
> >>> object (corollary of 1), though change log might not get any events with
> >>> "mydata" (as connection is dead).
> >>>
> >>> I've discussed with Kotresh about changes needed, offline. So, following
> >>> are the action items.
> >>> 1. Soumya's patch [2] is valid and is needed for 3.7 branch too.
> >>> 2. [2] can be accepted. However, someone might want to re-use an rpc
> >>> object
> >>> after disabling it, like introducing a new api rpc_clnt_enable_again
> >>> (though no of such use-cases is very less). But [2] doesn't allow it. The
> >>> point is as long as rpc-clnt object is alive, transport object is alive
> >>> (though disconnected) and we can re-use it. So, I would prefer not to
> >>> accept it.
> >>> 3. Kotresh will work on new changes to make sure changelog makes correct
> >>> use of rpc-clnt.
> >>>
> >>> [1] http://review.gluster.org/#/c/13592
> >>> [2] http://review.gluster.org/#/c/1359
> >>>
> >>> regards,
> >>> Raghavendra.
> >>>
> >>>
> >>>> Thanks and Regards,
> >>>> Kotresh H R
> >>>>
> >>>> ----- Original Message -----
> >>>>> From: "Soumya Koduri" <skoduri at redhat.com>
> >>>>> To: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com>, "Raghavendra
> >>>> G" <raghavendra at gluster.com>
> >>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> >>>>> Sent: Thursday, March 3, 2016 5:06:00 PM
> >>>>> Subject: Re: [Gluster-devel] Cores generated with
> >>>> ./tests/geo-rep/georep-basic-dr-tarssh.t
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 03/03/2016 04:58 PM, Kotresh Hiremath Ravishankar wrote:
> >>>>>> [Replying on top of my own reply]
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I have submitted the below patch [1] to avoid the issue of
> >>>>>> 'rpc_clnt_submit'
> >>>>>> getting reconnected. But it won't take care of memory leak problem
> >>>>>> you
> >>>> were
> >>>>>> trying to fix. That we have to carefully go through all cases and fix
> >>>> it.
> >>>>>> Please have a look at it.
> >>>>>>
> >>>>> Looks good. IIUC, with this patch, we need not set conn->trans to NULL
> >>>>> in 'rpc_clnt_disable()'. Right? If yes, then it takes care of memleak
> >>>>> as
> >>>>> the transport object shall then get freed as part of
> >>>>> 'rpc_clnt_trigger_destroy'.
> >>>>>
> >>>>>
> >>>>>> http://review.gluster.org/#/c/13592/
> >>>>>>
> >>>>>> Thanks and Regards,
> >>>>>> Kotresh H R
> >>>>>>
> >>>>>> ----- Original Message -----
> >>>>>>> From: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com>
> >>>>>>> To: "Soumya Koduri" <skoduri at redhat.com>
> >>>>>>> Cc: "Raghavendra G" <raghavendra at gluster.com>, "Gluster Devel"
> >>>>>>> <gluster-devel at gluster.org>
> >>>>>>> Sent: Thursday, March 3, 2016 3:39:11 PM
> >>>>>>> Subject: Re: [Gluster-devel] Cores generated with
> >>>>>>> ./tests/geo-rep/georep-basic-dr-tarssh.t
> >>>>>>>
> >>>>>>> Hi Soumya,
> >>>>>>>
> >>>>>>> I tested the lastes patch [2] on master where your previous patch
> >>>>>>> [1]
> >>>> in
> >>>>>>> merged.
> >>>>>>> I see crashes at different places.
> >>>>>>>
> >>>>>>> 1. If there are code paths that are holding rpc object without
> >>>>>>> taking
> >>>> ref
> >>>>>>> on
> >>>>>>> it, all those
> >>>>>>>      code path will crash on invoking rpc submit on that object as
> >>>>>>>      rpc
> >>>>>>>      object
> >>>>>>>      would have freed
> >>>>>>>      by last unref on DISCONNECT event. I see this kind of use-case
> >>>>>>>      in
> >>>>>>>      chagnelog rpc code.
> >>>>>>>      Need to check on other users of rpc.
> >>>>> Agree. We should fix all such code-paths. Since this seem to be an
> >>>>> intricate fix, shall we take these patches only in master branch and
> >>>>> not
> >>>>> in 3.7 release for now till we fix all such paths as we encounter?
> >>>>>
> >>>>>>>
> >>>>>>> 2. And also we need to take care of reconnect timers that are being
> >>>> set
> >>>>>>> and
> >>>>>>> are re-tried to
> >>>>>>>      connect back on expiration. In those cases also, we might crash
> >>>> as rpc
> >>>>>>>      object would have freed.
> >>>>> Your patch addresses this..right?
> >>>>>
> >>>>> Thanks,
> >>>>> Soumya
> >>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> [1] http://review.gluster.org/#/c/13507/
> >>>>>>> [2] http://review.gluster.org/#/c/13587/
> >>>>>>>
> >>>>>>> Thanks and Regards,
> >>>>>>> Kotresh H R
> >>>>>>>
> >>>>>>> ----- Original Message -----
> >>>>>>>> From: "Soumya Koduri" <skoduri at redhat.com>
> >>>>>>>> To: "Raghavendra G" <raghavendra at gluster.com>, "Kotresh Hiremath
> >>>>>>>> Ravishankar" <khiremat at redhat.com>
> >>>>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> >>>>>>>> Sent: Thursday, March 3, 2016 12:24:00 PM
> >>>>>>>> Subject: Re: [Gluster-devel] Cores generated with
> >>>>>>>> ./tests/geo-rep/georep-basic-dr-tarssh.t
> >>>>>>>>
> >>>>>>>> Thanks a lot Kotresh.
> >>>>>>>>
> >>>>>>>> On 03/03/2016 08:47 AM, Raghavendra G wrote:
> >>>>>>>>> Hi Soumya,
> >>>>>>>>>
> >>>>>>>>> Can you send a fix to this regression on upstream master too? This
> >>>> patch
> >>>>>>>>> is merged there.
> >>>>>>>>>
> >>>>>>>> I have submitted below patch.
> >>>>>>>>    http://review.gluster.org/#/c/13587/
> >>>>>>>>
> >>>>>>>> Kindly review the same.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Soumya
> >>>>>>>>
> >>>>>>>>> regards,
> >>>>>>>>> Raghavendra
> >>>>>>>>>
> >>>>>>>>> On Tue, Mar 1, 2016 at 10:34 PM, Kotresh Hiremath Ravishankar
> >>>>>>>>> <khiremat at redhat.com <mailto:khiremat at redhat.com>> wrote:
> >>>>>>>>>
> >>>>>>>>>       Hi Soumya,
> >>>>>>>>>
> >>>>>>>>>       I analysed the issue and found out that crash has happened
> >>>> because
> >>>>>>>>>       of the patch [1].
> >>>>>>>>>
> >>>>>>>>>       The patch doesn't set transport object to NULL in
> >>>>>>>>>       'rpc_clnt_disable'
> >>>>>>>>>       but instead does it on
> >>>>>>>>>       'rpc_clnt_trigger_destroy'. So if there are pending rpc
> >>>> invocations
> >>>>>>>>>       on the rpc object that
> >>>>>>>>>       is disabled (those instances are possible as happening now in
> >>>>>>>>>       changelog), it will trigger a
> >>>>>>>>>       CONNECT notify again with 'mydata' that is freed causing a
> >>>> crash.
> >>>>>>>>>       This happens because
> >>>>>>>>>       'rpc_clnt_submit' reconnects if rpc is not connected.
> >>>>>>>>>
> >>>>>>>>>         rpc_clnt_submit (...) {
> >>>>>>>>>           ...
> >>>>>>>>>                        if (conn->connected == 0) {
> >>>>>>>>>                                ret = rpc_transport_connect
> >>>> (conn->trans,
> >>>>>>>>>
> >>>>>>>>>         conn->config.remote_port);
> >>>>>>>>>                        }
> >>>>>>>>>           ...
> >>>>>>>>>         }
> >>>>>>>>>
> >>>>>>>>>       Without your patch, conn->trans was set NULL and hence
> >>>>>>>>>       CONNECT
> >>>>>>>>>       fails
> >>>>>>>>>       not resulting with
> >>>>>>>>>       CONNECT notify call. And also the cleanup happens in failure
> >>>> path.
> >>>>>>>>>
> >>>>>>>>>       So the memory leak can happen, if there is no try for rpc
> >>>>>>>>>       invocation
> >>>>>>>>>       after DISCONNECT.
> >>>>>>>>>       It will be cleaned up otherwise.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>       [1] http://review.gluster.org/#/c/13507/
> >>>>>>>>>
> >>>>>>>>>       Thanks and Regards,
> >>>>>>>>>       Kotresh H R
> >>>>>>>>>
> >>>>>>>>>       ----- Original Message -----
> >>>>>>>>>        > From: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com
> >>>>>>>>>       <mailto:khiremat at redhat.com>>
> >>>>>>>>>        > To: "Soumya Koduri" <skoduri at redhat.com
> >>>>>>>>>        > <mailto:skoduri at redhat.com>>
> >>>>>>>>>        > Cc: avishwan at redhat.com <mailto:avishwan at redhat.com>,
> >>>> "Gluster
> >>>>>>>>>       Devel" <gluster-devel at gluster.org
> >>>>>>>>>       <mailto:gluster-devel at gluster.org>>
> >>>>>>>>>        > Sent: Monday, February 29, 2016 4:15:22 PM
> >>>>>>>>>        > Subject: Re: Cores generated with
> >>>>>>>>>       ./tests/geo-rep/georep-basic-dr-tarssh.t
> >>>>>>>>>        >
> >>>>>>>>>        > Hi Soumya,
> >>>>>>>>>        >
> >>>>>>>>>        > I just tested that it is reproducible only with your patch
> >>>> both
> >>>>>>>>>       in master and
> >>>>>>>>>        > 3.76 branch.
> >>>>>>>>>        > The geo-rep test cases are marked bad in master. So it's
> >>>> not hit
> >>>>>>>>>       in master.
> >>>>>>>>>        > rpc is introduced
> >>>>>>>>>        > in changelog xlator to communicate to applications via
> >>>>>>>>>       libgfchangelog.
> >>>>>>>>>        > Venky/Me will check
> >>>>>>>>>        > why is the crash happening and will update.
> >>>>>>>>>        >
> >>>>>>>>>        >
> >>>>>>>>>        > Thanks and Regards,
> >>>>>>>>>        > Kotresh H R
> >>>>>>>>>        >
> >>>>>>>>>        > ----- Original Message -----
> >>>>>>>>>        > > From: "Soumya Koduri" <skoduri at redhat.com
> >>>>>>>>>       <mailto:skoduri at redhat.com>>
> >>>>>>>>>        > > To: avishwan at redhat.com <mailto:avishwan at redhat.com>,
> >>>>>>>>>        > > "kotresh"
> >>>>>>>>>       <khiremat at redhat.com <mailto:khiremat at redhat.com>>
> >>>>>>>>>        > > Cc: "Gluster Devel" <gluster-devel at gluster.org
> >>>>>>>>>       <mailto:gluster-devel at gluster.org>>
> >>>>>>>>>        > > Sent: Monday, February 29, 2016 2:10:51 PM
> >>>>>>>>>        > > Subject: Cores generated with
> >>>>>>>>>       ./tests/geo-rep/georep-basic-dr-tarssh.t
> >>>>>>>>>        > >
> >>>>>>>>>        > > Hi Aravinda/Kotresh,
> >>>>>>>>>        > >
> >>>>>>>>>        > > With [1], I consistently see cores generated with the
> >>>>>>>>>        > > test
> >>>>>>>>>        > > './tests/geo-rep/georep-basic-dr-tarssh.t' in
> >>>>>>>>>        > > release-3.7
> >>>>>>>>>       branch. From
> >>>>>>>>>        > > the cores, looks like we are trying to dereference a
> >>>>>>>>>        > > freed
> >>>>>>>>>        > > changelog_rpc_clnt_t(crpc) object in
> >>>> changelog_rpc_notify().
> >>>>>>>>>       Strangely
> >>>>>>>>>        > > this was not reported in master branch.
> >>>>>>>>>        > >
> >>>>>>>>>        > > I tried debugging but couldn't find any possible
> >>>> suspects. I
> >>>>>>>>>       request you
> >>>>>>>>>        > > to take a look and let me know if [1] caused any
> >>>> regression.
> >>>>>>>>>        > >
> >>>>>>>>>        > > Thanks,
> >>>>>>>>>        > > Soumya
> >>>>>>>>>        > >
> >>>>>>>>>        > > [1] http://review.gluster.org/#/c/13507/
> >>>>>>>>>        > >
> >>>>>>>>>        >
> >>>>>>>>>       _______________________________________________
> >>>>>>>>>       Gluster-devel mailing list
> >>>>>>>>>       Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
> >>>>>>>>>       http://www.gluster.org/mailman/listinfo/gluster-devel
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Raghavendra G
> >>>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> Gluster-devel mailing list
> >>>>>>> Gluster-devel at gluster.org
> >>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
> >>>>>>>
> >>>>>
> >>>> _______________________________________________
> >>>> Gluster-devel mailing list
> >>>> Gluster-devel at gluster.org
> >>>> http://www.gluster.org/mailman/listinfo/gluster-devel
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Raghavendra G
> >>
> >>> _______________________________________________
> >>> Gluster-devel mailing list
> >>> Gluster-devel at gluster.org
> >>> http://www.gluster.org/mailman/listinfo/gluster-devel
> >>
> >>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 


More information about the Gluster-devel mailing list