[Gluster-devel] Cores generated with ./tests/geo-rep/georep-basic-dr-tarssh.t
Soumya Koduri
skoduri at redhat.com
Fri Mar 4 12:20:16 UTC 2016
Hi Raghavendra,
On 03/04/2016 03:09 PM, Raghavendra G wrote:
>
>
> On Fri, Mar 4, 2016 at 2:02 PM, Raghavendra G <raghavendra at gluster.com
> <mailto:raghavendra at gluster.com>> wrote:
>
>
>
> On Thu, Mar 3, 2016 at 6:26 PM, Kotresh Hiremath Ravishankar
> <khiremat at redhat.com <mailto:khiremat at redhat.com>> wrote:
>
> Hi,
>
> Yes, with this patch we need not set conn->trans to NULL in
> rpc_clnt_disable
>
>
> While [1] fixes the crash, things can be improved in the way how
> changelog is using rpc.
>
> 1. In the current code, there is an rpc_clnt object leak during
> disconnect event.
> 2. Also, freed "mydata" of changelog is still associated with
> rpc_clnt object (corollary of 1), though change log might not get
> any events with "mydata" (as connection is dead).
>
> I've discussed with Kotresh about changes needed, offline. So,
> following are the action items.
> 1. Soumya's patch [2] is valid and is needed for 3.7 branch too.
> 2. [2] can be accepted. However, someone might want to re-use an rpc
> object after disabling it, like introducing a new api
> rpc_clnt_enable_again (though no of such use-cases is very less).
> But [2] doesn't allow it. The point is as long as rpc-clnt object is
> alive, transport object is alive (though disconnected) and we can
> re-use it. So, I would prefer not to accept it.
>
>
> [2] will be accepted now.
>
The link mentioned seems to be invalid. Just to be clear, we have 3
patches in question here -
[1] Original patch (merged in master but not in release-3.7)
http://review.gluster.org/13456
There are two patches proposed to fix the regression
[2] http://review.gluster.org/#/c/13587/
[3] http://review.gluster.org/#/c/13592/
Since the patch by Kotresh [3] completely fixes the regression, we have
decided to abandon [2]. In addition, since these fixes look very
intricate, till we are sure that the code is stable, we thought it may
be better not to back-port these patches to stable release branch.
Please confirm if you are implying to revive [2] and back-port all
these 3 patches to release-3.7 branch?
Thanks,
Soumya
> 3. Kotresh will work on new changes to make sure changelog makes
> correct use of rpc-clnt.
>
> [1] http://review.gluster.org/#/c/13592
> [2] http://review.gluster.org/#/c/1359
>
> regards,
> Raghavendra.
>
>
> Thanks and Regards,
> Kotresh H R
>
> ----- Original Message -----
> > From: "Soumya Koduri" <skoduri at redhat.com <mailto:skoduri at redhat.com>>
> > To: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com <mailto:khiremat at redhat.com>>, "Raghavendra
> G" <raghavendra at gluster.com <mailto:raghavendra at gluster.com>>
> > Cc: "Gluster Devel" <gluster-devel at gluster.org <mailto:gluster-devel at gluster.org>>
> > Sent: Thursday, March 3, 2016 5:06:00 PM
> > Subject: Re: [Gluster-devel] Cores generated with
> ./tests/geo-rep/georep-basic-dr-tarssh.t
> >
> >
> >
> > On 03/03/2016 04:58 PM, Kotresh Hiremath Ravishankar wrote:
> > > [Replying on top of my own reply]
> > >
> > > Hi,
> > >
> > > I have submitted the below patch [1] to avoid the issue of
> > > 'rpc_clnt_submit'
> > > getting reconnected. But it won't take care of memory leak
> problem you were
> > > trying to fix. That we have to carefully go through all
> cases and fix it.
> > > Please have a look at it.
> > >
> > Looks good. IIUC, with this patch, we need not set
> conn->trans to NULL
> > in 'rpc_clnt_disable()'. Right? If yes, then it takes care of
> memleak as
> > the transport object shall then get freed as part of
> > 'rpc_clnt_trigger_destroy'.
> >
> >
> > > http://review.gluster.org/#/c/13592/
> > >
> > > Thanks and Regards,
> > > Kotresh H R
> > >
> > > ----- Original Message -----
> > >> From: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com
> <mailto:khiremat at redhat.com>>
> > >> To: "Soumya Koduri" <skoduri at redhat.com
> <mailto:skoduri at redhat.com>>
> > >> Cc: "Raghavendra G" <raghavendra at gluster.com
> <mailto:raghavendra at gluster.com>>, "Gluster Devel"
> > >> <gluster-devel at gluster.org <mailto:gluster-devel at gluster.org>>
> > >> Sent: Thursday, March 3, 2016 3:39:11 PM
> > >> Subject: Re: [Gluster-devel] Cores generated with
> > >> ./tests/geo-rep/georep-basic-dr-tarssh.t
> > >>
> > >> Hi Soumya,
> > >>
> > >> I tested the lastes patch [2] on master where your
> previous patch [1] in
> > >> merged.
> > >> I see crashes at different places.
> > >>
> > >> 1. If there are code paths that are holding rpc object
> without taking ref
> > >> on
> > >> it, all those
> > >> code path will crash on invoking rpc submit on that
> object as rpc
> > >> object
> > >> would have freed
> > >> by last unref on DISCONNECT event. I see this kind of
> use-case in
> > >> chagnelog rpc code.
> > >> Need to check on other users of rpc.
> > Agree. We should fix all such code-paths. Since this seem to
> be an
> > intricate fix, shall we take these patches only in master
> branch and not
> > in 3.7 release for now till we fix all such paths as we
> encounter?
> >
> > >>
> > >> 2. And also we need to take care of reconnect timers that
> are being set
> > >> and
> > >> are re-tried to
> > >> connect back on expiration. In those cases also, we
> might crash as rpc
> > >> object would have freed.
> > Your patch addresses this..right?
> >
> > Thanks,
> > Soumya
> >
> > >>
> > >>
> > >> [1] http://review.gluster.org/#/c/13507/
> > >> [2] http://review.gluster.org/#/c/13587/
> > >>
> > >> Thanks and Regards,
> > >> Kotresh H R
> > >>
> > >> ----- Original Message -----
> > >>> From: "Soumya Koduri" <skoduri at redhat.com
> <mailto:skoduri at redhat.com>>
> > >>> To: "Raghavendra G" <raghavendra at gluster.com
> <mailto:raghavendra at gluster.com>>, "Kotresh Hiremath
> > >>> Ravishankar" <khiremat at redhat.com
> <mailto:khiremat at redhat.com>>
> > >>> Cc: "Gluster Devel" <gluster-devel at gluster.org
> <mailto:gluster-devel at gluster.org>>
> > >>> Sent: Thursday, March 3, 2016 12:24:00 PM
> > >>> Subject: Re: [Gluster-devel] Cores generated with
> > >>> ./tests/geo-rep/georep-basic-dr-tarssh.t
> > >>>
> > >>> Thanks a lot Kotresh.
> > >>>
> > >>> On 03/03/2016 08:47 AM, Raghavendra G wrote:
> > >>>> Hi Soumya,
> > >>>>
> > >>>> Can you send a fix to this regression on upstream master
> too? This patch
> > >>>> is merged there.
> > >>>>
> > >>> I have submitted below patch.
> > >>> http://review.gluster.org/#/c/13587/
> > >>>
> > >>> Kindly review the same.
> > >>>
> > >>> Thanks,
> > >>> Soumya
> > >>>
> > >>>> regards,
> > >>>> Raghavendra
> > >>>>
> > >>>> On Tue, Mar 1, 2016 at 10:34 PM, Kotresh Hiremath
> Ravishankar
> > >>>> <khiremat at redhat.com <mailto:khiremat at redhat.com>
> <mailto:khiremat at redhat.com <mailto:khiremat at redhat.com>>> wrote:
> > >>>>
> > >>>> Hi Soumya,
> > >>>>
> > >>>> I analysed the issue and found out that crash has
> happened because
> > >>>> of the patch [1].
> > >>>>
> > >>>> The patch doesn't set transport object to NULL in
> > >>>> 'rpc_clnt_disable'
> > >>>> but instead does it on
> > >>>> 'rpc_clnt_trigger_destroy'. So if there are pending
> rpc invocations
> > >>>> on the rpc object that
> > >>>> is disabled (those instances are possible as
> happening now in
> > >>>> changelog), it will trigger a
> > >>>> CONNECT notify again with 'mydata' that is freed
> causing a crash.
> > >>>> This happens because
> > >>>> 'rpc_clnt_submit' reconnects if rpc is not connected.
> > >>>>
> > >>>> rpc_clnt_submit (...) {
> > >>>> ...
> > >>>> if (conn->connected == 0) {
> > >>>> ret =
> rpc_transport_connect (conn->trans,
> > >>>>
> > >>>> conn->config.remote_port);
> > >>>> }
> > >>>> ...
> > >>>> }
> > >>>>
> > >>>> Without your patch, conn->trans was set NULL and
> hence CONNECT
> > >>>> fails
> > >>>> not resulting with
> > >>>> CONNECT notify call. And also the cleanup happens
> in failure path.
> > >>>>
> > >>>> So the memory leak can happen, if there is no try
> for rpc
> > >>>> invocation
> > >>>> after DISCONNECT.
> > >>>> It will be cleaned up otherwise.
> > >>>>
> > >>>>
> > >>>> [1] http://review.gluster.org/#/c/13507/
> > >>>>
> > >>>> Thanks and Regards,
> > >>>> Kotresh H R
> > >>>>
> > >>>> ----- Original Message -----
> > >>>> > From: "Kotresh Hiremath Ravishankar"
> <khiremat at redhat.com <mailto:khiremat at redhat.com>
> > >>>> <mailto:khiremat at redhat.com
> <mailto:khiremat at redhat.com>>>
> > >>>> > To: "Soumya Koduri" <skoduri at redhat.com
> <mailto:skoduri at redhat.com>
> > >>>> > <mailto:skoduri at redhat.com
> <mailto:skoduri at redhat.com>>>
> > >>>> > Cc: avishwan at redhat.com
> <mailto:avishwan at redhat.com> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>>, "Gluster
> > >>>> Devel" <gluster-devel at gluster.org
> <mailto:gluster-devel at gluster.org>
> > >>>> <mailto:gluster-devel at gluster.org
> <mailto:gluster-devel at gluster.org>>>
> > >>>> > Sent: Monday, February 29, 2016 4:15:22 PM
> > >>>> > Subject: Re: Cores generated with
> > >>>> ./tests/geo-rep/georep-basic-dr-tarssh.t
> > >>>> >
> > >>>> > Hi Soumya,
> > >>>> >
> > >>>> > I just tested that it is reproducible only with
> your patch both
> > >>>> in master and
> > >>>> > 3.76 branch.
> > >>>> > The geo-rep test cases are marked bad in master.
> So it's not hit
> > >>>> in master.
> > >>>> > rpc is introduced
> > >>>> > in changelog xlator to communicate to
> applications via
> > >>>> libgfchangelog.
> > >>>> > Venky/Me will check
> > >>>> > why is the crash happening and will update.
> > >>>> >
> > >>>> >
> > >>>> > Thanks and Regards,
> > >>>> > Kotresh H R
> > >>>> >
> > >>>> > ----- Original Message -----
> > >>>> > > From: "Soumya Koduri" <skoduri at redhat.com
> <mailto:skoduri at redhat.com>
> > >>>> <mailto:skoduri at redhat.com
> <mailto:skoduri at redhat.com>>>
> > >>>> > > To: avishwan at redhat.com
> <mailto:avishwan at redhat.com> <mailto:avishwan at redhat.com
> <mailto:avishwan at redhat.com>>,
> > >>>> > > "kotresh"
> > >>>> <khiremat at redhat.com <mailto:khiremat at redhat.com>
> <mailto:khiremat at redhat.com <mailto:khiremat at redhat.com>>>
> > >>>> > > Cc: "Gluster Devel" <gluster-devel at gluster.org
> <mailto:gluster-devel at gluster.org>
> > >>>> <mailto:gluster-devel at gluster.org
> <mailto:gluster-devel at gluster.org>>>
> > >>>> > > Sent: Monday, February 29, 2016 2:10:51 PM
> > >>>> > > Subject: Cores generated with
> > >>>> ./tests/geo-rep/georep-basic-dr-tarssh.t
> > >>>> > >
> > >>>> > > Hi Aravinda/Kotresh,
> > >>>> > >
> > >>>> > > With [1], I consistently see cores generated
> with the test
> > >>>> > > './tests/geo-rep/georep-basic-dr-tarssh.t' in
> release-3.7
> > >>>> branch. From
> > >>>> > > the cores, looks like we are trying to
> dereference a freed
> > >>>> > > changelog_rpc_clnt_t(crpc) object in
> changelog_rpc_notify().
> > >>>> Strangely
> > >>>> > > this was not reported in master branch.
> > >>>> > >
> > >>>> > > I tried debugging but couldn't find any
> possible suspects. I
> > >>>> request you
> > >>>> > > to take a look and let me know if [1] caused
> any regression.
> > >>>> > >
> > >>>> > > Thanks,
> > >>>> > > Soumya
> > >>>> > >
> > >>>> > > [1] http://review.gluster.org/#/c/13507/
> > >>>> > >
> > >>>> >
> > >>>> _______________________________________________
> > >>>> Gluster-devel mailing list
> > >>>> Gluster-devel at gluster.org
> <mailto:Gluster-devel at gluster.org>
> <mailto:Gluster-devel at gluster.org
> <mailto:Gluster-devel at gluster.org>>
> > >>>> http://www.gluster.org/mailman/listinfo/gluster-devel
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Raghavendra G
> > >>>
> > >> _______________________________________________
> > >> Gluster-devel mailing list
> > >> Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
> > >> http://www.gluster.org/mailman/listinfo/gluster-devel
> > >>
> >
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
> http://www.gluster.org/mailman/listinfo/gluster-devel
>
>
>
>
> --
> Raghavendra G
>
>
>
>
> --
> Raghavendra G
More information about the Gluster-devel
mailing list