[Gluster-devel] Cores generated with ./tests/geo-rep/georep-basic-dr-tarssh.t
Soumya Koduri
skoduri at redhat.com
Tue Jun 28 09:54:13 UTC 2016
Hi Raghavendra/Venky,
Gentle reminder. Oleksander (post-factum) confirmed that their
production setup has been running with this patch included since
sometime and had no issues. Shall we consider merging this patch if
there are no review comments?
Thanks,
Soumya
On 03/09/2016 09:16 PM, Kotresh Hiremath Ravishankar wrote:
> Hi All,
>
> The following patch is sent to address changelog rpc mem-leak issues.
> The fix is intricate and needs lot of testing before taking in. Please
> review the same.
>
> http://review.gluster.org/#/c/13658/1
>
> Thanks and Regards,
> Kotresh H R
>
>
> ----- Original Message -----
>> From: "Venky Shankar" <vshankar at redhat.com>
>> To: "Raghavendra G" <raghavendra at gluster.com>
>> Cc: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com>, "Gluster Devel" <gluster-devel at gluster.org>
>> Sent: Monday, March 7, 2016 3:52:13 PM
>> Subject: Re: [Gluster-devel] Cores generated with ./tests/geo-rep/georep-basic-dr-tarssh.t
>>
>> On Fri, Mar 04, 2016 at 02:02:32PM +0530, Raghavendra G wrote:
>>> On Thu, Mar 3, 2016 at 6:26 PM, Kotresh Hiremath Ravishankar <
>>> khiremat at redhat.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Yes, with this patch we need not set conn->trans to NULL in
>>>> rpc_clnt_disable
>>>>
>>>
>>> While [1] fixes the crash, things can be improved in the way how changelog
>>> is using rpc.
>>>
>>> 1. In the current code, there is an rpc_clnt object leak during disconnect
>>> event.
>>
>> Just for the record (as I couldn't find this information while building up
>> rpc infrastructure in changelog):
>>
>> Unref rpc-clnt object after calling rpc_clnt_disable() in notify() upon
>> RPC_CLNT_DISCONNECT. Free up 'mydata' in notify() upon RPC_CLNT_DESTROY.
>>
>>> 2. Also, freed "mydata" of changelog is still associated with rpc_clnt
>>> object (corollary of 1), though change log might not get any events with
>>> "mydata" (as connection is dead).
>>>
>>> I've discussed with Kotresh about changes needed, offline. So, following
>>> are the action items.
>>> 1. Soumya's patch [2] is valid and is needed for 3.7 branch too.
>>> 2. [2] can be accepted. However, someone might want to re-use an rpc object
>>> after disabling it, like introducing a new api rpc_clnt_enable_again
>>> (though no of such use-cases is very less). But [2] doesn't allow it. The
>>> point is as long as rpc-clnt object is alive, transport object is alive
>>> (though disconnected) and we can re-use it. So, I would prefer not to
>>> accept it.
>>> 3. Kotresh will work on new changes to make sure changelog makes correct
>>> use of rpc-clnt.
>>>
>>> [1] http://review.gluster.org/#/c/13592
>>> [2] http://review.gluster.org/#/c/1359
>>>
>>> regards,
>>> Raghavendra.
>>>
>>>
>>>> Thanks and Regards,
>>>> Kotresh H R
>>>>
>>>> ----- Original Message -----
>>>>> From: "Soumya Koduri" <skoduri at redhat.com>
>>>>> To: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com>, "Raghavendra
>>>> G" <raghavendra at gluster.com>
>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
>>>>> Sent: Thursday, March 3, 2016 5:06:00 PM
>>>>> Subject: Re: [Gluster-devel] Cores generated with
>>>> ./tests/geo-rep/georep-basic-dr-tarssh.t
>>>>>
>>>>>
>>>>>
>>>>> On 03/03/2016 04:58 PM, Kotresh Hiremath Ravishankar wrote:
>>>>>> [Replying on top of my own reply]
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have submitted the below patch [1] to avoid the issue of
>>>>>> 'rpc_clnt_submit'
>>>>>> getting reconnected. But it won't take care of memory leak problem
>>>>>> you
>>>> were
>>>>>> trying to fix. That we have to carefully go through all cases and fix
>>>> it.
>>>>>> Please have a look at it.
>>>>>>
>>>>> Looks good. IIUC, with this patch, we need not set conn->trans to NULL
>>>>> in 'rpc_clnt_disable()'. Right? If yes, then it takes care of memleak
>>>>> as
>>>>> the transport object shall then get freed as part of
>>>>> 'rpc_clnt_trigger_destroy'.
>>>>>
>>>>>
>>>>>> http://review.gluster.org/#/c/13592/
>>>>>>
>>>>>> Thanks and Regards,
>>>>>> Kotresh H R
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com>
>>>>>>> To: "Soumya Koduri" <skoduri at redhat.com>
>>>>>>> Cc: "Raghavendra G" <raghavendra at gluster.com>, "Gluster Devel"
>>>>>>> <gluster-devel at gluster.org>
>>>>>>> Sent: Thursday, March 3, 2016 3:39:11 PM
>>>>>>> Subject: Re: [Gluster-devel] Cores generated with
>>>>>>> ./tests/geo-rep/georep-basic-dr-tarssh.t
>>>>>>>
>>>>>>> Hi Soumya,
>>>>>>>
>>>>>>> I tested the lastes patch [2] on master where your previous patch
>>>>>>> [1]
>>>> in
>>>>>>> merged.
>>>>>>> I see crashes at different places.
>>>>>>>
>>>>>>> 1. If there are code paths that are holding rpc object without
>>>>>>> taking
>>>> ref
>>>>>>> on
>>>>>>> it, all those
>>>>>>> code path will crash on invoking rpc submit on that object as
>>>>>>> rpc
>>>>>>> object
>>>>>>> would have freed
>>>>>>> by last unref on DISCONNECT event. I see this kind of use-case
>>>>>>> in
>>>>>>> chagnelog rpc code.
>>>>>>> Need to check on other users of rpc.
>>>>> Agree. We should fix all such code-paths. Since this seem to be an
>>>>> intricate fix, shall we take these patches only in master branch and
>>>>> not
>>>>> in 3.7 release for now till we fix all such paths as we encounter?
>>>>>
>>>>>>>
>>>>>>> 2. And also we need to take care of reconnect timers that are being
>>>> set
>>>>>>> and
>>>>>>> are re-tried to
>>>>>>> connect back on expiration. In those cases also, we might crash
>>>> as rpc
>>>>>>> object would have freed.
>>>>> Your patch addresses this..right?
>>>>>
>>>>> Thanks,
>>>>> Soumya
>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [1] http://review.gluster.org/#/c/13507/
>>>>>>> [2] http://review.gluster.org/#/c/13587/
>>>>>>>
>>>>>>> Thanks and Regards,
>>>>>>> Kotresh H R
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> From: "Soumya Koduri" <skoduri at redhat.com>
>>>>>>>> To: "Raghavendra G" <raghavendra at gluster.com>, "Kotresh Hiremath
>>>>>>>> Ravishankar" <khiremat at redhat.com>
>>>>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
>>>>>>>> Sent: Thursday, March 3, 2016 12:24:00 PM
>>>>>>>> Subject: Re: [Gluster-devel] Cores generated with
>>>>>>>> ./tests/geo-rep/georep-basic-dr-tarssh.t
>>>>>>>>
>>>>>>>> Thanks a lot Kotresh.
>>>>>>>>
>>>>>>>> On 03/03/2016 08:47 AM, Raghavendra G wrote:
>>>>>>>>> Hi Soumya,
>>>>>>>>>
>>>>>>>>> Can you send a fix to this regression on upstream master too? This
>>>> patch
>>>>>>>>> is merged there.
>>>>>>>>>
>>>>>>>> I have submitted below patch.
>>>>>>>> http://review.gluster.org/#/c/13587/
>>>>>>>>
>>>>>>>> Kindly review the same.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Soumya
>>>>>>>>
>>>>>>>>> regards,
>>>>>>>>> Raghavendra
>>>>>>>>>
>>>>>>>>> On Tue, Mar 1, 2016 at 10:34 PM, Kotresh Hiremath Ravishankar
>>>>>>>>> <khiremat at redhat.com <mailto:khiremat at redhat.com>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Soumya,
>>>>>>>>>
>>>>>>>>> I analysed the issue and found out that crash has happened
>>>> because
>>>>>>>>> of the patch [1].
>>>>>>>>>
>>>>>>>>> The patch doesn't set transport object to NULL in
>>>>>>>>> 'rpc_clnt_disable'
>>>>>>>>> but instead does it on
>>>>>>>>> 'rpc_clnt_trigger_destroy'. So if there are pending rpc
>>>> invocations
>>>>>>>>> on the rpc object that
>>>>>>>>> is disabled (those instances are possible as happening now in
>>>>>>>>> changelog), it will trigger a
>>>>>>>>> CONNECT notify again with 'mydata' that is freed causing a
>>>> crash.
>>>>>>>>> This happens because
>>>>>>>>> 'rpc_clnt_submit' reconnects if rpc is not connected.
>>>>>>>>>
>>>>>>>>> rpc_clnt_submit (...) {
>>>>>>>>> ...
>>>>>>>>> if (conn->connected == 0) {
>>>>>>>>> ret = rpc_transport_connect
>>>> (conn->trans,
>>>>>>>>>
>>>>>>>>> conn->config.remote_port);
>>>>>>>>> }
>>>>>>>>> ...
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> Without your patch, conn->trans was set NULL and hence
>>>>>>>>> CONNECT
>>>>>>>>> fails
>>>>>>>>> not resulting with
>>>>>>>>> CONNECT notify call. And also the cleanup happens in failure
>>>> path.
>>>>>>>>>
>>>>>>>>> So the memory leak can happen, if there is no try for rpc
>>>>>>>>> invocation
>>>>>>>>> after DISCONNECT.
>>>>>>>>> It will be cleaned up otherwise.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [1] http://review.gluster.org/#/c/13507/
>>>>>>>>>
>>>>>>>>> Thanks and Regards,
>>>>>>>>> Kotresh H R
>>>>>>>>>
>>>>>>>>> ----- Original Message -----
>>>>>>>>> > From: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com
>>>>>>>>> <mailto:khiremat at redhat.com>>
>>>>>>>>> > To: "Soumya Koduri" <skoduri at redhat.com
>>>>>>>>> > <mailto:skoduri at redhat.com>>
>>>>>>>>> > Cc: avishwan at redhat.com <mailto:avishwan at redhat.com>,
>>>> "Gluster
>>>>>>>>> Devel" <gluster-devel at gluster.org
>>>>>>>>> <mailto:gluster-devel at gluster.org>>
>>>>>>>>> > Sent: Monday, February 29, 2016 4:15:22 PM
>>>>>>>>> > Subject: Re: Cores generated with
>>>>>>>>> ./tests/geo-rep/georep-basic-dr-tarssh.t
>>>>>>>>> >
>>>>>>>>> > Hi Soumya,
>>>>>>>>> >
>>>>>>>>> > I just tested that it is reproducible only with your patch
>>>> both
>>>>>>>>> in master and
>>>>>>>>> > 3.76 branch.
>>>>>>>>> > The geo-rep test cases are marked bad in master. So it's
>>>> not hit
>>>>>>>>> in master.
>>>>>>>>> > rpc is introduced
>>>>>>>>> > in changelog xlator to communicate to applications via
>>>>>>>>> libgfchangelog.
>>>>>>>>> > Venky/Me will check
>>>>>>>>> > why is the crash happening and will update.
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > Thanks and Regards,
>>>>>>>>> > Kotresh H R
>>>>>>>>> >
>>>>>>>>> > ----- Original Message -----
>>>>>>>>> > > From: "Soumya Koduri" <skoduri at redhat.com
>>>>>>>>> <mailto:skoduri at redhat.com>>
>>>>>>>>> > > To: avishwan at redhat.com <mailto:avishwan at redhat.com>,
>>>>>>>>> > > "kotresh"
>>>>>>>>> <khiremat at redhat.com <mailto:khiremat at redhat.com>>
>>>>>>>>> > > Cc: "Gluster Devel" <gluster-devel at gluster.org
>>>>>>>>> <mailto:gluster-devel at gluster.org>>
>>>>>>>>> > > Sent: Monday, February 29, 2016 2:10:51 PM
>>>>>>>>> > > Subject: Cores generated with
>>>>>>>>> ./tests/geo-rep/georep-basic-dr-tarssh.t
>>>>>>>>> > >
>>>>>>>>> > > Hi Aravinda/Kotresh,
>>>>>>>>> > >
>>>>>>>>> > > With [1], I consistently see cores generated with the
>>>>>>>>> > > test
>>>>>>>>> > > './tests/geo-rep/georep-basic-dr-tarssh.t' in
>>>>>>>>> > > release-3.7
>>>>>>>>> branch. From
>>>>>>>>> > > the cores, looks like we are trying to dereference a
>>>>>>>>> > > freed
>>>>>>>>> > > changelog_rpc_clnt_t(crpc) object in
>>>> changelog_rpc_notify().
>>>>>>>>> Strangely
>>>>>>>>> > > this was not reported in master branch.
>>>>>>>>> > >
>>>>>>>>> > > I tried debugging but couldn't find any possible
>>>> suspects. I
>>>>>>>>> request you
>>>>>>>>> > > to take a look and let me know if [1] caused any
>>>> regression.
>>>>>>>>> > >
>>>>>>>>> > > Thanks,
>>>>>>>>> > > Soumya
>>>>>>>>> > >
>>>>>>>>> > > [1] http://review.gluster.org/#/c/13507/
>>>>>>>>> > >
>>>>>>>>> >
>>>>>>>>> _______________________________________________
>>>>>>>>> Gluster-devel mailing list
>>>>>>>>> Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Raghavendra G
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Gluster-devel mailing list
>>>>>>> Gluster-devel at gluster.org
>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>>>
>>>>>
>>>> _______________________________________________
>>>> Gluster-devel mailing list
>>>> Gluster-devel at gluster.org
>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>
>>>
>>>
>>>
>>> --
>>> Raghavendra G
>>
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>
>>
More information about the Gluster-devel
mailing list