[Gluster-devel] Cores generated with ./tests/geo-rep/georep-basic-dr-tarssh.t

Soumya Koduri skoduri at redhat.com
Fri Mar 4 12:20:16 UTC 2016


Hi Raghavendra,

On 03/04/2016 03:09 PM, Raghavendra G wrote:
>
>
> On Fri, Mar 4, 2016 at 2:02 PM, Raghavendra G <raghavendra at gluster.com
> <mailto:raghavendra at gluster.com>> wrote:
>
>
>
>     On Thu, Mar 3, 2016 at 6:26 PM, Kotresh Hiremath Ravishankar
>     <khiremat at redhat.com <mailto:khiremat at redhat.com>> wrote:
>
>         Hi,
>
>         Yes, with this patch we need not set conn->trans to NULL in
>         rpc_clnt_disable
>
>
>     While [1] fixes the crash, things can be improved in the way how
>     changelog is using rpc.
>
>     1. In the current code, there is an rpc_clnt object leak during
>     disconnect event.
>     2. Also, freed "mydata" of changelog is still associated with
>     rpc_clnt object (corollary of 1), though change log might not get
>     any events with "mydata" (as connection is dead).
>
>     I've discussed with Kotresh about changes needed, offline. So,
>     following are the action items.
>     1. Soumya's patch [2] is valid and is needed for 3.7 branch too.
>     2. [2] can be accepted. However, someone might want to re-use an rpc
>     object after disabling it, like introducing a new api
>     rpc_clnt_enable_again (though no of such use-cases is very less).
>     But [2] doesn't allow it. The point is as long as rpc-clnt object is
>     alive, transport object is alive (though disconnected) and we can
>     re-use it. So, I would prefer not to accept it.
>
>
> [2] will be accepted now.
>
The link mentioned seems to be invalid. Just to be clear, we have 3 
patches in question here -

[1] Original patch (merged in master but not in release-3.7) 
http://review.gluster.org/13456

There are two patches proposed to fix the regression
[2] http://review.gluster.org/#/c/13587/
[3] http://review.gluster.org/#/c/13592/

Since the patch by Kotresh [3] completely fixes the regression, we have 
decided to abandon [2]. In addition, since these fixes look very 
intricate, till we are sure that the code is stable, we thought it may 
be better not to back-port these patches to stable release branch.

Please confirm if you are implying to revive [2] and  back-port all 
these 3 patches to release-3.7 branch?

Thanks,
Soumya

>     3. Kotresh will work on new changes to make sure changelog makes
>     correct use of rpc-clnt.
>
>     [1] http://review.gluster.org/#/c/13592
>     [2] http://review.gluster.org/#/c/1359
>
>     regards,
>     Raghavendra.
>
>
>         Thanks and Regards,
>         Kotresh H R
>
>         ----- Original Message -----
>         > From: "Soumya Koduri" <skoduri at redhat.com <mailto:skoduri at redhat.com>>
>         > To: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com <mailto:khiremat at redhat.com>>, "Raghavendra
>         G" <raghavendra at gluster.com <mailto:raghavendra at gluster.com>>
>         > Cc: "Gluster Devel" <gluster-devel at gluster.org <mailto:gluster-devel at gluster.org>>
>          > Sent: Thursday, March 3, 2016 5:06:00 PM
>          > Subject: Re: [Gluster-devel] Cores generated with
>         ./tests/geo-rep/georep-basic-dr-tarssh.t
>          >
>          >
>          >
>          > On 03/03/2016 04:58 PM, Kotresh Hiremath Ravishankar wrote:
>          > > [Replying on top of my own reply]
>          > >
>          > > Hi,
>          > >
>          > > I have submitted the below patch [1] to avoid the issue of
>          > > 'rpc_clnt_submit'
>          > > getting reconnected. But it won't take care of memory leak
>         problem you were
>          > > trying to fix. That we have to carefully go through all
>         cases and fix it.
>          > > Please have a look at it.
>          > >
>          > Looks good. IIUC, with this patch, we need not set
>         conn->trans to NULL
>          > in 'rpc_clnt_disable()'. Right? If yes, then it takes care of
>         memleak as
>          > the transport object shall then get freed as part of
>          > 'rpc_clnt_trigger_destroy'.
>          >
>          >
>          > > http://review.gluster.org/#/c/13592/
>          > >
>          > > Thanks and Regards,
>          > > Kotresh H R
>          > >
>          > > ----- Original Message -----
>          > >> From: "Kotresh Hiremath Ravishankar" <khiremat at redhat.com
>         <mailto:khiremat at redhat.com>>
>          > >> To: "Soumya Koduri" <skoduri at redhat.com
>         <mailto:skoduri at redhat.com>>
>          > >> Cc: "Raghavendra G" <raghavendra at gluster.com
>         <mailto:raghavendra at gluster.com>>, "Gluster Devel"
>          > >> <gluster-devel at gluster.org <mailto:gluster-devel at gluster.org>>
>          > >> Sent: Thursday, March 3, 2016 3:39:11 PM
>          > >> Subject: Re: [Gluster-devel] Cores generated with
>          > >> ./tests/geo-rep/georep-basic-dr-tarssh.t
>          > >>
>          > >> Hi Soumya,
>          > >>
>          > >> I tested the lastes patch [2] on master where your
>         previous patch [1] in
>          > >> merged.
>          > >> I see crashes at different places.
>          > >>
>          > >> 1. If there are code paths that are holding rpc object
>         without taking ref
>          > >> on
>          > >> it, all those
>          > >>     code path will crash on invoking rpc submit on that
>         object as rpc
>          > >>     object
>          > >>     would have freed
>          > >>     by last unref on DISCONNECT event. I see this kind of
>         use-case in
>          > >>     chagnelog rpc code.
>          > >>     Need to check on other users of rpc.
>          > Agree. We should fix all such code-paths. Since this seem to
>         be an
>          > intricate fix, shall we take these patches only in master
>         branch and not
>          > in 3.7 release for now till we fix all such paths as we
>         encounter?
>          >
>          > >>
>          > >> 2. And also we need to take care of reconnect timers that
>         are being set
>          > >> and
>          > >> are re-tried to
>          > >>     connect back on expiration. In those cases also, we
>         might crash as rpc
>          > >>     object would have freed.
>          > Your patch addresses this..right?
>          >
>          > Thanks,
>          > Soumya
>          >
>          > >>
>          > >>
>          > >> [1] http://review.gluster.org/#/c/13507/
>          > >> [2] http://review.gluster.org/#/c/13587/
>          > >>
>          > >> Thanks and Regards,
>          > >> Kotresh H R
>          > >>
>          > >> ----- Original Message -----
>          > >>> From: "Soumya Koduri" <skoduri at redhat.com
>         <mailto:skoduri at redhat.com>>
>          > >>> To: "Raghavendra G" <raghavendra at gluster.com
>         <mailto:raghavendra at gluster.com>>, "Kotresh Hiremath
>          > >>> Ravishankar" <khiremat at redhat.com
>         <mailto:khiremat at redhat.com>>
>          > >>> Cc: "Gluster Devel" <gluster-devel at gluster.org
>         <mailto:gluster-devel at gluster.org>>
>          > >>> Sent: Thursday, March 3, 2016 12:24:00 PM
>          > >>> Subject: Re: [Gluster-devel] Cores generated with
>          > >>> ./tests/geo-rep/georep-basic-dr-tarssh.t
>          > >>>
>          > >>> Thanks a lot Kotresh.
>          > >>>
>          > >>> On 03/03/2016 08:47 AM, Raghavendra G wrote:
>          > >>>> Hi Soumya,
>          > >>>>
>          > >>>> Can you send a fix to this regression on upstream master
>         too? This patch
>          > >>>> is merged there.
>          > >>>>
>          > >>> I have submitted below patch.
>          > >>> http://review.gluster.org/#/c/13587/
>          > >>>
>          > >>> Kindly review the same.
>          > >>>
>          > >>> Thanks,
>          > >>> Soumya
>          > >>>
>          > >>>> regards,
>          > >>>> Raghavendra
>          > >>>>
>          > >>>> On Tue, Mar 1, 2016 at 10:34 PM, Kotresh Hiremath
>         Ravishankar
>          > >>>> <khiremat at redhat.com <mailto:khiremat at redhat.com>
>         <mailto:khiremat at redhat.com <mailto:khiremat at redhat.com>>> wrote:
>          > >>>>
>          > >>>>      Hi Soumya,
>          > >>>>
>          > >>>>      I analysed the issue and found out that crash has
>         happened because
>          > >>>>      of the patch [1].
>          > >>>>
>          > >>>>      The patch doesn't set transport object to NULL in
>          > >>>>      'rpc_clnt_disable'
>          > >>>>      but instead does it on
>          > >>>>      'rpc_clnt_trigger_destroy'. So if there are pending
>         rpc invocations
>          > >>>>      on the rpc object that
>          > >>>>      is disabled (those instances are possible as
>         happening now in
>          > >>>>      changelog), it will trigger a
>          > >>>>      CONNECT notify again with 'mydata' that is freed
>         causing a crash.
>          > >>>>      This happens because
>          > >>>>      'rpc_clnt_submit' reconnects if rpc is not connected.
>          > >>>>
>          > >>>>        rpc_clnt_submit (...) {
>          > >>>>          ...
>          > >>>>                       if (conn->connected == 0) {
>          > >>>>                               ret =
>         rpc_transport_connect (conn->trans,
>          > >>>>
>          > >>>>        conn->config.remote_port);
>          > >>>>                       }
>          > >>>>          ...
>          > >>>>        }
>          > >>>>
>          > >>>>      Without your patch, conn->trans was set NULL and
>         hence CONNECT
>          > >>>>      fails
>          > >>>>      not resulting with
>          > >>>>      CONNECT notify call. And also the cleanup happens
>         in failure path.
>          > >>>>
>          > >>>>      So the memory leak can happen, if there is no try
>         for rpc
>          > >>>>      invocation
>          > >>>>      after DISCONNECT.
>          > >>>>      It will be cleaned up otherwise.
>          > >>>>
>          > >>>>
>          > >>>>      [1] http://review.gluster.org/#/c/13507/
>          > >>>>
>          > >>>>      Thanks and Regards,
>          > >>>>      Kotresh H R
>          > >>>>
>          > >>>>      ----- Original Message -----
>          > >>>>       > From: "Kotresh Hiremath Ravishankar"
>         <khiremat at redhat.com <mailto:khiremat at redhat.com>
>          > >>>>      <mailto:khiremat at redhat.com
>         <mailto:khiremat at redhat.com>>>
>          > >>>>       > To: "Soumya Koduri" <skoduri at redhat.com
>         <mailto:skoduri at redhat.com>
>          > >>>>       > <mailto:skoduri at redhat.com
>         <mailto:skoduri at redhat.com>>>
>          > >>>>       > Cc: avishwan at redhat.com
>         <mailto:avishwan at redhat.com> <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>>, "Gluster
>          > >>>>      Devel" <gluster-devel at gluster.org
>         <mailto:gluster-devel at gluster.org>
>          > >>>>      <mailto:gluster-devel at gluster.org
>         <mailto:gluster-devel at gluster.org>>>
>          > >>>>       > Sent: Monday, February 29, 2016 4:15:22 PM
>          > >>>>       > Subject: Re: Cores generated with
>          > >>>>      ./tests/geo-rep/georep-basic-dr-tarssh.t
>          > >>>>       >
>          > >>>>       > Hi Soumya,
>          > >>>>       >
>          > >>>>       > I just tested that it is reproducible only with
>         your patch both
>          > >>>>      in master and
>          > >>>>       > 3.76 branch.
>          > >>>>       > The geo-rep test cases are marked bad in master.
>         So it's not hit
>          > >>>>      in master.
>          > >>>>       > rpc is introduced
>          > >>>>       > in changelog xlator to communicate to
>         applications via
>          > >>>>      libgfchangelog.
>          > >>>>       > Venky/Me will check
>          > >>>>       > why is the crash happening and will update.
>          > >>>>       >
>          > >>>>       >
>          > >>>>       > Thanks and Regards,
>          > >>>>       > Kotresh H R
>          > >>>>       >
>          > >>>>       > ----- Original Message -----
>          > >>>>       > > From: "Soumya Koduri" <skoduri at redhat.com
>         <mailto:skoduri at redhat.com>
>          > >>>>      <mailto:skoduri at redhat.com
>         <mailto:skoduri at redhat.com>>>
>          > >>>>       > > To: avishwan at redhat.com
>         <mailto:avishwan at redhat.com> <mailto:avishwan at redhat.com
>         <mailto:avishwan at redhat.com>>,
>          > >>>>       > > "kotresh"
>          > >>>>      <khiremat at redhat.com <mailto:khiremat at redhat.com>
>         <mailto:khiremat at redhat.com <mailto:khiremat at redhat.com>>>
>          > >>>>       > > Cc: "Gluster Devel" <gluster-devel at gluster.org
>         <mailto:gluster-devel at gluster.org>
>          > >>>>      <mailto:gluster-devel at gluster.org
>         <mailto:gluster-devel at gluster.org>>>
>          > >>>>       > > Sent: Monday, February 29, 2016 2:10:51 PM
>          > >>>>       > > Subject: Cores generated with
>          > >>>>      ./tests/geo-rep/georep-basic-dr-tarssh.t
>          > >>>>       > >
>          > >>>>       > > Hi Aravinda/Kotresh,
>          > >>>>       > >
>          > >>>>       > > With [1], I consistently see cores generated
>         with the test
>          > >>>>       > > './tests/geo-rep/georep-basic-dr-tarssh.t' in
>         release-3.7
>          > >>>>      branch. From
>          > >>>>       > > the cores, looks like we are trying to
>         dereference a freed
>          > >>>>       > > changelog_rpc_clnt_t(crpc) object in
>         changelog_rpc_notify().
>          > >>>>      Strangely
>          > >>>>       > > this was not reported in master branch.
>          > >>>>       > >
>          > >>>>       > > I tried debugging but couldn't find any
>         possible suspects. I
>          > >>>>      request you
>          > >>>>       > > to take a look and let me know if [1] caused
>         any regression.
>          > >>>>       > >
>          > >>>>       > > Thanks,
>          > >>>>       > > Soumya
>          > >>>>       > >
>          > >>>>       > > [1] http://review.gluster.org/#/c/13507/
>          > >>>>       > >
>          > >>>>       >
>          > >>>>      _______________________________________________
>          > >>>>      Gluster-devel mailing list
>          > >>>> Gluster-devel at gluster.org
>         <mailto:Gluster-devel at gluster.org>
>         <mailto:Gluster-devel at gluster.org
>         <mailto:Gluster-devel at gluster.org>>
>          > >>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>          > >>>>
>          > >>>>
>          > >>>>
>          > >>>>
>          > >>>> --
>          > >>>> Raghavendra G
>          > >>>
>          > >> _______________________________________________
>          > >> Gluster-devel mailing list
>          > >> Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>          > >> http://www.gluster.org/mailman/listinfo/gluster-devel
>          > >>
>          >
>         _______________________________________________
>         Gluster-devel mailing list
>         Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org>
>         http://www.gluster.org/mailman/listinfo/gluster-devel
>
>
>
>
>     --
>     Raghavendra G
>
>
>
>
> --
> Raghavendra G


More information about the Gluster-devel mailing list