[Gluster-Maintainers] [Gluster-devel] GF_CALLOC to GF_MALLOC conversion - is it safe?

Wed Mar 27 08:54:21 UTC 2019

On Tue, Mar 26, 2019 at 02:52:33PM -0700, Vijay Bellur wrote:
> On Thu, Mar 21, 2019 at 8:44 AM Yaniv Kaul <ykaul at redhat.com> wrote:
> 
> >
> >
> > On Thu, Mar 21, 2019 at 5:23 PM Nithya Balachandran <nbalacha at redhat.com>
> > wrote:
> >
> >>
> >>
> >> On Thu, 21 Mar 2019 at 16:16, Atin Mukherjee <amukherj at redhat.com> wrote:
> >>
> >>> All,
> >>>
> >>> In the last few releases of glusterfs, with stability as a primary theme
> >>> of the releases, there has been lots of changes done on the code
> >>> optimization with an expectation that such changes will have gluster to
> >>> provide better performance. While many of these changes do help, but off
> >>> late we have started seeing some diverse effects of them, one especially
> >>> being the calloc to malloc conversions. While I do understand that malloc
> >>> syscall will eliminate the extra memset bottleneck which calloc bears, but
> >>> with recent kernels having in-built strong compiler optimizations I am not
> >>> sure whether that makes any significant difference, but as I mentioned
> >>> earlier certainly if this isn't done carefully it can potentially introduce
> >>> lot of bugs and I'm writing this email to share one of such experiences.
> >>>
> >>> Sanju & I were having troubles for last two days to figure out why
> >>> https://review.gluster.org/#/c/glusterfs/+/22388/ wasn't working in
> >>> Sanju's system but it had no problems running the same fix in my gluster
> >>> containers. After spending a significant amount of time, what we now
> >>> figured out is that a malloc call [1] (which was a calloc earlier) is the
> >>> culprit here. As you all can see, in this function we allocate txn_id and
> >>> copy the event->txn_id into it through gf_uuid_copy () . But when we were
> >>> debugging this step wise through gdb, txn_id wasn't exactly copied with the
> >>> exact event->txn_id and it had some junk values which made the
> >>> glusterd_clear_txn_opinfo to be invoked with a wrong txn_id later on
> >>> resulting the leaks to remain the same which was the original intention of
> >>> the fix.
> >>>
> >>> This was quite painful to debug and we had to spend some time to figure
> >>> this out. Considering we have converted many such calls in past, I'd urge
> >>> that we review all such conversions and see if there're any side effects to
> >>> it. Otherwise we might end up running into many potential memory related
> >>> bugs later on. OTOH, going forward I'd request every patch
> >>> owners/maintainers to pay some special attention to these conversions and
> >>> see they are really beneficial and error free. IMO, general guideline
> >>> should be - for bigger buffers, malloc would make better sense but has to
> >>> be done carefully, for smaller size, we stick to calloc.
> >>>
> >>
> >>> What do others think about it?
> >>>
> >>
> >> I believe that replacing calloc with malloc everywhere without adequate
> >> testing and review is not safe and am against doing so for the following
> >> reasons:
> >>
> >
> > No patch should get in without adequate testing and thorough review.
> >
> 
> 
> There are lots of interesting points to glean in this thread. However, this
> particular one caught my attention. How about we introduce a policy that no
> patch gets merged unless it is thoroughly tested? The onus would be on the
> developer to provide a .t test case to show completeness in the testing of
> that patch. If the developer does not or cannot for any reason, we could
> have the maintainer run tests and add a note in gerrit explaining the tests
> run. This would provide more assurance about the patches being tested
> before getting merged. Obviously, patches that fix typos or that cannot
> affect any functionality need not be subject to this policy.
> 
> As far as review thoroughness is concerned, it might be better to mandate
> acks from respective maintainers before merging a patch that affects
> several components. More eyeballs that specialize in particular
> component(s) will hopefully catch some of these issues during the review
> phase.

Both of these points have always been strongly encouraged. They are also
documented in the
https://docs.gluster.org/en/latest/Contributors-Guide/Guidelines-For-Maintainers/
https://github.com/gluster/glusterdocs/blob/master/docs/Contributors-Guide/Guidelines-For-Maintainers.md
(formatting is broken in the 1st link, but I dont know how to fix it)

We probably need to apply our own guidelines a little better, and
remember developers that > 90% of the patch(series) should come with a
.t file or added test in an existing one.

And a big +1 for getting reviews or at least some involvement of the
component maintainers.

Niels