From atumball at redhat.com Wed May 1 12:42:30 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Wed, 1 May 2019 18:12:30 +0530 Subject: [Gluster-devel] glusterfsd memory leak issue found after enable ssl In-Reply-To: <5d0c2ed30e884b86ba29bff5a47c960e@nokia-sbell.com> References: <07cb1c3aa08b414dbe37442955ddad36@nokia-sbell.com>

<6ce04fb69243465295a71b6953eafa19@nokia-sbell.com> <3cd91d1ce39541e7ad30c60ef15000aa@nokia-sbell.com>

<5d0c2ed30e884b86ba29bff5a47c960e@nokia-sbell.com> Message-ID: Hi Cynthia Zhou, Can you post the patch which fixes the issue of missing free? We will continue to investigate the leak further, but would really appreciate getting the patch which is already worked on land into upstream master. -Amar On Mon, Apr 22, 2019 at 1:38 PM Zhou, Cynthia (NSB - CN/Hangzhou) < cynthia.zhou at nokia-sbell.com> wrote: > Ok, I am clear now. > > I?ve added ssl_free in socket reset and socket finish function, though > glusterfsd memory leak is not that much, still it is leaking, from source > code I can not find anything else, > > Could you help to check if this issue exists in your env? If not I may > have a try to merge your patch . > > Step > > 1> while true;do gluster v heal info, > > 2> check the vol-name glusterfsd memory usage, it is obviously > increasing. > > cynthia > > > > *From:* Milind Changire > *Sent:* Monday, April 22, 2019 2:36 PM > *To:* Zhou, Cynthia (NSB - CN/Hangzhou) > *Cc:* Atin Mukherjee ; gluster-devel at gluster.org > *Subject:* Re: [Gluster-devel] glusterfsd memory leak issue found after > enable ssl > > > > According to BIO_new_socket() man page ... > > > > *If the close flag is set then the socket is shut down and closed when the > BIO is freed.* > > > > For Gluster to have more control over the socket shutdown, the BIO_NOCLOSE > flag is set. Otherwise, SSL takes control of socket shutdown whenever BIO > is freed. > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From srakonde at redhat.com Wed May 1 15:11:30 2019 From: srakonde at redhat.com (Sanju Rakonde) Date: Wed, 1 May 2019 20:41:30 +0530 Subject: [Gluster-devel] ./tests/basic/uss.t is timing out in release-6 branch In-Reply-To: References:

Message-ID: Thank you Raghavendra. On Tue, Apr 30, 2019 at 11:46 PM FNU Raghavendra Manjunath < rabhat at redhat.com> wrote: > > To make things relatively easy for the cleanup () function in the test > framework, I think it would be better to ensure that uss.t itself deletes > snapshots and the volume once the tests are done. Patch [1] has been > submitted for review. > > [1] https://review.gluster.org/#/c/glusterfs/+/22649/ > > Regards, > Raghavendra > > On Tue, Apr 30, 2019 at 10:42 AM FNU Raghavendra Manjunath < > rabhat at redhat.com> wrote: > >> >> The failure looks similar to the issue I had mentioned in [1] >> >> In short for some reason the cleanup (the cleanup function that we call >> in our .t files) seems to be taking more time and also not cleaning up >> properly. This leads to problems for the 2nd iteration (where basic things >> such as volume creation or volume start itself fails due to ENODATA or >> ENOENT errors). >> >> The 2nd iteration of the uss.t ran had the following errors. >> >> "[2019-04-29 09:08:15.275773]:++++++++++ G_LOG:./tests/basic/uss.t: TEST: >> 39 gluster --mode=script --wignore volume set patchy nfs.disable false >> ++++++++++ >> [2019-04-29 09:08:15.390550] : volume set patchy nfs.disable false : >> SUCCESS >> [2019-04-29 09:08:15.404624]:++++++++++ G_LOG:./tests/basic/uss.t: TEST: >> 42 gluster --mode=script --wignore volume start patchy ++++++++++ >> [2019-04-29 09:08:15.468780] : volume start patchy : FAILED : Failed to >> get extended attribute trusted.glusterfs.volume-id for brick dir >> /d/backends/3/patchy_snap_mnt. Reason : No data available >> " >> >> These are the initial steps to create and start volume. Why >> trusted.glusterfs.volume-id extended attribute is absent is not sure. The >> analysis in [1] had errors of ENOENT (i.e. export directory itself was >> absent). >> I suspect this to be because of some issue with the cleanup mechanism at >> the end of the tests. >> >> [1] >> https://lists.gluster.org/pipermail/gluster-devel/2019-April/056104.html >> >> On Tue, Apr 30, 2019 at 8:37 AM Sanju Rakonde >> wrote: >> >>> Hi Raghavendra, >>> >>> ./tests/basic/uss.t is timing out in release-6 branch consistently. One >>> such instance is https://review.gluster.org/#/c/glusterfs/+/22641/. Can >>> you please look into this? >>> >>> -- >>> Thanks, >>> Sanju >>> >> -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From moagrawa at redhat.com Thu May 2 06:45:09 2019 From: moagrawa at redhat.com (Mohit Agrawal) Date: Thu, 2 May 2019 12:15:09 +0530 Subject: [Gluster-devel] Query regarding dictionary logic In-Reply-To: References:

Message-ID: Hi Vijay, I have tried to execute smallfile tool on volume(12x3), i have not found any significant performance improvement for smallfile operations, I have configured 4 clients and 8 thread to run operations. I have generated statedump and found below data for dictionaries specific to gluster processes brick max-pairs-per-dict=50 total-pairs-used=192212171 total-dicts-used=24794349 average-pairs-per-dict=7 glusterd max-pairs-per-dict=301 total-pairs-used=156677 total-dicts-used=30719 average-pairs-per-dict=5 fuse process [dict] max-pairs-per-dict=50 total-pairs-used=88669561 total-dicts-used=12360543 average-pairs-per-dict=7 It seems dictionary has max-pairs in case of glusterd and while no. of volumes are high the number can be increased. I think there is no performance regression in case of brick and fuse. I have used hash_size 20 for the dictionary. Let me know if you can provide some other test to validate the same. Thanks, Mohit Agrawal On Tue, Apr 30, 2019 at 2:29 PM Mohit Agrawal wrote: > Thanks, Amar for sharing the patch, I will test and share the result. > > On Tue, Apr 30, 2019 at 2:23 PM Amar Tumballi Suryanarayan < > atumball at redhat.com> wrote: > >> Shreyas/Kevin tried to address it some time back using >> https://bugzilla.redhat.com/show_bug.cgi?id=1428049 ( >> https://review.gluster.org/16830) >> >> I vaguely remember the reason to keep the hash value 1 was done during >> the time when we had dictionary itself sent as on wire protocol, and in >> most other places, number of entries in dictionary was on an avg, 3. So, we >> felt, saving on a bit of memory for optimization was better at that time. >> >> -Amar >> >> On Tue, Apr 30, 2019 at 12:02 PM Mohit Agrawal >> wrote: >> >>> sure Vijay, I will try and update. >>> >>> Regards, >>> Mohit Agrawal >>> >>> On Tue, Apr 30, 2019 at 11:44 AM Vijay Bellur >>> wrote: >>> >>>> Hi Mohit, >>>> >>>> On Mon, Apr 29, 2019 at 7:15 AM Mohit Agrawal >>>> wrote: >>>> >>>>> Hi All, >>>>> >>>>> I was just looking at the code of dict, I have one query current >>>>> dictionary logic. >>>>> I am not able to understand why we use hash_size is 1 for a >>>>> dictionary.IMO with the >>>>> hash_size of 1 dictionary always work like a list, not a hash, for >>>>> every lookup >>>>> in dictionary complexity is O(n). >>>>> >>>>> Before optimizing the code I just want to know what was the exact >>>>> reason to define >>>>> hash_size is 1? >>>>> >>>> >>>> This is a good question. I looked up the source in gluster's historic >>>> repo [1] and hash_size is 1 even there. So, this could have been the case >>>> since the first version of the dictionary code. >>>> >>>> Would you be able to run some tests with a larger hash_size and share >>>> your observations? >>>> >>>> Thanks, >>>> Vijay >>>> >>>> [1] >>>> https://github.com/gluster/historic/blob/master/libglusterfs/src/dict.c >>>> >>>> >>>> >>>>> >>>>> Please share your view on the same. >>>>> >>>>> Thanks, >>>>> Mohit Agrawal >>>>> _______________________________________________ >>>>> Gluster-devel mailing list >>>>> Gluster-devel at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>>> >>>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> >> >> -- >> Amar Tumballi (amarts) >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhernandez at redhat.com Thu May 2 10:45:38 2019 From: xhernandez at redhat.com (Xavi Hernandez) Date: Thu, 2 May 2019 12:45:38 +0200 Subject: [Gluster-devel] Should we enable contention notification by default ? Message-ID: Hi all, there's a feature in the locks xlator that sends a notification to current owner of a lock when another client tries to acquire the same lock. This way the current owner is made aware of the contention and can release the lock as soon as possible to allow the other client to proceed. This is specially useful when eager-locking is used and multiple clients access the same files and directories. Currently both replicated and dispersed volumes use eager-locking and can use contention notification to force an early release of the lock. Eager-locking reduces the number of network requests required for each operation, improving performance, but could add delays to other clients while it keeps the inode or entry locked. With the contention notification feature we avoid this delay, so we get the best performance with minimal issues in multiclient environments. Currently the contention notification feature is controlled by the 'features.lock-notify-contention' option and it's disabled by default. Should we enable it by default ? I don't see any reason to keep it disabled by default. Does anyone foresee any problem ? Regards, Xavi -------------- next part -------------- An HTML attachment was scrubbed... URL: From aspandey at redhat.com Thu May 2 12:17:51 2019 From: aspandey at redhat.com (Ashish Pandey) Date: Thu, 2 May 2019 08:17:51 -0400 (EDT) Subject: [Gluster-devel] Should we enable contention notification by default ? In-Reply-To: References: Message-ID: <2044282595.16006319.1556799471980.JavaMail.zimbra@redhat.com> Xavi, I would like to keep this option (features.lock-notify-contention) enabled by default. However, I can see that there is one more option which will impact the working of this option which is "notify-contention-delay" .description = "This value determines the minimum amount of time " "(in seconds) between upcall contention notifications " "on the same inode. If multiple lock requests are " "received during this period, only one upcall will " "be sent."}, I am not sure what should be the best value for this option if we want to keep features.lock-notify-contention ON by default? It looks like if we keep the value of notify-contention-delay more, say 5 sec, it will wait for this much time to send up call notification which does not look good. Is my understanding correct? What will be impact of this value and what should be the default value of this option? --- Ashish ----- Original Message ----- From: "Xavi Hernandez" To: "gluster-devel" Cc: "Pranith Kumar Karampuri" , "Ashish Pandey" , "Amar Tumballi" Sent: Thursday, May 2, 2019 4:15:38 PM Subject: Should we enable contention notification by default ? Hi all, there's a feature in the locks xlator that sends a notification to current owner of a lock when another client tries to acquire the same lock. This way the current owner is made aware of the contention and can release the lock as soon as possible to allow the other client to proceed. This is specially useful when eager-locking is used and multiple clients access the same files and directories. Currently both replicated and dispersed volumes use eager-locking and can use contention notification to force an early release of the lock. Eager-locking reduces the number of network requests required for each operation, improving performance, but could add delays to other clients while it keeps the inode or entry locked. With the contention notification feature we avoid this delay, so we get the best performance with minimal issues in multiclient environments. Currently the contention notification feature is controlled by the 'features.lock-notify-contention' option and it's disabled by default. Should we enable it by default ? I don't see any reason to keep it disabled by default. Does anyone foresee any problem ? Regards, Xavi -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhernandez at redhat.com Thu May 2 13:13:36 2019 From: xhernandez at redhat.com (Xavi Hernandez) Date: Thu, 2 May 2019 15:13:36 +0200 Subject: [Gluster-devel] Should we enable contention notification by default ? In-Reply-To: <2044282595.16006319.1556799471980.JavaMail.zimbra@redhat.com> References: <2044282595.16006319.1556799471980.JavaMail.zimbra@redhat.com> Message-ID: Hi Ashish, On Thu, May 2, 2019 at 2:17 PM Ashish Pandey wrote: > Xavi, > > I would like to keep this option (features.lock-notify-contention) enabled > by default. > However, I can see that there is one more option which will impact the > working of this option which is "notify-contention-delay" > .description = "This value determines the minimum amount of time " > "(in seconds) between upcall contention notifications " > "on the same inode. If multiple lock requests are " > "received during this period, only one upcall will " > "be sent."}, > > I am not sure what should be the best value for this option if we want to > keep features.lock-notify-contention ON by default? > It looks like if we keep the value of notify-contention-delay more, say 5 > sec, it will wait for this much time to send up call > notification which does not look good. > No, the first notification is sent immediately. What this option does is to define the minimum interval between notifications. This interval is per lock. This is done to avoid storms of notifications if many requests come referencing the same lock. Is my understanding correct? > What will be impact of this value and what should be the default value of > this option? > I think the current default value of 5 seconds seems good enough. If there are many bricks, each brick could send a notification per lock. 1000 bricks would mean a client would receive 1000 notifications every 5 seconds. It doesn't seem too much, but in those cases 10, and considering we could have other locks, maybe a higher value could be better. Xavi > > --- > Ashish > > > > > > > ------------------------------ > *From: *"Xavi Hernandez" > *To: *"gluster-devel" > *Cc: *"Pranith Kumar Karampuri" , "Ashish Pandey" < > aspandey at redhat.com>, "Amar Tumballi" > *Sent: *Thursday, May 2, 2019 4:15:38 PM > *Subject: *Should we enable contention notification by default ? > > Hi all, > > there's a feature in the locks xlator that sends a notification to current > owner of a lock when another client tries to acquire the same lock. This > way the current owner is made aware of the contention and can release the > lock as soon as possible to allow the other client to proceed. > > This is specially useful when eager-locking is used and multiple clients > access the same files and directories. Currently both replicated and > dispersed volumes use eager-locking and can use contention notification to > force an early release of the lock. > > Eager-locking reduces the number of network requests required for each > operation, improving performance, but could add delays to other clients > while it keeps the inode or entry locked. With the contention notification > feature we avoid this delay, so we get the best performance with minimal > issues in multiclient environments. > > Currently the contention notification feature is controlled by the > 'features.lock-notify-contention' option and it's disabled by default. > Should we enable it by default ? > > I don't see any reason to keep it disabled by default. Does anyone foresee > any problem ? > > Regards, > > Xavi > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mchangir at redhat.com Thu May 2 13:37:02 2019 From: mchangir at redhat.com (Milind Changire) Date: Thu, 2 May 2019 19:07:02 +0530 Subject: [Gluster-devel] Should we enable contention notification by default ? In-Reply-To: References: <2044282595.16006319.1556799471980.JavaMail.zimbra@redhat.com> Message-ID: On Thu, May 2, 2019 at 6:44 PM Xavi Hernandez wrote: > Hi Ashish, > > On Thu, May 2, 2019 at 2:17 PM Ashish Pandey wrote: > >> Xavi, >> >> I would like to keep this option (features.lock-notify-contention) >> enabled by default. >> However, I can see that there is one more option which will impact the >> working of this option which is "notify-contention-delay" >> > Just a nit. I wish the option was called "notify-contention-interval" The "delay" part doesn't really emphasize where the delay would be put in. > .description = "This value determines the minimum amount of time " >> "(in seconds) between upcall contention notifications >> " >> "on the same inode. If multiple lock requests are " >> "received during this period, only one upcall will " >> "be sent."}, >> >> I am not sure what should be the best value for this option if we want to >> keep features.lock-notify-contention ON by default? >> It looks like if we keep the value of notify-contention-delay more, say 5 >> sec, it will wait for this much time to send up call >> notification which does not look good. >> > > No, the first notification is sent immediately. What this option does is > to define the minimum interval between notifications. This interval is per > lock. This is done to avoid storms of notifications if many requests come > referencing the same lock. > > Is my understanding correct? >> What will be impact of this value and what should be the default value of >> this option? >> > > I think the current default value of 5 seconds seems good enough. If there > are many bricks, each brick could send a notification per lock. 1000 bricks > would mean a client would receive 1000 notifications every 5 seconds. It > doesn't seem too much, but in those cases 10, and considering we could have > other locks, maybe a higher value could be better. > > Xavi > > >> >> --- >> Ashish >> >> >> >> >> >> >> ------------------------------ >> *From: *"Xavi Hernandez" >> *To: *"gluster-devel" >> *Cc: *"Pranith Kumar Karampuri" , "Ashish Pandey" < >> aspandey at redhat.com>, "Amar Tumballi" >> *Sent: *Thursday, May 2, 2019 4:15:38 PM >> *Subject: *Should we enable contention notification by default ? >> >> Hi all, >> >> there's a feature in the locks xlator that sends a notification to >> current owner of a lock when another client tries to acquire the same lock. >> This way the current owner is made aware of the contention and can release >> the lock as soon as possible to allow the other client to proceed. >> >> This is specially useful when eager-locking is used and multiple clients >> access the same files and directories. Currently both replicated and >> dispersed volumes use eager-locking and can use contention notification to >> force an early release of the lock. >> >> Eager-locking reduces the number of network requests required for each >> operation, improving performance, but could add delays to other clients >> while it keeps the inode or entry locked. With the contention notification >> feature we avoid this delay, so we get the best performance with minimal >> issues in multiclient environments. >> >> Currently the contention notification feature is controlled by the >> 'features.lock-notify-contention' option and it's disabled by default. >> Should we enable it by default ? >> >> I don't see any reason to keep it disabled by default. Does anyone >> foresee any problem ? >> >> Regards, >> >> Xavi >> >> _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -- Milind -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhernandez at redhat.com Thu May 2 13:44:40 2019 From: xhernandez at redhat.com (Xavi Hernandez) Date: Thu, 2 May 2019 15:44:40 +0200 Subject: [Gluster-devel] Should we enable contention notification by default ? In-Reply-To: References: <2044282595.16006319.1556799471980.JavaMail.zimbra@redhat.com> Message-ID: On Thu, 2 May 2019, 15:37 Milind Changire, wrote: > On Thu, May 2, 2019 at 6:44 PM Xavi Hernandez > wrote: > >> Hi Ashish, >> >> On Thu, May 2, 2019 at 2:17 PM Ashish Pandey wrote: >> >>> Xavi, >>> >>> I would like to keep this option (features.lock-notify-contention) >>> enabled by default. >>> However, I can see that there is one more option which will impact the >>> working of this option which is "notify-contention-delay" >>> >> > Just a nit. I wish the option was called "notify-contention-interval" > The "delay" part doesn't really emphasize where the delay would be put in. > It makes sense. Maybe we can also rename it or add a second name (alias). If there are no objections, I will send a patch with the change. Xavi > >> .description = "This value determines the minimum amount of time " >>> "(in seconds) between upcall contention >>> notifications " >>> "on the same inode. If multiple lock requests are " >>> "received during this period, only one upcall will " >>> "be sent."}, >>> >>> I am not sure what should be the best value for this option if we want >>> to keep features.lock-notify-contention ON by default? >>> It looks like if we keep the value of notify-contention-delay more, say >>> 5 sec, it will wait for this much time to send up call >>> notification which does not look good. >>> >> >> No, the first notification is sent immediately. What this option does is >> to define the minimum interval between notifications. This interval is per >> lock. This is done to avoid storms of notifications if many requests come >> referencing the same lock. >> >> Is my understanding correct? >>> What will be impact of this value and what should be the default value >>> of this option? >>> >> >> I think the current default value of 5 seconds seems good enough. If >> there are many bricks, each brick could send a notification per lock. 1000 >> bricks would mean a client would receive 1000 notifications every 5 >> seconds. It doesn't seem too much, but in those cases 10, and considering >> we could have other locks, maybe a higher value could be better. >> >> Xavi >> >> >>> >>> --- >>> Ashish >>> >>> >>> >>> >>> >>> >>> ------------------------------ >>> *From: *"Xavi Hernandez" >>> *To: *"gluster-devel" >>> *Cc: *"Pranith Kumar Karampuri" , "Ashish Pandey" < >>> aspandey at redhat.com>, "Amar Tumballi" >>> *Sent: *Thursday, May 2, 2019 4:15:38 PM >>> *Subject: *Should we enable contention notification by default ? >>> >>> Hi all, >>> >>> there's a feature in the locks xlator that sends a notification to >>> current owner of a lock when another client tries to acquire the same lock. >>> This way the current owner is made aware of the contention and can release >>> the lock as soon as possible to allow the other client to proceed. >>> >>> This is specially useful when eager-locking is used and multiple clients >>> access the same files and directories. Currently both replicated and >>> dispersed volumes use eager-locking and can use contention notification to >>> force an early release of the lock. >>> >>> Eager-locking reduces the number of network requests required for each >>> operation, improving performance, but could add delays to other clients >>> while it keeps the inode or entry locked. With the contention notification >>> feature we avoid this delay, so we get the best performance with minimal >>> issues in multiclient environments. >>> >>> Currently the contention notification feature is controlled by the >>> 'features.lock-notify-contention' option and it's disabled by default. >>> Should we enable it by default ? >>> >>> I don't see any reason to keep it disabled by default. Does anyone >>> foresee any problem ? >>> >>> Regards, >>> >>> Xavi >>> >>> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel > > > > -- > Milind > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From atin.mukherjee83 at gmail.com Thu May 2 14:06:19 2019 From: atin.mukherjee83 at gmail.com (Atin Mukherjee) Date: Thu, 2 May 2019 19:36:19 +0530 Subject: [Gluster-devel] Should we enable contention notification by default ? In-Reply-To: References: <2044282595.16006319.1556799471980.JavaMail.zimbra@redhat.com> Message-ID: On Thu, 2 May 2019 at 19:14, Xavi Hernandez wrote: > On Thu, 2 May 2019, 15:37 Milind Changire, wrote: > >> On Thu, May 2, 2019 at 6:44 PM Xavi Hernandez >> wrote: >> >>> Hi Ashish, >>> >>> On Thu, May 2, 2019 at 2:17 PM Ashish Pandey >>> wrote: >>> >>>> Xavi, >>>> >>>> I would like to keep this option (features.lock-notify-contention) >>>> enabled by default. >>>> However, I can see that there is one more option which will impact the >>>> working of this option which is "notify-contention-delay" >>>> >>> >> Just a nit. I wish the option was called "notify-contention-interval" >> The "delay" part doesn't really emphasize where the delay would be put in. >> > > It makes sense. Maybe we can also rename it or add a second name (alias). > If there are no objections, I will send a patch with the change. > > Xavi > > >> >>> .description = "This value determines the minimum amount of time " >>>> "(in seconds) between upcall contention >>>> notifications " >>>> "on the same inode. If multiple lock requests are " >>>> "received during this period, only one upcall will " >>>> "be sent."}, >>>> >>>> I am not sure what should be the best value for this option if we want >>>> to keep features.lock-notify-contention ON by default? >>>> It looks like if we keep the value of notify-contention-delay more, say >>>> 5 sec, it will wait for this much time to send up call >>>> notification which does not look good. >>>> >>> >>> No, the first notification is sent immediately. What this option does is >>> to define the minimum interval between notifications. This interval is per >>> lock. This is done to avoid storms of notifications if many requests come >>> referencing the same lock. >>> >>> Is my understanding correct? >>>> What will be impact of this value and what should be the default value >>>> of this option? >>>> >>> >>> I think the current default value of 5 seconds seems good enough. If >>> there are many bricks, each brick could send a notification per lock. 1000 >>> bricks would mean a client would receive 1000 notifications every 5 >>> seconds. It doesn't seem too much, but in those cases 10, and considering >>> we could have other locks, maybe a higher value could be better. >>> >>> Xavi >>> >>> >>>> >>>> --- >>>> Ashish >>>> >>>> >>>> >>>> >>>> >>>> >>>> ------------------------------ >>>> *From: *"Xavi Hernandez" >>>> *To: *"gluster-devel" >>>> *Cc: *"Pranith Kumar Karampuri" , "Ashish Pandey" >>>> , "Amar Tumballi" >>>> *Sent: *Thursday, May 2, 2019 4:15:38 PM >>>> *Subject: *Should we enable contention notification by default ? >>>> >>>> Hi all, >>>> >>>> there's a feature in the locks xlator that sends a notification to >>>> current owner of a lock when another client tries to acquire the same lock. >>>> This way the current owner is made aware of the contention and can release >>>> the lock as soon as possible to allow the other client to proceed. >>>> >>>> This is specially useful when eager-locking is used and multiple >>>> clients access the same files and directories. Currently both replicated >>>> and dispersed volumes use eager-locking and can use contention notification >>>> to force an early release of the lock. >>>> >>>> Eager-locking reduces the number of network requests required for each >>>> operation, improving performance, but could add delays to other clients >>>> while it keeps the inode or entry locked. With the contention notification >>>> feature we avoid this delay, so we get the best performance with minimal >>>> issues in multiclient environments. >>>> >>>> Currently the contention notification feature is controlled by the >>>> 'features.lock-notify-contention' option and it's disabled by default. >>>> Should we enable it by default ? >>>> >>>> I don't see any reason to keep it disabled by default. Does anyone >>>> foresee any problem ? >>>> >>> Is it a server only option? Otherwise it will break backward compatibility if we rename the key, If alias can get this fixed, that?s a better choice but I?m not sure if it solves all the problems. >>>> Regards, >>>> >>>> Xavi >>>> >>>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> >> >> -- >> Milind >> >> _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -- --Atin -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhernandez at redhat.com Thu May 2 15:08:45 2019 From: xhernandez at redhat.com (Xavi Hernandez) Date: Thu, 2 May 2019 17:08:45 +0200 Subject: [Gluster-devel] Should we enable contention notification by default ? In-Reply-To: References: <2044282595.16006319.1556799471980.JavaMail.zimbra@redhat.com>

Message-ID: On Thu, May 2, 2019 at 4:06 PM Atin Mukherjee wrote: > > > On Thu, 2 May 2019 at 19:14, Xavi Hernandez wrote: > >> On Thu, 2 May 2019, 15:37 Milind Changire, wrote: >> >>> On Thu, May 2, 2019 at 6:44 PM Xavi Hernandez >>> wrote: >>> >>>> Hi Ashish, >>>> >>>> On Thu, May 2, 2019 at 2:17 PM Ashish Pandey >>>> wrote: >>>> >>>>> Xavi, >>>>> >>>>> I would like to keep this option (features.lock-notify-contention) >>>>> enabled by default. >>>>> However, I can see that there is one more option which will impact the >>>>> working of this option which is "notify-contention-delay" >>>>> >>>> >>> Just a nit. I wish the option was called "notify-contention-interval" >>> The "delay" part doesn't really emphasize where the delay would be put >>> in. >>> >> >> It makes sense. Maybe we can also rename it or add a second name (alias). >> If there are no objections, I will send a patch with the change. >> >> Xavi >> >> >>> >>>> .description = "This value determines the minimum amount of time " >>>>> "(in seconds) between upcall contention >>>>> notifications " >>>>> "on the same inode. If multiple lock requests are " >>>>> "received during this period, only one upcall will >>>>> " >>>>> "be sent."}, >>>>> >>>>> I am not sure what should be the best value for this option if we want >>>>> to keep features.lock-notify-contention ON by default? >>>>> It looks like if we keep the value of notify-contention-delay more, >>>>> say 5 sec, it will wait for this much time to send up call >>>>> notification which does not look good. >>>>> >>>> >>>> No, the first notification is sent immediately. What this option does >>>> is to define the minimum interval between notifications. This interval is >>>> per lock. This is done to avoid storms of notifications if many requests >>>> come referencing the same lock. >>>> >>>> Is my understanding correct? >>>>> What will be impact of this value and what should be the default value >>>>> of this option? >>>>> >>>> >>>> I think the current default value of 5 seconds seems good enough. If >>>> there are many bricks, each brick could send a notification per lock. 1000 >>>> bricks would mean a client would receive 1000 notifications every 5 >>>> seconds. It doesn't seem too much, but in those cases 10, and considering >>>> we could have other locks, maybe a higher value could be better. >>>> >>>> Xavi >>>> >>>> >>>>> >>>>> --- >>>>> Ashish >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ------------------------------ >>>>> *From: *"Xavi Hernandez" >>>>> *To: *"gluster-devel" >>>>> *Cc: *"Pranith Kumar Karampuri" , "Ashish >>>>> Pandey" , "Amar Tumballi" >>>>> *Sent: *Thursday, May 2, 2019 4:15:38 PM >>>>> *Subject: *Should we enable contention notification by default ? >>>>> >>>>> Hi all, >>>>> >>>>> there's a feature in the locks xlator that sends a notification to >>>>> current owner of a lock when another client tries to acquire the same lock. >>>>> This way the current owner is made aware of the contention and can release >>>>> the lock as soon as possible to allow the other client to proceed. >>>>> >>>>> This is specially useful when eager-locking is used and multiple >>>>> clients access the same files and directories. Currently both replicated >>>>> and dispersed volumes use eager-locking and can use contention notification >>>>> to force an early release of the lock. >>>>> >>>>> Eager-locking reduces the number of network requests required for each >>>>> operation, improving performance, but could add delays to other clients >>>>> while it keeps the inode or entry locked. With the contention notification >>>>> feature we avoid this delay, so we get the best performance with minimal >>>>> issues in multiclient environments. >>>>> >>>>> Currently the contention notification feature is controlled by the >>>>> 'features.lock-notify-contention' option and it's disabled by default. >>>>> Should we enable it by default ? >>>>> >>>>> I don't see any reason to keep it disabled by default. Does anyone >>>>> foresee any problem ? >>>>> >>>> > Is it a server only option? Otherwise it will break backward compatibility > if we rename the key, If alias can get this fixed, that?s a better choice > but I?m not sure if it solves all the problems. > It's a server side option. I though that an alias didn't have any other implication than accept two names for the same option. Is there anything else I need to consider ? > >>>>> Regards, >>>>> >>>>> Xavi >>>>> >>>>> _______________________________________________ >>>> Gluster-devel mailing list >>>> Gluster-devel at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>> >>> >>> >>> -- >>> Milind >>> >>> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel > > -- > --Atin > -------------- next part -------------- An HTML attachment was scrubbed... URL: From atin.mukherjee83 at gmail.com Thu May 2 15:45:39 2019 From: atin.mukherjee83 at gmail.com (Atin Mukherjee) Date: Thu, 2 May 2019 21:15:39 +0530 Subject: [Gluster-devel] Should we enable contention notification by default ? In-Reply-To: References: <2044282595.16006319.1556799471980.JavaMail.zimbra@redhat.com>

Message-ID: On Thu, 2 May 2019 at 20:38, Xavi Hernandez wrote: > On Thu, May 2, 2019 at 4:06 PM Atin Mukherjee > wrote: > >> >> >> On Thu, 2 May 2019 at 19:14, Xavi Hernandez >> wrote: >> >>> On Thu, 2 May 2019, 15:37 Milind Changire, wrote: >>> >>>> On Thu, May 2, 2019 at 6:44 PM Xavi Hernandez >>>> wrote: >>>> >>>>> Hi Ashish, >>>>> >>>>> On Thu, May 2, 2019 at 2:17 PM Ashish Pandey >>>>> wrote: >>>>> >>>>>> Xavi, >>>>>> >>>>>> I would like to keep this option (features.lock-notify-contention) >>>>>> enabled by default. >>>>>> However, I can see that there is one more option which will impact >>>>>> the working of this option which is "notify-contention-delay" >>>>>> >>>>> >>>> Just a nit. I wish the option was called "notify-contention-interval" >>>> The "delay" part doesn't really emphasize where the delay would be put >>>> in. >>>> >>> >>> It makes sense. Maybe we can also rename it or add a second name >>> (alias). If there are no objections, I will send a patch with the change. >>> >>> Xavi >>> >>> >>>> >>>>> .description = "This value determines the minimum amount of time " >>>>>> "(in seconds) between upcall contention >>>>>> notifications " >>>>>> "on the same inode. If multiple lock requests are >>>>>> " >>>>>> "received during this period, only one upcall >>>>>> will " >>>>>> "be sent."}, >>>>>> >>>>>> I am not sure what should be the best value for this option if we >>>>>> want to keep features.lock-notify-contention ON by default? >>>>>> It looks like if we keep the value of notify-contention-delay more, >>>>>> say 5 sec, it will wait for this much time to send up call >>>>>> notification which does not look good. >>>>>> >>>>> >>>>> No, the first notification is sent immediately. What this option does >>>>> is to define the minimum interval between notifications. This interval is >>>>> per lock. This is done to avoid storms of notifications if many requests >>>>> come referencing the same lock. >>>>> >>>>> Is my understanding correct? >>>>>> What will be impact of this value and what should be the default >>>>>> value of this option? >>>>>> >>>>> >>>>> I think the current default value of 5 seconds seems good enough. If >>>>> there are many bricks, each brick could send a notification per lock. 1000 >>>>> bricks would mean a client would receive 1000 notifications every 5 >>>>> seconds. It doesn't seem too much, but in those cases 10, and considering >>>>> we could have other locks, maybe a higher value could be better. >>>>> >>>>> Xavi >>>>> >>>>> >>>>>> >>>>>> --- >>>>>> Ashish >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------ >>>>>> *From: *"Xavi Hernandez" >>>>>> *To: *"gluster-devel" >>>>>> *Cc: *"Pranith Kumar Karampuri" , "Ashish >>>>>> Pandey" , "Amar Tumballi" >>>>>> *Sent: *Thursday, May 2, 2019 4:15:38 PM >>>>>> *Subject: *Should we enable contention notification by default ? >>>>>> >>>>>> Hi all, >>>>>> >>>>>> there's a feature in the locks xlator that sends a notification to >>>>>> current owner of a lock when another client tries to acquire the same lock. >>>>>> This way the current owner is made aware of the contention and can release >>>>>> the lock as soon as possible to allow the other client to proceed. >>>>>> >>>>>> This is specially useful when eager-locking is used and multiple >>>>>> clients access the same files and directories. Currently both replicated >>>>>> and dispersed volumes use eager-locking and can use contention notification >>>>>> to force an early release of the lock. >>>>>> >>>>>> Eager-locking reduces the number of network requests required for >>>>>> each operation, improving performance, but could add delays to other >>>>>> clients while it keeps the inode or entry locked. With the contention >>>>>> notification feature we avoid this delay, so we get the best performance >>>>>> with minimal issues in multiclient environments. >>>>>> >>>>>> Currently the contention notification feature is controlled by the >>>>>> 'features.lock-notify-contention' option and it's disabled by default. >>>>>> Should we enable it by default ? >>>>>> >>>>>> I don't see any reason to keep it disabled by default. Does anyone >>>>>> foresee any problem ? >>>>>> >>>>> >> Is it a server only option? Otherwise it will break backward >> compatibility if we rename the key, If alias can get this fixed, that?s a >> better choice but I?m not sure if it solves all the problems. >> > > It's a server side option. I though that an alias didn't have any other > implication than accept two names for the same option. Is there anything > else I need to consider ? > If it?s a server side option then there?s no challenge in alias. If you do rename then in heterogeneous server versions volume set wouldn?t work though. > >> >>>>>> Regards, >>>>>> >>>>>> Xavi >>>>>> >>>>>> _______________________________________________ >>>>> Gluster-devel mailing list >>>>> Gluster-devel at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>>> >>>> >>>> >>>> -- >>>> Milind >>>> >>>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> -- >> --Atin >> > -- --Atin -------------- next part -------------- An HTML attachment was scrubbed... URL: From pkalever at redhat.com Thu May 2 17:34:41 2019 From: pkalever at redhat.com (Prasanna Kalever) Date: Thu, 2 May 2019 23:04:41 +0530 Subject: [Gluster-devel] gluster-block v0.4 is alive! Message-ID: Hello Gluster folks, Gluster-block team is happy to announce the v0.4 release [1]. This is the new stable version of gluster-block, lots of new and exciting features and interesting bug fixes are made available as part of this release. Please find the big list of release highlights and notable fixes at [2]. Details about installation can be found in the easy install guide at [3]. Find the details about prerequisites and setup guide at [4]. If you are a new user, checkout the demo video attached in the README doc [5], which will be a good source of intro to the project. There are good examples about how to use gluster-block both in the man pages [6] and test file [7] (also in the README). gluster-block is part of fedora package collection, an updated package with release version v0.4 will be soon made available. And the community provided packages will be soon made available at [8]. Please spend a minute to report any kind of issue that comes to your notice with this handy link [9]. We look forward to your feedback, which will help gluster-block get better! We would like to thank all our users, contributors for bug filing and fixes, also the whole team who involved in the huge effort with pre-release testing. [1] https://github.com/gluster/gluster-block [2] https://github.com/gluster/gluster-block/releases [3] https://github.com/gluster/gluster-block/blob/master/INSTALL [4] https://github.com/gluster/gluster-block#usage [5] https://github.com/gluster/gluster-block/blob/master/README.md [6] https://github.com/gluster/gluster-block/tree/master/docs [7] https://github.com/gluster/gluster-block/blob/master/tests/basic.t [8] https://download.gluster.org/pub/gluster/gluster-block/ [9] https://github.com/gluster/gluster-block/issues/new Cheers, Team Gluster-Block! From xhernandez at redhat.com Thu May 2 20:58:12 2019 From: xhernandez at redhat.com (Xavi Hernandez) Date: Thu, 2 May 2019 22:58:12 +0200 Subject: [Gluster-devel] Weird performance behavior Message-ID: Hi, doing some tests to compare performance I've found some weird results. I've seen this in different tests, but probably the more clear an easier to reproduce is to use smallfile tool to create files. The test command is: # python smallfile_cli.py --operation create --files-per-dir 100 --file-size 32768 --threads 16 --files 256 --top --stonewall no I've run this test 5 times sequentially using the same initial conditions (at least this is what I think): bricks cleared, all gluster processes stopped, volume destroyed and recreated, caches emptied. This is the data I've obtained for each execution: Time us sy ni id wa hi si st read write use 435 1.80 3.70 0.00 81.62 11.06 0.00 0.00 0.00 32.931 608715.575 97.632 450 1.67 3.62 0.00 80.67 12.19 0.00 0.00 0.00 30.989 589078.308 97.714 425 1.74 3.75 0.00 81.85 10.76 0.00 0.00 0.00 37.588 622034.812 97.706 320 2.47 5.06 0.00 82.84 7.75 0.00 0.00 0.00 46.406 828637.359 96.891 365 2.19 4.44 0.00 84.45 7.12 0.00 0.00 0.00 45.822 734566.685 97.466 Time is in seconds. us, sy, ni, id, wa, hi, si and st are the CPU times, as reported by top. read and write are the disk throughput in KiB/s. use is the disk usage percentage. Based on this we can see that there's a big difference between the best and the worst cases. But it seems more relevant that when it performed better, in fact disk utilization and CPU wait time were a bit lower. Disk is a NVMe and I used a recent commit from master (2b86da69). Volume type is a replica 3 with 3 bricks. I'm not sure what can be causing this. Any idea ? can anyone try to reproduce it to see if it's a problem in my environment or it's a common problem ? Thanks, Xavi -------------- next part -------------- An HTML attachment was scrubbed... URL: From vbellur at redhat.com Fri May 3 05:44:39 2019 From: vbellur at redhat.com (Vijay Bellur) Date: Thu, 2 May 2019 22:44:39 -0700 Subject: [Gluster-devel] Query regarding dictionary logic In-Reply-To: References:

Message-ID: Hi Mohit, Thank you for the update. More inline. On Wed, May 1, 2019 at 11:45 PM Mohit Agrawal wrote: > Hi Vijay, > > I have tried to execute smallfile tool on volume(12x3), i have not found > any significant performance improvement > for smallfile operations, I have configured 4 clients and 8 thread to run > operations. > For measuring performance, did you measure both time taken and cpu consumed? Normally O(n) computations are cpu expensive and we might see better results with a hash table when a large number of objects ( a few thousands) are present in a single dictionary. If you haven't gathered cpu statistics, please also gather that for comparison. > I have generated statedump and found below data for dictionaries specific > to gluster processes > > brick > max-pairs-per-dict=50 > total-pairs-used=192212171 > total-dicts-used=24794349 > average-pairs-per-dict=7 > > > glusterd > max-pairs-per-dict=301 > total-pairs-used=156677 > total-dicts-used=30719 > average-pairs-per-dict=5 > > > fuse process > [dict] > max-pairs-per-dict=50 > total-pairs-used=88669561 > total-dicts-used=12360543 > average-pairs-per-dict=7 > > It seems dictionary has max-pairs in case of glusterd and while no. of > volumes are high the number can be increased. > I think there is no performance regression in case of brick and fuse. I > have used hash_size 20 for the dictionary. > Let me know if you can provide some other test to validate the same. > A few more items to try out: 1. Vary the number of buckets and test. 2. Create about 10000 volumes and measure performance for a volume info operation on some random volume? 3. Check the related patch from Facebook and see if we can incorporate any ideas from their patch. Thanks, Vijay > Thanks, > Mohit Agrawal > > On Tue, Apr 30, 2019 at 2:29 PM Mohit Agrawal wrote: > >> Thanks, Amar for sharing the patch, I will test and share the result. >> >> On Tue, Apr 30, 2019 at 2:23 PM Amar Tumballi Suryanarayan < >> atumball at redhat.com> wrote: >> >>> Shreyas/Kevin tried to address it some time back using >>> https://bugzilla.redhat.com/show_bug.cgi?id=1428049 ( >>> https://review.gluster.org/16830) >>> >>> I vaguely remember the reason to keep the hash value 1 was done during >>> the time when we had dictionary itself sent as on wire protocol, and in >>> most other places, number of entries in dictionary was on an avg, 3. So, we >>> felt, saving on a bit of memory for optimization was better at that time. >>> >>> -Amar >>> >>> On Tue, Apr 30, 2019 at 12:02 PM Mohit Agrawal >>> wrote: >>> >>>> sure Vijay, I will try and update. >>>> >>>> Regards, >>>> Mohit Agrawal >>>> >>>> On Tue, Apr 30, 2019 at 11:44 AM Vijay Bellur >>>> wrote: >>>> >>>>> Hi Mohit, >>>>> >>>>> On Mon, Apr 29, 2019 at 7:15 AM Mohit Agrawal >>>>> wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> I was just looking at the code of dict, I have one query current >>>>>> dictionary logic. >>>>>> I am not able to understand why we use hash_size is 1 for a >>>>>> dictionary.IMO with the >>>>>> hash_size of 1 dictionary always work like a list, not a hash, for >>>>>> every lookup >>>>>> in dictionary complexity is O(n). >>>>>> >>>>>> Before optimizing the code I just want to know what was the exact >>>>>> reason to define >>>>>> hash_size is 1? >>>>>> >>>>> >>>>> This is a good question. I looked up the source in gluster's historic >>>>> repo [1] and hash_size is 1 even there. So, this could have been the case >>>>> since the first version of the dictionary code. >>>>> >>>>> Would you be able to run some tests with a larger hash_size and share >>>>> your observations? >>>>> >>>>> Thanks, >>>>> Vijay >>>>> >>>>> [1] >>>>> https://github.com/gluster/historic/blob/master/libglusterfs/src/dict.c >>>>> >>>>> >>>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From jthottan at redhat.com Fri May 3 06:04:50 2019 From: jthottan at redhat.com (Jiffin Tony Thottan) Date: Fri, 3 May 2019 11:34:50 +0530 Subject: [Gluster-devel] [Gluster-users] Proposing to previous ganesha HA cluster solution back to gluster code as gluster-7 feature In-Reply-To: <7d75b62f0eb0495782c46ef8521790d5@ul-exc-pr-mbx13.ulaval.ca> References: <9BE7F129-DE42-46A5-896B-81460E605E9E@gmail.com> <7d75b62f0eb0495782c46ef8521790d5@ul-exc-pr-mbx13.ulaval.ca> Message-ID: On 30/04/19 6:41 PM, Renaud Fortier wrote: > > IMO, you should keep storhaug and maintain it. At the beginning, we > were with pacemaker and corosync. Then we move to storhaug with the > upgrade to gluster 4.1.x. Now you are talking about going back like it > was. Maybe it will be better with pacemake and corosync but the > important is to have a solution that will be stable and maintained. > I agree it is very frustrating, there is no longer development planned for future unless someone pick it and work on for its stabilization and improvement. My plan is just to get back what gluster and nfs-ganesha had before -- Jiffin > thanks > > Renaud > > *De?:*gluster-users-bounces at gluster.org > [mailto:gluster-users-bounces at gluster.org] *De la part de* Jim Kinney > *Envoy??:* 30 avril 2019 08:20 > *??:* gluster-users at gluster.org; Jiffin Tony Thottan > ; gluster-users at gluster.org; Gluster Devel > ; gluster-maintainers at gluster.org; > nfs-ganesha ; devel at lists.nfs-ganesha.org > *Objet?:* Re: [Gluster-users] Proposing to previous ganesha HA cluster > solution back to gluster code as gluster-7 feature > > +1! > I'm using nfs-ganesha in my next upgrade so my client systems can use > NFS instead of fuse mounts. Having an integrated, designed in process > to coordinate multiple nodes into an HA cluster will very welcome. > > On April 30, 2019 3:20:11 AM EDT, Jiffin Tony Thottan > > wrote: > > Hi all, > > Some of you folks may be familiar with HA solution provided for > nfs-ganesha by gluster using pacemaker and corosync. > > That feature was removed in glusterfs 3.10 in favour for common HA > project "Storhaug". Even Storhaug was not progressed > > much from last two years and current development is in halt state, > hence planning to restore old HA ganesha solution back > > to gluster code repository with some improvement and targetting > for next gluster release 7. > > I have opened up an issue [1] with details and posted initial set > of patches [2] > > Please share your thoughts on the same > > Regards, > > Jiffin > > [1]https://github.com/gluster/glusterfs/issues/663 > > > [2] > https://review.gluster.org/#/q/topic:rfc-663+(status:open+OR+status:merged) > > > -- > Sent from my Android device with K-9 Mail. All tyopes are thumb > related and reflect authenticity. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jthottan at redhat.com Fri May 3 06:08:07 2019 From: jthottan at redhat.com (Jiffin Tony Thottan) Date: Fri, 3 May 2019 11:38:07 +0530 Subject: [Gluster-devel] [Gluster-users] Proposing to previous ganesha HA clustersolution back to gluster code as gluster-7 feature In-Reply-To: <1028413072.2343069.1556630991785@mail.yahoo.com> References: <1028413072.2343069.1556630991785@mail.yahoo.com> Message-ID: <84885b70-e6b0-6e9b-f43d-a13dbafc6b6a@redhat.com> On 30/04/19 6:59 PM, Strahil Nikolov wrote: > Hi, > > I'm posting this again as it got bounced. > Keep in mind that corosync/pacemaker? is hard for proper setup by new admins/users. > > I'm still trying to remediate the effects of poor configuration at work. > Also, storhaug is nice for hyperconverged setups where the host is not only hosting bricks, but? other? workloads. > Corosync/pacemaker require proper fencing to be setup and most of the stonith resources 'shoot the other node in the head'. > I would be happy to see an easy to deploy (let say 'cluster.enable-ha-ganesha true') and gluster to be bringing up the Floating IPs and taking care of the NFS locks, so no disruption will be felt by the clients. It do take care those, but need to follow certain prerequisite, but please fencing won't configured for this setup. May we think about in future. -- Jiffin > > Still, this will be a lot of work to achieve. > > Best Regards, > Strahil Nikolov > > On Apr 30, 2019 15:19, Jim Kinney wrote: >> >> +1! >> I'm using nfs-ganesha in my next upgrade so my client systems can use NFS instead of fuse mounts. Having an integrated, designed in process to coordinate multiple nodes into an HA cluster will very welcome. >> >> On April 30, 2019 3:20:11 AM EDT, Jiffin Tony Thottan wrote: >>> >>> Hi all, >>> >>> Some of you folks may be familiar with HA solution provided for nfs-ganesha by gluster using pacemaker and corosync. >>> >>> That feature was removed in glusterfs 3.10 in favour for common HA project "Storhaug". Even Storhaug was not progressed >>> >>> much from last two years and current development is in halt state, hence planning to restore old HA ganesha solution back >>> >>> to gluster code repository with some improvement and targetting for next gluster release 7. >>> >>> ??I have opened up an issue [1] with details and posted initial set of patches [2] >>> >>> Please share your thoughts on the same >>> >>> >>> Regards, >>> >>> Jiffin >>> >>> [1] https://github.com/gluster/glusterfs/issues/663 >>> >>> [2] https://review.gluster.org/#/q/topic:rfc-663+(status:open+OR+status:merged) >>> >>> >> -- >> Sent from my Android device with K-9 Mail. All tyopes are thumb related and reflect authenticity. > Keep in mind that corosync/pacemaker? is hard for proper setup by new admins/users. > > I'm still trying to remediate the effects of poor configuration at work. > Also, storhaug is nice for hyperconverged setups where the host is not only hosting bricks, but? other? workloads. > Corosync/pacemaker require proper fencing to be setup and most of the stonith resources 'shoot the other node in the head'. > I would be happy to see an easy to deploy (let say 'cluster.enable-ha-ganesha true') and gluster to be bringing up the Floating IPs and taking care of the NFS locks, so no disruption will be felt by the clients. > > Still, this will be a lot of work to achieve. > > Best Regards, > Strahil NikolovOn Apr 30, 2019 15:19, Jim Kinney wrote: >> +1! >> I'm using nfs-ganesha in my next upgrade so my client systems can use NFS instead of fuse mounts. Having an integrated, designed in process to coordinate multiple nodes into an HA cluster will very welcome. >> >> On April 30, 2019 3:20:11 AM EDT, Jiffin Tony Thottan wrote: >>> Hi all, >>> >>> Some of you folks may be familiar with HA solution provided for nfs-ganesha by gluster using pacemaker and corosync. >>> >>> That feature was removed in glusterfs 3.10 in favour for common HA project "Storhaug". Even Storhaug was not progressed >>> >>> much from last two years and current development is in halt state, hence planning to restore old HA ganesha solution back >>> >>> to gluster code repository with some improvement and targetting for next gluster release 7. >>> >>> I have opened up an issue [1] with details and posted initial set of patches [2] >>> >>> Please share your thoughts on the same >>> >>> Regards, >>> >>> Jiffin >>> >>> [1] https://github.com/gluster/glusterfs/issues/663 >>> >>> [2] https://review.gluster.org/#/q/topic:rfc-663+(status:open+OR+status:merged) >> >> -- >> Sent from my Android device with K-9 Mail. All tyopes are thumb related and reflect authenticity. From amukherj at redhat.com Fri May 3 08:56:28 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Fri, 3 May 2019 14:26:28 +0530 Subject: [Gluster-devel] Coverity scan - how does it ignore dismissed defects & annotations? Message-ID: I'm bit puzzled on the way coverity is reporting the open defects on GD1 component. As you can see from [1], technically we have 6 open defects and all of the rest are being marked as dismissed. We tried to put some additional annotations in the code through [2] to see if coverity starts feeling happy but the result doesn't change. I still see in the report it complaints about open defect of GD1 as 25 (7 as High, 18 as medium and 1 as Low). More interestingly yesterday's report claimed we fixed 8 defects, introduced 1, but the overall count remained as 102. I'm not able to connect the dots of this puzzle, can anyone? [1] https://scan.coverity.com/projects/gluster-glusterfs/view_defects [2] https://review.gluster.org/#/c/22619/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From jahernan at redhat.com Fri May 3 09:29:21 2019 From: jahernan at redhat.com (Xavi Hernandez) Date: Fri, 3 May 2019 11:29:21 +0200 Subject: [Gluster-devel] Coverity scan - how does it ignore dismissed defects & annotations? In-Reply-To: References: Message-ID: Hi Atin, On Fri, May 3, 2019 at 10:57 AM Atin Mukherjee wrote: > I'm bit puzzled on the way coverity is reporting the open defects on GD1 > component. As you can see from [1], technically we have 6 open defects and > all of the rest are being marked as dismissed. We tried to put some > additional annotations in the code through [2] to see if coverity starts > feeling happy but the result doesn't change. I still see in the report it > complaints about open defect of GD1 as 25 (7 as High, 18 as medium and 1 as > Low). More interestingly yesterday's report claimed we fixed 8 defects, > introduced 1, but the overall count remained as 102. I'm not able to > connect the dots of this puzzle, can anyone? > Maybe we need to modify all dismissed CID's so that Coverity considers them again and, hopefully, mark them as solved with the newer updates. They have been manually marked to be ignored, so they are still there... Just a thought, I'm not sure how this really works. Xavi > > [1] https://scan.coverity.com/projects/gluster-glusterfs/view_defects > [2] https://review.gluster.org/#/c/22619/ > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Fri May 3 09:46:49 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Fri, 3 May 2019 15:16:49 +0530 Subject: [Gluster-devel] Coverity scan - how does it ignore dismissed defects & annotations? In-Reply-To: References:

Message-ID: On Fri, 3 May 2019 at 14:59, Xavi Hernandez wrote: > Hi Atin, > > On Fri, May 3, 2019 at 10:57 AM Atin Mukherjee > wrote: > >> I'm bit puzzled on the way coverity is reporting the open defects on GD1 >> component. As you can see from [1], technically we have 6 open defects and >> all of the rest are being marked as dismissed. We tried to put some >> additional annotations in the code through [2] to see if coverity starts >> feeling happy but the result doesn't change. I still see in the report it >> complaints about open defect of GD1 as 25 (7 as High, 18 as medium and 1 as >> Low). More interestingly yesterday's report claimed we fixed 8 defects, >> introduced 1, but the overall count remained as 102. I'm not able to >> connect the dots of this puzzle, can anyone? >> > > Maybe we need to modify all dismissed CID's so that Coverity considers > them again and, hopefully, mark them as solved with the newer updates. They > have been manually marked to be ignored, so they are still there... > After yesterday?s run I set the severity for all of them to see if modifications to these CIDs make any difference or not. So fingers crossed till the next report comes :-) . > Just a thought, I'm not sure how this really works. > Same here, I don?t understand the exact workflow and hence seeking additional ideas. > Xavi > > >> >> [1] https://scan.coverity.com/projects/gluster-glusterfs/view_defects >> [2] https://review.gluster.org/#/c/22619/ >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel > > -- - Atin (atinm) -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Fri May 3 10:36:36 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Fri, 3 May 2019 16:06:36 +0530 Subject: [Gluster-devel] Coverity scan - how does it ignore dismissed defects & annotations? In-Reply-To: References:

Message-ID: On Fri, May 3, 2019 at 3:17 PM Atin Mukherjee wrote: > > > On Fri, 3 May 2019 at 14:59, Xavi Hernandez wrote: > >> Hi Atin, >> >> On Fri, May 3, 2019 at 10:57 AM Atin Mukherjee >> wrote: >> >>> I'm bit puzzled on the way coverity is reporting the open defects on GD1 >>> component. As you can see from [1], technically we have 6 open defects and >>> all of the rest are being marked as dismissed. We tried to put some >>> additional annotations in the code through [2] to see if coverity starts >>> feeling happy but the result doesn't change. I still see in the report it >>> complaints about open defect of GD1 as 25 (7 as High, 18 as medium and 1 as >>> Low). More interestingly yesterday's report claimed we fixed 8 defects, >>> introduced 1, but the overall count remained as 102. I'm not able to >>> connect the dots of this puzzle, can anyone? >>> >> >> Maybe we need to modify all dismissed CID's so that Coverity considers >> them again and, hopefully, mark them as solved with the newer updates. They >> have been manually marked to be ignored, so they are still there... >> > > After yesterday?s run I set the severity for all of them to see if > modifications to these CIDs make any difference or not. So fingers crossed > till the next report comes :-) . > If you noticed the previous day report, it was 101 'Open defects' and 65 'Dismissed' (which means, they are not 'fixed in code', but dismissed as false positive or ignore in CID dashboard. Now, it is 57 'Dismissed', which means, your patch has actually fixed 8 defects. > > >> Just a thought, I'm not sure how this really works. >> > > Same here, I don?t understand the exact workflow and hence seeking > additional ideas. > > Looks like we should consider overall open defects as Open + Dismissed. > >> Xavi >> >> >>> >>> [1] https://scan.coverity.com/projects/gluster-glusterfs/view_defects >>> [2] https://review.gluster.org/#/c/22619/ >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> -- > - Atin (atinm) > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Fri May 3 15:40:15 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Fri, 3 May 2019 21:10:15 +0530 Subject: [Gluster-devel] Coverity scan - how does it ignore dismissed defects & annotations? In-Reply-To: References:

Message-ID: On Fri, 3 May 2019 at 16:07, Amar Tumballi Suryanarayan wrote: > > > On Fri, May 3, 2019 at 3:17 PM Atin Mukherjee wrote: > >> >> >> On Fri, 3 May 2019 at 14:59, Xavi Hernandez wrote: >> >>> Hi Atin, >>> >>> On Fri, May 3, 2019 at 10:57 AM Atin Mukherjee >>> wrote: >>> >>>> I'm bit puzzled on the way coverity is reporting the open defects on >>>> GD1 component. As you can see from [1], technically we have 6 open defects >>>> and all of the rest are being marked as dismissed. We tried to put some >>>> additional annotations in the code through [2] to see if coverity starts >>>> feeling happy but the result doesn't change. I still see in the report it >>>> complaints about open defect of GD1 as 25 (7 as High, 18 as medium and 1 as >>>> Low). More interestingly yesterday's report claimed we fixed 8 defects, >>>> introduced 1, but the overall count remained as 102. I'm not able to >>>> connect the dots of this puzzle, can anyone? >>>> >>> >>> Maybe we need to modify all dismissed CID's so that Coverity considers >>> them again and, hopefully, mark them as solved with the newer updates. They >>> have been manually marked to be ignored, so they are still there... >>> >> >> After yesterday?s run I set the severity for all of them to see if >> modifications to these CIDs make any difference or not. So fingers crossed >> till the next report comes :-) . >> > > If you noticed the previous day report, it was 101 'Open defects' and 65 > 'Dismissed' (which means, they are not 'fixed in code', but dismissed as > false positive or ignore in CID dashboard. > > Now, it is 57 'Dismissed', which means, your patch has actually fixed 8 > defects. > > >> >> >>> Just a thought, I'm not sure how this really works. >>> >> >> Same here, I don?t understand the exact workflow and hence seeking >> additional ideas. >> >> > Looks like we should consider overall open defects as Open + Dismissed. > This is why I?m concerned. There?re defects which we clearly can?t or don?t want to fix and in that case even though they are marked as dismissed the overall open defect count doesn?t come down. So we?d never be able to come down below total number of dismissed defects :-( . However today?s report bring the overall count down to 97 from 102. Coverity claimed we fixed 0 defects since last scan which means somehow my update at those GD1 dismissed defects did a trick for 5 defects. This continues to be a great puzzle for me! > >> >>> Xavi >>> >>> >>>> >>>> [1] https://scan.coverity.com/projects/gluster-glusterfs/view_defects >>>> [2] https://review.gluster.org/#/c/22619/ >>>> _______________________________________________ >>>> Gluster-devel mailing list >>>> Gluster-devel at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>> >>> -- >> - Atin (atinm) >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel > > > > -- > Amar Tumballi (amarts) > -- - Atin (atinm) -------------- next part -------------- An HTML attachment was scrubbed... URL: From jenkins at build.gluster.org Mon May 6 01:45:02 2019 From: jenkins at build.gluster.org (jenkins at build.gluster.org) Date: Mon, 6 May 2019 01:45:02 +0000 (UTC) Subject: [Gluster-devel] Weekly Untriaged Bugs Message-ID: <324506145.71.1557107102526.JavaMail.jenkins@jenkins-el7.rht.gluster.org> [...truncated 6 lines...] https://bugzilla.redhat.com/1702316 / core: Cannot upgrade 5.x volume to 6.1 because of unused 'crypt' and 'bd' xlators https://bugzilla.redhat.com/1700295 / core: The data couldn't be flushed immediately even with O_SYNC in glfs_create or with glfs_fsync/glfs_fdatasync after glfs_write. https://bugzilla.redhat.com/1698861 / disperse: Renaming a directory when 2 bricks of multiple disperse subvols are down leaves both old and new dirs on the bricks. https://bugzilla.redhat.com/1697293 / distribute: DHT: print hash and layout values in hexadecimal format in the logs https://bugzilla.redhat.com/1703322 / doc: Need to document about fips-mode-rchecksum in gluster-7 release notes. https://bugzilla.redhat.com/1702043 / fuse: Newly created files are inaccessible via FUSE https://bugzilla.redhat.com/1703007 / glusterd: The telnet or something would cause high memory usage for glusterd & glusterfsd https://bugzilla.redhat.com/1705351 / HDFS: glusterfsd crash after days of running https://bugzilla.redhat.com/1703433 / project-infrastructure: gluster-block: setup GCOV & LCOV job https://bugzilla.redhat.com/1703435 / project-infrastructure: gluster-block: Upstream Jenkins job which get triggered at PR level https://bugzilla.redhat.com/1703329 / project-infrastructure: [gluster-infra]: Please create repo for plus one scale work https://bugzilla.redhat.com/1699309 / snapshot: Gluster snapshot fails with systemd autmounted bricks https://bugzilla.redhat.com/1702289 / tiering: Promotion failed for a0afd3e3-0109-49b7-9b74-ba77bf653aba.11229 https://bugzilla.redhat.com/1697812 / website: mention a pointer to all the mailing lists available under glusterfs project(https://www.gluster.org/community/) [...truncated 2 lines...] -------------- next part -------------- A non-text attachment was scrubbed... Name: build.log Type: application/octet-stream Size: 2089 bytes Desc: not available URL: From cynthia.zhou at nokia-sbell.com Mon May 6 02:34:08 2019 From: cynthia.zhou at nokia-sbell.com (Zhou, Cynthia (NSB - CN/Hangzhou)) Date: Mon, 6 May 2019 02:34:08 +0000 Subject: [Gluster-devel] glusterfsd memory leak issue found after enable ssl In-Reply-To: References: <07cb1c3aa08b414dbe37442955ddad36@nokia-sbell.com>

<6ce04fb69243465295a71b6953eafa19@nokia-sbell.com> <3cd91d1ce39541e7ad30c60ef15000aa@nokia-sbell.com>

<5d0c2ed30e884b86ba29bff5a47c960e@nokia-sbell.com> Message-ID: Hi, Sorry, I am so busy with other issues these days, could you help me to submit my patch for review? It is based on glusterfs3.12.15 code. But even with this patch , memory leak still exists, from memory leak tool it should be related with ssl_accept, not sure if it is because of openssl library or because improper use of ssl interfaces. --- a/rpc/rpc-transport/socket/src/socket.c +++ b/rpc/rpc-transport/socket/src/socket.c @@ -1019,7 +1019,16 @@ static void __socket_reset(rpc_transport_t *this) { memset(&priv->incoming, 0, sizeof(priv->incoming)); event_unregister_close(this->ctx->event_pool, priv->sock, priv->idx); - + if(priv->use_ssl&& priv->ssl_ssl) + { + gf_log(this->name, GF_LOG_INFO, + "clear and reset for socket(%d), free ssl ", + priv->sock); + SSL_shutdown(priv->ssl_ssl); + SSL_clear(priv->ssl_ssl); + SSL_free(priv->ssl_ssl); + priv->ssl_ssl = NULL; + } priv->sock = -1; priv->idx = -1; priv->connected = -1; @@ -4238,6 +4250,16 @@ void fini(rpc_transport_t *this) { pthread_mutex_destroy(&priv->out_lock); pthread_mutex_destroy(&priv->cond_lock); pthread_cond_destroy(&priv->cond); + if(priv->use_ssl&& priv->ssl_ssl) + { + gf_log(this->name, GF_LOG_INFO, + "clear and reset for socket(%d), free ssl ", + priv->sock); + SSL_shutdown(priv->ssl_ssl); + SSL_clear(priv->ssl_ssl); + SSL_free(priv->ssl_ssl); + priv->ssl_ssl = NULL; + } if (priv->ssl_private_key) { GF_FREE(priv->ssl_private_key); } From: Amar Tumballi Suryanarayan Sent: Wednesday, May 01, 2019 8:43 PM To: Zhou, Cynthia (NSB - CN/Hangzhou) Cc: Milind Changire ; gluster-devel at gluster.org Subject: Re: [Gluster-devel] glusterfsd memory leak issue found after enable ssl Hi Cynthia Zhou, Can you post the patch which fixes the issue of missing free? We will continue to investigate the leak further, but would really appreciate getting the patch which is already worked on land into upstream master. -Amar On Mon, Apr 22, 2019 at 1:38 PM Zhou, Cynthia (NSB - CN/Hangzhou) > wrote: Ok, I am clear now. I?ve added ssl_free in socket reset and socket finish function, though glusterfsd memory leak is not that much, still it is leaking, from source code I can not find anything else, Could you help to check if this issue exists in your env? If not I may have a try to merge your patch . Step 1> while true;do gluster v heal info, 2> check the vol-name glusterfsd memory usage, it is obviously increasing. cynthia From: Milind Changire > Sent: Monday, April 22, 2019 2:36 PM To: Zhou, Cynthia (NSB - CN/Hangzhou) > Cc: Atin Mukherjee >; gluster-devel at gluster.org Subject: Re: [Gluster-devel] glusterfsd memory leak issue found after enable ssl According to BIO_new_socket() man page ... If the close flag is set then the socket is shut down and closed when the BIO is freed. For Gluster to have more control over the socket shutdown, the BIO_NOCLOSE flag is set. Otherwise, SSL takes control of socket shutdown whenever BIO is freed. _______________________________________________ Gluster-devel mailing list Gluster-devel at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From cynthia.zhou at nokia-sbell.com Mon May 6 06:11:58 2019 From: cynthia.zhou at nokia-sbell.com (Zhou, Cynthia (NSB - CN/Hangzhou)) Date: Mon, 6 May 2019 06:11:58 +0000 Subject: [Gluster-devel] glusterfsd memory leak issue found after enable ssl In-Reply-To: References: <07cb1c3aa08b414dbe37442955ddad36@nokia-sbell.com>

<6ce04fb69243465295a71b6953eafa19@nokia-sbell.com> <3cd91d1ce39541e7ad30c60ef15000aa@nokia-sbell.com>

<5d0c2ed30e884b86ba29bff5a47c960e@nokia-sbell.com>

Message-ID: <6d3f68f73e6d440dab19028526745171@nokia-sbell.com> Hi, From our test valgrind and libleak all blame ssl3_accept ///////////////////////////from valgrind attached to glusterfds/////////////////////////////////////////// ==16673== 198,720 bytes in 12 blocks are definitely lost in loss record 1,114 of 1,123 ==16673== at 0x4C2EB7B: malloc (vg_replace_malloc.c:299) ==16673== by 0x63E1977: CRYPTO_malloc (in /usr/lib64/libcrypto.so.1.0.2p) ==16673== by 0xA855E0C: ssl3_setup_write_buffer (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA855E77: ssl3_setup_buffers (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA8485D9: ssl3_accept (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA610DDF: ssl_complete_connection (socket.c:400) ==16673== by 0xA617F38: ssl_handle_server_connection_attempt (socket.c:2409) ==16673== by 0xA618420: socket_complete_connection (socket.c:2554) ==16673== by 0xA618788: socket_event_handler (socket.c:2613) ==16673== by 0x4ED6983: event_dispatch_epoll_handler (event-epoll.c:587) ==16673== by 0x4ED6C5A: event_dispatch_epoll_worker (event-epoll.c:663) ==16673== by 0x615C5D9: start_thread (in /usr/lib64/libpthread-2.27.so) ==16673== ==16673== 200,544 bytes in 12 blocks are definitely lost in loss record 1,115 of 1,123 ==16673== at 0x4C2EB7B: malloc (vg_replace_malloc.c:299) ==16673== by 0x63E1977: CRYPTO_malloc (in /usr/lib64/libcrypto.so.1.0.2p) ==16673== by 0xA855D12: ssl3_setup_read_buffer (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA855E68: ssl3_setup_buffers (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA8485D9: ssl3_accept (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA610DDF: ssl_complete_connection (socket.c:400) ==16673== by 0xA617F38: ssl_handle_server_connection_attempt (socket.c:2409) ==16673== by 0xA618420: socket_complete_connection (socket.c:2554) ==16673== by 0xA618788: socket_event_handler (socket.c:2613) ==16673== by 0x4ED6983: event_dispatch_epoll_handler (event-epoll.c:587) ==16673== by 0x4ED6C5A: event_dispatch_epoll_worker (event-epoll.c:663) ==16673== by 0x615C5D9: start_thread (in /usr/lib64/libpthread-2.27.so) ==16673== valgrind --leak-check=f ////////////////////////////////////with libleak attached to glusterfsd///////////////////////////////////////// callstack[2419] expires. count=1 size=224/224 alloc=362 free=350 /home/robot/libleak/libleak.so(malloc+0x25) [0x7f1460604065] /lib64/libcrypto.so.10(CRYPTO_malloc+0x58) [0x7f145ecd9978] /lib64/libcrypto.so.10(EVP_DigestInit_ex+0x2a9) [0x7f145ed95749] /lib64/libssl.so.10(ssl3_digest_cached_records+0x11d) [0x7f145abb6ced] /lib64/libssl.so.10(ssl3_accept+0xc8f) [0x7f145abadc4f] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(ssl_complete_connection+0x5e) [0x7f145ae00f3a] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc16d) [0x7f145ae0816d] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc68a) [0x7f145ae0868a] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc9f2) [0x7f145ae089f2] /lib64/libglusterfs.so.0(+0x9b96f) [0x7f146038596f] /lib64/libglusterfs.so.0(+0x9bc46) [0x7f1460385c46] /lib64/libpthread.so.0(+0x75da) [0x7f145f0d15da] /lib64/libc.so.6(clone+0x3f) [0x7f145e9a7eaf] callstack[2432] expires. count=1 size=104/104 alloc=362 free=0 /home/robot/libleak/libleak.so(malloc+0x25) [0x7f1460604065] /lib64/libcrypto.so.10(CRYPTO_malloc+0x58) [0x7f145ecd9978] /lib64/libcrypto.so.10(BN_MONT_CTX_new+0x17) [0x7f145ed48627] /lib64/libcrypto.so.10(BN_MONT_CTX_set_locked+0x6d) [0x7f145ed489fd] /lib64/libcrypto.so.10(+0xff4d9) [0x7f145ed6a4d9] /lib64/libcrypto.so.10(int_rsa_verify+0x1cd) [0x7f145ed6d41d] /lib64/libcrypto.so.10(RSA_verify+0x32) [0x7f145ed6d972] /lib64/libcrypto.so.10(+0x107ff5) [0x7f145ed72ff5] /lib64/libcrypto.so.10(EVP_VerifyFinal+0x211) [0x7f145ed9dd51] /lib64/libssl.so.10(ssl3_get_cert_verify+0x5bb) [0x7f145abac06b] /lib64/libssl.so.10(ssl3_accept+0x988) [0x7f145abad948] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(ssl_complete_connection+0x5e) [0x7f145ae00f3a] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc16d) [0x7f145ae0816d] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc68a) [0x7f145ae0868a] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc9f2) [0x7f145ae089f2] /lib64/libglusterfs.so.0(+0x9b96f) [0x7f146038596f] /lib64/libglusterfs.so.0(+0x9bc46) [0x7f1460385c46] /lib64/libpthread.so.0(+0x75da) [0x7f145f0d15da] /lib64/libc.so.6(clone+0x3f) [0x7f145e9a7eaf] one interesting thing is that the memory goes up to about 300m then it stopped increasing !!! I am wondering if this is caused by open-ssl library? But when I search from openssl community, there is no such issue reported before. Is glusterfs using ssl_accept correctly? cynthia From: Zhou, Cynthia (NSB - CN/Hangzhou) Sent: Monday, May 06, 2019 10:34 AM To: 'Amar Tumballi Suryanarayan' Cc: Milind Changire ; gluster-devel at gluster.org Subject: RE: [Gluster-devel] glusterfsd memory leak issue found after enable ssl Hi, Sorry, I am so busy with other issues these days, could you help me to submit my patch for review? It is based on glusterfs3.12.15 code. But even with this patch , memory leak still exists, from memory leak tool it should be related with ssl_accept, not sure if it is because of openssl library or because improper use of ssl interfaces. --- a/rpc/rpc-transport/socket/src/socket.c +++ b/rpc/rpc-transport/socket/src/socket.c @@ -1019,7 +1019,16 @@ static void __socket_reset(rpc_transport_t *this) { memset(&priv->incoming, 0, sizeof(priv->incoming)); event_unregister_close(this->ctx->event_pool, priv->sock, priv->idx); - + if(priv->use_ssl&& priv->ssl_ssl) + { + gf_log(this->name, GF_LOG_INFO, + "clear and reset for socket(%d), free ssl ", + priv->sock); + SSL_shutdown(priv->ssl_ssl); + SSL_clear(priv->ssl_ssl); + SSL_free(priv->ssl_ssl); + priv->ssl_ssl = NULL; + } priv->sock = -1; priv->idx = -1; priv->connected = -1; @@ -4238,6 +4250,16 @@ void fini(rpc_transport_t *this) { pthread_mutex_destroy(&priv->out_lock); pthread_mutex_destroy(&priv->cond_lock); pthread_cond_destroy(&priv->cond); + if(priv->use_ssl&& priv->ssl_ssl) + { + gf_log(this->name, GF_LOG_INFO, + "clear and reset for socket(%d), free ssl ", + priv->sock); + SSL_shutdown(priv->ssl_ssl); + SSL_clear(priv->ssl_ssl); + SSL_free(priv->ssl_ssl); + priv->ssl_ssl = NULL; + } if (priv->ssl_private_key) { GF_FREE(priv->ssl_private_key); } From: Amar Tumballi Suryanarayan > Sent: Wednesday, May 01, 2019 8:43 PM To: Zhou, Cynthia (NSB - CN/Hangzhou) > Cc: Milind Changire >; gluster-devel at gluster.org Subject: Re: [Gluster-devel] glusterfsd memory leak issue found after enable ssl Hi Cynthia Zhou, Can you post the patch which fixes the issue of missing free? We will continue to investigate the leak further, but would really appreciate getting the patch which is already worked on land into upstream master. -Amar On Mon, Apr 22, 2019 at 1:38 PM Zhou, Cynthia (NSB - CN/Hangzhou) > wrote: Ok, I am clear now. I?ve added ssl_free in socket reset and socket finish function, though glusterfsd memory leak is not that much, still it is leaking, from source code I can not find anything else, Could you help to check if this issue exists in your env? If not I may have a try to merge your patch . Step 1> while true;do gluster v heal info, 2> check the vol-name glusterfsd memory usage, it is obviously increasing. cynthia From: Milind Changire > Sent: Monday, April 22, 2019 2:36 PM To: Zhou, Cynthia (NSB - CN/Hangzhou) > Cc: Atin Mukherjee >; gluster-devel at gluster.org Subject: Re: [Gluster-devel] glusterfsd memory leak issue found after enable ssl According to BIO_new_socket() man page ... If the close flag is set then the socket is shut down and closed when the BIO is freed. For Gluster to have more control over the socket shutdown, the BIO_NOCLOSE flag is set. Otherwise, SSL takes control of socket shutdown whenever BIO is freed. _______________________________________________ Gluster-devel mailing list Gluster-devel at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Mon May 6 18:15:04 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Mon, 6 May 2019 14:15:04 -0400 Subject: [Gluster-devel] [Gluster-users] gluster-block v0.4 is alive! In-Reply-To: References: Message-ID: On Thu, May 2, 2019 at 1:35 PM Prasanna Kalever wrote: > Hello Gluster folks, > > Gluster-block team is happy to announce the v0.4 release [1]. > > This is the new stable version of gluster-block, lots of new and > exciting features and interesting bug fixes are made available as part > of this release. > Please find the big list of release highlights and notable fixes at [2]. > > Good work Team (Prasanna and Xiubo Li to be precise)!! This was much needed release w.r.to gluster-block project, mainly because of the number of improvements done since last release. Also, gluster-block release 0.3 was not compatible with glusterfs-6.x series. All, feel free to use it if your deployment has any usecase for Block storage, and give us feedback. Happy to make sure gluster-block is stable for you. Regards, Amar > Details about installation can be found in the easy install guide at > [3]. Find the details about prerequisites and setup guide at [4]. > If you are a new user, checkout the demo video attached in the README > doc [5], which will be a good source of intro to the project. > There are good examples about how to use gluster-block both in the man > pages [6] and test file [7] (also in the README). > > gluster-block is part of fedora package collection, an updated package > with release version v0.4 will be soon made available. And the > community provided packages will be soon made available at [8]. > > Please spend a minute to report any kind of issue that comes to your > notice with this handy link [9]. > We look forward to your feedback, which will help gluster-block get better! > > We would like to thank all our users, contributors for bug filing and > fixes, also the whole team who involved in the huge effort with > pre-release testing. > > > [1] https://github.com/gluster/gluster-block > [2] https://github.com/gluster/gluster-block/releases > [3] https://github.com/gluster/gluster-block/blob/master/INSTALL > [4] https://github.com/gluster/gluster-block#usage > [5] https://github.com/gluster/gluster-block/blob/master/README.md > [6] https://github.com/gluster/gluster-block/tree/master/docs > [7] https://github.com/gluster/gluster-block/blob/master/tests/basic.t > [8] https://download.gluster.org/pub/gluster/gluster-block/ > [9] https://github.com/gluster/gluster-block/issues/new > > Cheers, > Team Gluster-Block! > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > > -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From rajibcse2k10 at gmail.com Mon May 6 20:04:59 2019 From: rajibcse2k10 at gmail.com (Rajib Hossen) Date: Mon, 6 May 2019 15:04:59 -0500 Subject: [Gluster-devel] New in GlusterFS Message-ID: Hello all, I am new in glusterfs development. I would like to contribute in Erasure Coding part of glusterfs. I already studied non-systematic code and its theory. Now, I want to know how erasure coding read/write works in terms of coding. Can you please give me any documentation that'll help to understand glusterfs ec read/write, coding structure. Please any help is appreciated. Thanks you very much. Sincerely, Md Rajib Hossen -------------- next part -------------- An HTML attachment was scrubbed... URL: From jthottan at redhat.com Tue May 7 04:10:11 2019 From: jthottan at redhat.com (Jiffin Tony Thottan) Date: Tue, 7 May 2019 09:40:11 +0530 Subject: [Gluster-devel] [Gluster-users] Proposing to previous ganesha HA clustersolution back to gluster code as gluster-7 feature In-Reply-To: References: Message-ID: Hi On 04/05/19 12:04 PM, Strahil wrote: > Hi Jiffin, > > No vendor will support your corosync/pacemaker stack if you do not have proper fencing. > As Gluster is already a cluster of its own, it makes sense to control everything from there. > > Best Regards, Yeah I agree with your point. What I meant to say by default this feature won't provide any fencing mechanism, user need to manually configure fencing for the cluster. In future we can try to include to default fencing configuration for the ganesha cluster as part of the Ganesha HA configuration Regards, Jiffin > Strahil NikolovOn May 3, 2019 09:08, Jiffin Tony Thottan wrote: >> >> On 30/04/19 6:59 PM, Strahil Nikolov wrote: >>> Hi, >>> >>> I'm posting this again as it got bounced. >>> Keep in mind that corosync/pacemaker? is hard for proper setup by new admins/users. >>> >>> I'm still trying to remediate the effects of poor configuration at work. >>> Also, storhaug is nice for hyperconverged setups where the host is not only hosting bricks, but? other? workloads. >>> Corosync/pacemaker require proper fencing to be setup and most of the stonith resources 'shoot the other node in the head'. >>> I would be happy to see an easy to deploy (let say 'cluster.enable-ha-ganesha true') and gluster to be bringing up the Floating IPs and taking care of the NFS locks, so no disruption will be felt by the clients. >> >> It do take care those, but need to follow certain prerequisite, but >> please fencing won't configured for this setup. May we think about in >> future. >> >> -- >> >> Jiffin >> >>> Still, this will be a lot of work to achieve. >>> >>> Best Regards, >>> Strahil Nikolov >>> >>> On Apr 30, 2019 15:19, Jim Kinney wrote: >>>> >>>> +1! >>>> I'm using nfs-ganesha in my next upgrade so my client systems can use NFS instead of fuse mounts. Having an integrated, designed in process to coordinate multiple nodes into an HA cluster will very welcome. >>>> >>>> On April 30, 2019 3:20:11 AM EDT, Jiffin Tony Thottan wrote: >>>>> >>>>> Hi all, >>>>> >>>>> Some of you folks may be familiar with HA solution provided for nfs-ganesha by gluster using pacemaker and corosync. >>>>> >>>>> That feature was removed in glusterfs 3.10 in favour for common HA project "Storhaug". Even Storhaug was not progressed >>>>> >>>>> much from last two years and current development is in halt state, hence planning to restore old HA ganesha solution back >>>>> >>>>> to gluster code repository with some improvement and targetting for next gluster release 7. >>>>> >>>>> ? ??I have opened up an issue [1] with details and posted initial set of patches [2] >>>>> >>>>> Please share your thoughts on the same >>>>> >>>>> >>>>> Regards, >>>>> >>>>> Jiffin >>>>> >>>>> [1] https://github.com/gluster/glusterfs/issues/663 >>>>> >>>>> [2] https://review.gluster.org/#/q/topic:rfc-663+(status:open+OR+status:merged) >>>>> >>>>> >>>> -- >>>> Sent from my Android device with K-9 Mail. All tyopes are thumb related and reflect authenticity. >>> Keep in mind that corosync/pacemaker? is hard for proper setup by new admins/users. >>> >>> I'm still trying to remediate the effects of poor configuration at work. >>> Also, storhaug is nice for hyperconverged setups where the host is not only hosting bricks, but? other? workloads. >>> Corosync/pacemaker require proper fencing to be setup and most of the stonith resources 'shoot the other node in the head'. >>> I would be happy to see an easy to deploy (let say 'cluster.enable-ha-ganesha true') and gluster to be bringing up the Floating IPs and taking care of the NFS locks, so no disruption will be felt by the clients. >>> >>> Still, this will be a lot of work to achieve. >>> >>> Best Regards, >>> Strahil NikolovOn Apr 30, 2019 15:19, Jim Kinney wrote: >>>> +1! >>>> I'm using nfs-ganesha in my next upgrade so my client systems can use NFS instead of fuse mounts. Having an integrated, designed in process to coordinate multiple nodes into an HA cluster will very welcome. >>>> >>>> On April 30, 2019 3:20:11 AM EDT, Jiffin Tony Thottan wrote: >>>>> Hi all, >>>>> >>>>> Some of you folks may be familiar with HA solution provided for nfs-ganesha by gluster using pacemaker and corosync. >>>>> >>>>> That feature was removed in glusterfs 3.10 in favour for common HA project "Storhaug". Even Storhaug was not progressed >>>>> >>>>> much from last two years and current development is in halt state, hence planning to restore old HA ganesha solution back >>>>> >>>>> to gluster code repository with some improvement and targetting for next gluster release 7. >>>>> >>>>> I have opened up an issue [1] with details and posted initial set of patches [2] >>>>> >>>>> Please share your thoughts on the same >>>>> >>>>> Regards, >>>>> >>>>> Jiffin >>>>> >>>>> [1] https://github.com/gluster/glusterfs/issues/663 >>>>> >>>>> [2] https://review.gluster.org/#/q/topic:rfc-663+(status:open+OR+status:merged) >>>> -- >>>> Sent from my Android device with K-9 Mail. All tyopes are thumb related and reflect authenticity. From ndevos at redhat.com Tue May 7 05:35:34 2019 From: ndevos at redhat.com (Niels de Vos) Date: Tue, 7 May 2019 07:35:34 +0200 Subject: [Gluster-devel] gluster-block v0.4 is alive! In-Reply-To: References: Message-ID: <20190507053534.GF5209@ndevos-x270> On Thu, May 02, 2019 at 11:04:41PM +0530, Prasanna Kalever wrote: > Hello Gluster folks, > > Gluster-block team is happy to announce the v0.4 release [1]. > > This is the new stable version of gluster-block, lots of new and > exciting features and interesting bug fixes are made available as part > of this release. > Please find the big list of release highlights and notable fixes at [2]. > > Details about installation can be found in the easy install guide at > [3]. Find the details about prerequisites and setup guide at [4]. > If you are a new user, checkout the demo video attached in the README > doc [5], which will be a good source of intro to the project. > There are good examples about how to use gluster-block both in the man > pages [6] and test file [7] (also in the README). > > gluster-block is part of fedora package collection, an updated package > with release version v0.4 will be soon made available. And the > community provided packages will be soon made available at [8]. Updates for Fedora are available in the testing repositories: Fedora 30: https://bodhi.fedoraproject.org/updates/FEDORA-2019-76730d7230 Fedora 29: https://bodhi.fedoraproject.org/updates/FEDORA-2019-cc7cdce2a4 Fedora 28: https://bodhi.fedoraproject.org/updates/FEDORA-2019-9e9a210110 Installation instructions can be found at the above links. Please leave testing feedback as comments on the Fedora Update pages. Thanks, Niels > Please spend a minute to report any kind of issue that comes to your > notice with this handy link [9]. > We look forward to your feedback, which will help gluster-block get better! > > We would like to thank all our users, contributors for bug filing and > fixes, also the whole team who involved in the huge effort with > pre-release testing. > > > [1] https://github.com/gluster/gluster-block > [2] https://github.com/gluster/gluster-block/releases > [3] https://github.com/gluster/gluster-block/blob/master/INSTALL > [4] https://github.com/gluster/gluster-block#usage > [5] https://github.com/gluster/gluster-block/blob/master/README.md > [6] https://github.com/gluster/gluster-block/tree/master/docs > [7] https://github.com/gluster/gluster-block/blob/master/tests/basic.t > [8] https://download.gluster.org/pub/gluster/gluster-block/ > [9] https://github.com/gluster/gluster-block/issues/new > > Cheers, > Team Gluster-Block! > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel From aspandey at redhat.com Tue May 7 09:03:52 2019 From: aspandey at redhat.com (Ashish Pandey) Date: Tue, 7 May 2019 05:03:52 -0400 (EDT) Subject: [Gluster-devel] New in GlusterFS In-Reply-To: References: Message-ID: <1109996399.17152975.1557219832489.JavaMail.zimbra@redhat.com> Hi Rajib, Welcome to the gluster community. I am attaching some of the documents which I found while I started working on Erasure Coded volumes. Once you clone, you can also find out following documents - glusterfs/doc/developer-guide/ec-implementation.md You can start with above documents and code reading, If you have any doubts, please feel free to send it to gluster-devel mailing list. Also, you can attend gluster community meetings to discuss technical details of gluster. https://github.com/gluster/community Meeting schedule - * APAC friendly hours * Tuesday 14th May 2019, 11:30AM IST * Bridge: https://bluejeans.com/836554017 * NA/EMEA * Tuesday 7th May 2019, 01:00 PM EDT * Bridge: https://bluejeans.com/486278655 --- Ashish ----- Original Message ----- From: "Rajib Hossen" To: gluster-devel at gluster.org Sent: Tuesday, May 7, 2019 1:34:59 AM Subject: [Gluster-devel] New in GlusterFS Hello all, I am new in glusterfs development. I would like to contribute in Erasure Coding part of glusterfs. I already studied non-systematic code and its theory. Now, I want to know how erasure coding read/write works in terms of coding. Can you please give me any documentation that'll help to understand glusterfs ec read/write, coding structure. Please any help is appreciated. Thanks you very much. Sincerely, Md Rajib Hossen _______________________________________________ Gluster-devel mailing list Gluster-devel at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Erasure Coding - Design Type: application/octet-stream Size: 10021 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Disperse_Xlator_Ramon_Datalab.pdf Type: application/pdf Size: 1594327 bytes Desc: not available URL: From aspandey at redhat.com Tue May 7 09:19:05 2019 From: aspandey at redhat.com (Ashish Pandey) Date: Tue, 7 May 2019 05:19:05 -0400 (EDT) Subject: [Gluster-devel] Meeting Details on footer of the gluster-devel and gluster-user mailing list In-Reply-To: <2029030585.17155612.1557220163425.JavaMail.zimbra@redhat.com> Message-ID: <1839109616.17156274.1557220745006.JavaMail.zimbra@redhat.com> Hi, While we send a mail on gluster-devel or gluster-user mailing list, following content gets auto generated and placed at the end of mail. Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users Gluster-devel mailing list Gluster-devel at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel In the similar way, is it possible to attach meeting schedule and link at the end of every such mails? Like this - Meeting schedule - * APAC friendly hours * Tuesday 14th May 2019 , 11:30AM IST * Bridge: https://bluejeans.com/836554017 * NA/EMEA * Tuesday 7th May 2019 , 01:00 PM EDT * Bridge: https://bluejeans.com/486278655 Or just a link to meeting minutes details?? https://github.com/gluster/community/tree/master/meetings This will help developers and users of the community to know when and where meeting happens and how to attend those meetings. --- Ashish -------------- next part -------------- An HTML attachment was scrubbed... URL: From srakonde at redhat.com Tue May 7 14:34:52 2019 From: srakonde at redhat.com (Sanju Rakonde) Date: Tue, 7 May 2019 20:04:52 +0530 Subject: [Gluster-devel] [Gluster-infra] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> <797512f6ff7f1b9fedbf8b7968dd86a6968d9105.camel@redhat.com> Message-ID: Looks like is_nfs_export_available started failing again in recent centos-regressions. Michael, can you please check? On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul wrote: > > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer > wrote: > >> Le lundi 22 avril 2019 ? 22:57 +0530, Atin Mukherjee a ?crit : >> > Is this back again? The recent patches are failing regression :-\ . >> >> So, on builder206, it took me a while to find that the issue is that >> nfs (the service) was running. >> >> ./tests/basic/afr/tarissue.t failed, because the nfs initialisation >> failed with a rather cryptic message: >> >> [2019-04-23 13:17:05.371733] I [socket.c:991:__socket_server_bind] 0- >> socket.nfs-server: process started listening on port (38465) >> [2019-04-23 13:17:05.385819] E [socket.c:972:__socket_server_bind] 0- >> socket.nfs-server: binding to failed: Address already in use >> [2019-04-23 13:17:05.385843] E [socket.c:974:__socket_server_bind] 0- >> socket.nfs-server: Port is already in use >> [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- >> socket.nfs-server: __socket_server_bind failed;closing socket 14 >> >> I found where this came from, but a few stuff did surprised me: >> >> - the order of print is different that the order in the code >> > > Indeed strange... > >> - the message on "started listening" didn't take in account the fact >> that bind failed on: >> > > Shouldn't it bail out if it failed to bind? > Some missing 'goto out' around line 975/976? > Y. > >> >> >> >> https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 >> >> The message about port 38465 also threw me off the track. The real >> issue is that the service nfs was already running, and I couldn't find >> anything listening on port 38465 >> >> once I do service nfs stop, it no longer failed. >> >> So far, I do know why nfs.service was activated. >> >> But at least, 206 should be fixed, and we know a bit more on what would >> be causing some failure. >> >> >> >> > On Wed, 3 Apr 2019 at 19:26, Michael Scherer >> > wrote: >> > >> > > Le mercredi 03 avril 2019 ? 16:30 +0530, Atin Mukherjee a ?crit : >> > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < >> > > > jthottan at redhat.com> >> > > > wrote: >> > > > >> > > > > Hi, >> > > > > >> > > > > is_nfs_export_available is just a wrapper around "showmount" >> > > > > command AFAIR. >> > > > > I saw following messages in console output. >> > > > > mount.nfs: rpc.statd is not running but is required for remote >> > > > > locking. >> > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, >> > > > > or >> > > > > start >> > > > > statd. >> > > > > 05:06:55 mount.nfs: an incorrect mount option was specified >> > > > > >> > > > > For me it looks rpcbind may not be running on the machine. >> > > > > Usually rpcbind starts automatically on machines, don't know >> > > > > whether it >> > > > > can happen or not. >> > > > > >> > > > >> > > > That's precisely what the question is. Why suddenly we're seeing >> > > > this >> > > > happening too frequently. Today I saw atleast 4 to 5 such >> > > > failures >> > > > already. >> > > > >> > > > Deepshika - Can you please help in inspecting this? >> > > >> > > So we think (we are not sure) that the issue is a bit complex. >> > > >> > > What we were investigating was nightly run fail on aws. When the >> > > build >> > > crash, the builder is restarted, since that's the easiest way to >> > > clean >> > > everything (since even with a perfect test suite that would clean >> > > itself, we could always end in a corrupt state on the system, WRT >> > > mount, fs, etc). >> > > >> > > In turn, this seems to cause trouble on aws, since cloud-init or >> > > something rename eth0 interface to ens5, without cleaning to the >> > > network configuration. >> > > >> > > So the network init script fail (because the image say "start eth0" >> > > and >> > > that's not present), but fail in a weird way. Network is >> > > initialised >> > > and working (we can connect), but the dhclient process is not in >> > > the >> > > right cgroup, and network.service is in failed state. Restarting >> > > network didn't work. In turn, this mean that rpc-statd refuse to >> > > start >> > > (due to systemd dependencies), which seems to impact various NFS >> > > tests. >> > > >> > > We have also seen that on some builders, rpcbind pick some IP v6 >> > > autoconfiguration, but we can't reproduce that, and there is no ip >> > > v6 >> > > set up anywhere. I suspect the network.service failure is somehow >> > > involved, but fail to see how. In turn, rpcbind.socket not starting >> > > could cause NFS test troubles. >> > > >> > > Our current stop gap fix was to fix all the builders one by one. >> > > Remove >> > > the config, kill the rogue dhclient, restart network service. >> > > >> > > However, we can't be sure this is going to fix the problem long >> > > term >> > > since this only manifest after a crash of the test suite, and it >> > > doesn't happen so often. (plus, it was working before some day in >> > > the >> > > past, when something did make this fail, and I do not know if >> > > that's a >> > > system upgrade, or a test change, or both). >> > > >> > > So we are still looking at it to have a complete understanding of >> > > the >> > > issue, but so far, we hacked our way to make it work (or so do I >> > > think). >> > > >> > > Deepshika is working to fix it long term, by fixing the issue >> > > regarding >> > > eth0/ens5 with a new base image. >> > > -- >> > > Michael Scherer >> > > Sysadmin, Community Infrastructure and Platform, OSAS >> > > >> > > >> > > -- >> > >> > - Atin (atinm) >> -- >> Michael Scherer >> Sysadmin, Community Infrastructure >> >> >> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From mscherer at redhat.com Tue May 7 16:11:29 2019 From: mscherer at redhat.com (Michael Scherer) Date: Tue, 07 May 2019 18:11:29 +0200 Subject: [Gluster-devel] [Gluster-infra] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> <797512f6ff7f1b9fedbf8b7968dd86a6968d9105.camel@redhat.com> Message-ID: Le mardi 07 mai 2019 ? 20:04 +0530, Sanju Rakonde a ?crit : > Looks like is_nfs_export_available started failing again in recent > centos-regressions. > > Michael, can you please check? I will try but I am leaving for vacation tonight, so if I find nothing, until I leave, I guess Deepshika will have to look. > On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul wrote: > > > > > > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer < > > mscherer at redhat.com> > > wrote: > > > > > Le lundi 22 avril 2019 ? 22:57 +0530, Atin Mukherjee a ?crit : > > > > Is this back again? The recent patches are failing regression > > > > :-\ . > > > > > > So, on builder206, it took me a while to find that the issue is > > > that > > > nfs (the service) was running. > > > > > > ./tests/basic/afr/tarissue.t failed, because the nfs > > > initialisation > > > failed with a rather cryptic message: > > > > > > [2019-04-23 13:17:05.371733] I > > > [socket.c:991:__socket_server_bind] 0- > > > socket.nfs-server: process started listening on port (38465) > > > [2019-04-23 13:17:05.385819] E > > > [socket.c:972:__socket_server_bind] 0- > > > socket.nfs-server: binding to failed: Address already in use > > > [2019-04-23 13:17:05.385843] E > > > [socket.c:974:__socket_server_bind] 0- > > > socket.nfs-server: Port is already in use > > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- > > > socket.nfs-server: __socket_server_bind failed;closing socket 14 > > > > > > I found where this came from, but a few stuff did surprised me: > > > > > > - the order of print is different that the order in the code > > > > > > > Indeed strange... > > > > > - the message on "started listening" didn't take in account the > > > fact > > > that bind failed on: > > > > > > > Shouldn't it bail out if it failed to bind? > > Some missing 'goto out' around line 975/976? > > Y. > > > > > > > > > > > > > > https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 > > > > > > The message about port 38465 also threw me off the track. The > > > real > > > issue is that the service nfs was already running, and I couldn't > > > find > > > anything listening on port 38465 > > > > > > once I do service nfs stop, it no longer failed. > > > > > > So far, I do know why nfs.service was activated. > > > > > > But at least, 206 should be fixed, and we know a bit more on what > > > would > > > be causing some failure. > > > > > > > > > > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer < > > > > mscherer at redhat.com> > > > > wrote: > > > > > > > > > Le mercredi 03 avril 2019 ? 16:30 +0530, Atin Mukherjee a > > > > > ?crit : > > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < > > > > > > jthottan at redhat.com> > > > > > > wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > is_nfs_export_available is just a wrapper around > > > > > > > "showmount" > > > > > > > command AFAIR. > > > > > > > I saw following messages in console output. > > > > > > > mount.nfs: rpc.statd is not running but is required for > > > > > > > remote > > > > > > > locking. > > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks > > > > > > > local, > > > > > > > or > > > > > > > start > > > > > > > statd. > > > > > > > 05:06:55 mount.nfs: an incorrect mount option was > > > > > > > specified > > > > > > > > > > > > > > For me it looks rpcbind may not be running on the > > > > > > > machine. > > > > > > > Usually rpcbind starts automatically on machines, don't > > > > > > > know > > > > > > > whether it > > > > > > > can happen or not. > > > > > > > > > > > > > > > > > > > That's precisely what the question is. Why suddenly we're > > > > > > seeing > > > > > > this > > > > > > happening too frequently. Today I saw atleast 4 to 5 such > > > > > > failures > > > > > > already. > > > > > > > > > > > > Deepshika - Can you please help in inspecting this? > > > > > > > > > > So we think (we are not sure) that the issue is a bit > > > > > complex. > > > > > > > > > > What we were investigating was nightly run fail on aws. When > > > > > the > > > > > build > > > > > crash, the builder is restarted, since that's the easiest way > > > > > to > > > > > clean > > > > > everything (since even with a perfect test suite that would > > > > > clean > > > > > itself, we could always end in a corrupt state on the system, > > > > > WRT > > > > > mount, fs, etc). > > > > > > > > > > In turn, this seems to cause trouble on aws, since cloud-init > > > > > or > > > > > something rename eth0 interface to ens5, without cleaning to > > > > > the > > > > > network configuration. > > > > > > > > > > So the network init script fail (because the image say "start > > > > > eth0" > > > > > and > > > > > that's not present), but fail in a weird way. Network is > > > > > initialised > > > > > and working (we can connect), but the dhclient process is not > > > > > in > > > > > the > > > > > right cgroup, and network.service is in failed state. > > > > > Restarting > > > > > network didn't work. In turn, this mean that rpc-statd refuse > > > > > to > > > > > start > > > > > (due to systemd dependencies), which seems to impact various > > > > > NFS > > > > > tests. > > > > > > > > > > We have also seen that on some builders, rpcbind pick some IP > > > > > v6 > > > > > autoconfiguration, but we can't reproduce that, and there is > > > > > no ip > > > > > v6 > > > > > set up anywhere. I suspect the network.service failure is > > > > > somehow > > > > > involved, but fail to see how. In turn, rpcbind.socket not > > > > > starting > > > > > could cause NFS test troubles. > > > > > > > > > > Our current stop gap fix was to fix all the builders one by > > > > > one. > > > > > Remove > > > > > the config, kill the rogue dhclient, restart network service. > > > > > > > > > > However, we can't be sure this is going to fix the problem > > > > > long > > > > > term > > > > > since this only manifest after a crash of the test suite, and > > > > > it > > > > > doesn't happen so often. (plus, it was working before some > > > > > day in > > > > > the > > > > > past, when something did make this fail, and I do not know if > > > > > that's a > > > > > system upgrade, or a test change, or both). > > > > > > > > > > So we are still looking at it to have a complete > > > > > understanding of > > > > > the > > > > > issue, but so far, we hacked our way to make it work (or so > > > > > do I > > > > > think). > > > > > > > > > > Deepshika is working to fix it long term, by fixing the issue > > > > > regarding > > > > > eth0/ens5 with a new base image. > > > > > -- > > > > > Michael Scherer > > > > > Sysadmin, Community Infrastructure and Platform, OSAS > > > > > > > > > > > > > > > -- > > > > > > > > - Atin (atinm) > > > > > > -- > > > Michael Scherer > > > Sysadmin, Community Infrastructure > > > > > > > > > > > > _______________________________________________ > > > Gluster-devel mailing list > > > Gluster-devel at gluster.org > > > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > -- Michael Scherer Sysadmin, Community Infrastructure -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part URL: From rabhat at redhat.com Tue May 7 18:14:33 2019 From: rabhat at redhat.com (FNU Raghavendra Manjunath) Date: Tue, 7 May 2019 14:14:33 -0400 Subject: [Gluster-devel] [Gluster-users] Meeting Details on footer of the gluster-devel and gluster-user mailing list In-Reply-To: <1839109616.17156274.1557220745006.JavaMail.zimbra@redhat.com> References: <2029030585.17155612.1557220163425.JavaMail.zimbra@redhat.com> <1839109616.17156274.1557220745006.JavaMail.zimbra@redhat.com> Message-ID: + 1 to this. There is also one more thing. For some reason, the community meeting is not visible in my calendar (especially NA region). I am not sure if anyone else also facing this issue. Regards, Raghavendra On Tue, May 7, 2019 at 5:19 AM Ashish Pandey wrote: > Hi, > > While we send a mail on gluster-devel or gluster-user mailing list, > following content gets auto generated and placed at the end of mail. > > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > In the similar way, is it possible to attach meeting schedule and link at the end of every such mails? > Like this - > > Meeting schedule - > > > - APAC friendly hours > - Tuesday 14th May 2019, 11:30AM IST > - Bridge: https://bluejeans.com/836554017 > - NA/EMEA > - Tuesday 7th May 2019, 01:00 PM EDT > - Bridge: https://bluejeans.com/486278655 > > Or just a link to meeting minutes details?? > https://github.com/gluster/community/tree/master/meetings > > This will help developers and users of the community to know when and where meeting happens and how to attend those meetings. > > --- > Ashish > > > > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From vbellur at redhat.com Tue May 7 18:37:27 2019 From: vbellur at redhat.com (Vijay Bellur) Date: Tue, 7 May 2019 11:37:27 -0700 Subject: [Gluster-devel] [Gluster-users] Meeting Details on footer of the gluster-devel and gluster-user mailing list In-Reply-To: References: <2029030585.17155612.1557220163425.JavaMail.zimbra@redhat.com> <1839109616.17156274.1557220745006.JavaMail.zimbra@redhat.com> Message-ID: On Tue, May 7, 2019 at 11:15 AM FNU Raghavendra Manjunath wrote: > > + 1 to this. > I have updated the footer of gluster-devel. If that looks ok, we can extend it to gluster-users too. In case of a month with 5 Tuesdays, we can skip the 5th Tuesday and always stick to the first 4 Tuesdays of every month. That will help in describing the community meeting schedule better. If we want to keep the schedule running on alternate Tuesdays, please let me know and the mailing list footers can be updated accordingly :-). > There is also one more thing. For some reason, the community meeting is > not visible in my calendar (especially NA region). I am not sure if anyone > else also facing this issue. > I did face this issue. Realized that we had a meeting today and showed up at the meeting a while later but did not see many participants. Perhaps, the calendar invite has to be made a recurring one. Thanks, Vijay > > Regards, > Raghavendra > > On Tue, May 7, 2019 at 5:19 AM Ashish Pandey wrote: > >> Hi, >> >> While we send a mail on gluster-devel or gluster-user mailing list, >> following content gets auto generated and placed at the end of mail. >> >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> In the similar way, is it possible to attach meeting schedule and link at the end of every such mails? >> Like this - >> >> Meeting schedule - >> >> >> - APAC friendly hours >> - Tuesday 14th May 2019, 11:30AM IST >> - Bridge: https://bluejeans.com/836554017 >> - NA/EMEA >> - Tuesday 7th May 2019, 01:00 PM EDT >> - Bridge: https://bluejeans.com/486278655 >> >> Or just a link to meeting minutes details?? >> https://github.com/gluster/community/tree/master/meetings >> >> This will help developers and users of the community to know when and where meeting happens and how to attend those meetings. >> >> --- >> Ashish >> >> >> >> >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From dkhandel at redhat.com Tue May 7 18:53:05 2019 From: dkhandel at redhat.com (Deepshikha Khandelwal) Date: Wed, 8 May 2019 00:23:05 +0530 Subject: [Gluster-devel] [Gluster-infra] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> <797512f6ff7f1b9fedbf8b7968dd86a6968d9105.camel@redhat.com> Message-ID: Sanju, can you please give us more info about the failures. I see the failures occurring on just one of the builder (builder206). I'm taking it back offline for now. On Tue, May 7, 2019 at 9:42 PM Michael Scherer wrote: > Le mardi 07 mai 2019 ? 20:04 +0530, Sanju Rakonde a ?crit : > > Looks like is_nfs_export_available started failing again in recent > > centos-regressions. > > > > Michael, can you please check? > > I will try but I am leaving for vacation tonight, so if I find nothing, > until I leave, I guess Deepshika will have to look. > > > On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul wrote: > > > > > > > > > > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer < > > > mscherer at redhat.com> > > > wrote: > > > > > > > Le lundi 22 avril 2019 ? 22:57 +0530, Atin Mukherjee a ?crit : > > > > > Is this back again? The recent patches are failing regression > > > > > :-\ . > > > > > > > > So, on builder206, it took me a while to find that the issue is > > > > that > > > > nfs (the service) was running. > > > > > > > > ./tests/basic/afr/tarissue.t failed, because the nfs > > > > initialisation > > > > failed with a rather cryptic message: > > > > > > > > [2019-04-23 13:17:05.371733] I > > > > [socket.c:991:__socket_server_bind] 0- > > > > socket.nfs-server: process started listening on port (38465) > > > > [2019-04-23 13:17:05.385819] E > > > > [socket.c:972:__socket_server_bind] 0- > > > > socket.nfs-server: binding to failed: Address already in use > > > > [2019-04-23 13:17:05.385843] E > > > > [socket.c:974:__socket_server_bind] 0- > > > > socket.nfs-server: Port is already in use > > > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- > > > > socket.nfs-server: __socket_server_bind failed;closing socket 14 > > > > > > > > I found where this came from, but a few stuff did surprised me: > > > > > > > > - the order of print is different that the order in the code > > > > > > > > > > Indeed strange... > > > > > > > - the message on "started listening" didn't take in account the > > > > fact > > > > that bind failed on: > > > > > > > > > > Shouldn't it bail out if it failed to bind? > > > Some missing 'goto out' around line 975/976? > > > Y. > > > > > > > > > > > > > > > > > > > > > https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 > > > > > > > > The message about port 38465 also threw me off the track. The > > > > real > > > > issue is that the service nfs was already running, and I couldn't > > > > find > > > > anything listening on port 38465 > > > > > > > > once I do service nfs stop, it no longer failed. > > > > > > > > So far, I do know why nfs.service was activated. > > > > > > > > But at least, 206 should be fixed, and we know a bit more on what > > > > would > > > > be causing some failure. > > > > > > > > > > > > > > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer < > > > > > mscherer at redhat.com> > > > > > wrote: > > > > > > > > > > > Le mercredi 03 avril 2019 ? 16:30 +0530, Atin Mukherjee a > > > > > > ?crit : > > > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < > > > > > > > jthottan at redhat.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > is_nfs_export_available is just a wrapper around > > > > > > > > "showmount" > > > > > > > > command AFAIR. > > > > > > > > I saw following messages in console output. > > > > > > > > mount.nfs: rpc.statd is not running but is required for > > > > > > > > remote > > > > > > > > locking. > > > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks > > > > > > > > local, > > > > > > > > or > > > > > > > > start > > > > > > > > statd. > > > > > > > > 05:06:55 mount.nfs: an incorrect mount option was > > > > > > > > specified > > > > > > > > > > > > > > > > For me it looks rpcbind may not be running on the > > > > > > > > machine. > > > > > > > > Usually rpcbind starts automatically on machines, don't > > > > > > > > know > > > > > > > > whether it > > > > > > > > can happen or not. > > > > > > > > > > > > > > > > > > > > > > That's precisely what the question is. Why suddenly we're > > > > > > > seeing > > > > > > > this > > > > > > > happening too frequently. Today I saw atleast 4 to 5 such > > > > > > > failures > > > > > > > already. > > > > > > > > > > > > > > Deepshika - Can you please help in inspecting this? > > > > > > > > > > > > So we think (we are not sure) that the issue is a bit > > > > > > complex. > > > > > > > > > > > > What we were investigating was nightly run fail on aws. When > > > > > > the > > > > > > build > > > > > > crash, the builder is restarted, since that's the easiest way > > > > > > to > > > > > > clean > > > > > > everything (since even with a perfect test suite that would > > > > > > clean > > > > > > itself, we could always end in a corrupt state on the system, > > > > > > WRT > > > > > > mount, fs, etc). > > > > > > > > > > > > In turn, this seems to cause trouble on aws, since cloud-init > > > > > > or > > > > > > something rename eth0 interface to ens5, without cleaning to > > > > > > the > > > > > > network configuration. > > > > > > > > > > > > So the network init script fail (because the image say "start > > > > > > eth0" > > > > > > and > > > > > > that's not present), but fail in a weird way. Network is > > > > > > initialised > > > > > > and working (we can connect), but the dhclient process is not > > > > > > in > > > > > > the > > > > > > right cgroup, and network.service is in failed state. > > > > > > Restarting > > > > > > network didn't work. In turn, this mean that rpc-statd refuse > > > > > > to > > > > > > start > > > > > > (due to systemd dependencies), which seems to impact various > > > > > > NFS > > > > > > tests. > > > > > > > > > > > > We have also seen that on some builders, rpcbind pick some IP > > > > > > v6 > > > > > > autoconfiguration, but we can't reproduce that, and there is > > > > > > no ip > > > > > > v6 > > > > > > set up anywhere. I suspect the network.service failure is > > > > > > somehow > > > > > > involved, but fail to see how. In turn, rpcbind.socket not > > > > > > starting > > > > > > could cause NFS test troubles. > > > > > > > > > > > > Our current stop gap fix was to fix all the builders one by > > > > > > one. > > > > > > Remove > > > > > > the config, kill the rogue dhclient, restart network service. > > > > > > > > > > > > However, we can't be sure this is going to fix the problem > > > > > > long > > > > > > term > > > > > > since this only manifest after a crash of the test suite, and > > > > > > it > > > > > > doesn't happen so often. (plus, it was working before some > > > > > > day in > > > > > > the > > > > > > past, when something did make this fail, and I do not know if > > > > > > that's a > > > > > > system upgrade, or a test change, or both). > > > > > > > > > > > > So we are still looking at it to have a complete > > > > > > understanding of > > > > > > the > > > > > > issue, but so far, we hacked our way to make it work (or so > > > > > > do I > > > > > > think). > > > > > > > > > > > > Deepshika is working to fix it long term, by fixing the issue > > > > > > regarding > > > > > > eth0/ens5 with a new base image. > > > > > > -- > > > > > > Michael Scherer > > > > > > Sysadmin, Community Infrastructure and Platform, OSAS > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > - Atin (atinm) > > > > > > > > -- > > > > Michael Scherer > > > > Sysadmin, Community Infrastructure > > > > > > > > > > > > > > > > _______________________________________________ > > > > Gluster-devel mailing list > > > > Gluster-devel at gluster.org > > > > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > > > > _______________________________________________ > > > Gluster-devel mailing list > > > Gluster-devel at gluster.org > > > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > > > > > -- > Michael Scherer > Sysadmin, Community Infrastructure > > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From srakonde at redhat.com Wed May 8 01:45:53 2019 From: srakonde at redhat.com (Sanju Rakonde) Date: Wed, 8 May 2019 07:15:53 +0530 Subject: [Gluster-devel] [Gluster-infra] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> <797512f6ff7f1b9fedbf8b7968dd86a6968d9105.camel@redhat.com> Message-ID: Deepshikha, I see the failure here[1] which ran on builder206. So, we are good. [1] https://build.gluster.org/job/centos7-regression/5901/consoleFull On Wed, May 8, 2019 at 12:23 AM Deepshikha Khandelwal wrote: > Sanju, can you please give us more info about the failures. > > I see the failures occurring on just one of the builder (builder206). I'm > taking it back offline for now. > > On Tue, May 7, 2019 at 9:42 PM Michael Scherer > wrote: > >> Le mardi 07 mai 2019 ? 20:04 +0530, Sanju Rakonde a ?crit : >> > Looks like is_nfs_export_available started failing again in recent >> > centos-regressions. >> > >> > Michael, can you please check? >> >> I will try but I am leaving for vacation tonight, so if I find nothing, >> until I leave, I guess Deepshika will have to look. >> >> > On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul wrote: >> > >> > > >> > > >> > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer < >> > > mscherer at redhat.com> >> > > wrote: >> > > >> > > > Le lundi 22 avril 2019 ? 22:57 +0530, Atin Mukherjee a ?crit : >> > > > > Is this back again? The recent patches are failing regression >> > > > > :-\ . >> > > > >> > > > So, on builder206, it took me a while to find that the issue is >> > > > that >> > > > nfs (the service) was running. >> > > > >> > > > ./tests/basic/afr/tarissue.t failed, because the nfs >> > > > initialisation >> > > > failed with a rather cryptic message: >> > > > >> > > > [2019-04-23 13:17:05.371733] I >> > > > [socket.c:991:__socket_server_bind] 0- >> > > > socket.nfs-server: process started listening on port (38465) >> > > > [2019-04-23 13:17:05.385819] E >> > > > [socket.c:972:__socket_server_bind] 0- >> > > > socket.nfs-server: binding to failed: Address already in use >> > > > [2019-04-23 13:17:05.385843] E >> > > > [socket.c:974:__socket_server_bind] 0- >> > > > socket.nfs-server: Port is already in use >> > > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- >> > > > socket.nfs-server: __socket_server_bind failed;closing socket 14 >> > > > >> > > > I found where this came from, but a few stuff did surprised me: >> > > > >> > > > - the order of print is different that the order in the code >> > > > >> > > >> > > Indeed strange... >> > > >> > > > - the message on "started listening" didn't take in account the >> > > > fact >> > > > that bind failed on: >> > > > >> > > >> > > Shouldn't it bail out if it failed to bind? >> > > Some missing 'goto out' around line 975/976? >> > > Y. >> > > >> > > > >> > > > >> > > > >> > > > >> >> https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 >> > > > >> > > > The message about port 38465 also threw me off the track. The >> > > > real >> > > > issue is that the service nfs was already running, and I couldn't >> > > > find >> > > > anything listening on port 38465 >> > > > >> > > > once I do service nfs stop, it no longer failed. >> > > > >> > > > So far, I do know why nfs.service was activated. >> > > > >> > > > But at least, 206 should be fixed, and we know a bit more on what >> > > > would >> > > > be causing some failure. >> > > > >> > > > >> > > > >> > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer < >> > > > > mscherer at redhat.com> >> > > > > wrote: >> > > > > >> > > > > > Le mercredi 03 avril 2019 ? 16:30 +0530, Atin Mukherjee a >> > > > > > ?crit : >> > > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < >> > > > > > > jthottan at redhat.com> >> > > > > > > wrote: >> > > > > > > >> > > > > > > > Hi, >> > > > > > > > >> > > > > > > > is_nfs_export_available is just a wrapper around >> > > > > > > > "showmount" >> > > > > > > > command AFAIR. >> > > > > > > > I saw following messages in console output. >> > > > > > > > mount.nfs: rpc.statd is not running but is required for >> > > > > > > > remote >> > > > > > > > locking. >> > > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks >> > > > > > > > local, >> > > > > > > > or >> > > > > > > > start >> > > > > > > > statd. >> > > > > > > > 05:06:55 mount.nfs: an incorrect mount option was >> > > > > > > > specified >> > > > > > > > >> > > > > > > > For me it looks rpcbind may not be running on the >> > > > > > > > machine. >> > > > > > > > Usually rpcbind starts automatically on machines, don't >> > > > > > > > know >> > > > > > > > whether it >> > > > > > > > can happen or not. >> > > > > > > > >> > > > > > > >> > > > > > > That's precisely what the question is. Why suddenly we're >> > > > > > > seeing >> > > > > > > this >> > > > > > > happening too frequently. Today I saw atleast 4 to 5 such >> > > > > > > failures >> > > > > > > already. >> > > > > > > >> > > > > > > Deepshika - Can you please help in inspecting this? >> > > > > > >> > > > > > So we think (we are not sure) that the issue is a bit >> > > > > > complex. >> > > > > > >> > > > > > What we were investigating was nightly run fail on aws. When >> > > > > > the >> > > > > > build >> > > > > > crash, the builder is restarted, since that's the easiest way >> > > > > > to >> > > > > > clean >> > > > > > everything (since even with a perfect test suite that would >> > > > > > clean >> > > > > > itself, we could always end in a corrupt state on the system, >> > > > > > WRT >> > > > > > mount, fs, etc). >> > > > > > >> > > > > > In turn, this seems to cause trouble on aws, since cloud-init >> > > > > > or >> > > > > > something rename eth0 interface to ens5, without cleaning to >> > > > > > the >> > > > > > network configuration. >> > > > > > >> > > > > > So the network init script fail (because the image say "start >> > > > > > eth0" >> > > > > > and >> > > > > > that's not present), but fail in a weird way. Network is >> > > > > > initialised >> > > > > > and working (we can connect), but the dhclient process is not >> > > > > > in >> > > > > > the >> > > > > > right cgroup, and network.service is in failed state. >> > > > > > Restarting >> > > > > > network didn't work. In turn, this mean that rpc-statd refuse >> > > > > > to >> > > > > > start >> > > > > > (due to systemd dependencies), which seems to impact various >> > > > > > NFS >> > > > > > tests. >> > > > > > >> > > > > > We have also seen that on some builders, rpcbind pick some IP >> > > > > > v6 >> > > > > > autoconfiguration, but we can't reproduce that, and there is >> > > > > > no ip >> > > > > > v6 >> > > > > > set up anywhere. I suspect the network.service failure is >> > > > > > somehow >> > > > > > involved, but fail to see how. In turn, rpcbind.socket not >> > > > > > starting >> > > > > > could cause NFS test troubles. >> > > > > > >> > > > > > Our current stop gap fix was to fix all the builders one by >> > > > > > one. >> > > > > > Remove >> > > > > > the config, kill the rogue dhclient, restart network service. >> > > > > > >> > > > > > However, we can't be sure this is going to fix the problem >> > > > > > long >> > > > > > term >> > > > > > since this only manifest after a crash of the test suite, and >> > > > > > it >> > > > > > doesn't happen so often. (plus, it was working before some >> > > > > > day in >> > > > > > the >> > > > > > past, when something did make this fail, and I do not know if >> > > > > > that's a >> > > > > > system upgrade, or a test change, or both). >> > > > > > >> > > > > > So we are still looking at it to have a complete >> > > > > > understanding of >> > > > > > the >> > > > > > issue, but so far, we hacked our way to make it work (or so >> > > > > > do I >> > > > > > think). >> > > > > > >> > > > > > Deepshika is working to fix it long term, by fixing the issue >> > > > > > regarding >> > > > > > eth0/ens5 with a new base image. >> > > > > > -- >> > > > > > Michael Scherer >> > > > > > Sysadmin, Community Infrastructure and Platform, OSAS >> > > > > > >> > > > > > >> > > > > > -- >> > > > > >> > > > > - Atin (atinm) >> > > > >> > > > -- >> > > > Michael Scherer >> > > > Sysadmin, Community Infrastructure >> > > > >> > > > >> > > > >> > > > _______________________________________________ >> > > > Gluster-devel mailing list >> > > > Gluster-devel at gluster.org >> > > > https://lists.gluster.org/mailman/listinfo/gluster-devel >> > > >> > > _______________________________________________ >> > > Gluster-devel mailing list >> > > Gluster-devel at gluster.org >> > > https://lists.gluster.org/mailman/listinfo/gluster-devel >> > >> > >> > >> -- >> Michael Scherer >> Sysadmin, Community Infrastructure >> >> >> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel > > -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Wed May 8 04:15:10 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Wed, 8 May 2019 09:45:10 +0530 Subject: [Gluster-devel] [Gluster-users] Meeting Details on footer of the gluster-devel and gluster-user mailing list In-Reply-To: References: <2029030585.17155612.1557220163425.JavaMail.zimbra@redhat.com> <1839109616.17156274.1557220745006.JavaMail.zimbra@redhat.com>

Message-ID: On Wed, May 8, 2019 at 12:08 AM Vijay Bellur wrote: > > > On Tue, May 7, 2019 at 11:15 AM FNU Raghavendra Manjunath < > rabhat at redhat.com> wrote: > >> >> + 1 to this. >> > > I have updated the footer of gluster-devel. If that looks ok, we can > extend it to gluster-users too. > > In case of a month with 5 Tuesdays, we can skip the 5th Tuesday and always > stick to the first 4 Tuesdays of every month. That will help in describing > the community meeting schedule better. If we want to keep the schedule > running on alternate Tuesdays, please let me know and the mailing list > footers can be updated accordingly :-). > > >> There is also one more thing. For some reason, the community meeting is >> not visible in my calendar (especially NA region). I am not sure if anyone >> else also facing this issue. >> > > I did face this issue. Realized that we had a meeting today and showed up > at the meeting a while later but did not see many participants. Perhaps, > the calendar invite has to be made a recurring one. > We'd need to explicitly import the invite and add it to our calendar, otherwise it doesn't reflect. > Thanks, > Vijay > > >> >> Regards, >> Raghavendra >> >> On Tue, May 7, 2019 at 5:19 AM Ashish Pandey wrote: >> >>> Hi, >>> >>> While we send a mail on gluster-devel or gluster-user mailing list, >>> following content gets auto generated and placed at the end of mail. >>> >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> Gluster-devel mailing list >>> Gluster-devel at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>> >>> In the similar way, is it possible to attach meeting schedule and link at the end of every such mails? >>> Like this - >>> >>> Meeting schedule - >>> >>> >>> - APAC friendly hours >>> - Tuesday 14th May 2019, 11:30AM IST >>> - Bridge: https://bluejeans.com/836554017 >>> - NA/EMEA >>> - Tuesday 7th May 2019, 01:00 PM EDT >>> - Bridge: https://bluejeans.com/486278655 >>> >>> Or just a link to meeting minutes details?? >>> https://github.com/gluster/community/tree/master/meetings >>> >>> This will help developers and users of the community to know when and where meeting happens and how to attend those meetings. >>> >>> --- >>> Ashish >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > > Community Meeting Calendar: > > APAC Schedule - > Every 2nd and 4th Tuesday at 11:30 AM IST > Bridge: https://bluejeans.com/836554017 > > NA/EMEA Schedule - > Every 1st and 3rd Tuesday at 01:00 PM EDT > Bridge: https://bluejeans.com/486278655 > > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Wed May 8 04:16:47 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Wed, 8 May 2019 09:46:47 +0530 Subject: [Gluster-devel] [Gluster-users] Meeting Details on footer of the gluster-devel and gluster-user mailing list In-Reply-To: References: <2029030585.17155612.1557220163425.JavaMail.zimbra@redhat.com> <1839109616.17156274.1557220745006.JavaMail.zimbra@redhat.com>

Message-ID: On Wed, May 8, 2019 at 9:45 AM Atin Mukherjee wrote: > > > On Wed, May 8, 2019 at 12:08 AM Vijay Bellur wrote: > >> >> >> On Tue, May 7, 2019 at 11:15 AM FNU Raghavendra Manjunath < >> rabhat at redhat.com> wrote: >> >>> >>> + 1 to this. >>> >> >> I have updated the footer of gluster-devel. If that looks ok, we can >> extend it to gluster-users too. >> >> In case of a month with 5 Tuesdays, we can skip the 5th Tuesday and >> always stick to the first 4 Tuesdays of every month. That will help in >> describing the community meeting schedule better. If we want to keep the >> schedule running on alternate Tuesdays, please let me know and the mailing >> list footers can be updated accordingly :-). >> >> >>> There is also one more thing. For some reason, the community meeting is >>> not visible in my calendar (especially NA region). I am not sure if anyone >>> else also facing this issue. >>> >> >> I did face this issue. Realized that we had a meeting today and showed up >> at the meeting a while later but did not see many participants. Perhaps, >> the calendar invite has to be made a recurring one. >> > > We'd need to explicitly import the invite and add it to our calendar, > otherwise it doesn't reflect. > And you're right that the last series wasn't a recurring one either. > >> Thanks, >> Vijay >> >> >>> >>> Regards, >>> Raghavendra >>> >>> On Tue, May 7, 2019 at 5:19 AM Ashish Pandey >>> wrote: >>> >>>> Hi, >>>> >>>> While we send a mail on gluster-devel or gluster-user mailing list, >>>> following content gets auto generated and placed at the end of mail. >>>> >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> Gluster-devel mailing list >>>> Gluster-devel at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>>> >>>> In the similar way, is it possible to attach meeting schedule and link at the end of every such mails? >>>> Like this - >>>> >>>> Meeting schedule - >>>> >>>> >>>> - APAC friendly hours >>>> - Tuesday 14th May 2019, 11:30AM IST >>>> - Bridge: https://bluejeans.com/836554017 >>>> - NA/EMEA >>>> - Tuesday 7th May 2019, 01:00 PM EDT >>>> - Bridge: https://bluejeans.com/486278655 >>>> >>>> Or just a link to meeting minutes details?? >>>> https://github.com/gluster/community/tree/master/meetings >>>> >>>> This will help developers and users of the community to know when and where meeting happens and how to attend those meetings. >>>> >>>> --- >>>> Ashish >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> _______________________________________________ >> >> Community Meeting Calendar: >> >> APAC Schedule - >> Every 2nd and 4th Tuesday at 11:30 AM IST >> Bridge: https://bluejeans.com/836554017 >> >> NA/EMEA Schedule - >> Every 1st and 3rd Tuesday at 01:00 PM EDT >> Bridge: https://bluejeans.com/486278655 >> >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Wed May 8 04:23:04 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Wed, 8 May 2019 09:53:04 +0530 Subject: [Gluster-devel] [Gluster-infra] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> <797512f6ff7f1b9fedbf8b7968dd86a6968d9105.camel@redhat.com> Message-ID: On Wed, May 8, 2019 at 7:16 AM Sanju Rakonde wrote: > Deepshikha, > > I see the failure here[1] which ran on builder206. So, we are good. > Not really, https://build.gluster.org/job/centos7-regression/5909/consoleFull failed on builder204 for similar reasons I believe? I am bit more worried on this issue being resurfacing more often these days. What can we do to fix this permanently? > [1] https://build.gluster.org/job/centos7-regression/5901/consoleFull > > On Wed, May 8, 2019 at 12:23 AM Deepshikha Khandelwal > wrote: > >> Sanju, can you please give us more info about the failures. >> >> I see the failures occurring on just one of the builder (builder206). I'm >> taking it back offline for now. >> >> On Tue, May 7, 2019 at 9:42 PM Michael Scherer >> wrote: >> >>> Le mardi 07 mai 2019 ? 20:04 +0530, Sanju Rakonde a ?crit : >>> > Looks like is_nfs_export_available started failing again in recent >>> > centos-regressions. >>> > >>> > Michael, can you please check? >>> >>> I will try but I am leaving for vacation tonight, so if I find nothing, >>> until I leave, I guess Deepshika will have to look. >>> >>> > On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul wrote: >>> > >>> > > >>> > > >>> > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer < >>> > > mscherer at redhat.com> >>> > > wrote: >>> > > >>> > > > Le lundi 22 avril 2019 ? 22:57 +0530, Atin Mukherjee a ?crit : >>> > > > > Is this back again? The recent patches are failing regression >>> > > > > :-\ . >>> > > > >>> > > > So, on builder206, it took me a while to find that the issue is >>> > > > that >>> > > > nfs (the service) was running. >>> > > > >>> > > > ./tests/basic/afr/tarissue.t failed, because the nfs >>> > > > initialisation >>> > > > failed with a rather cryptic message: >>> > > > >>> > > > [2019-04-23 13:17:05.371733] I >>> > > > [socket.c:991:__socket_server_bind] 0- >>> > > > socket.nfs-server: process started listening on port (38465) >>> > > > [2019-04-23 13:17:05.385819] E >>> > > > [socket.c:972:__socket_server_bind] 0- >>> > > > socket.nfs-server: binding to failed: Address already in use >>> > > > [2019-04-23 13:17:05.385843] E >>> > > > [socket.c:974:__socket_server_bind] 0- >>> > > > socket.nfs-server: Port is already in use >>> > > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- >>> > > > socket.nfs-server: __socket_server_bind failed;closing socket 14 >>> > > > >>> > > > I found where this came from, but a few stuff did surprised me: >>> > > > >>> > > > - the order of print is different that the order in the code >>> > > > >>> > > >>> > > Indeed strange... >>> > > >>> > > > - the message on "started listening" didn't take in account the >>> > > > fact >>> > > > that bind failed on: >>> > > > >>> > > >>> > > Shouldn't it bail out if it failed to bind? >>> > > Some missing 'goto out' around line 975/976? >>> > > Y. >>> > > >>> > > > >>> > > > >>> > > > >>> > > > >>> >>> https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 >>> > > > >>> > > > The message about port 38465 also threw me off the track. The >>> > > > real >>> > > > issue is that the service nfs was already running, and I couldn't >>> > > > find >>> > > > anything listening on port 38465 >>> > > > >>> > > > once I do service nfs stop, it no longer failed. >>> > > > >>> > > > So far, I do know why nfs.service was activated. >>> > > > >>> > > > But at least, 206 should be fixed, and we know a bit more on what >>> > > > would >>> > > > be causing some failure. >>> > > > >>> > > > >>> > > > >>> > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer < >>> > > > > mscherer at redhat.com> >>> > > > > wrote: >>> > > > > >>> > > > > > Le mercredi 03 avril 2019 ? 16:30 +0530, Atin Mukherjee a >>> > > > > > ?crit : >>> > > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < >>> > > > > > > jthottan at redhat.com> >>> > > > > > > wrote: >>> > > > > > > >>> > > > > > > > Hi, >>> > > > > > > > >>> > > > > > > > is_nfs_export_available is just a wrapper around >>> > > > > > > > "showmount" >>> > > > > > > > command AFAIR. >>> > > > > > > > I saw following messages in console output. >>> > > > > > > > mount.nfs: rpc.statd is not running but is required for >>> > > > > > > > remote >>> > > > > > > > locking. >>> > > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks >>> > > > > > > > local, >>> > > > > > > > or >>> > > > > > > > start >>> > > > > > > > statd. >>> > > > > > > > 05:06:55 mount.nfs: an incorrect mount option was >>> > > > > > > > specified >>> > > > > > > > >>> > > > > > > > For me it looks rpcbind may not be running on the >>> > > > > > > > machine. >>> > > > > > > > Usually rpcbind starts automatically on machines, don't >>> > > > > > > > know >>> > > > > > > > whether it >>> > > > > > > > can happen or not. >>> > > > > > > > >>> > > > > > > >>> > > > > > > That's precisely what the question is. Why suddenly we're >>> > > > > > > seeing >>> > > > > > > this >>> > > > > > > happening too frequently. Today I saw atleast 4 to 5 such >>> > > > > > > failures >>> > > > > > > already. >>> > > > > > > >>> > > > > > > Deepshika - Can you please help in inspecting this? >>> > > > > > >>> > > > > > So we think (we are not sure) that the issue is a bit >>> > > > > > complex. >>> > > > > > >>> > > > > > What we were investigating was nightly run fail on aws. When >>> > > > > > the >>> > > > > > build >>> > > > > > crash, the builder is restarted, since that's the easiest way >>> > > > > > to >>> > > > > > clean >>> > > > > > everything (since even with a perfect test suite that would >>> > > > > > clean >>> > > > > > itself, we could always end in a corrupt state on the system, >>> > > > > > WRT >>> > > > > > mount, fs, etc). >>> > > > > > >>> > > > > > In turn, this seems to cause trouble on aws, since cloud-init >>> > > > > > or >>> > > > > > something rename eth0 interface to ens5, without cleaning to >>> > > > > > the >>> > > > > > network configuration. >>> > > > > > >>> > > > > > So the network init script fail (because the image say "start >>> > > > > > eth0" >>> > > > > > and >>> > > > > > that's not present), but fail in a weird way. Network is >>> > > > > > initialised >>> > > > > > and working (we can connect), but the dhclient process is not >>> > > > > > in >>> > > > > > the >>> > > > > > right cgroup, and network.service is in failed state. >>> > > > > > Restarting >>> > > > > > network didn't work. In turn, this mean that rpc-statd refuse >>> > > > > > to >>> > > > > > start >>> > > > > > (due to systemd dependencies), which seems to impact various >>> > > > > > NFS >>> > > > > > tests. >>> > > > > > >>> > > > > > We have also seen that on some builders, rpcbind pick some IP >>> > > > > > v6 >>> > > > > > autoconfiguration, but we can't reproduce that, and there is >>> > > > > > no ip >>> > > > > > v6 >>> > > > > > set up anywhere. I suspect the network.service failure is >>> > > > > > somehow >>> > > > > > involved, but fail to see how. In turn, rpcbind.socket not >>> > > > > > starting >>> > > > > > could cause NFS test troubles. >>> > > > > > >>> > > > > > Our current stop gap fix was to fix all the builders one by >>> > > > > > one. >>> > > > > > Remove >>> > > > > > the config, kill the rogue dhclient, restart network service. >>> > > > > > >>> > > > > > However, we can't be sure this is going to fix the problem >>> > > > > > long >>> > > > > > term >>> > > > > > since this only manifest after a crash of the test suite, and >>> > > > > > it >>> > > > > > doesn't happen so often. (plus, it was working before some >>> > > > > > day in >>> > > > > > the >>> > > > > > past, when something did make this fail, and I do not know if >>> > > > > > that's a >>> > > > > > system upgrade, or a test change, or both). >>> > > > > > >>> > > > > > So we are still looking at it to have a complete >>> > > > > > understanding of >>> > > > > > the >>> > > > > > issue, but so far, we hacked our way to make it work (or so >>> > > > > > do I >>> > > > > > think). >>> > > > > > >>> > > > > > Deepshika is working to fix it long term, by fixing the issue >>> > > > > > regarding >>> > > > > > eth0/ens5 with a new base image. >>> > > > > > -- >>> > > > > > Michael Scherer >>> > > > > > Sysadmin, Community Infrastructure and Platform, OSAS >>> > > > > > >>> > > > > > >>> > > > > > -- >>> > > > > >>> > > > > - Atin (atinm) >>> > > > >>> > > > -- >>> > > > Michael Scherer >>> > > > Sysadmin, Community Infrastructure >>> > > > >>> > > > >>> > > > >>> > > > _______________________________________________ >>> > > > Gluster-devel mailing list >>> > > > Gluster-devel at gluster.org >>> > > > https://lists.gluster.org/mailman/listinfo/gluster-devel >>> > > >>> > > _______________________________________________ >>> > > Gluster-devel mailing list >>> > > Gluster-devel at gluster.org >>> > > https://lists.gluster.org/mailman/listinfo/gluster-devel >>> > >>> > >>> > >>> -- >>> Michael Scherer >>> Sysadmin, Community Infrastructure >>> >>> >>> >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> > > -- > Thanks, > Sanju > _______________________________________________ > > Community Meeting Calendar: > > APAC Schedule - > Every 2nd and 4th Tuesday at 11:30 AM IST > Bridge: https://bluejeans.com/836554017 > > NA/EMEA Schedule - > Every 1st and 3rd Tuesday at 01:00 PM EDT > Bridge: https://bluejeans.com/486278655 > > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ndevos at redhat.com Wed May 8 07:08:08 2019 From: ndevos at redhat.com (Niels de Vos) Date: Wed, 8 May 2019 09:08:08 +0200 Subject: [Gluster-devel] Meeting Details on footer of the gluster-devel and gluster-user mailing list In-Reply-To: References: <2029030585.17155612.1557220163425.JavaMail.zimbra@redhat.com> <1839109616.17156274.1557220745006.JavaMail.zimbra@redhat.com>

Message-ID: <20190508070808.GA22482@ndevos-x270> On Tue, May 07, 2019 at 11:37:27AM -0700, Vijay Bellur wrote: > On Tue, May 7, 2019 at 11:15 AM FNU Raghavendra Manjunath > wrote: > > > > > + 1 to this. > > > > I have updated the footer of gluster-devel. If that looks ok, we can extend > it to gluster-users too. > > In case of a month with 5 Tuesdays, we can skip the 5th Tuesday and always > stick to the first 4 Tuesdays of every month. That will help in describing > the community meeting schedule better. If we want to keep the schedule > running on alternate Tuesdays, please let me know and the mailing list > footers can be updated accordingly :-). > > > > There is also one more thing. For some reason, the community meeting is > > not visible in my calendar (especially NA region). I am not sure if anyone > > else also facing this issue. > > > > I did face this issue. Realized that we had a meeting today and showed up > at the meeting a while later but did not see many participants. Perhaps, > the calendar invite has to be made a recurring one. Maybe a new invite can be sent with the minutes after a meeting has finished. This makes it easier for people that recently subscribed to the list to add it to their calendar? Niels > > Thanks, > Vijay > > > > > > Regards, > > Raghavendra > > > > On Tue, May 7, 2019 at 5:19 AM Ashish Pandey wrote: > > > >> Hi, > >> > >> While we send a mail on gluster-devel or gluster-user mailing list, > >> following content gets auto generated and placed at the end of mail. > >> > >> Gluster-users mailing list > >> Gluster-users at gluster.org > >> https://lists.gluster.org/mailman/listinfo/gluster-users > >> > >> Gluster-devel mailing list > >> Gluster-devel at gluster.org > >> https://lists.gluster.org/mailman/listinfo/gluster-devel > >> > >> In the similar way, is it possible to attach meeting schedule and link at the end of every such mails? > >> Like this - > >> > >> Meeting schedule - > >> > >> > >> - APAC friendly hours > >> - Tuesday 14th May 2019, 11:30AM IST > >> - Bridge: https://bluejeans.com/836554017 > >> - NA/EMEA > >> - Tuesday 7th May 2019, 01:00 PM EDT > >> - Bridge: https://bluejeans.com/486278655 > >> > >> Or just a link to meeting minutes details?? > >> https://github.com/gluster/community/tree/master/meetings > >> > >> This will help developers and users of the community to know when and where meeting happens and how to attend those meetings. > >> > >> --- > >> Ashish > >> > >> > >> > >> > >> > >> > >> _______________________________________________ > >> Gluster-users mailing list > >> Gluster-users at gluster.org > >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-users > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users From vbellur at redhat.com Wed May 8 07:31:37 2019 From: vbellur at redhat.com (Vijay Bellur) Date: Wed, 8 May 2019 00:31:37 -0700 Subject: [Gluster-devel] Meeting Details on footer of the gluster-devel and gluster-user mailing list In-Reply-To: <20190508070808.GA22482@ndevos-x270> References: <2029030585.17155612.1557220163425.JavaMail.zimbra@redhat.com> <1839109616.17156274.1557220745006.JavaMail.zimbra@redhat.com>

<20190508070808.GA22482@ndevos-x270> Message-ID: On Wed, May 8, 2019 at 12:08 AM Niels de Vos wrote: > On Tue, May 07, 2019 at 11:37:27AM -0700, Vijay Bellur wrote: > > On Tue, May 7, 2019 at 11:15 AM FNU Raghavendra Manjunath < > rabhat at redhat.com> > > wrote: > > > > > > > > + 1 to this. > > > > > > > I have updated the footer of gluster-devel. If that looks ok, we can > extend > > it to gluster-users too. > > > > In case of a month with 5 Tuesdays, we can skip the 5th Tuesday and > always > > stick to the first 4 Tuesdays of every month. That will help in > describing > > the community meeting schedule better. If we want to keep the schedule > > running on alternate Tuesdays, please let me know and the mailing list > > footers can be updated accordingly :-). > > > > > > > There is also one more thing. For some reason, the community meeting is > > > not visible in my calendar (especially NA region). I am not sure if > anyone > > > else also facing this issue. > > > > > > > I did face this issue. Realized that we had a meeting today and showed up > > at the meeting a while later but did not see many participants. Perhaps, > > the calendar invite has to be made a recurring one. > > Maybe a new invite can be sent with the minutes after a meeting has > finished. This makes it easier for people that recently subscribed to > the list to add it to their calendar? > > > That is a good point. I have observed in google groups based mailing lists that a calendar invite for a recurring event is sent automatically to people after they subscribe to the list. I don't think mailman has a similar feature yet. Thanks, Vijay -------------- next part -------------- An HTML attachment was scrubbed... URL: From cynthia.zhou at nokia-sbell.com Wed May 8 07:58:20 2019 From: cynthia.zhou at nokia-sbell.com (Zhou, Cynthia (NSB - CN/Hangzhou)) Date: Wed, 8 May 2019 07:58:20 +0000 Subject: [Gluster-devel] glusterfsd memory leak issue found after enable ssl In-Reply-To: <6d3f68f73e6d440dab19028526745171@nokia-sbell.com> References: <07cb1c3aa08b414dbe37442955ddad36@nokia-sbell.com>

<6ce04fb69243465295a71b6953eafa19@nokia-sbell.com> <3cd91d1ce39541e7ad30c60ef15000aa@nokia-sbell.com>

<5d0c2ed30e884b86ba29bff5a47c960e@nokia-sbell.com>

<6d3f68f73e6d440dab19028526745171@nokia-sbell.com> Message-ID: <0d7934cac01f4a43b4581a2f74928dbc@nokia-sbell.com> Hi 'Milind Changire' , The leak is getting more and more clear to me now. the unsolved memory leak is because of in gluterfs version 3.12.15 (in my env)the ssl context is a shared one, while we do ssl_acept, ssl will allocate some read/write buffer to ssl object, however, ssl_free in socket_reset or fini function of socket.c, the buffer is returened back to ssl context free list instead of completely freed. So following patch is able to fix the memory leak issue completely.(created for gluster master branch) --- a/rpc/rpc-transport/socket/src/socket.c +++ b/rpc/rpc-transport/socket/src/socket.c @@ -446,6 +446,7 @@ ssl_setup_connection_postfix(rpc_transport_t *this) gf_log(this->name, GF_LOG_DEBUG, "SSL verification succeeded (client: %s) (server: %s)", this->peerinfo.identifier, this->myinfo.identifier); + X509_free(peer); return gf_strdup(peer_CN); /* Error paths. */ @@ -1157,7 +1158,21 @@ __socket_reset(rpc_transport_t *this) memset(&priv->incoming, 0, sizeof(priv->incoming)); event_unregister_close(this->ctx->event_pool, priv->sock, priv->idx); - + if(priv->use_ssl&& priv->ssl_ssl) + { + gf_log(this->name, GF_LOG_TRACE, + "clear and reset for socket(%d), free ssl ", + priv->sock); + if(priv->ssl_ctx) + { + SSL_CTX_free(priv->ssl_ctx); + priv->ssl_ctx = NULL; + } + SSL_shutdown(priv->ssl_ssl); + SSL_clear(priv->ssl_ssl); + SSL_free(priv->ssl_ssl); + priv->ssl_ssl = NULL; + } priv->sock = -1; priv->idx = -1; priv->connected = -1; @@ -4675,6 +4690,21 @@ fini(rpc_transport_t *this) pthread_mutex_destroy(&priv->out_lock); pthread_mutex_destroy(&priv->cond_lock); pthread_cond_destroy(&priv->cond); + if(priv->use_ssl&& priv->ssl_ssl) + { + gf_log(this->name, GF_LOG_TRACE, + "clear and reset for socket(%d), free ssl ", + priv->sock); + if(priv->ssl_ctx) + { + SSL_CTX_free(priv->ssl_ctx); + priv->ssl_ctx = NULL; + } + SSL_shutdown(priv->ssl_ssl); + SSL_clear(priv->ssl_ssl); + SSL_free(priv->ssl_ssl); From: Zhou, Cynthia (NSB - CN/Hangzhou) Sent: Monday, May 06, 2019 2:12 PM To: 'Amar Tumballi Suryanarayan' Cc: 'Milind Changire' ; 'gluster-devel at gluster.org' Subject: RE: [Gluster-devel] glusterfsd memory leak issue found after enable ssl Hi, From our test valgrind and libleak all blame ssl3_accept ///////////////////////////from valgrind attached to glusterfds/////////////////////////////////////////// ==16673== 198,720 bytes in 12 blocks are definitely lost in loss record 1,114 of 1,123 ==16673== at 0x4C2EB7B: malloc (vg_replace_malloc.c:299) ==16673== by 0x63E1977: CRYPTO_malloc (in /usr/lib64/libcrypto.so.1.0.2p) ==16673== by 0xA855E0C: ssl3_setup_write_buffer (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA855E77: ssl3_setup_buffers (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA8485D9: ssl3_accept (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA610DDF: ssl_complete_connection (socket.c:400) ==16673== by 0xA617F38: ssl_handle_server_connection_attempt (socket.c:2409) ==16673== by 0xA618420: socket_complete_connection (socket.c:2554) ==16673== by 0xA618788: socket_event_handler (socket.c:2613) ==16673== by 0x4ED6983: event_dispatch_epoll_handler (event-epoll.c:587) ==16673== by 0x4ED6C5A: event_dispatch_epoll_worker (event-epoll.c:663) ==16673== by 0x615C5D9: start_thread (in /usr/lib64/libpthread-2.27.so) ==16673== ==16673== 200,544 bytes in 12 blocks are definitely lost in loss record 1,115 of 1,123 ==16673== at 0x4C2EB7B: malloc (vg_replace_malloc.c:299) ==16673== by 0x63E1977: CRYPTO_malloc (in /usr/lib64/libcrypto.so.1.0.2p) ==16673== by 0xA855D12: ssl3_setup_read_buffer (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA855E68: ssl3_setup_buffers (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA8485D9: ssl3_accept (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA610DDF: ssl_complete_connection (socket.c:400) ==16673== by 0xA617F38: ssl_handle_server_connection_attempt (socket.c:2409) ==16673== by 0xA618420: socket_complete_connection (socket.c:2554) ==16673== by 0xA618788: socket_event_handler (socket.c:2613) ==16673== by 0x4ED6983: event_dispatch_epoll_handler (event-epoll.c:587) ==16673== by 0x4ED6C5A: event_dispatch_epoll_worker (event-epoll.c:663) ==16673== by 0x615C5D9: start_thread (in /usr/lib64/libpthread-2.27.so) ==16673== valgrind --leak-check=f ////////////////////////////////////with libleak attached to glusterfsd///////////////////////////////////////// callstack[2419] expires. count=1 size=224/224 alloc=362 free=350 /home/robot/libleak/libleak.so(malloc+0x25) [0x7f1460604065] /lib64/libcrypto.so.10(CRYPTO_malloc+0x58) [0x7f145ecd9978] /lib64/libcrypto.so.10(EVP_DigestInit_ex+0x2a9) [0x7f145ed95749] /lib64/libssl.so.10(ssl3_digest_cached_records+0x11d) [0x7f145abb6ced] /lib64/libssl.so.10(ssl3_accept+0xc8f) [0x7f145abadc4f] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(ssl_complete_connection+0x5e) [0x7f145ae00f3a] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc16d) [0x7f145ae0816d] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc68a) [0x7f145ae0868a] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc9f2) [0x7f145ae089f2] /lib64/libglusterfs.so.0(+0x9b96f) [0x7f146038596f] /lib64/libglusterfs.so.0(+0x9bc46) [0x7f1460385c46] /lib64/libpthread.so.0(+0x75da) [0x7f145f0d15da] /lib64/libc.so.6(clone+0x3f) [0x7f145e9a7eaf] callstack[2432] expires. count=1 size=104/104 alloc=362 free=0 /home/robot/libleak/libleak.so(malloc+0x25) [0x7f1460604065] /lib64/libcrypto.so.10(CRYPTO_malloc+0x58) [0x7f145ecd9978] /lib64/libcrypto.so.10(BN_MONT_CTX_new+0x17) [0x7f145ed48627] /lib64/libcrypto.so.10(BN_MONT_CTX_set_locked+0x6d) [0x7f145ed489fd] /lib64/libcrypto.so.10(+0xff4d9) [0x7f145ed6a4d9] /lib64/libcrypto.so.10(int_rsa_verify+0x1cd) [0x7f145ed6d41d] /lib64/libcrypto.so.10(RSA_verify+0x32) [0x7f145ed6d972] /lib64/libcrypto.so.10(+0x107ff5) [0x7f145ed72ff5] /lib64/libcrypto.so.10(EVP_VerifyFinal+0x211) [0x7f145ed9dd51] /lib64/libssl.so.10(ssl3_get_cert_verify+0x5bb) [0x7f145abac06b] /lib64/libssl.so.10(ssl3_accept+0x988) [0x7f145abad948] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(ssl_complete_connection+0x5e) [0x7f145ae00f3a] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc16d) [0x7f145ae0816d] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc68a) [0x7f145ae0868a] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc9f2) [0x7f145ae089f2] /lib64/libglusterfs.so.0(+0x9b96f) [0x7f146038596f] /lib64/libglusterfs.so.0(+0x9bc46) [0x7f1460385c46] /lib64/libpthread.so.0(+0x75da) [0x7f145f0d15da] /lib64/libc.so.6(clone+0x3f) [0x7f145e9a7eaf] one interesting thing is that the memory goes up to about 300m then it stopped increasing !!! I am wondering if this is caused by open-ssl library? But when I search from openssl community, there is no such issue reported before. Is glusterfs using ssl_accept correctly? cynthia From: Zhou, Cynthia (NSB - CN/Hangzhou) Sent: Monday, May 06, 2019 10:34 AM To: 'Amar Tumballi Suryanarayan' > Cc: Milind Changire >; gluster-devel at gluster.org Subject: RE: [Gluster-devel] glusterfsd memory leak issue found after enable ssl Hi, Sorry, I am so busy with other issues these days, could you help me to submit my patch for review? It is based on glusterfs3.12.15 code. But even with this patch , memory leak still exists, from memory leak tool it should be related with ssl_accept, not sure if it is because of openssl library or because improper use of ssl interfaces. --- a/rpc/rpc-transport/socket/src/socket.c +++ b/rpc/rpc-transport/socket/src/socket.c @@ -1019,7 +1019,16 @@ static void __socket_reset(rpc_transport_t *this) { memset(&priv->incoming, 0, sizeof(priv->incoming)); event_unregister_close(this->ctx->event_pool, priv->sock, priv->idx); - + if(priv->use_ssl&& priv->ssl_ssl) + { + gf_log(this->name, GF_LOG_INFO, + "clear and reset for socket(%d), free ssl ", + priv->sock); + SSL_shutdown(priv->ssl_ssl); + SSL_clear(priv->ssl_ssl); + SSL_free(priv->ssl_ssl); + priv->ssl_ssl = NULL; + } priv->sock = -1; priv->idx = -1; priv->connected = -1; @@ -4238,6 +4250,16 @@ void fini(rpc_transport_t *this) { pthread_mutex_destroy(&priv->out_lock); pthread_mutex_destroy(&priv->cond_lock); pthread_cond_destroy(&priv->cond); + if(priv->use_ssl&& priv->ssl_ssl) + { + gf_log(this->name, GF_LOG_INFO, + "clear and reset for socket(%d), free ssl ", + priv->sock); + SSL_shutdown(priv->ssl_ssl); + SSL_clear(priv->ssl_ssl); + SSL_free(priv->ssl_ssl); + priv->ssl_ssl = NULL; + } if (priv->ssl_private_key) { GF_FREE(priv->ssl_private_key); } From: Amar Tumballi Suryanarayan > Sent: Wednesday, May 01, 2019 8:43 PM To: Zhou, Cynthia (NSB - CN/Hangzhou) > Cc: Milind Changire >; gluster-devel at gluster.org Subject: Re: [Gluster-devel] glusterfsd memory leak issue found after enable ssl Hi Cynthia Zhou, Can you post the patch which fixes the issue of missing free? We will continue to investigate the leak further, but would really appreciate getting the patch which is already worked on land into upstream master. -Amar On Mon, Apr 22, 2019 at 1:38 PM Zhou, Cynthia (NSB - CN/Hangzhou) > wrote: Ok, I am clear now. I?ve added ssl_free in socket reset and socket finish function, though glusterfsd memory leak is not that much, still it is leaking, from source code I can not find anything else, Could you help to check if this issue exists in your env? If not I may have a try to merge your patch . Step 1> while true;do gluster v heal info, 2> check the vol-name glusterfsd memory usage, it is obviously increasing. cynthia From: Milind Changire > Sent: Monday, April 22, 2019 2:36 PM To: Zhou, Cynthia (NSB - CN/Hangzhou) > Cc: Atin Mukherjee >; gluster-devel at gluster.org Subject: Re: [Gluster-devel] glusterfsd memory leak issue found after enable ssl According to BIO_new_socket() man page ... If the close flag is set then the socket is shut down and closed when the BIO is freed. For Gluster to have more control over the socket shutdown, the BIO_NOCLOSE flag is set. Otherwise, SSL takes control of socket shutdown whenever BIO is freed. _______________________________________________ Gluster-devel mailing list Gluster-devel at gluster.org