From atumball at redhat.com Wed May 1 12:42:30 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Wed, 1 May 2019 18:12:30 +0530 Subject: [Gluster-devel] glusterfsd memory leak issue found after enable ssl In-Reply-To: <5d0c2ed30e884b86ba29bff5a47c960e@nokia-sbell.com> References: <07cb1c3aa08b414dbe37442955ddad36@nokia-sbell.com> <6ce04fb69243465295a71b6953eafa19@nokia-sbell.com> <3cd91d1ce39541e7ad30c60ef15000aa@nokia-sbell.com> <5d0c2ed30e884b86ba29bff5a47c960e@nokia-sbell.com> Message-ID: Hi Cynthia Zhou, Can you post the patch which fixes the issue of missing free? We will continue to investigate the leak further, but would really appreciate getting the patch which is already worked on land into upstream master. -Amar On Mon, Apr 22, 2019 at 1:38 PM Zhou, Cynthia (NSB - CN/Hangzhou) < cynthia.zhou at nokia-sbell.com> wrote: > Ok, I am clear now. > > I?ve added ssl_free in socket reset and socket finish function, though > glusterfsd memory leak is not that much, still it is leaking, from source > code I can not find anything else, > > Could you help to check if this issue exists in your env? If not I may > have a try to merge your patch . > > Step > > 1> while true;do gluster v heal info, > > 2> check the vol-name glusterfsd memory usage, it is obviously > increasing. > > cynthia > > > > *From:* Milind Changire > *Sent:* Monday, April 22, 2019 2:36 PM > *To:* Zhou, Cynthia (NSB - CN/Hangzhou) > *Cc:* Atin Mukherjee ; gluster-devel at gluster.org > *Subject:* Re: [Gluster-devel] glusterfsd memory leak issue found after > enable ssl > > > > According to BIO_new_socket() man page ... > > > > *If the close flag is set then the socket is shut down and closed when the > BIO is freed.* > > > > For Gluster to have more control over the socket shutdown, the BIO_NOCLOSE > flag is set. Otherwise, SSL takes control of socket shutdown whenever BIO > is freed. > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From srakonde at redhat.com Wed May 1 15:11:30 2019 From: srakonde at redhat.com (Sanju Rakonde) Date: Wed, 1 May 2019 20:41:30 +0530 Subject: [Gluster-devel] ./tests/basic/uss.t is timing out in release-6 branch In-Reply-To: References: Message-ID: Thank you Raghavendra. On Tue, Apr 30, 2019 at 11:46 PM FNU Raghavendra Manjunath < rabhat at redhat.com> wrote: > > To make things relatively easy for the cleanup () function in the test > framework, I think it would be better to ensure that uss.t itself deletes > snapshots and the volume once the tests are done. Patch [1] has been > submitted for review. > > [1] https://review.gluster.org/#/c/glusterfs/+/22649/ > > Regards, > Raghavendra > > On Tue, Apr 30, 2019 at 10:42 AM FNU Raghavendra Manjunath < > rabhat at redhat.com> wrote: > >> >> The failure looks similar to the issue I had mentioned in [1] >> >> In short for some reason the cleanup (the cleanup function that we call >> in our .t files) seems to be taking more time and also not cleaning up >> properly. This leads to problems for the 2nd iteration (where basic things >> such as volume creation or volume start itself fails due to ENODATA or >> ENOENT errors). >> >> The 2nd iteration of the uss.t ran had the following errors. >> >> "[2019-04-29 09:08:15.275773]:++++++++++ G_LOG:./tests/basic/uss.t: TEST: >> 39 gluster --mode=script --wignore volume set patchy nfs.disable false >> ++++++++++ >> [2019-04-29 09:08:15.390550] : volume set patchy nfs.disable false : >> SUCCESS >> [2019-04-29 09:08:15.404624]:++++++++++ G_LOG:./tests/basic/uss.t: TEST: >> 42 gluster --mode=script --wignore volume start patchy ++++++++++ >> [2019-04-29 09:08:15.468780] : volume start patchy : FAILED : Failed to >> get extended attribute trusted.glusterfs.volume-id for brick dir >> /d/backends/3/patchy_snap_mnt. Reason : No data available >> " >> >> These are the initial steps to create and start volume. Why >> trusted.glusterfs.volume-id extended attribute is absent is not sure. The >> analysis in [1] had errors of ENOENT (i.e. export directory itself was >> absent). >> I suspect this to be because of some issue with the cleanup mechanism at >> the end of the tests. >> >> [1] >> https://lists.gluster.org/pipermail/gluster-devel/2019-April/056104.html >> >> On Tue, Apr 30, 2019 at 8:37 AM Sanju Rakonde >> wrote: >> >>> Hi Raghavendra, >>> >>> ./tests/basic/uss.t is timing out in release-6 branch consistently. One >>> such instance is https://review.gluster.org/#/c/glusterfs/+/22641/. Can >>> you please look into this? >>> >>> -- >>> Thanks, >>> Sanju >>> >> -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From moagrawa at redhat.com Thu May 2 06:45:09 2019 From: moagrawa at redhat.com (Mohit Agrawal) Date: Thu, 2 May 2019 12:15:09 +0530 Subject: [Gluster-devel] Query regarding dictionary logic In-Reply-To: References: Message-ID: Hi Vijay, I have tried to execute smallfile tool on volume(12x3), i have not found any significant performance improvement for smallfile operations, I have configured 4 clients and 8 thread to run operations. I have generated statedump and found below data for dictionaries specific to gluster processes brick max-pairs-per-dict=50 total-pairs-used=192212171 total-dicts-used=24794349 average-pairs-per-dict=7 glusterd max-pairs-per-dict=301 total-pairs-used=156677 total-dicts-used=30719 average-pairs-per-dict=5 fuse process [dict] max-pairs-per-dict=50 total-pairs-used=88669561 total-dicts-used=12360543 average-pairs-per-dict=7 It seems dictionary has max-pairs in case of glusterd and while no. of volumes are high the number can be increased. I think there is no performance regression in case of brick and fuse. I have used hash_size 20 for the dictionary. Let me know if you can provide some other test to validate the same. Thanks, Mohit Agrawal On Tue, Apr 30, 2019 at 2:29 PM Mohit Agrawal wrote: > Thanks, Amar for sharing the patch, I will test and share the result. > > On Tue, Apr 30, 2019 at 2:23 PM Amar Tumballi Suryanarayan < > atumball at redhat.com> wrote: > >> Shreyas/Kevin tried to address it some time back using >> https://bugzilla.redhat.com/show_bug.cgi?id=1428049 ( >> https://review.gluster.org/16830) >> >> I vaguely remember the reason to keep the hash value 1 was done during >> the time when we had dictionary itself sent as on wire protocol, and in >> most other places, number of entries in dictionary was on an avg, 3. So, we >> felt, saving on a bit of memory for optimization was better at that time. >> >> -Amar >> >> On Tue, Apr 30, 2019 at 12:02 PM Mohit Agrawal >> wrote: >> >>> sure Vijay, I will try and update. >>> >>> Regards, >>> Mohit Agrawal >>> >>> On Tue, Apr 30, 2019 at 11:44 AM Vijay Bellur >>> wrote: >>> >>>> Hi Mohit, >>>> >>>> On Mon, Apr 29, 2019 at 7:15 AM Mohit Agrawal >>>> wrote: >>>> >>>>> Hi All, >>>>> >>>>> I was just looking at the code of dict, I have one query current >>>>> dictionary logic. >>>>> I am not able to understand why we use hash_size is 1 for a >>>>> dictionary.IMO with the >>>>> hash_size of 1 dictionary always work like a list, not a hash, for >>>>> every lookup >>>>> in dictionary complexity is O(n). >>>>> >>>>> Before optimizing the code I just want to know what was the exact >>>>> reason to define >>>>> hash_size is 1? >>>>> >>>> >>>> This is a good question. I looked up the source in gluster's historic >>>> repo [1] and hash_size is 1 even there. So, this could have been the case >>>> since the first version of the dictionary code. >>>> >>>> Would you be able to run some tests with a larger hash_size and share >>>> your observations? >>>> >>>> Thanks, >>>> Vijay >>>> >>>> [1] >>>> https://github.com/gluster/historic/blob/master/libglusterfs/src/dict.c >>>> >>>> >>>> >>>>> >>>>> Please share your view on the same. >>>>> >>>>> Thanks, >>>>> Mohit Agrawal >>>>> _______________________________________________ >>>>> Gluster-devel mailing list >>>>> Gluster-devel at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>>> >>>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> >> >> -- >> Amar Tumballi (amarts) >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhernandez at redhat.com Thu May 2 10:45:38 2019 From: xhernandez at redhat.com (Xavi Hernandez) Date: Thu, 2 May 2019 12:45:38 +0200 Subject: [Gluster-devel] Should we enable contention notification by default ? Message-ID: Hi all, there's a feature in the locks xlator that sends a notification to current owner of a lock when another client tries to acquire the same lock. This way the current owner is made aware of the contention and can release the lock as soon as possible to allow the other client to proceed. This is specially useful when eager-locking is used and multiple clients access the same files and directories. Currently both replicated and dispersed volumes use eager-locking and can use contention notification to force an early release of the lock. Eager-locking reduces the number of network requests required for each operation, improving performance, but could add delays to other clients while it keeps the inode or entry locked. With the contention notification feature we avoid this delay, so we get the best performance with minimal issues in multiclient environments. Currently the contention notification feature is controlled by the 'features.lock-notify-contention' option and it's disabled by default. Should we enable it by default ? I don't see any reason to keep it disabled by default. Does anyone foresee any problem ? Regards, Xavi -------------- next part -------------- An HTML attachment was scrubbed... URL: From aspandey at redhat.com Thu May 2 12:17:51 2019 From: aspandey at redhat.com (Ashish Pandey) Date: Thu, 2 May 2019 08:17:51 -0400 (EDT) Subject: [Gluster-devel] Should we enable contention notification by default ? In-Reply-To: References: Message-ID: <2044282595.16006319.1556799471980.JavaMail.zimbra@redhat.com> Xavi, I would like to keep this option (features.lock-notify-contention) enabled by default. However, I can see that there is one more option which will impact the working of this option which is "notify-contention-delay" .description = "This value determines the minimum amount of time " "(in seconds) between upcall contention notifications " "on the same inode. If multiple lock requests are " "received during this period, only one upcall will " "be sent."}, I am not sure what should be the best value for this option if we want to keep features.lock-notify-contention ON by default? It looks like if we keep the value of notify-contention-delay more, say 5 sec, it will wait for this much time to send up call notification which does not look good. Is my understanding correct? What will be impact of this value and what should be the default value of this option? --- Ashish ----- Original Message ----- From: "Xavi Hernandez" To: "gluster-devel" Cc: "Pranith Kumar Karampuri" , "Ashish Pandey" , "Amar Tumballi" Sent: Thursday, May 2, 2019 4:15:38 PM Subject: Should we enable contention notification by default ? Hi all, there's a feature in the locks xlator that sends a notification to current owner of a lock when another client tries to acquire the same lock. This way the current owner is made aware of the contention and can release the lock as soon as possible to allow the other client to proceed. This is specially useful when eager-locking is used and multiple clients access the same files and directories. Currently both replicated and dispersed volumes use eager-locking and can use contention notification to force an early release of the lock. Eager-locking reduces the number of network requests required for each operation, improving performance, but could add delays to other clients while it keeps the inode or entry locked. With the contention notification feature we avoid this delay, so we get the best performance with minimal issues in multiclient environments. Currently the contention notification feature is controlled by the 'features.lock-notify-contention' option and it's disabled by default. Should we enable it by default ? I don't see any reason to keep it disabled by default. Does anyone foresee any problem ? Regards, Xavi -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhernandez at redhat.com Thu May 2 13:13:36 2019 From: xhernandez at redhat.com (Xavi Hernandez) Date: Thu, 2 May 2019 15:13:36 +0200 Subject: [Gluster-devel] Should we enable contention notification by default ? In-Reply-To: <2044282595.16006319.1556799471980.JavaMail.zimbra@redhat.com> References: <2044282595.16006319.1556799471980.JavaMail.zimbra@redhat.com> Message-ID: Hi Ashish, On Thu, May 2, 2019 at 2:17 PM Ashish Pandey wrote: > Xavi, > > I would like to keep this option (features.lock-notify-contention) enabled > by default. > However, I can see that there is one more option which will impact the > working of this option which is "notify-contention-delay" > .description = "This value determines the minimum amount of time " > "(in seconds) between upcall contention notifications " > "on the same inode. If multiple lock requests are " > "received during this period, only one upcall will " > "be sent."}, > > I am not sure what should be the best value for this option if we want to > keep features.lock-notify-contention ON by default? > It looks like if we keep the value of notify-contention-delay more, say 5 > sec, it will wait for this much time to send up call > notification which does not look good. > No, the first notification is sent immediately. What this option does is to define the minimum interval between notifications. This interval is per lock. This is done to avoid storms of notifications if many requests come referencing the same lock. Is my understanding correct? > What will be impact of this value and what should be the default value of > this option? > I think the current default value of 5 seconds seems good enough. If there are many bricks, each brick could send a notification per lock. 1000 bricks would mean a client would receive 1000 notifications every 5 seconds. It doesn't seem too much, but in those cases 10, and considering we could have other locks, maybe a higher value could be better. Xavi > > --- > Ashish > > > > > > > ------------------------------ > *From: *"Xavi Hernandez" > *To: *"gluster-devel" > *Cc: *"Pranith Kumar Karampuri" , "Ashish Pandey" < > aspandey at redhat.com>, "Amar Tumballi" > *Sent: *Thursday, May 2, 2019 4:15:38 PM > *Subject: *Should we enable contention notification by default ? > > Hi all, > > there's a feature in the locks xlator that sends a notification to current > owner of a lock when another client tries to acquire the same lock. This > way the current owner is made aware of the contention and can release the > lock as soon as possible to allow the other client to proceed. > > This is specially useful when eager-locking is used and multiple clients > access the same files and directories. Currently both replicated and > dispersed volumes use eager-locking and can use contention notification to > force an early release of the lock. > > Eager-locking reduces the number of network requests required for each > operation, improving performance, but could add delays to other clients > while it keeps the inode or entry locked. With the contention notification > feature we avoid this delay, so we get the best performance with minimal > issues in multiclient environments. > > Currently the contention notification feature is controlled by the > 'features.lock-notify-contention' option and it's disabled by default. > Should we enable it by default ? > > I don't see any reason to keep it disabled by default. Does anyone foresee > any problem ? > > Regards, > > Xavi > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mchangir at redhat.com Thu May 2 13:37:02 2019 From: mchangir at redhat.com (Milind Changire) Date: Thu, 2 May 2019 19:07:02 +0530 Subject: [Gluster-devel] Should we enable contention notification by default ? In-Reply-To: References: <2044282595.16006319.1556799471980.JavaMail.zimbra@redhat.com> Message-ID: On Thu, May 2, 2019 at 6:44 PM Xavi Hernandez wrote: > Hi Ashish, > > On Thu, May 2, 2019 at 2:17 PM Ashish Pandey wrote: > >> Xavi, >> >> I would like to keep this option (features.lock-notify-contention) >> enabled by default. >> However, I can see that there is one more option which will impact the >> working of this option which is "notify-contention-delay" >> > Just a nit. I wish the option was called "notify-contention-interval" The "delay" part doesn't really emphasize where the delay would be put in. > .description = "This value determines the minimum amount of time " >> "(in seconds) between upcall contention notifications >> " >> "on the same inode. If multiple lock requests are " >> "received during this period, only one upcall will " >> "be sent."}, >> >> I am not sure what should be the best value for this option if we want to >> keep features.lock-notify-contention ON by default? >> It looks like if we keep the value of notify-contention-delay more, say 5 >> sec, it will wait for this much time to send up call >> notification which does not look good. >> > > No, the first notification is sent immediately. What this option does is > to define the minimum interval between notifications. This interval is per > lock. This is done to avoid storms of notifications if many requests come > referencing the same lock. > > Is my understanding correct? >> What will be impact of this value and what should be the default value of >> this option? >> > > I think the current default value of 5 seconds seems good enough. If there > are many bricks, each brick could send a notification per lock. 1000 bricks > would mean a client would receive 1000 notifications every 5 seconds. It > doesn't seem too much, but in those cases 10, and considering we could have > other locks, maybe a higher value could be better. > > Xavi > > >> >> --- >> Ashish >> >> >> >> >> >> >> ------------------------------ >> *From: *"Xavi Hernandez" >> *To: *"gluster-devel" >> *Cc: *"Pranith Kumar Karampuri" , "Ashish Pandey" < >> aspandey at redhat.com>, "Amar Tumballi" >> *Sent: *Thursday, May 2, 2019 4:15:38 PM >> *Subject: *Should we enable contention notification by default ? >> >> Hi all, >> >> there's a feature in the locks xlator that sends a notification to >> current owner of a lock when another client tries to acquire the same lock. >> This way the current owner is made aware of the contention and can release >> the lock as soon as possible to allow the other client to proceed. >> >> This is specially useful when eager-locking is used and multiple clients >> access the same files and directories. Currently both replicated and >> dispersed volumes use eager-locking and can use contention notification to >> force an early release of the lock. >> >> Eager-locking reduces the number of network requests required for each >> operation, improving performance, but could add delays to other clients >> while it keeps the inode or entry locked. With the contention notification >> feature we avoid this delay, so we get the best performance with minimal >> issues in multiclient environments. >> >> Currently the contention notification feature is controlled by the >> 'features.lock-notify-contention' option and it's disabled by default. >> Should we enable it by default ? >> >> I don't see any reason to keep it disabled by default. Does anyone >> foresee any problem ? >> >> Regards, >> >> Xavi >> >> _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -- Milind -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhernandez at redhat.com Thu May 2 13:44:40 2019 From: xhernandez at redhat.com (Xavi Hernandez) Date: Thu, 2 May 2019 15:44:40 +0200 Subject: [Gluster-devel] Should we enable contention notification by default ? In-Reply-To: References: <2044282595.16006319.1556799471980.JavaMail.zimbra@redhat.com> Message-ID: On Thu, 2 May 2019, 15:37 Milind Changire, wrote: > On Thu, May 2, 2019 at 6:44 PM Xavi Hernandez > wrote: > >> Hi Ashish, >> >> On Thu, May 2, 2019 at 2:17 PM Ashish Pandey wrote: >> >>> Xavi, >>> >>> I would like to keep this option (features.lock-notify-contention) >>> enabled by default. >>> However, I can see that there is one more option which will impact the >>> working of this option which is "notify-contention-delay" >>> >> > Just a nit. I wish the option was called "notify-contention-interval" > The "delay" part doesn't really emphasize where the delay would be put in. > It makes sense. Maybe we can also rename it or add a second name (alias). If there are no objections, I will send a patch with the change. Xavi > >> .description = "This value determines the minimum amount of time " >>> "(in seconds) between upcall contention >>> notifications " >>> "on the same inode. If multiple lock requests are " >>> "received during this period, only one upcall will " >>> "be sent."}, >>> >>> I am not sure what should be the best value for this option if we want >>> to keep features.lock-notify-contention ON by default? >>> It looks like if we keep the value of notify-contention-delay more, say >>> 5 sec, it will wait for this much time to send up call >>> notification which does not look good. >>> >> >> No, the first notification is sent immediately. What this option does is >> to define the minimum interval between notifications. This interval is per >> lock. This is done to avoid storms of notifications if many requests come >> referencing the same lock. >> >> Is my understanding correct? >>> What will be impact of this value and what should be the default value >>> of this option? >>> >> >> I think the current default value of 5 seconds seems good enough. If >> there are many bricks, each brick could send a notification per lock. 1000 >> bricks would mean a client would receive 1000 notifications every 5 >> seconds. It doesn't seem too much, but in those cases 10, and considering >> we could have other locks, maybe a higher value could be better. >> >> Xavi >> >> >>> >>> --- >>> Ashish >>> >>> >>> >>> >>> >>> >>> ------------------------------ >>> *From: *"Xavi Hernandez" >>> *To: *"gluster-devel" >>> *Cc: *"Pranith Kumar Karampuri" , "Ashish Pandey" < >>> aspandey at redhat.com>, "Amar Tumballi" >>> *Sent: *Thursday, May 2, 2019 4:15:38 PM >>> *Subject: *Should we enable contention notification by default ? >>> >>> Hi all, >>> >>> there's a feature in the locks xlator that sends a notification to >>> current owner of a lock when another client tries to acquire the same lock. >>> This way the current owner is made aware of the contention and can release >>> the lock as soon as possible to allow the other client to proceed. >>> >>> This is specially useful when eager-locking is used and multiple clients >>> access the same files and directories. Currently both replicated and >>> dispersed volumes use eager-locking and can use contention notification to >>> force an early release of the lock. >>> >>> Eager-locking reduces the number of network requests required for each >>> operation, improving performance, but could add delays to other clients >>> while it keeps the inode or entry locked. With the contention notification >>> feature we avoid this delay, so we get the best performance with minimal >>> issues in multiclient environments. >>> >>> Currently the contention notification feature is controlled by the >>> 'features.lock-notify-contention' option and it's disabled by default. >>> Should we enable it by default ? >>> >>> I don't see any reason to keep it disabled by default. Does anyone >>> foresee any problem ? >>> >>> Regards, >>> >>> Xavi >>> >>> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel > > > > -- > Milind > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From atin.mukherjee83 at gmail.com Thu May 2 14:06:19 2019 From: atin.mukherjee83 at gmail.com (Atin Mukherjee) Date: Thu, 2 May 2019 19:36:19 +0530 Subject: [Gluster-devel] Should we enable contention notification by default ? In-Reply-To: References: <2044282595.16006319.1556799471980.JavaMail.zimbra@redhat.com> Message-ID: On Thu, 2 May 2019 at 19:14, Xavi Hernandez wrote: > On Thu, 2 May 2019, 15:37 Milind Changire, wrote: > >> On Thu, May 2, 2019 at 6:44 PM Xavi Hernandez >> wrote: >> >>> Hi Ashish, >>> >>> On Thu, May 2, 2019 at 2:17 PM Ashish Pandey >>> wrote: >>> >>>> Xavi, >>>> >>>> I would like to keep this option (features.lock-notify-contention) >>>> enabled by default. >>>> However, I can see that there is one more option which will impact the >>>> working of this option which is "notify-contention-delay" >>>> >>> >> Just a nit. I wish the option was called "notify-contention-interval" >> The "delay" part doesn't really emphasize where the delay would be put in. >> > > It makes sense. Maybe we can also rename it or add a second name (alias). > If there are no objections, I will send a patch with the change. > > Xavi > > >> >>> .description = "This value determines the minimum amount of time " >>>> "(in seconds) between upcall contention >>>> notifications " >>>> "on the same inode. If multiple lock requests are " >>>> "received during this period, only one upcall will " >>>> "be sent."}, >>>> >>>> I am not sure what should be the best value for this option if we want >>>> to keep features.lock-notify-contention ON by default? >>>> It looks like if we keep the value of notify-contention-delay more, say >>>> 5 sec, it will wait for this much time to send up call >>>> notification which does not look good. >>>> >>> >>> No, the first notification is sent immediately. What this option does is >>> to define the minimum interval between notifications. This interval is per >>> lock. This is done to avoid storms of notifications if many requests come >>> referencing the same lock. >>> >>> Is my understanding correct? >>>> What will be impact of this value and what should be the default value >>>> of this option? >>>> >>> >>> I think the current default value of 5 seconds seems good enough. If >>> there are many bricks, each brick could send a notification per lock. 1000 >>> bricks would mean a client would receive 1000 notifications every 5 >>> seconds. It doesn't seem too much, but in those cases 10, and considering >>> we could have other locks, maybe a higher value could be better. >>> >>> Xavi >>> >>> >>>> >>>> --- >>>> Ashish >>>> >>>> >>>> >>>> >>>> >>>> >>>> ------------------------------ >>>> *From: *"Xavi Hernandez" >>>> *To: *"gluster-devel" >>>> *Cc: *"Pranith Kumar Karampuri" , "Ashish Pandey" >>>> , "Amar Tumballi" >>>> *Sent: *Thursday, May 2, 2019 4:15:38 PM >>>> *Subject: *Should we enable contention notification by default ? >>>> >>>> Hi all, >>>> >>>> there's a feature in the locks xlator that sends a notification to >>>> current owner of a lock when another client tries to acquire the same lock. >>>> This way the current owner is made aware of the contention and can release >>>> the lock as soon as possible to allow the other client to proceed. >>>> >>>> This is specially useful when eager-locking is used and multiple >>>> clients access the same files and directories. Currently both replicated >>>> and dispersed volumes use eager-locking and can use contention notification >>>> to force an early release of the lock. >>>> >>>> Eager-locking reduces the number of network requests required for each >>>> operation, improving performance, but could add delays to other clients >>>> while it keeps the inode or entry locked. With the contention notification >>>> feature we avoid this delay, so we get the best performance with minimal >>>> issues in multiclient environments. >>>> >>>> Currently the contention notification feature is controlled by the >>>> 'features.lock-notify-contention' option and it's disabled by default. >>>> Should we enable it by default ? >>>> >>>> I don't see any reason to keep it disabled by default. Does anyone >>>> foresee any problem ? >>>> >>> Is it a server only option? Otherwise it will break backward compatibility if we rename the key, If alias can get this fixed, that?s a better choice but I?m not sure if it solves all the problems. >>>> Regards, >>>> >>>> Xavi >>>> >>>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> >> >> -- >> Milind >> >> _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -- --Atin -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhernandez at redhat.com Thu May 2 15:08:45 2019 From: xhernandez at redhat.com (Xavi Hernandez) Date: Thu, 2 May 2019 17:08:45 +0200 Subject: [Gluster-devel] Should we enable contention notification by default ? In-Reply-To: References: <2044282595.16006319.1556799471980.JavaMail.zimbra@redhat.com> Message-ID: On Thu, May 2, 2019 at 4:06 PM Atin Mukherjee wrote: > > > On Thu, 2 May 2019 at 19:14, Xavi Hernandez wrote: > >> On Thu, 2 May 2019, 15:37 Milind Changire, wrote: >> >>> On Thu, May 2, 2019 at 6:44 PM Xavi Hernandez >>> wrote: >>> >>>> Hi Ashish, >>>> >>>> On Thu, May 2, 2019 at 2:17 PM Ashish Pandey >>>> wrote: >>>> >>>>> Xavi, >>>>> >>>>> I would like to keep this option (features.lock-notify-contention) >>>>> enabled by default. >>>>> However, I can see that there is one more option which will impact the >>>>> working of this option which is "notify-contention-delay" >>>>> >>>> >>> Just a nit. I wish the option was called "notify-contention-interval" >>> The "delay" part doesn't really emphasize where the delay would be put >>> in. >>> >> >> It makes sense. Maybe we can also rename it or add a second name (alias). >> If there are no objections, I will send a patch with the change. >> >> Xavi >> >> >>> >>>> .description = "This value determines the minimum amount of time " >>>>> "(in seconds) between upcall contention >>>>> notifications " >>>>> "on the same inode. If multiple lock requests are " >>>>> "received during this period, only one upcall will >>>>> " >>>>> "be sent."}, >>>>> >>>>> I am not sure what should be the best value for this option if we want >>>>> to keep features.lock-notify-contention ON by default? >>>>> It looks like if we keep the value of notify-contention-delay more, >>>>> say 5 sec, it will wait for this much time to send up call >>>>> notification which does not look good. >>>>> >>>> >>>> No, the first notification is sent immediately. What this option does >>>> is to define the minimum interval between notifications. This interval is >>>> per lock. This is done to avoid storms of notifications if many requests >>>> come referencing the same lock. >>>> >>>> Is my understanding correct? >>>>> What will be impact of this value and what should be the default value >>>>> of this option? >>>>> >>>> >>>> I think the current default value of 5 seconds seems good enough. If >>>> there are many bricks, each brick could send a notification per lock. 1000 >>>> bricks would mean a client would receive 1000 notifications every 5 >>>> seconds. It doesn't seem too much, but in those cases 10, and considering >>>> we could have other locks, maybe a higher value could be better. >>>> >>>> Xavi >>>> >>>> >>>>> >>>>> --- >>>>> Ashish >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ------------------------------ >>>>> *From: *"Xavi Hernandez" >>>>> *To: *"gluster-devel" >>>>> *Cc: *"Pranith Kumar Karampuri" , "Ashish >>>>> Pandey" , "Amar Tumballi" >>>>> *Sent: *Thursday, May 2, 2019 4:15:38 PM >>>>> *Subject: *Should we enable contention notification by default ? >>>>> >>>>> Hi all, >>>>> >>>>> there's a feature in the locks xlator that sends a notification to >>>>> current owner of a lock when another client tries to acquire the same lock. >>>>> This way the current owner is made aware of the contention and can release >>>>> the lock as soon as possible to allow the other client to proceed. >>>>> >>>>> This is specially useful when eager-locking is used and multiple >>>>> clients access the same files and directories. Currently both replicated >>>>> and dispersed volumes use eager-locking and can use contention notification >>>>> to force an early release of the lock. >>>>> >>>>> Eager-locking reduces the number of network requests required for each >>>>> operation, improving performance, but could add delays to other clients >>>>> while it keeps the inode or entry locked. With the contention notification >>>>> feature we avoid this delay, so we get the best performance with minimal >>>>> issues in multiclient environments. >>>>> >>>>> Currently the contention notification feature is controlled by the >>>>> 'features.lock-notify-contention' option and it's disabled by default. >>>>> Should we enable it by default ? >>>>> >>>>> I don't see any reason to keep it disabled by default. Does anyone >>>>> foresee any problem ? >>>>> >>>> > Is it a server only option? Otherwise it will break backward compatibility > if we rename the key, If alias can get this fixed, that?s a better choice > but I?m not sure if it solves all the problems. > It's a server side option. I though that an alias didn't have any other implication than accept two names for the same option. Is there anything else I need to consider ? > >>>>> Regards, >>>>> >>>>> Xavi >>>>> >>>>> _______________________________________________ >>>> Gluster-devel mailing list >>>> Gluster-devel at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>> >>> >>> >>> -- >>> Milind >>> >>> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel > > -- > --Atin > -------------- next part -------------- An HTML attachment was scrubbed... URL: From atin.mukherjee83 at gmail.com Thu May 2 15:45:39 2019 From: atin.mukherjee83 at gmail.com (Atin Mukherjee) Date: Thu, 2 May 2019 21:15:39 +0530 Subject: [Gluster-devel] Should we enable contention notification by default ? In-Reply-To: References: <2044282595.16006319.1556799471980.JavaMail.zimbra@redhat.com> Message-ID: On Thu, 2 May 2019 at 20:38, Xavi Hernandez wrote: > On Thu, May 2, 2019 at 4:06 PM Atin Mukherjee > wrote: > >> >> >> On Thu, 2 May 2019 at 19:14, Xavi Hernandez >> wrote: >> >>> On Thu, 2 May 2019, 15:37 Milind Changire, wrote: >>> >>>> On Thu, May 2, 2019 at 6:44 PM Xavi Hernandez >>>> wrote: >>>> >>>>> Hi Ashish, >>>>> >>>>> On Thu, May 2, 2019 at 2:17 PM Ashish Pandey >>>>> wrote: >>>>> >>>>>> Xavi, >>>>>> >>>>>> I would like to keep this option (features.lock-notify-contention) >>>>>> enabled by default. >>>>>> However, I can see that there is one more option which will impact >>>>>> the working of this option which is "notify-contention-delay" >>>>>> >>>>> >>>> Just a nit. I wish the option was called "notify-contention-interval" >>>> The "delay" part doesn't really emphasize where the delay would be put >>>> in. >>>> >>> >>> It makes sense. Maybe we can also rename it or add a second name >>> (alias). If there are no objections, I will send a patch with the change. >>> >>> Xavi >>> >>> >>>> >>>>> .description = "This value determines the minimum amount of time " >>>>>> "(in seconds) between upcall contention >>>>>> notifications " >>>>>> "on the same inode. If multiple lock requests are >>>>>> " >>>>>> "received during this period, only one upcall >>>>>> will " >>>>>> "be sent."}, >>>>>> >>>>>> I am not sure what should be the best value for this option if we >>>>>> want to keep features.lock-notify-contention ON by default? >>>>>> It looks like if we keep the value of notify-contention-delay more, >>>>>> say 5 sec, it will wait for this much time to send up call >>>>>> notification which does not look good. >>>>>> >>>>> >>>>> No, the first notification is sent immediately. What this option does >>>>> is to define the minimum interval between notifications. This interval is >>>>> per lock. This is done to avoid storms of notifications if many requests >>>>> come referencing the same lock. >>>>> >>>>> Is my understanding correct? >>>>>> What will be impact of this value and what should be the default >>>>>> value of this option? >>>>>> >>>>> >>>>> I think the current default value of 5 seconds seems good enough. If >>>>> there are many bricks, each brick could send a notification per lock. 1000 >>>>> bricks would mean a client would receive 1000 notifications every 5 >>>>> seconds. It doesn't seem too much, but in those cases 10, and considering >>>>> we could have other locks, maybe a higher value could be better. >>>>> >>>>> Xavi >>>>> >>>>> >>>>>> >>>>>> --- >>>>>> Ashish >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------ >>>>>> *From: *"Xavi Hernandez" >>>>>> *To: *"gluster-devel" >>>>>> *Cc: *"Pranith Kumar Karampuri" , "Ashish >>>>>> Pandey" , "Amar Tumballi" >>>>>> *Sent: *Thursday, May 2, 2019 4:15:38 PM >>>>>> *Subject: *Should we enable contention notification by default ? >>>>>> >>>>>> Hi all, >>>>>> >>>>>> there's a feature in the locks xlator that sends a notification to >>>>>> current owner of a lock when another client tries to acquire the same lock. >>>>>> This way the current owner is made aware of the contention and can release >>>>>> the lock as soon as possible to allow the other client to proceed. >>>>>> >>>>>> This is specially useful when eager-locking is used and multiple >>>>>> clients access the same files and directories. Currently both replicated >>>>>> and dispersed volumes use eager-locking and can use contention notification >>>>>> to force an early release of the lock. >>>>>> >>>>>> Eager-locking reduces the number of network requests required for >>>>>> each operation, improving performance, but could add delays to other >>>>>> clients while it keeps the inode or entry locked. With the contention >>>>>> notification feature we avoid this delay, so we get the best performance >>>>>> with minimal issues in multiclient environments. >>>>>> >>>>>> Currently the contention notification feature is controlled by the >>>>>> 'features.lock-notify-contention' option and it's disabled by default. >>>>>> Should we enable it by default ? >>>>>> >>>>>> I don't see any reason to keep it disabled by default. Does anyone >>>>>> foresee any problem ? >>>>>> >>>>> >> Is it a server only option? Otherwise it will break backward >> compatibility if we rename the key, If alias can get this fixed, that?s a >> better choice but I?m not sure if it solves all the problems. >> > > It's a server side option. I though that an alias didn't have any other > implication than accept two names for the same option. Is there anything > else I need to consider ? > If it?s a server side option then there?s no challenge in alias. If you do rename then in heterogeneous server versions volume set wouldn?t work though. > >> >>>>>> Regards, >>>>>> >>>>>> Xavi >>>>>> >>>>>> _______________________________________________ >>>>> Gluster-devel mailing list >>>>> Gluster-devel at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>>> >>>> >>>> >>>> -- >>>> Milind >>>> >>>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> -- >> --Atin >> > -- --Atin -------------- next part -------------- An HTML attachment was scrubbed... URL: From pkalever at redhat.com Thu May 2 17:34:41 2019 From: pkalever at redhat.com (Prasanna Kalever) Date: Thu, 2 May 2019 23:04:41 +0530 Subject: [Gluster-devel] gluster-block v0.4 is alive! Message-ID: Hello Gluster folks, Gluster-block team is happy to announce the v0.4 release [1]. This is the new stable version of gluster-block, lots of new and exciting features and interesting bug fixes are made available as part of this release. Please find the big list of release highlights and notable fixes at [2]. Details about installation can be found in the easy install guide at [3]. Find the details about prerequisites and setup guide at [4]. If you are a new user, checkout the demo video attached in the README doc [5], which will be a good source of intro to the project. There are good examples about how to use gluster-block both in the man pages [6] and test file [7] (also in the README). gluster-block is part of fedora package collection, an updated package with release version v0.4 will be soon made available. And the community provided packages will be soon made available at [8]. Please spend a minute to report any kind of issue that comes to your notice with this handy link [9]. We look forward to your feedback, which will help gluster-block get better! We would like to thank all our users, contributors for bug filing and fixes, also the whole team who involved in the huge effort with pre-release testing. [1] https://github.com/gluster/gluster-block [2] https://github.com/gluster/gluster-block/releases [3] https://github.com/gluster/gluster-block/blob/master/INSTALL [4] https://github.com/gluster/gluster-block#usage [5] https://github.com/gluster/gluster-block/blob/master/README.md [6] https://github.com/gluster/gluster-block/tree/master/docs [7] https://github.com/gluster/gluster-block/blob/master/tests/basic.t [8] https://download.gluster.org/pub/gluster/gluster-block/ [9] https://github.com/gluster/gluster-block/issues/new Cheers, Team Gluster-Block! From xhernandez at redhat.com Thu May 2 20:58:12 2019 From: xhernandez at redhat.com (Xavi Hernandez) Date: Thu, 2 May 2019 22:58:12 +0200 Subject: [Gluster-devel] Weird performance behavior Message-ID: Hi, doing some tests to compare performance I've found some weird results. I've seen this in different tests, but probably the more clear an easier to reproduce is to use smallfile tool to create files. The test command is: # python smallfile_cli.py --operation create --files-per-dir 100 --file-size 32768 --threads 16 --files 256 --top --stonewall no I've run this test 5 times sequentially using the same initial conditions (at least this is what I think): bricks cleared, all gluster processes stopped, volume destroyed and recreated, caches emptied. This is the data I've obtained for each execution: Time us sy ni id wa hi si st read write use 435 1.80 3.70 0.00 81.62 11.06 0.00 0.00 0.00 32.931 608715.575 97.632 450 1.67 3.62 0.00 80.67 12.19 0.00 0.00 0.00 30.989 589078.308 97.714 425 1.74 3.75 0.00 81.85 10.76 0.00 0.00 0.00 37.588 622034.812 97.706 320 2.47 5.06 0.00 82.84 7.75 0.00 0.00 0.00 46.406 828637.359 96.891 365 2.19 4.44 0.00 84.45 7.12 0.00 0.00 0.00 45.822 734566.685 97.466 Time is in seconds. us, sy, ni, id, wa, hi, si and st are the CPU times, as reported by top. read and write are the disk throughput in KiB/s. use is the disk usage percentage. Based on this we can see that there's a big difference between the best and the worst cases. But it seems more relevant that when it performed better, in fact disk utilization and CPU wait time were a bit lower. Disk is a NVMe and I used a recent commit from master (2b86da69). Volume type is a replica 3 with 3 bricks. I'm not sure what can be causing this. Any idea ? can anyone try to reproduce it to see if it's a problem in my environment or it's a common problem ? Thanks, Xavi -------------- next part -------------- An HTML attachment was scrubbed... URL: From vbellur at redhat.com Fri May 3 05:44:39 2019 From: vbellur at redhat.com (Vijay Bellur) Date: Thu, 2 May 2019 22:44:39 -0700 Subject: [Gluster-devel] Query regarding dictionary logic In-Reply-To: References: Message-ID: Hi Mohit, Thank you for the update. More inline. On Wed, May 1, 2019 at 11:45 PM Mohit Agrawal wrote: > Hi Vijay, > > I have tried to execute smallfile tool on volume(12x3), i have not found > any significant performance improvement > for smallfile operations, I have configured 4 clients and 8 thread to run > operations. > For measuring performance, did you measure both time taken and cpu consumed? Normally O(n) computations are cpu expensive and we might see better results with a hash table when a large number of objects ( a few thousands) are present in a single dictionary. If you haven't gathered cpu statistics, please also gather that for comparison. > I have generated statedump and found below data for dictionaries specific > to gluster processes > > brick > max-pairs-per-dict=50 > total-pairs-used=192212171 > total-dicts-used=24794349 > average-pairs-per-dict=7 > > > glusterd > max-pairs-per-dict=301 > total-pairs-used=156677 > total-dicts-used=30719 > average-pairs-per-dict=5 > > > fuse process > [dict] > max-pairs-per-dict=50 > total-pairs-used=88669561 > total-dicts-used=12360543 > average-pairs-per-dict=7 > > It seems dictionary has max-pairs in case of glusterd and while no. of > volumes are high the number can be increased. > I think there is no performance regression in case of brick and fuse. I > have used hash_size 20 for the dictionary. > Let me know if you can provide some other test to validate the same. > A few more items to try out: 1. Vary the number of buckets and test. 2. Create about 10000 volumes and measure performance for a volume info operation on some random volume? 3. Check the related patch from Facebook and see if we can incorporate any ideas from their patch. Thanks, Vijay > Thanks, > Mohit Agrawal > > On Tue, Apr 30, 2019 at 2:29 PM Mohit Agrawal wrote: > >> Thanks, Amar for sharing the patch, I will test and share the result. >> >> On Tue, Apr 30, 2019 at 2:23 PM Amar Tumballi Suryanarayan < >> atumball at redhat.com> wrote: >> >>> Shreyas/Kevin tried to address it some time back using >>> https://bugzilla.redhat.com/show_bug.cgi?id=1428049 ( >>> https://review.gluster.org/16830) >>> >>> I vaguely remember the reason to keep the hash value 1 was done during >>> the time when we had dictionary itself sent as on wire protocol, and in >>> most other places, number of entries in dictionary was on an avg, 3. So, we >>> felt, saving on a bit of memory for optimization was better at that time. >>> >>> -Amar >>> >>> On Tue, Apr 30, 2019 at 12:02 PM Mohit Agrawal >>> wrote: >>> >>>> sure Vijay, I will try and update. >>>> >>>> Regards, >>>> Mohit Agrawal >>>> >>>> On Tue, Apr 30, 2019 at 11:44 AM Vijay Bellur >>>> wrote: >>>> >>>>> Hi Mohit, >>>>> >>>>> On Mon, Apr 29, 2019 at 7:15 AM Mohit Agrawal >>>>> wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> I was just looking at the code of dict, I have one query current >>>>>> dictionary logic. >>>>>> I am not able to understand why we use hash_size is 1 for a >>>>>> dictionary.IMO with the >>>>>> hash_size of 1 dictionary always work like a list, not a hash, for >>>>>> every lookup >>>>>> in dictionary complexity is O(n). >>>>>> >>>>>> Before optimizing the code I just want to know what was the exact >>>>>> reason to define >>>>>> hash_size is 1? >>>>>> >>>>> >>>>> This is a good question. I looked up the source in gluster's historic >>>>> repo [1] and hash_size is 1 even there. So, this could have been the case >>>>> since the first version of the dictionary code. >>>>> >>>>> Would you be able to run some tests with a larger hash_size and share >>>>> your observations? >>>>> >>>>> Thanks, >>>>> Vijay >>>>> >>>>> [1] >>>>> https://github.com/gluster/historic/blob/master/libglusterfs/src/dict.c >>>>> >>>>> >>>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From jthottan at redhat.com Fri May 3 06:04:50 2019 From: jthottan at redhat.com (Jiffin Tony Thottan) Date: Fri, 3 May 2019 11:34:50 +0530 Subject: [Gluster-devel] [Gluster-users] Proposing to previous ganesha HA cluster solution back to gluster code as gluster-7 feature In-Reply-To: <7d75b62f0eb0495782c46ef8521790d5@ul-exc-pr-mbx13.ulaval.ca> References: <9BE7F129-DE42-46A5-896B-81460E605E9E@gmail.com> <7d75b62f0eb0495782c46ef8521790d5@ul-exc-pr-mbx13.ulaval.ca> Message-ID: On 30/04/19 6:41 PM, Renaud Fortier wrote: > > IMO, you should keep storhaug and maintain it. At the beginning, we > were with pacemaker and corosync. Then we move to storhaug with the > upgrade to gluster 4.1.x. Now you are talking about going back like it > was. Maybe it will be better with pacemake and corosync but the > important is to have a solution that will be stable and maintained. > I agree it is very frustrating, there is no longer development planned for future unless someone pick it and work on for its stabilization and improvement. My plan is just to get back what gluster and nfs-ganesha had before -- Jiffin > thanks > > Renaud > > *De?:*gluster-users-bounces at gluster.org > [mailto:gluster-users-bounces at gluster.org] *De la part de* Jim Kinney > *Envoy??:* 30 avril 2019 08:20 > *??:* gluster-users at gluster.org; Jiffin Tony Thottan > ; gluster-users at gluster.org; Gluster Devel > ; gluster-maintainers at gluster.org; > nfs-ganesha ; devel at lists.nfs-ganesha.org > *Objet?:* Re: [Gluster-users] Proposing to previous ganesha HA cluster > solution back to gluster code as gluster-7 feature > > +1! > I'm using nfs-ganesha in my next upgrade so my client systems can use > NFS instead of fuse mounts. Having an integrated, designed in process > to coordinate multiple nodes into an HA cluster will very welcome. > > On April 30, 2019 3:20:11 AM EDT, Jiffin Tony Thottan > > wrote: > > Hi all, > > Some of you folks may be familiar with HA solution provided for > nfs-ganesha by gluster using pacemaker and corosync. > > That feature was removed in glusterfs 3.10 in favour for common HA > project "Storhaug". Even Storhaug was not progressed > > much from last two years and current development is in halt state, > hence planning to restore old HA ganesha solution back > > to gluster code repository with some improvement and targetting > for next gluster release 7. > > I have opened up an issue [1] with details and posted initial set > of patches [2] > > Please share your thoughts on the same > > Regards, > > Jiffin > > [1]https://github.com/gluster/glusterfs/issues/663 > > > [2] > https://review.gluster.org/#/q/topic:rfc-663+(status:open+OR+status:merged) > > > -- > Sent from my Android device with K-9 Mail. All tyopes are thumb > related and reflect authenticity. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jthottan at redhat.com Fri May 3 06:08:07 2019 From: jthottan at redhat.com (Jiffin Tony Thottan) Date: Fri, 3 May 2019 11:38:07 +0530 Subject: [Gluster-devel] [Gluster-users] Proposing to previous ganesha HA clustersolution back to gluster code as gluster-7 feature In-Reply-To: <1028413072.2343069.1556630991785@mail.yahoo.com> References: <1028413072.2343069.1556630991785@mail.yahoo.com> Message-ID: <84885b70-e6b0-6e9b-f43d-a13dbafc6b6a@redhat.com> On 30/04/19 6:59 PM, Strahil Nikolov wrote: > Hi, > > I'm posting this again as it got bounced. > Keep in mind that corosync/pacemaker? is hard for proper setup by new admins/users. > > I'm still trying to remediate the effects of poor configuration at work. > Also, storhaug is nice for hyperconverged setups where the host is not only hosting bricks, but? other? workloads. > Corosync/pacemaker require proper fencing to be setup and most of the stonith resources 'shoot the other node in the head'. > I would be happy to see an easy to deploy (let say 'cluster.enable-ha-ganesha true') and gluster to be bringing up the Floating IPs and taking care of the NFS locks, so no disruption will be felt by the clients. It do take care those, but need to follow certain prerequisite, but please fencing won't configured for this setup. May we think about in future. -- Jiffin > > Still, this will be a lot of work to achieve. > > Best Regards, > Strahil Nikolov > > On Apr 30, 2019 15:19, Jim Kinney wrote: >> >> +1! >> I'm using nfs-ganesha in my next upgrade so my client systems can use NFS instead of fuse mounts. Having an integrated, designed in process to coordinate multiple nodes into an HA cluster will very welcome. >> >> On April 30, 2019 3:20:11 AM EDT, Jiffin Tony Thottan wrote: >>> >>> Hi all, >>> >>> Some of you folks may be familiar with HA solution provided for nfs-ganesha by gluster using pacemaker and corosync. >>> >>> That feature was removed in glusterfs 3.10 in favour for common HA project "Storhaug". Even Storhaug was not progressed >>> >>> much from last two years and current development is in halt state, hence planning to restore old HA ganesha solution back >>> >>> to gluster code repository with some improvement and targetting for next gluster release 7. >>> >>> ??I have opened up an issue [1] with details and posted initial set of patches [2] >>> >>> Please share your thoughts on the same >>> >>> >>> Regards, >>> >>> Jiffin >>> >>> [1] https://github.com/gluster/glusterfs/issues/663 >>> >>> [2] https://review.gluster.org/#/q/topic:rfc-663+(status:open+OR+status:merged) >>> >>> >> -- >> Sent from my Android device with K-9 Mail. All tyopes are thumb related and reflect authenticity. > Keep in mind that corosync/pacemaker? is hard for proper setup by new admins/users. > > I'm still trying to remediate the effects of poor configuration at work. > Also, storhaug is nice for hyperconverged setups where the host is not only hosting bricks, but? other? workloads. > Corosync/pacemaker require proper fencing to be setup and most of the stonith resources 'shoot the other node in the head'. > I would be happy to see an easy to deploy (let say 'cluster.enable-ha-ganesha true') and gluster to be bringing up the Floating IPs and taking care of the NFS locks, so no disruption will be felt by the clients. > > Still, this will be a lot of work to achieve. > > Best Regards, > Strahil NikolovOn Apr 30, 2019 15:19, Jim Kinney wrote: >> +1! >> I'm using nfs-ganesha in my next upgrade so my client systems can use NFS instead of fuse mounts. Having an integrated, designed in process to coordinate multiple nodes into an HA cluster will very welcome. >> >> On April 30, 2019 3:20:11 AM EDT, Jiffin Tony Thottan wrote: >>> Hi all, >>> >>> Some of you folks may be familiar with HA solution provided for nfs-ganesha by gluster using pacemaker and corosync. >>> >>> That feature was removed in glusterfs 3.10 in favour for common HA project "Storhaug". Even Storhaug was not progressed >>> >>> much from last two years and current development is in halt state, hence planning to restore old HA ganesha solution back >>> >>> to gluster code repository with some improvement and targetting for next gluster release 7. >>> >>> I have opened up an issue [1] with details and posted initial set of patches [2] >>> >>> Please share your thoughts on the same >>> >>> Regards, >>> >>> Jiffin >>> >>> [1] https://github.com/gluster/glusterfs/issues/663 >>> >>> [2] https://review.gluster.org/#/q/topic:rfc-663+(status:open+OR+status:merged) >> >> -- >> Sent from my Android device with K-9 Mail. All tyopes are thumb related and reflect authenticity. From amukherj at redhat.com Fri May 3 08:56:28 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Fri, 3 May 2019 14:26:28 +0530 Subject: [Gluster-devel] Coverity scan - how does it ignore dismissed defects & annotations? Message-ID: I'm bit puzzled on the way coverity is reporting the open defects on GD1 component. As you can see from [1], technically we have 6 open defects and all of the rest are being marked as dismissed. We tried to put some additional annotations in the code through [2] to see if coverity starts feeling happy but the result doesn't change. I still see in the report it complaints about open defect of GD1 as 25 (7 as High, 18 as medium and 1 as Low). More interestingly yesterday's report claimed we fixed 8 defects, introduced 1, but the overall count remained as 102. I'm not able to connect the dots of this puzzle, can anyone? [1] https://scan.coverity.com/projects/gluster-glusterfs/view_defects [2] https://review.gluster.org/#/c/22619/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From jahernan at redhat.com Fri May 3 09:29:21 2019 From: jahernan at redhat.com (Xavi Hernandez) Date: Fri, 3 May 2019 11:29:21 +0200 Subject: [Gluster-devel] Coverity scan - how does it ignore dismissed defects & annotations? In-Reply-To: References: Message-ID: Hi Atin, On Fri, May 3, 2019 at 10:57 AM Atin Mukherjee wrote: > I'm bit puzzled on the way coverity is reporting the open defects on GD1 > component. As you can see from [1], technically we have 6 open defects and > all of the rest are being marked as dismissed. We tried to put some > additional annotations in the code through [2] to see if coverity starts > feeling happy but the result doesn't change. I still see in the report it > complaints about open defect of GD1 as 25 (7 as High, 18 as medium and 1 as > Low). More interestingly yesterday's report claimed we fixed 8 defects, > introduced 1, but the overall count remained as 102. I'm not able to > connect the dots of this puzzle, can anyone? > Maybe we need to modify all dismissed CID's so that Coverity considers them again and, hopefully, mark them as solved with the newer updates. They have been manually marked to be ignored, so they are still there... Just a thought, I'm not sure how this really works. Xavi > > [1] https://scan.coverity.com/projects/gluster-glusterfs/view_defects > [2] https://review.gluster.org/#/c/22619/ > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Fri May 3 09:46:49 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Fri, 3 May 2019 15:16:49 +0530 Subject: [Gluster-devel] Coverity scan - how does it ignore dismissed defects & annotations? In-Reply-To: References: Message-ID: On Fri, 3 May 2019 at 14:59, Xavi Hernandez wrote: > Hi Atin, > > On Fri, May 3, 2019 at 10:57 AM Atin Mukherjee > wrote: > >> I'm bit puzzled on the way coverity is reporting the open defects on GD1 >> component. As you can see from [1], technically we have 6 open defects and >> all of the rest are being marked as dismissed. We tried to put some >> additional annotations in the code through [2] to see if coverity starts >> feeling happy but the result doesn't change. I still see in the report it >> complaints about open defect of GD1 as 25 (7 as High, 18 as medium and 1 as >> Low). More interestingly yesterday's report claimed we fixed 8 defects, >> introduced 1, but the overall count remained as 102. I'm not able to >> connect the dots of this puzzle, can anyone? >> > > Maybe we need to modify all dismissed CID's so that Coverity considers > them again and, hopefully, mark them as solved with the newer updates. They > have been manually marked to be ignored, so they are still there... > After yesterday?s run I set the severity for all of them to see if modifications to these CIDs make any difference or not. So fingers crossed till the next report comes :-) . > Just a thought, I'm not sure how this really works. > Same here, I don?t understand the exact workflow and hence seeking additional ideas. > Xavi > > >> >> [1] https://scan.coverity.com/projects/gluster-glusterfs/view_defects >> [2] https://review.gluster.org/#/c/22619/ >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel > > -- - Atin (atinm) -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Fri May 3 10:36:36 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Fri, 3 May 2019 16:06:36 +0530 Subject: [Gluster-devel] Coverity scan - how does it ignore dismissed defects & annotations? In-Reply-To: References: Message-ID: On Fri, May 3, 2019 at 3:17 PM Atin Mukherjee wrote: > > > On Fri, 3 May 2019 at 14:59, Xavi Hernandez wrote: > >> Hi Atin, >> >> On Fri, May 3, 2019 at 10:57 AM Atin Mukherjee >> wrote: >> >>> I'm bit puzzled on the way coverity is reporting the open defects on GD1 >>> component. As you can see from [1], technically we have 6 open defects and >>> all of the rest are being marked as dismissed. We tried to put some >>> additional annotations in the code through [2] to see if coverity starts >>> feeling happy but the result doesn't change. I still see in the report it >>> complaints about open defect of GD1 as 25 (7 as High, 18 as medium and 1 as >>> Low). More interestingly yesterday's report claimed we fixed 8 defects, >>> introduced 1, but the overall count remained as 102. I'm not able to >>> connect the dots of this puzzle, can anyone? >>> >> >> Maybe we need to modify all dismissed CID's so that Coverity considers >> them again and, hopefully, mark them as solved with the newer updates. They >> have been manually marked to be ignored, so they are still there... >> > > After yesterday?s run I set the severity for all of them to see if > modifications to these CIDs make any difference or not. So fingers crossed > till the next report comes :-) . > If you noticed the previous day report, it was 101 'Open defects' and 65 'Dismissed' (which means, they are not 'fixed in code', but dismissed as false positive or ignore in CID dashboard. Now, it is 57 'Dismissed', which means, your patch has actually fixed 8 defects. > > >> Just a thought, I'm not sure how this really works. >> > > Same here, I don?t understand the exact workflow and hence seeking > additional ideas. > > Looks like we should consider overall open defects as Open + Dismissed. > >> Xavi >> >> >>> >>> [1] https://scan.coverity.com/projects/gluster-glusterfs/view_defects >>> [2] https://review.gluster.org/#/c/22619/ >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> -- > - Atin (atinm) > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Fri May 3 15:40:15 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Fri, 3 May 2019 21:10:15 +0530 Subject: [Gluster-devel] Coverity scan - how does it ignore dismissed defects & annotations? In-Reply-To: References: Message-ID: On Fri, 3 May 2019 at 16:07, Amar Tumballi Suryanarayan wrote: > > > On Fri, May 3, 2019 at 3:17 PM Atin Mukherjee wrote: > >> >> >> On Fri, 3 May 2019 at 14:59, Xavi Hernandez wrote: >> >>> Hi Atin, >>> >>> On Fri, May 3, 2019 at 10:57 AM Atin Mukherjee >>> wrote: >>> >>>> I'm bit puzzled on the way coverity is reporting the open defects on >>>> GD1 component. As you can see from [1], technically we have 6 open defects >>>> and all of the rest are being marked as dismissed. We tried to put some >>>> additional annotations in the code through [2] to see if coverity starts >>>> feeling happy but the result doesn't change. I still see in the report it >>>> complaints about open defect of GD1 as 25 (7 as High, 18 as medium and 1 as >>>> Low). More interestingly yesterday's report claimed we fixed 8 defects, >>>> introduced 1, but the overall count remained as 102. I'm not able to >>>> connect the dots of this puzzle, can anyone? >>>> >>> >>> Maybe we need to modify all dismissed CID's so that Coverity considers >>> them again and, hopefully, mark them as solved with the newer updates. They >>> have been manually marked to be ignored, so they are still there... >>> >> >> After yesterday?s run I set the severity for all of them to see if >> modifications to these CIDs make any difference or not. So fingers crossed >> till the next report comes :-) . >> > > If you noticed the previous day report, it was 101 'Open defects' and 65 > 'Dismissed' (which means, they are not 'fixed in code', but dismissed as > false positive or ignore in CID dashboard. > > Now, it is 57 'Dismissed', which means, your patch has actually fixed 8 > defects. > > >> >> >>> Just a thought, I'm not sure how this really works. >>> >> >> Same here, I don?t understand the exact workflow and hence seeking >> additional ideas. >> >> > Looks like we should consider overall open defects as Open + Dismissed. > This is why I?m concerned. There?re defects which we clearly can?t or don?t want to fix and in that case even though they are marked as dismissed the overall open defect count doesn?t come down. So we?d never be able to come down below total number of dismissed defects :-( . However today?s report bring the overall count down to 97 from 102. Coverity claimed we fixed 0 defects since last scan which means somehow my update at those GD1 dismissed defects did a trick for 5 defects. This continues to be a great puzzle for me! > >> >>> Xavi >>> >>> >>>> >>>> [1] https://scan.coverity.com/projects/gluster-glusterfs/view_defects >>>> [2] https://review.gluster.org/#/c/22619/ >>>> _______________________________________________ >>>> Gluster-devel mailing list >>>> Gluster-devel at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>> >>> -- >> - Atin (atinm) >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel > > > > -- > Amar Tumballi (amarts) > -- - Atin (atinm) -------------- next part -------------- An HTML attachment was scrubbed... URL: From jenkins at build.gluster.org Mon May 6 01:45:02 2019 From: jenkins at build.gluster.org (jenkins at build.gluster.org) Date: Mon, 6 May 2019 01:45:02 +0000 (UTC) Subject: [Gluster-devel] Weekly Untriaged Bugs Message-ID: <324506145.71.1557107102526.JavaMail.jenkins@jenkins-el7.rht.gluster.org> [...truncated 6 lines...] https://bugzilla.redhat.com/1702316 / core: Cannot upgrade 5.x volume to 6.1 because of unused 'crypt' and 'bd' xlators https://bugzilla.redhat.com/1700295 / core: The data couldn't be flushed immediately even with O_SYNC in glfs_create or with glfs_fsync/glfs_fdatasync after glfs_write. https://bugzilla.redhat.com/1698861 / disperse: Renaming a directory when 2 bricks of multiple disperse subvols are down leaves both old and new dirs on the bricks. https://bugzilla.redhat.com/1697293 / distribute: DHT: print hash and layout values in hexadecimal format in the logs https://bugzilla.redhat.com/1703322 / doc: Need to document about fips-mode-rchecksum in gluster-7 release notes. https://bugzilla.redhat.com/1702043 / fuse: Newly created files are inaccessible via FUSE https://bugzilla.redhat.com/1703007 / glusterd: The telnet or something would cause high memory usage for glusterd & glusterfsd https://bugzilla.redhat.com/1705351 / HDFS: glusterfsd crash after days of running https://bugzilla.redhat.com/1703433 / project-infrastructure: gluster-block: setup GCOV & LCOV job https://bugzilla.redhat.com/1703435 / project-infrastructure: gluster-block: Upstream Jenkins job which get triggered at PR level https://bugzilla.redhat.com/1703329 / project-infrastructure: [gluster-infra]: Please create repo for plus one scale work https://bugzilla.redhat.com/1699309 / snapshot: Gluster snapshot fails with systemd autmounted bricks https://bugzilla.redhat.com/1702289 / tiering: Promotion failed for a0afd3e3-0109-49b7-9b74-ba77bf653aba.11229 https://bugzilla.redhat.com/1697812 / website: mention a pointer to all the mailing lists available under glusterfs project(https://www.gluster.org/community/) [...truncated 2 lines...] -------------- next part -------------- A non-text attachment was scrubbed... Name: build.log Type: application/octet-stream Size: 2089 bytes Desc: not available URL: From cynthia.zhou at nokia-sbell.com Mon May 6 02:34:08 2019 From: cynthia.zhou at nokia-sbell.com (Zhou, Cynthia (NSB - CN/Hangzhou)) Date: Mon, 6 May 2019 02:34:08 +0000 Subject: [Gluster-devel] glusterfsd memory leak issue found after enable ssl In-Reply-To: References: <07cb1c3aa08b414dbe37442955ddad36@nokia-sbell.com> <6ce04fb69243465295a71b6953eafa19@nokia-sbell.com> <3cd91d1ce39541e7ad30c60ef15000aa@nokia-sbell.com> <5d0c2ed30e884b86ba29bff5a47c960e@nokia-sbell.com> Message-ID: Hi, Sorry, I am so busy with other issues these days, could you help me to submit my patch for review? It is based on glusterfs3.12.15 code. But even with this patch , memory leak still exists, from memory leak tool it should be related with ssl_accept, not sure if it is because of openssl library or because improper use of ssl interfaces. --- a/rpc/rpc-transport/socket/src/socket.c +++ b/rpc/rpc-transport/socket/src/socket.c @@ -1019,7 +1019,16 @@ static void __socket_reset(rpc_transport_t *this) { memset(&priv->incoming, 0, sizeof(priv->incoming)); event_unregister_close(this->ctx->event_pool, priv->sock, priv->idx); - + if(priv->use_ssl&& priv->ssl_ssl) + { + gf_log(this->name, GF_LOG_INFO, + "clear and reset for socket(%d), free ssl ", + priv->sock); + SSL_shutdown(priv->ssl_ssl); + SSL_clear(priv->ssl_ssl); + SSL_free(priv->ssl_ssl); + priv->ssl_ssl = NULL; + } priv->sock = -1; priv->idx = -1; priv->connected = -1; @@ -4238,6 +4250,16 @@ void fini(rpc_transport_t *this) { pthread_mutex_destroy(&priv->out_lock); pthread_mutex_destroy(&priv->cond_lock); pthread_cond_destroy(&priv->cond); + if(priv->use_ssl&& priv->ssl_ssl) + { + gf_log(this->name, GF_LOG_INFO, + "clear and reset for socket(%d), free ssl ", + priv->sock); + SSL_shutdown(priv->ssl_ssl); + SSL_clear(priv->ssl_ssl); + SSL_free(priv->ssl_ssl); + priv->ssl_ssl = NULL; + } if (priv->ssl_private_key) { GF_FREE(priv->ssl_private_key); } From: Amar Tumballi Suryanarayan Sent: Wednesday, May 01, 2019 8:43 PM To: Zhou, Cynthia (NSB - CN/Hangzhou) Cc: Milind Changire ; gluster-devel at gluster.org Subject: Re: [Gluster-devel] glusterfsd memory leak issue found after enable ssl Hi Cynthia Zhou, Can you post the patch which fixes the issue of missing free? We will continue to investigate the leak further, but would really appreciate getting the patch which is already worked on land into upstream master. -Amar On Mon, Apr 22, 2019 at 1:38 PM Zhou, Cynthia (NSB - CN/Hangzhou) > wrote: Ok, I am clear now. I?ve added ssl_free in socket reset and socket finish function, though glusterfsd memory leak is not that much, still it is leaking, from source code I can not find anything else, Could you help to check if this issue exists in your env? If not I may have a try to merge your patch . Step 1> while true;do gluster v heal info, 2> check the vol-name glusterfsd memory usage, it is obviously increasing. cynthia From: Milind Changire > Sent: Monday, April 22, 2019 2:36 PM To: Zhou, Cynthia (NSB - CN/Hangzhou) > Cc: Atin Mukherjee >; gluster-devel at gluster.org Subject: Re: [Gluster-devel] glusterfsd memory leak issue found after enable ssl According to BIO_new_socket() man page ... If the close flag is set then the socket is shut down and closed when the BIO is freed. For Gluster to have more control over the socket shutdown, the BIO_NOCLOSE flag is set. Otherwise, SSL takes control of socket shutdown whenever BIO is freed. _______________________________________________ Gluster-devel mailing list Gluster-devel at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From cynthia.zhou at nokia-sbell.com Mon May 6 06:11:58 2019 From: cynthia.zhou at nokia-sbell.com (Zhou, Cynthia (NSB - CN/Hangzhou)) Date: Mon, 6 May 2019 06:11:58 +0000 Subject: [Gluster-devel] glusterfsd memory leak issue found after enable ssl In-Reply-To: References: <07cb1c3aa08b414dbe37442955ddad36@nokia-sbell.com> <6ce04fb69243465295a71b6953eafa19@nokia-sbell.com> <3cd91d1ce39541e7ad30c60ef15000aa@nokia-sbell.com> <5d0c2ed30e884b86ba29bff5a47c960e@nokia-sbell.com> Message-ID: <6d3f68f73e6d440dab19028526745171@nokia-sbell.com> Hi, From our test valgrind and libleak all blame ssl3_accept ///////////////////////////from valgrind attached to glusterfds/////////////////////////////////////////// ==16673== 198,720 bytes in 12 blocks are definitely lost in loss record 1,114 of 1,123 ==16673== at 0x4C2EB7B: malloc (vg_replace_malloc.c:299) ==16673== by 0x63E1977: CRYPTO_malloc (in /usr/lib64/libcrypto.so.1.0.2p) ==16673== by 0xA855E0C: ssl3_setup_write_buffer (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA855E77: ssl3_setup_buffers (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA8485D9: ssl3_accept (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA610DDF: ssl_complete_connection (socket.c:400) ==16673== by 0xA617F38: ssl_handle_server_connection_attempt (socket.c:2409) ==16673== by 0xA618420: socket_complete_connection (socket.c:2554) ==16673== by 0xA618788: socket_event_handler (socket.c:2613) ==16673== by 0x4ED6983: event_dispatch_epoll_handler (event-epoll.c:587) ==16673== by 0x4ED6C5A: event_dispatch_epoll_worker (event-epoll.c:663) ==16673== by 0x615C5D9: start_thread (in /usr/lib64/libpthread-2.27.so) ==16673== ==16673== 200,544 bytes in 12 blocks are definitely lost in loss record 1,115 of 1,123 ==16673== at 0x4C2EB7B: malloc (vg_replace_malloc.c:299) ==16673== by 0x63E1977: CRYPTO_malloc (in /usr/lib64/libcrypto.so.1.0.2p) ==16673== by 0xA855D12: ssl3_setup_read_buffer (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA855E68: ssl3_setup_buffers (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA8485D9: ssl3_accept (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA610DDF: ssl_complete_connection (socket.c:400) ==16673== by 0xA617F38: ssl_handle_server_connection_attempt (socket.c:2409) ==16673== by 0xA618420: socket_complete_connection (socket.c:2554) ==16673== by 0xA618788: socket_event_handler (socket.c:2613) ==16673== by 0x4ED6983: event_dispatch_epoll_handler (event-epoll.c:587) ==16673== by 0x4ED6C5A: event_dispatch_epoll_worker (event-epoll.c:663) ==16673== by 0x615C5D9: start_thread (in /usr/lib64/libpthread-2.27.so) ==16673== valgrind --leak-check=f ////////////////////////////////////with libleak attached to glusterfsd///////////////////////////////////////// callstack[2419] expires. count=1 size=224/224 alloc=362 free=350 /home/robot/libleak/libleak.so(malloc+0x25) [0x7f1460604065] /lib64/libcrypto.so.10(CRYPTO_malloc+0x58) [0x7f145ecd9978] /lib64/libcrypto.so.10(EVP_DigestInit_ex+0x2a9) [0x7f145ed95749] /lib64/libssl.so.10(ssl3_digest_cached_records+0x11d) [0x7f145abb6ced] /lib64/libssl.so.10(ssl3_accept+0xc8f) [0x7f145abadc4f] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(ssl_complete_connection+0x5e) [0x7f145ae00f3a] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc16d) [0x7f145ae0816d] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc68a) [0x7f145ae0868a] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc9f2) [0x7f145ae089f2] /lib64/libglusterfs.so.0(+0x9b96f) [0x7f146038596f] /lib64/libglusterfs.so.0(+0x9bc46) [0x7f1460385c46] /lib64/libpthread.so.0(+0x75da) [0x7f145f0d15da] /lib64/libc.so.6(clone+0x3f) [0x7f145e9a7eaf] callstack[2432] expires. count=1 size=104/104 alloc=362 free=0 /home/robot/libleak/libleak.so(malloc+0x25) [0x7f1460604065] /lib64/libcrypto.so.10(CRYPTO_malloc+0x58) [0x7f145ecd9978] /lib64/libcrypto.so.10(BN_MONT_CTX_new+0x17) [0x7f145ed48627] /lib64/libcrypto.so.10(BN_MONT_CTX_set_locked+0x6d) [0x7f145ed489fd] /lib64/libcrypto.so.10(+0xff4d9) [0x7f145ed6a4d9] /lib64/libcrypto.so.10(int_rsa_verify+0x1cd) [0x7f145ed6d41d] /lib64/libcrypto.so.10(RSA_verify+0x32) [0x7f145ed6d972] /lib64/libcrypto.so.10(+0x107ff5) [0x7f145ed72ff5] /lib64/libcrypto.so.10(EVP_VerifyFinal+0x211) [0x7f145ed9dd51] /lib64/libssl.so.10(ssl3_get_cert_verify+0x5bb) [0x7f145abac06b] /lib64/libssl.so.10(ssl3_accept+0x988) [0x7f145abad948] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(ssl_complete_connection+0x5e) [0x7f145ae00f3a] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc16d) [0x7f145ae0816d] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc68a) [0x7f145ae0868a] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc9f2) [0x7f145ae089f2] /lib64/libglusterfs.so.0(+0x9b96f) [0x7f146038596f] /lib64/libglusterfs.so.0(+0x9bc46) [0x7f1460385c46] /lib64/libpthread.so.0(+0x75da) [0x7f145f0d15da] /lib64/libc.so.6(clone+0x3f) [0x7f145e9a7eaf] one interesting thing is that the memory goes up to about 300m then it stopped increasing !!! I am wondering if this is caused by open-ssl library? But when I search from openssl community, there is no such issue reported before. Is glusterfs using ssl_accept correctly? cynthia From: Zhou, Cynthia (NSB - CN/Hangzhou) Sent: Monday, May 06, 2019 10:34 AM To: 'Amar Tumballi Suryanarayan' Cc: Milind Changire ; gluster-devel at gluster.org Subject: RE: [Gluster-devel] glusterfsd memory leak issue found after enable ssl Hi, Sorry, I am so busy with other issues these days, could you help me to submit my patch for review? It is based on glusterfs3.12.15 code. But even with this patch , memory leak still exists, from memory leak tool it should be related with ssl_accept, not sure if it is because of openssl library or because improper use of ssl interfaces. --- a/rpc/rpc-transport/socket/src/socket.c +++ b/rpc/rpc-transport/socket/src/socket.c @@ -1019,7 +1019,16 @@ static void __socket_reset(rpc_transport_t *this) { memset(&priv->incoming, 0, sizeof(priv->incoming)); event_unregister_close(this->ctx->event_pool, priv->sock, priv->idx); - + if(priv->use_ssl&& priv->ssl_ssl) + { + gf_log(this->name, GF_LOG_INFO, + "clear and reset for socket(%d), free ssl ", + priv->sock); + SSL_shutdown(priv->ssl_ssl); + SSL_clear(priv->ssl_ssl); + SSL_free(priv->ssl_ssl); + priv->ssl_ssl = NULL; + } priv->sock = -1; priv->idx = -1; priv->connected = -1; @@ -4238,6 +4250,16 @@ void fini(rpc_transport_t *this) { pthread_mutex_destroy(&priv->out_lock); pthread_mutex_destroy(&priv->cond_lock); pthread_cond_destroy(&priv->cond); + if(priv->use_ssl&& priv->ssl_ssl) + { + gf_log(this->name, GF_LOG_INFO, + "clear and reset for socket(%d), free ssl ", + priv->sock); + SSL_shutdown(priv->ssl_ssl); + SSL_clear(priv->ssl_ssl); + SSL_free(priv->ssl_ssl); + priv->ssl_ssl = NULL; + } if (priv->ssl_private_key) { GF_FREE(priv->ssl_private_key); } From: Amar Tumballi Suryanarayan > Sent: Wednesday, May 01, 2019 8:43 PM To: Zhou, Cynthia (NSB - CN/Hangzhou) > Cc: Milind Changire >; gluster-devel at gluster.org Subject: Re: [Gluster-devel] glusterfsd memory leak issue found after enable ssl Hi Cynthia Zhou, Can you post the patch which fixes the issue of missing free? We will continue to investigate the leak further, but would really appreciate getting the patch which is already worked on land into upstream master. -Amar On Mon, Apr 22, 2019 at 1:38 PM Zhou, Cynthia (NSB - CN/Hangzhou) > wrote: Ok, I am clear now. I?ve added ssl_free in socket reset and socket finish function, though glusterfsd memory leak is not that much, still it is leaking, from source code I can not find anything else, Could you help to check if this issue exists in your env? If not I may have a try to merge your patch . Step 1> while true;do gluster v heal info, 2> check the vol-name glusterfsd memory usage, it is obviously increasing. cynthia From: Milind Changire > Sent: Monday, April 22, 2019 2:36 PM To: Zhou, Cynthia (NSB - CN/Hangzhou) > Cc: Atin Mukherjee >; gluster-devel at gluster.org Subject: Re: [Gluster-devel] glusterfsd memory leak issue found after enable ssl According to BIO_new_socket() man page ... If the close flag is set then the socket is shut down and closed when the BIO is freed. For Gluster to have more control over the socket shutdown, the BIO_NOCLOSE flag is set. Otherwise, SSL takes control of socket shutdown whenever BIO is freed. _______________________________________________ Gluster-devel mailing list Gluster-devel at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Mon May 6 18:15:04 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Mon, 6 May 2019 14:15:04 -0400 Subject: [Gluster-devel] [Gluster-users] gluster-block v0.4 is alive! In-Reply-To: References: Message-ID: On Thu, May 2, 2019 at 1:35 PM Prasanna Kalever wrote: > Hello Gluster folks, > > Gluster-block team is happy to announce the v0.4 release [1]. > > This is the new stable version of gluster-block, lots of new and > exciting features and interesting bug fixes are made available as part > of this release. > Please find the big list of release highlights and notable fixes at [2]. > > Good work Team (Prasanna and Xiubo Li to be precise)!! This was much needed release w.r.to gluster-block project, mainly because of the number of improvements done since last release. Also, gluster-block release 0.3 was not compatible with glusterfs-6.x series. All, feel free to use it if your deployment has any usecase for Block storage, and give us feedback. Happy to make sure gluster-block is stable for you. Regards, Amar > Details about installation can be found in the easy install guide at > [3]. Find the details about prerequisites and setup guide at [4]. > If you are a new user, checkout the demo video attached in the README > doc [5], which will be a good source of intro to the project. > There are good examples about how to use gluster-block both in the man > pages [6] and test file [7] (also in the README). > > gluster-block is part of fedora package collection, an updated package > with release version v0.4 will be soon made available. And the > community provided packages will be soon made available at [8]. > > Please spend a minute to report any kind of issue that comes to your > notice with this handy link [9]. > We look forward to your feedback, which will help gluster-block get better! > > We would like to thank all our users, contributors for bug filing and > fixes, also the whole team who involved in the huge effort with > pre-release testing. > > > [1] https://github.com/gluster/gluster-block > [2] https://github.com/gluster/gluster-block/releases > [3] https://github.com/gluster/gluster-block/blob/master/INSTALL > [4] https://github.com/gluster/gluster-block#usage > [5] https://github.com/gluster/gluster-block/blob/master/README.md > [6] https://github.com/gluster/gluster-block/tree/master/docs > [7] https://github.com/gluster/gluster-block/blob/master/tests/basic.t > [8] https://download.gluster.org/pub/gluster/gluster-block/ > [9] https://github.com/gluster/gluster-block/issues/new > > Cheers, > Team Gluster-Block! > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > > -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From rajibcse2k10 at gmail.com Mon May 6 20:04:59 2019 From: rajibcse2k10 at gmail.com (Rajib Hossen) Date: Mon, 6 May 2019 15:04:59 -0500 Subject: [Gluster-devel] New in GlusterFS Message-ID: Hello all, I am new in glusterfs development. I would like to contribute in Erasure Coding part of glusterfs. I already studied non-systematic code and its theory. Now, I want to know how erasure coding read/write works in terms of coding. Can you please give me any documentation that'll help to understand glusterfs ec read/write, coding structure. Please any help is appreciated. Thanks you very much. Sincerely, Md Rajib Hossen -------------- next part -------------- An HTML attachment was scrubbed... URL: From jthottan at redhat.com Tue May 7 04:10:11 2019 From: jthottan at redhat.com (Jiffin Tony Thottan) Date: Tue, 7 May 2019 09:40:11 +0530 Subject: [Gluster-devel] [Gluster-users] Proposing to previous ganesha HA clustersolution back to gluster code as gluster-7 feature In-Reply-To: References: Message-ID: Hi On 04/05/19 12:04 PM, Strahil wrote: > Hi Jiffin, > > No vendor will support your corosync/pacemaker stack if you do not have proper fencing. > As Gluster is already a cluster of its own, it makes sense to control everything from there. > > Best Regards, Yeah I agree with your point. What I meant to say by default this feature won't provide any fencing mechanism, user need to manually configure fencing for the cluster. In future we can try to include to default fencing configuration for the ganesha cluster as part of the Ganesha HA configuration Regards, Jiffin > Strahil NikolovOn May 3, 2019 09:08, Jiffin Tony Thottan wrote: >> >> On 30/04/19 6:59 PM, Strahil Nikolov wrote: >>> Hi, >>> >>> I'm posting this again as it got bounced. >>> Keep in mind that corosync/pacemaker? is hard for proper setup by new admins/users. >>> >>> I'm still trying to remediate the effects of poor configuration at work. >>> Also, storhaug is nice for hyperconverged setups where the host is not only hosting bricks, but? other? workloads. >>> Corosync/pacemaker require proper fencing to be setup and most of the stonith resources 'shoot the other node in the head'. >>> I would be happy to see an easy to deploy (let say 'cluster.enable-ha-ganesha true') and gluster to be bringing up the Floating IPs and taking care of the NFS locks, so no disruption will be felt by the clients. >> >> It do take care those, but need to follow certain prerequisite, but >> please fencing won't configured for this setup. May we think about in >> future. >> >> -- >> >> Jiffin >> >>> Still, this will be a lot of work to achieve. >>> >>> Best Regards, >>> Strahil Nikolov >>> >>> On Apr 30, 2019 15:19, Jim Kinney wrote: >>>> >>>> +1! >>>> I'm using nfs-ganesha in my next upgrade so my client systems can use NFS instead of fuse mounts. Having an integrated, designed in process to coordinate multiple nodes into an HA cluster will very welcome. >>>> >>>> On April 30, 2019 3:20:11 AM EDT, Jiffin Tony Thottan wrote: >>>>> >>>>> Hi all, >>>>> >>>>> Some of you folks may be familiar with HA solution provided for nfs-ganesha by gluster using pacemaker and corosync. >>>>> >>>>> That feature was removed in glusterfs 3.10 in favour for common HA project "Storhaug". Even Storhaug was not progressed >>>>> >>>>> much from last two years and current development is in halt state, hence planning to restore old HA ganesha solution back >>>>> >>>>> to gluster code repository with some improvement and targetting for next gluster release 7. >>>>> >>>>> ? ??I have opened up an issue [1] with details and posted initial set of patches [2] >>>>> >>>>> Please share your thoughts on the same >>>>> >>>>> >>>>> Regards, >>>>> >>>>> Jiffin >>>>> >>>>> [1] https://github.com/gluster/glusterfs/issues/663 >>>>> >>>>> [2] https://review.gluster.org/#/q/topic:rfc-663+(status:open+OR+status:merged) >>>>> >>>>> >>>> -- >>>> Sent from my Android device with K-9 Mail. All tyopes are thumb related and reflect authenticity. >>> Keep in mind that corosync/pacemaker? is hard for proper setup by new admins/users. >>> >>> I'm still trying to remediate the effects of poor configuration at work. >>> Also, storhaug is nice for hyperconverged setups where the host is not only hosting bricks, but? other? workloads. >>> Corosync/pacemaker require proper fencing to be setup and most of the stonith resources 'shoot the other node in the head'. >>> I would be happy to see an easy to deploy (let say 'cluster.enable-ha-ganesha true') and gluster to be bringing up the Floating IPs and taking care of the NFS locks, so no disruption will be felt by the clients. >>> >>> Still, this will be a lot of work to achieve. >>> >>> Best Regards, >>> Strahil NikolovOn Apr 30, 2019 15:19, Jim Kinney wrote: >>>> +1! >>>> I'm using nfs-ganesha in my next upgrade so my client systems can use NFS instead of fuse mounts. Having an integrated, designed in process to coordinate multiple nodes into an HA cluster will very welcome. >>>> >>>> On April 30, 2019 3:20:11 AM EDT, Jiffin Tony Thottan wrote: >>>>> Hi all, >>>>> >>>>> Some of you folks may be familiar with HA solution provided for nfs-ganesha by gluster using pacemaker and corosync. >>>>> >>>>> That feature was removed in glusterfs 3.10 in favour for common HA project "Storhaug". Even Storhaug was not progressed >>>>> >>>>> much from last two years and current development is in halt state, hence planning to restore old HA ganesha solution back >>>>> >>>>> to gluster code repository with some improvement and targetting for next gluster release 7. >>>>> >>>>> I have opened up an issue [1] with details and posted initial set of patches [2] >>>>> >>>>> Please share your thoughts on the same >>>>> >>>>> Regards, >>>>> >>>>> Jiffin >>>>> >>>>> [1] https://github.com/gluster/glusterfs/issues/663 >>>>> >>>>> [2] https://review.gluster.org/#/q/topic:rfc-663+(status:open+OR+status:merged) >>>> -- >>>> Sent from my Android device with K-9 Mail. All tyopes are thumb related and reflect authenticity. From ndevos at redhat.com Tue May 7 05:35:34 2019 From: ndevos at redhat.com (Niels de Vos) Date: Tue, 7 May 2019 07:35:34 +0200 Subject: [Gluster-devel] gluster-block v0.4 is alive! In-Reply-To: References: Message-ID: <20190507053534.GF5209@ndevos-x270> On Thu, May 02, 2019 at 11:04:41PM +0530, Prasanna Kalever wrote: > Hello Gluster folks, > > Gluster-block team is happy to announce the v0.4 release [1]. > > This is the new stable version of gluster-block, lots of new and > exciting features and interesting bug fixes are made available as part > of this release. > Please find the big list of release highlights and notable fixes at [2]. > > Details about installation can be found in the easy install guide at > [3]. Find the details about prerequisites and setup guide at [4]. > If you are a new user, checkout the demo video attached in the README > doc [5], which will be a good source of intro to the project. > There are good examples about how to use gluster-block both in the man > pages [6] and test file [7] (also in the README). > > gluster-block is part of fedora package collection, an updated package > with release version v0.4 will be soon made available. And the > community provided packages will be soon made available at [8]. Updates for Fedora are available in the testing repositories: Fedora 30: https://bodhi.fedoraproject.org/updates/FEDORA-2019-76730d7230 Fedora 29: https://bodhi.fedoraproject.org/updates/FEDORA-2019-cc7cdce2a4 Fedora 28: https://bodhi.fedoraproject.org/updates/FEDORA-2019-9e9a210110 Installation instructions can be found at the above links. Please leave testing feedback as comments on the Fedora Update pages. Thanks, Niels > Please spend a minute to report any kind of issue that comes to your > notice with this handy link [9]. > We look forward to your feedback, which will help gluster-block get better! > > We would like to thank all our users, contributors for bug filing and > fixes, also the whole team who involved in the huge effort with > pre-release testing. > > > [1] https://github.com/gluster/gluster-block > [2] https://github.com/gluster/gluster-block/releases > [3] https://github.com/gluster/gluster-block/blob/master/INSTALL > [4] https://github.com/gluster/gluster-block#usage > [5] https://github.com/gluster/gluster-block/blob/master/README.md > [6] https://github.com/gluster/gluster-block/tree/master/docs > [7] https://github.com/gluster/gluster-block/blob/master/tests/basic.t > [8] https://download.gluster.org/pub/gluster/gluster-block/ > [9] https://github.com/gluster/gluster-block/issues/new > > Cheers, > Team Gluster-Block! > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel From aspandey at redhat.com Tue May 7 09:03:52 2019 From: aspandey at redhat.com (Ashish Pandey) Date: Tue, 7 May 2019 05:03:52 -0400 (EDT) Subject: [Gluster-devel] New in GlusterFS In-Reply-To: References: Message-ID: <1109996399.17152975.1557219832489.JavaMail.zimbra@redhat.com> Hi Rajib, Welcome to the gluster community. I am attaching some of the documents which I found while I started working on Erasure Coded volumes. Once you clone, you can also find out following documents - glusterfs/doc/developer-guide/ec-implementation.md You can start with above documents and code reading, If you have any doubts, please feel free to send it to gluster-devel mailing list. Also, you can attend gluster community meetings to discuss technical details of gluster. https://github.com/gluster/community Meeting schedule - * APAC friendly hours * Tuesday 14th May 2019, 11:30AM IST * Bridge: https://bluejeans.com/836554017 * NA/EMEA * Tuesday 7th May 2019, 01:00 PM EDT * Bridge: https://bluejeans.com/486278655 --- Ashish ----- Original Message ----- From: "Rajib Hossen" To: gluster-devel at gluster.org Sent: Tuesday, May 7, 2019 1:34:59 AM Subject: [Gluster-devel] New in GlusterFS Hello all, I am new in glusterfs development. I would like to contribute in Erasure Coding part of glusterfs. I already studied non-systematic code and its theory. Now, I want to know how erasure coding read/write works in terms of coding. Can you please give me any documentation that'll help to understand glusterfs ec read/write, coding structure. Please any help is appreciated. Thanks you very much. Sincerely, Md Rajib Hossen _______________________________________________ Gluster-devel mailing list Gluster-devel at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Erasure Coding - Design Type: application/octet-stream Size: 10021 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Disperse_Xlator_Ramon_Datalab.pdf Type: application/pdf Size: 1594327 bytes Desc: not available URL: From aspandey at redhat.com Tue May 7 09:19:05 2019 From: aspandey at redhat.com (Ashish Pandey) Date: Tue, 7 May 2019 05:19:05 -0400 (EDT) Subject: [Gluster-devel] Meeting Details on footer of the gluster-devel and gluster-user mailing list In-Reply-To: <2029030585.17155612.1557220163425.JavaMail.zimbra@redhat.com> Message-ID: <1839109616.17156274.1557220745006.JavaMail.zimbra@redhat.com> Hi, While we send a mail on gluster-devel or gluster-user mailing list, following content gets auto generated and placed at the end of mail. Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users Gluster-devel mailing list Gluster-devel at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel In the similar way, is it possible to attach meeting schedule and link at the end of every such mails? Like this - Meeting schedule - * APAC friendly hours * Tuesday 14th May 2019 , 11:30AM IST * Bridge: https://bluejeans.com/836554017 * NA/EMEA * Tuesday 7th May 2019 , 01:00 PM EDT * Bridge: https://bluejeans.com/486278655 Or just a link to meeting minutes details?? https://github.com/gluster/community/tree/master/meetings This will help developers and users of the community to know when and where meeting happens and how to attend those meetings. --- Ashish -------------- next part -------------- An HTML attachment was scrubbed... URL: From srakonde at redhat.com Tue May 7 14:34:52 2019 From: srakonde at redhat.com (Sanju Rakonde) Date: Tue, 7 May 2019 20:04:52 +0530 Subject: [Gluster-devel] [Gluster-infra] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> <797512f6ff7f1b9fedbf8b7968dd86a6968d9105.camel@redhat.com> Message-ID: Looks like is_nfs_export_available started failing again in recent centos-regressions. Michael, can you please check? On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul wrote: > > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer > wrote: > >> Le lundi 22 avril 2019 ? 22:57 +0530, Atin Mukherjee a ?crit : >> > Is this back again? The recent patches are failing regression :-\ . >> >> So, on builder206, it took me a while to find that the issue is that >> nfs (the service) was running. >> >> ./tests/basic/afr/tarissue.t failed, because the nfs initialisation >> failed with a rather cryptic message: >> >> [2019-04-23 13:17:05.371733] I [socket.c:991:__socket_server_bind] 0- >> socket.nfs-server: process started listening on port (38465) >> [2019-04-23 13:17:05.385819] E [socket.c:972:__socket_server_bind] 0- >> socket.nfs-server: binding to failed: Address already in use >> [2019-04-23 13:17:05.385843] E [socket.c:974:__socket_server_bind] 0- >> socket.nfs-server: Port is already in use >> [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- >> socket.nfs-server: __socket_server_bind failed;closing socket 14 >> >> I found where this came from, but a few stuff did surprised me: >> >> - the order of print is different that the order in the code >> > > Indeed strange... > >> - the message on "started listening" didn't take in account the fact >> that bind failed on: >> > > Shouldn't it bail out if it failed to bind? > Some missing 'goto out' around line 975/976? > Y. > >> >> >> >> https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 >> >> The message about port 38465 also threw me off the track. The real >> issue is that the service nfs was already running, and I couldn't find >> anything listening on port 38465 >> >> once I do service nfs stop, it no longer failed. >> >> So far, I do know why nfs.service was activated. >> >> But at least, 206 should be fixed, and we know a bit more on what would >> be causing some failure. >> >> >> >> > On Wed, 3 Apr 2019 at 19:26, Michael Scherer >> > wrote: >> > >> > > Le mercredi 03 avril 2019 ? 16:30 +0530, Atin Mukherjee a ?crit : >> > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < >> > > > jthottan at redhat.com> >> > > > wrote: >> > > > >> > > > > Hi, >> > > > > >> > > > > is_nfs_export_available is just a wrapper around "showmount" >> > > > > command AFAIR. >> > > > > I saw following messages in console output. >> > > > > mount.nfs: rpc.statd is not running but is required for remote >> > > > > locking. >> > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, >> > > > > or >> > > > > start >> > > > > statd. >> > > > > 05:06:55 mount.nfs: an incorrect mount option was specified >> > > > > >> > > > > For me it looks rpcbind may not be running on the machine. >> > > > > Usually rpcbind starts automatically on machines, don't know >> > > > > whether it >> > > > > can happen or not. >> > > > > >> > > > >> > > > That's precisely what the question is. Why suddenly we're seeing >> > > > this >> > > > happening too frequently. Today I saw atleast 4 to 5 such >> > > > failures >> > > > already. >> > > > >> > > > Deepshika - Can you please help in inspecting this? >> > > >> > > So we think (we are not sure) that the issue is a bit complex. >> > > >> > > What we were investigating was nightly run fail on aws. When the >> > > build >> > > crash, the builder is restarted, since that's the easiest way to >> > > clean >> > > everything (since even with a perfect test suite that would clean >> > > itself, we could always end in a corrupt state on the system, WRT >> > > mount, fs, etc). >> > > >> > > In turn, this seems to cause trouble on aws, since cloud-init or >> > > something rename eth0 interface to ens5, without cleaning to the >> > > network configuration. >> > > >> > > So the network init script fail (because the image say "start eth0" >> > > and >> > > that's not present), but fail in a weird way. Network is >> > > initialised >> > > and working (we can connect), but the dhclient process is not in >> > > the >> > > right cgroup, and network.service is in failed state. Restarting >> > > network didn't work. In turn, this mean that rpc-statd refuse to >> > > start >> > > (due to systemd dependencies), which seems to impact various NFS >> > > tests. >> > > >> > > We have also seen that on some builders, rpcbind pick some IP v6 >> > > autoconfiguration, but we can't reproduce that, and there is no ip >> > > v6 >> > > set up anywhere. I suspect the network.service failure is somehow >> > > involved, but fail to see how. In turn, rpcbind.socket not starting >> > > could cause NFS test troubles. >> > > >> > > Our current stop gap fix was to fix all the builders one by one. >> > > Remove >> > > the config, kill the rogue dhclient, restart network service. >> > > >> > > However, we can't be sure this is going to fix the problem long >> > > term >> > > since this only manifest after a crash of the test suite, and it >> > > doesn't happen so often. (plus, it was working before some day in >> > > the >> > > past, when something did make this fail, and I do not know if >> > > that's a >> > > system upgrade, or a test change, or both). >> > > >> > > So we are still looking at it to have a complete understanding of >> > > the >> > > issue, but so far, we hacked our way to make it work (or so do I >> > > think). >> > > >> > > Deepshika is working to fix it long term, by fixing the issue >> > > regarding >> > > eth0/ens5 with a new base image. >> > > -- >> > > Michael Scherer >> > > Sysadmin, Community Infrastructure and Platform, OSAS >> > > >> > > >> > > -- >> > >> > - Atin (atinm) >> -- >> Michael Scherer >> Sysadmin, Community Infrastructure >> >> >> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From mscherer at redhat.com Tue May 7 16:11:29 2019 From: mscherer at redhat.com (Michael Scherer) Date: Tue, 07 May 2019 18:11:29 +0200 Subject: [Gluster-devel] [Gluster-infra] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> <797512f6ff7f1b9fedbf8b7968dd86a6968d9105.camel@redhat.com> Message-ID: Le mardi 07 mai 2019 ? 20:04 +0530, Sanju Rakonde a ?crit : > Looks like is_nfs_export_available started failing again in recent > centos-regressions. > > Michael, can you please check? I will try but I am leaving for vacation tonight, so if I find nothing, until I leave, I guess Deepshika will have to look. > On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul wrote: > > > > > > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer < > > mscherer at redhat.com> > > wrote: > > > > > Le lundi 22 avril 2019 ? 22:57 +0530, Atin Mukherjee a ?crit : > > > > Is this back again? The recent patches are failing regression > > > > :-\ . > > > > > > So, on builder206, it took me a while to find that the issue is > > > that > > > nfs (the service) was running. > > > > > > ./tests/basic/afr/tarissue.t failed, because the nfs > > > initialisation > > > failed with a rather cryptic message: > > > > > > [2019-04-23 13:17:05.371733] I > > > [socket.c:991:__socket_server_bind] 0- > > > socket.nfs-server: process started listening on port (38465) > > > [2019-04-23 13:17:05.385819] E > > > [socket.c:972:__socket_server_bind] 0- > > > socket.nfs-server: binding to failed: Address already in use > > > [2019-04-23 13:17:05.385843] E > > > [socket.c:974:__socket_server_bind] 0- > > > socket.nfs-server: Port is already in use > > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- > > > socket.nfs-server: __socket_server_bind failed;closing socket 14 > > > > > > I found where this came from, but a few stuff did surprised me: > > > > > > - the order of print is different that the order in the code > > > > > > > Indeed strange... > > > > > - the message on "started listening" didn't take in account the > > > fact > > > that bind failed on: > > > > > > > Shouldn't it bail out if it failed to bind? > > Some missing 'goto out' around line 975/976? > > Y. > > > > > > > > > > > > > > https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 > > > > > > The message about port 38465 also threw me off the track. The > > > real > > > issue is that the service nfs was already running, and I couldn't > > > find > > > anything listening on port 38465 > > > > > > once I do service nfs stop, it no longer failed. > > > > > > So far, I do know why nfs.service was activated. > > > > > > But at least, 206 should be fixed, and we know a bit more on what > > > would > > > be causing some failure. > > > > > > > > > > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer < > > > > mscherer at redhat.com> > > > > wrote: > > > > > > > > > Le mercredi 03 avril 2019 ? 16:30 +0530, Atin Mukherjee a > > > > > ?crit : > > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < > > > > > > jthottan at redhat.com> > > > > > > wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > is_nfs_export_available is just a wrapper around > > > > > > > "showmount" > > > > > > > command AFAIR. > > > > > > > I saw following messages in console output. > > > > > > > mount.nfs: rpc.statd is not running but is required for > > > > > > > remote > > > > > > > locking. > > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks > > > > > > > local, > > > > > > > or > > > > > > > start > > > > > > > statd. > > > > > > > 05:06:55 mount.nfs: an incorrect mount option was > > > > > > > specified > > > > > > > > > > > > > > For me it looks rpcbind may not be running on the > > > > > > > machine. > > > > > > > Usually rpcbind starts automatically on machines, don't > > > > > > > know > > > > > > > whether it > > > > > > > can happen or not. > > > > > > > > > > > > > > > > > > > That's precisely what the question is. Why suddenly we're > > > > > > seeing > > > > > > this > > > > > > happening too frequently. Today I saw atleast 4 to 5 such > > > > > > failures > > > > > > already. > > > > > > > > > > > > Deepshika - Can you please help in inspecting this? > > > > > > > > > > So we think (we are not sure) that the issue is a bit > > > > > complex. > > > > > > > > > > What we were investigating was nightly run fail on aws. When > > > > > the > > > > > build > > > > > crash, the builder is restarted, since that's the easiest way > > > > > to > > > > > clean > > > > > everything (since even with a perfect test suite that would > > > > > clean > > > > > itself, we could always end in a corrupt state on the system, > > > > > WRT > > > > > mount, fs, etc). > > > > > > > > > > In turn, this seems to cause trouble on aws, since cloud-init > > > > > or > > > > > something rename eth0 interface to ens5, without cleaning to > > > > > the > > > > > network configuration. > > > > > > > > > > So the network init script fail (because the image say "start > > > > > eth0" > > > > > and > > > > > that's not present), but fail in a weird way. Network is > > > > > initialised > > > > > and working (we can connect), but the dhclient process is not > > > > > in > > > > > the > > > > > right cgroup, and network.service is in failed state. > > > > > Restarting > > > > > network didn't work. In turn, this mean that rpc-statd refuse > > > > > to > > > > > start > > > > > (due to systemd dependencies), which seems to impact various > > > > > NFS > > > > > tests. > > > > > > > > > > We have also seen that on some builders, rpcbind pick some IP > > > > > v6 > > > > > autoconfiguration, but we can't reproduce that, and there is > > > > > no ip > > > > > v6 > > > > > set up anywhere. I suspect the network.service failure is > > > > > somehow > > > > > involved, but fail to see how. In turn, rpcbind.socket not > > > > > starting > > > > > could cause NFS test troubles. > > > > > > > > > > Our current stop gap fix was to fix all the builders one by > > > > > one. > > > > > Remove > > > > > the config, kill the rogue dhclient, restart network service. > > > > > > > > > > However, we can't be sure this is going to fix the problem > > > > > long > > > > > term > > > > > since this only manifest after a crash of the test suite, and > > > > > it > > > > > doesn't happen so often. (plus, it was working before some > > > > > day in > > > > > the > > > > > past, when something did make this fail, and I do not know if > > > > > that's a > > > > > system upgrade, or a test change, or both). > > > > > > > > > > So we are still looking at it to have a complete > > > > > understanding of > > > > > the > > > > > issue, but so far, we hacked our way to make it work (or so > > > > > do I > > > > > think). > > > > > > > > > > Deepshika is working to fix it long term, by fixing the issue > > > > > regarding > > > > > eth0/ens5 with a new base image. > > > > > -- > > > > > Michael Scherer > > > > > Sysadmin, Community Infrastructure and Platform, OSAS > > > > > > > > > > > > > > > -- > > > > > > > > - Atin (atinm) > > > > > > -- > > > Michael Scherer > > > Sysadmin, Community Infrastructure > > > > > > > > > > > > _______________________________________________ > > > Gluster-devel mailing list > > > Gluster-devel at gluster.org > > > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > -- Michael Scherer Sysadmin, Community Infrastructure -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part URL: From rabhat at redhat.com Tue May 7 18:14:33 2019 From: rabhat at redhat.com (FNU Raghavendra Manjunath) Date: Tue, 7 May 2019 14:14:33 -0400 Subject: [Gluster-devel] [Gluster-users] Meeting Details on footer of the gluster-devel and gluster-user mailing list In-Reply-To: <1839109616.17156274.1557220745006.JavaMail.zimbra@redhat.com> References: <2029030585.17155612.1557220163425.JavaMail.zimbra@redhat.com> <1839109616.17156274.1557220745006.JavaMail.zimbra@redhat.com> Message-ID: + 1 to this. There is also one more thing. For some reason, the community meeting is not visible in my calendar (especially NA region). I am not sure if anyone else also facing this issue. Regards, Raghavendra On Tue, May 7, 2019 at 5:19 AM Ashish Pandey wrote: > Hi, > > While we send a mail on gluster-devel or gluster-user mailing list, > following content gets auto generated and placed at the end of mail. > > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > In the similar way, is it possible to attach meeting schedule and link at the end of every such mails? > Like this - > > Meeting schedule - > > > - APAC friendly hours > - Tuesday 14th May 2019, 11:30AM IST > - Bridge: https://bluejeans.com/836554017 > - NA/EMEA > - Tuesday 7th May 2019, 01:00 PM EDT > - Bridge: https://bluejeans.com/486278655 > > Or just a link to meeting minutes details?? > https://github.com/gluster/community/tree/master/meetings > > This will help developers and users of the community to know when and where meeting happens and how to attend those meetings. > > --- > Ashish > > > > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From vbellur at redhat.com Tue May 7 18:37:27 2019 From: vbellur at redhat.com (Vijay Bellur) Date: Tue, 7 May 2019 11:37:27 -0700 Subject: [Gluster-devel] [Gluster-users] Meeting Details on footer of the gluster-devel and gluster-user mailing list In-Reply-To: References: <2029030585.17155612.1557220163425.JavaMail.zimbra@redhat.com> <1839109616.17156274.1557220745006.JavaMail.zimbra@redhat.com> Message-ID: On Tue, May 7, 2019 at 11:15 AM FNU Raghavendra Manjunath wrote: > > + 1 to this. > I have updated the footer of gluster-devel. If that looks ok, we can extend it to gluster-users too. In case of a month with 5 Tuesdays, we can skip the 5th Tuesday and always stick to the first 4 Tuesdays of every month. That will help in describing the community meeting schedule better. If we want to keep the schedule running on alternate Tuesdays, please let me know and the mailing list footers can be updated accordingly :-). > There is also one more thing. For some reason, the community meeting is > not visible in my calendar (especially NA region). I am not sure if anyone > else also facing this issue. > I did face this issue. Realized that we had a meeting today and showed up at the meeting a while later but did not see many participants. Perhaps, the calendar invite has to be made a recurring one. Thanks, Vijay > > Regards, > Raghavendra > > On Tue, May 7, 2019 at 5:19 AM Ashish Pandey wrote: > >> Hi, >> >> While we send a mail on gluster-devel or gluster-user mailing list, >> following content gets auto generated and placed at the end of mail. >> >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> In the similar way, is it possible to attach meeting schedule and link at the end of every such mails? >> Like this - >> >> Meeting schedule - >> >> >> - APAC friendly hours >> - Tuesday 14th May 2019, 11:30AM IST >> - Bridge: https://bluejeans.com/836554017 >> - NA/EMEA >> - Tuesday 7th May 2019, 01:00 PM EDT >> - Bridge: https://bluejeans.com/486278655 >> >> Or just a link to meeting minutes details?? >> https://github.com/gluster/community/tree/master/meetings >> >> This will help developers and users of the community to know when and where meeting happens and how to attend those meetings. >> >> --- >> Ashish >> >> >> >> >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From dkhandel at redhat.com Tue May 7 18:53:05 2019 From: dkhandel at redhat.com (Deepshikha Khandelwal) Date: Wed, 8 May 2019 00:23:05 +0530 Subject: [Gluster-devel] [Gluster-infra] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> <797512f6ff7f1b9fedbf8b7968dd86a6968d9105.camel@redhat.com> Message-ID: Sanju, can you please give us more info about the failures. I see the failures occurring on just one of the builder (builder206). I'm taking it back offline for now. On Tue, May 7, 2019 at 9:42 PM Michael Scherer wrote: > Le mardi 07 mai 2019 ? 20:04 +0530, Sanju Rakonde a ?crit : > > Looks like is_nfs_export_available started failing again in recent > > centos-regressions. > > > > Michael, can you please check? > > I will try but I am leaving for vacation tonight, so if I find nothing, > until I leave, I guess Deepshika will have to look. > > > On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul wrote: > > > > > > > > > > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer < > > > mscherer at redhat.com> > > > wrote: > > > > > > > Le lundi 22 avril 2019 ? 22:57 +0530, Atin Mukherjee a ?crit : > > > > > Is this back again? The recent patches are failing regression > > > > > :-\ . > > > > > > > > So, on builder206, it took me a while to find that the issue is > > > > that > > > > nfs (the service) was running. > > > > > > > > ./tests/basic/afr/tarissue.t failed, because the nfs > > > > initialisation > > > > failed with a rather cryptic message: > > > > > > > > [2019-04-23 13:17:05.371733] I > > > > [socket.c:991:__socket_server_bind] 0- > > > > socket.nfs-server: process started listening on port (38465) > > > > [2019-04-23 13:17:05.385819] E > > > > [socket.c:972:__socket_server_bind] 0- > > > > socket.nfs-server: binding to failed: Address already in use > > > > [2019-04-23 13:17:05.385843] E > > > > [socket.c:974:__socket_server_bind] 0- > > > > socket.nfs-server: Port is already in use > > > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- > > > > socket.nfs-server: __socket_server_bind failed;closing socket 14 > > > > > > > > I found where this came from, but a few stuff did surprised me: > > > > > > > > - the order of print is different that the order in the code > > > > > > > > > > Indeed strange... > > > > > > > - the message on "started listening" didn't take in account the > > > > fact > > > > that bind failed on: > > > > > > > > > > Shouldn't it bail out if it failed to bind? > > > Some missing 'goto out' around line 975/976? > > > Y. > > > > > > > > > > > > > > > > > > > > > https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 > > > > > > > > The message about port 38465 also threw me off the track. The > > > > real > > > > issue is that the service nfs was already running, and I couldn't > > > > find > > > > anything listening on port 38465 > > > > > > > > once I do service nfs stop, it no longer failed. > > > > > > > > So far, I do know why nfs.service was activated. > > > > > > > > But at least, 206 should be fixed, and we know a bit more on what > > > > would > > > > be causing some failure. > > > > > > > > > > > > > > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer < > > > > > mscherer at redhat.com> > > > > > wrote: > > > > > > > > > > > Le mercredi 03 avril 2019 ? 16:30 +0530, Atin Mukherjee a > > > > > > ?crit : > > > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < > > > > > > > jthottan at redhat.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > is_nfs_export_available is just a wrapper around > > > > > > > > "showmount" > > > > > > > > command AFAIR. > > > > > > > > I saw following messages in console output. > > > > > > > > mount.nfs: rpc.statd is not running but is required for > > > > > > > > remote > > > > > > > > locking. > > > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks > > > > > > > > local, > > > > > > > > or > > > > > > > > start > > > > > > > > statd. > > > > > > > > 05:06:55 mount.nfs: an incorrect mount option was > > > > > > > > specified > > > > > > > > > > > > > > > > For me it looks rpcbind may not be running on the > > > > > > > > machine. > > > > > > > > Usually rpcbind starts automatically on machines, don't > > > > > > > > know > > > > > > > > whether it > > > > > > > > can happen or not. > > > > > > > > > > > > > > > > > > > > > > That's precisely what the question is. Why suddenly we're > > > > > > > seeing > > > > > > > this > > > > > > > happening too frequently. Today I saw atleast 4 to 5 such > > > > > > > failures > > > > > > > already. > > > > > > > > > > > > > > Deepshika - Can you please help in inspecting this? > > > > > > > > > > > > So we think (we are not sure) that the issue is a bit > > > > > > complex. > > > > > > > > > > > > What we were investigating was nightly run fail on aws. When > > > > > > the > > > > > > build > > > > > > crash, the builder is restarted, since that's the easiest way > > > > > > to > > > > > > clean > > > > > > everything (since even with a perfect test suite that would > > > > > > clean > > > > > > itself, we could always end in a corrupt state on the system, > > > > > > WRT > > > > > > mount, fs, etc). > > > > > > > > > > > > In turn, this seems to cause trouble on aws, since cloud-init > > > > > > or > > > > > > something rename eth0 interface to ens5, without cleaning to > > > > > > the > > > > > > network configuration. > > > > > > > > > > > > So the network init script fail (because the image say "start > > > > > > eth0" > > > > > > and > > > > > > that's not present), but fail in a weird way. Network is > > > > > > initialised > > > > > > and working (we can connect), but the dhclient process is not > > > > > > in > > > > > > the > > > > > > right cgroup, and network.service is in failed state. > > > > > > Restarting > > > > > > network didn't work. In turn, this mean that rpc-statd refuse > > > > > > to > > > > > > start > > > > > > (due to systemd dependencies), which seems to impact various > > > > > > NFS > > > > > > tests. > > > > > > > > > > > > We have also seen that on some builders, rpcbind pick some IP > > > > > > v6 > > > > > > autoconfiguration, but we can't reproduce that, and there is > > > > > > no ip > > > > > > v6 > > > > > > set up anywhere. I suspect the network.service failure is > > > > > > somehow > > > > > > involved, but fail to see how. In turn, rpcbind.socket not > > > > > > starting > > > > > > could cause NFS test troubles. > > > > > > > > > > > > Our current stop gap fix was to fix all the builders one by > > > > > > one. > > > > > > Remove > > > > > > the config, kill the rogue dhclient, restart network service. > > > > > > > > > > > > However, we can't be sure this is going to fix the problem > > > > > > long > > > > > > term > > > > > > since this only manifest after a crash of the test suite, and > > > > > > it > > > > > > doesn't happen so often. (plus, it was working before some > > > > > > day in > > > > > > the > > > > > > past, when something did make this fail, and I do not know if > > > > > > that's a > > > > > > system upgrade, or a test change, or both). > > > > > > > > > > > > So we are still looking at it to have a complete > > > > > > understanding of > > > > > > the > > > > > > issue, but so far, we hacked our way to make it work (or so > > > > > > do I > > > > > > think). > > > > > > > > > > > > Deepshika is working to fix it long term, by fixing the issue > > > > > > regarding > > > > > > eth0/ens5 with a new base image. > > > > > > -- > > > > > > Michael Scherer > > > > > > Sysadmin, Community Infrastructure and Platform, OSAS > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > - Atin (atinm) > > > > > > > > -- > > > > Michael Scherer > > > > Sysadmin, Community Infrastructure > > > > > > > > > > > > > > > > _______________________________________________ > > > > Gluster-devel mailing list > > > > Gluster-devel at gluster.org > > > > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > > > > _______________________________________________ > > > Gluster-devel mailing list > > > Gluster-devel at gluster.org > > > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > > > > > -- > Michael Scherer > Sysadmin, Community Infrastructure > > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From srakonde at redhat.com Wed May 8 01:45:53 2019 From: srakonde at redhat.com (Sanju Rakonde) Date: Wed, 8 May 2019 07:15:53 +0530 Subject: [Gluster-devel] [Gluster-infra] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> <797512f6ff7f1b9fedbf8b7968dd86a6968d9105.camel@redhat.com> Message-ID: Deepshikha, I see the failure here[1] which ran on builder206. So, we are good. [1] https://build.gluster.org/job/centos7-regression/5901/consoleFull On Wed, May 8, 2019 at 12:23 AM Deepshikha Khandelwal wrote: > Sanju, can you please give us more info about the failures. > > I see the failures occurring on just one of the builder (builder206). I'm > taking it back offline for now. > > On Tue, May 7, 2019 at 9:42 PM Michael Scherer > wrote: > >> Le mardi 07 mai 2019 ? 20:04 +0530, Sanju Rakonde a ?crit : >> > Looks like is_nfs_export_available started failing again in recent >> > centos-regressions. >> > >> > Michael, can you please check? >> >> I will try but I am leaving for vacation tonight, so if I find nothing, >> until I leave, I guess Deepshika will have to look. >> >> > On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul wrote: >> > >> > > >> > > >> > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer < >> > > mscherer at redhat.com> >> > > wrote: >> > > >> > > > Le lundi 22 avril 2019 ? 22:57 +0530, Atin Mukherjee a ?crit : >> > > > > Is this back again? The recent patches are failing regression >> > > > > :-\ . >> > > > >> > > > So, on builder206, it took me a while to find that the issue is >> > > > that >> > > > nfs (the service) was running. >> > > > >> > > > ./tests/basic/afr/tarissue.t failed, because the nfs >> > > > initialisation >> > > > failed with a rather cryptic message: >> > > > >> > > > [2019-04-23 13:17:05.371733] I >> > > > [socket.c:991:__socket_server_bind] 0- >> > > > socket.nfs-server: process started listening on port (38465) >> > > > [2019-04-23 13:17:05.385819] E >> > > > [socket.c:972:__socket_server_bind] 0- >> > > > socket.nfs-server: binding to failed: Address already in use >> > > > [2019-04-23 13:17:05.385843] E >> > > > [socket.c:974:__socket_server_bind] 0- >> > > > socket.nfs-server: Port is already in use >> > > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- >> > > > socket.nfs-server: __socket_server_bind failed;closing socket 14 >> > > > >> > > > I found where this came from, but a few stuff did surprised me: >> > > > >> > > > - the order of print is different that the order in the code >> > > > >> > > >> > > Indeed strange... >> > > >> > > > - the message on "started listening" didn't take in account the >> > > > fact >> > > > that bind failed on: >> > > > >> > > >> > > Shouldn't it bail out if it failed to bind? >> > > Some missing 'goto out' around line 975/976? >> > > Y. >> > > >> > > > >> > > > >> > > > >> > > > >> >> https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 >> > > > >> > > > The message about port 38465 also threw me off the track. The >> > > > real >> > > > issue is that the service nfs was already running, and I couldn't >> > > > find >> > > > anything listening on port 38465 >> > > > >> > > > once I do service nfs stop, it no longer failed. >> > > > >> > > > So far, I do know why nfs.service was activated. >> > > > >> > > > But at least, 206 should be fixed, and we know a bit more on what >> > > > would >> > > > be causing some failure. >> > > > >> > > > >> > > > >> > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer < >> > > > > mscherer at redhat.com> >> > > > > wrote: >> > > > > >> > > > > > Le mercredi 03 avril 2019 ? 16:30 +0530, Atin Mukherjee a >> > > > > > ?crit : >> > > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < >> > > > > > > jthottan at redhat.com> >> > > > > > > wrote: >> > > > > > > >> > > > > > > > Hi, >> > > > > > > > >> > > > > > > > is_nfs_export_available is just a wrapper around >> > > > > > > > "showmount" >> > > > > > > > command AFAIR. >> > > > > > > > I saw following messages in console output. >> > > > > > > > mount.nfs: rpc.statd is not running but is required for >> > > > > > > > remote >> > > > > > > > locking. >> > > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks >> > > > > > > > local, >> > > > > > > > or >> > > > > > > > start >> > > > > > > > statd. >> > > > > > > > 05:06:55 mount.nfs: an incorrect mount option was >> > > > > > > > specified >> > > > > > > > >> > > > > > > > For me it looks rpcbind may not be running on the >> > > > > > > > machine. >> > > > > > > > Usually rpcbind starts automatically on machines, don't >> > > > > > > > know >> > > > > > > > whether it >> > > > > > > > can happen or not. >> > > > > > > > >> > > > > > > >> > > > > > > That's precisely what the question is. Why suddenly we're >> > > > > > > seeing >> > > > > > > this >> > > > > > > happening too frequently. Today I saw atleast 4 to 5 such >> > > > > > > failures >> > > > > > > already. >> > > > > > > >> > > > > > > Deepshika - Can you please help in inspecting this? >> > > > > > >> > > > > > So we think (we are not sure) that the issue is a bit >> > > > > > complex. >> > > > > > >> > > > > > What we were investigating was nightly run fail on aws. When >> > > > > > the >> > > > > > build >> > > > > > crash, the builder is restarted, since that's the easiest way >> > > > > > to >> > > > > > clean >> > > > > > everything (since even with a perfect test suite that would >> > > > > > clean >> > > > > > itself, we could always end in a corrupt state on the system, >> > > > > > WRT >> > > > > > mount, fs, etc). >> > > > > > >> > > > > > In turn, this seems to cause trouble on aws, since cloud-init >> > > > > > or >> > > > > > something rename eth0 interface to ens5, without cleaning to >> > > > > > the >> > > > > > network configuration. >> > > > > > >> > > > > > So the network init script fail (because the image say "start >> > > > > > eth0" >> > > > > > and >> > > > > > that's not present), but fail in a weird way. Network is >> > > > > > initialised >> > > > > > and working (we can connect), but the dhclient process is not >> > > > > > in >> > > > > > the >> > > > > > right cgroup, and network.service is in failed state. >> > > > > > Restarting >> > > > > > network didn't work. In turn, this mean that rpc-statd refuse >> > > > > > to >> > > > > > start >> > > > > > (due to systemd dependencies), which seems to impact various >> > > > > > NFS >> > > > > > tests. >> > > > > > >> > > > > > We have also seen that on some builders, rpcbind pick some IP >> > > > > > v6 >> > > > > > autoconfiguration, but we can't reproduce that, and there is >> > > > > > no ip >> > > > > > v6 >> > > > > > set up anywhere. I suspect the network.service failure is >> > > > > > somehow >> > > > > > involved, but fail to see how. In turn, rpcbind.socket not >> > > > > > starting >> > > > > > could cause NFS test troubles. >> > > > > > >> > > > > > Our current stop gap fix was to fix all the builders one by >> > > > > > one. >> > > > > > Remove >> > > > > > the config, kill the rogue dhclient, restart network service. >> > > > > > >> > > > > > However, we can't be sure this is going to fix the problem >> > > > > > long >> > > > > > term >> > > > > > since this only manifest after a crash of the test suite, and >> > > > > > it >> > > > > > doesn't happen so often. (plus, it was working before some >> > > > > > day in >> > > > > > the >> > > > > > past, when something did make this fail, and I do not know if >> > > > > > that's a >> > > > > > system upgrade, or a test change, or both). >> > > > > > >> > > > > > So we are still looking at it to have a complete >> > > > > > understanding of >> > > > > > the >> > > > > > issue, but so far, we hacked our way to make it work (or so >> > > > > > do I >> > > > > > think). >> > > > > > >> > > > > > Deepshika is working to fix it long term, by fixing the issue >> > > > > > regarding >> > > > > > eth0/ens5 with a new base image. >> > > > > > -- >> > > > > > Michael Scherer >> > > > > > Sysadmin, Community Infrastructure and Platform, OSAS >> > > > > > >> > > > > > >> > > > > > -- >> > > > > >> > > > > - Atin (atinm) >> > > > >> > > > -- >> > > > Michael Scherer >> > > > Sysadmin, Community Infrastructure >> > > > >> > > > >> > > > >> > > > _______________________________________________ >> > > > Gluster-devel mailing list >> > > > Gluster-devel at gluster.org >> > > > https://lists.gluster.org/mailman/listinfo/gluster-devel >> > > >> > > _______________________________________________ >> > > Gluster-devel mailing list >> > > Gluster-devel at gluster.org >> > > https://lists.gluster.org/mailman/listinfo/gluster-devel >> > >> > >> > >> -- >> Michael Scherer >> Sysadmin, Community Infrastructure >> >> >> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel > > -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Wed May 8 04:15:10 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Wed, 8 May 2019 09:45:10 +0530 Subject: [Gluster-devel] [Gluster-users] Meeting Details on footer of the gluster-devel and gluster-user mailing list In-Reply-To: References: <2029030585.17155612.1557220163425.JavaMail.zimbra@redhat.com> <1839109616.17156274.1557220745006.JavaMail.zimbra@redhat.com> Message-ID: On Wed, May 8, 2019 at 12:08 AM Vijay Bellur wrote: > > > On Tue, May 7, 2019 at 11:15 AM FNU Raghavendra Manjunath < > rabhat at redhat.com> wrote: > >> >> + 1 to this. >> > > I have updated the footer of gluster-devel. If that looks ok, we can > extend it to gluster-users too. > > In case of a month with 5 Tuesdays, we can skip the 5th Tuesday and always > stick to the first 4 Tuesdays of every month. That will help in describing > the community meeting schedule better. If we want to keep the schedule > running on alternate Tuesdays, please let me know and the mailing list > footers can be updated accordingly :-). > > >> There is also one more thing. For some reason, the community meeting is >> not visible in my calendar (especially NA region). I am not sure if anyone >> else also facing this issue. >> > > I did face this issue. Realized that we had a meeting today and showed up > at the meeting a while later but did not see many participants. Perhaps, > the calendar invite has to be made a recurring one. > We'd need to explicitly import the invite and add it to our calendar, otherwise it doesn't reflect. > Thanks, > Vijay > > >> >> Regards, >> Raghavendra >> >> On Tue, May 7, 2019 at 5:19 AM Ashish Pandey wrote: >> >>> Hi, >>> >>> While we send a mail on gluster-devel or gluster-user mailing list, >>> following content gets auto generated and placed at the end of mail. >>> >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> Gluster-devel mailing list >>> Gluster-devel at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>> >>> In the similar way, is it possible to attach meeting schedule and link at the end of every such mails? >>> Like this - >>> >>> Meeting schedule - >>> >>> >>> - APAC friendly hours >>> - Tuesday 14th May 2019, 11:30AM IST >>> - Bridge: https://bluejeans.com/836554017 >>> - NA/EMEA >>> - Tuesday 7th May 2019, 01:00 PM EDT >>> - Bridge: https://bluejeans.com/486278655 >>> >>> Or just a link to meeting minutes details?? >>> https://github.com/gluster/community/tree/master/meetings >>> >>> This will help developers and users of the community to know when and where meeting happens and how to attend those meetings. >>> >>> --- >>> Ashish >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > > Community Meeting Calendar: > > APAC Schedule - > Every 2nd and 4th Tuesday at 11:30 AM IST > Bridge: https://bluejeans.com/836554017 > > NA/EMEA Schedule - > Every 1st and 3rd Tuesday at 01:00 PM EDT > Bridge: https://bluejeans.com/486278655 > > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Wed May 8 04:16:47 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Wed, 8 May 2019 09:46:47 +0530 Subject: [Gluster-devel] [Gluster-users] Meeting Details on footer of the gluster-devel and gluster-user mailing list In-Reply-To: References: <2029030585.17155612.1557220163425.JavaMail.zimbra@redhat.com> <1839109616.17156274.1557220745006.JavaMail.zimbra@redhat.com> Message-ID: On Wed, May 8, 2019 at 9:45 AM Atin Mukherjee wrote: > > > On Wed, May 8, 2019 at 12:08 AM Vijay Bellur wrote: > >> >> >> On Tue, May 7, 2019 at 11:15 AM FNU Raghavendra Manjunath < >> rabhat at redhat.com> wrote: >> >>> >>> + 1 to this. >>> >> >> I have updated the footer of gluster-devel. If that looks ok, we can >> extend it to gluster-users too. >> >> In case of a month with 5 Tuesdays, we can skip the 5th Tuesday and >> always stick to the first 4 Tuesdays of every month. That will help in >> describing the community meeting schedule better. If we want to keep the >> schedule running on alternate Tuesdays, please let me know and the mailing >> list footers can be updated accordingly :-). >> >> >>> There is also one more thing. For some reason, the community meeting is >>> not visible in my calendar (especially NA region). I am not sure if anyone >>> else also facing this issue. >>> >> >> I did face this issue. Realized that we had a meeting today and showed up >> at the meeting a while later but did not see many participants. Perhaps, >> the calendar invite has to be made a recurring one. >> > > We'd need to explicitly import the invite and add it to our calendar, > otherwise it doesn't reflect. > And you're right that the last series wasn't a recurring one either. > >> Thanks, >> Vijay >> >> >>> >>> Regards, >>> Raghavendra >>> >>> On Tue, May 7, 2019 at 5:19 AM Ashish Pandey >>> wrote: >>> >>>> Hi, >>>> >>>> While we send a mail on gluster-devel or gluster-user mailing list, >>>> following content gets auto generated and placed at the end of mail. >>>> >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> Gluster-devel mailing list >>>> Gluster-devel at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>>> >>>> In the similar way, is it possible to attach meeting schedule and link at the end of every such mails? >>>> Like this - >>>> >>>> Meeting schedule - >>>> >>>> >>>> - APAC friendly hours >>>> - Tuesday 14th May 2019, 11:30AM IST >>>> - Bridge: https://bluejeans.com/836554017 >>>> - NA/EMEA >>>> - Tuesday 7th May 2019, 01:00 PM EDT >>>> - Bridge: https://bluejeans.com/486278655 >>>> >>>> Or just a link to meeting minutes details?? >>>> https://github.com/gluster/community/tree/master/meetings >>>> >>>> This will help developers and users of the community to know when and where meeting happens and how to attend those meetings. >>>> >>>> --- >>>> Ashish >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> _______________________________________________ >> >> Community Meeting Calendar: >> >> APAC Schedule - >> Every 2nd and 4th Tuesday at 11:30 AM IST >> Bridge: https://bluejeans.com/836554017 >> >> NA/EMEA Schedule - >> Every 1st and 3rd Tuesday at 01:00 PM EDT >> Bridge: https://bluejeans.com/486278655 >> >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Wed May 8 04:23:04 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Wed, 8 May 2019 09:53:04 +0530 Subject: [Gluster-devel] [Gluster-infra] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> <797512f6ff7f1b9fedbf8b7968dd86a6968d9105.camel@redhat.com> Message-ID: On Wed, May 8, 2019 at 7:16 AM Sanju Rakonde wrote: > Deepshikha, > > I see the failure here[1] which ran on builder206. So, we are good. > Not really, https://build.gluster.org/job/centos7-regression/5909/consoleFull failed on builder204 for similar reasons I believe? I am bit more worried on this issue being resurfacing more often these days. What can we do to fix this permanently? > [1] https://build.gluster.org/job/centos7-regression/5901/consoleFull > > On Wed, May 8, 2019 at 12:23 AM Deepshikha Khandelwal > wrote: > >> Sanju, can you please give us more info about the failures. >> >> I see the failures occurring on just one of the builder (builder206). I'm >> taking it back offline for now. >> >> On Tue, May 7, 2019 at 9:42 PM Michael Scherer >> wrote: >> >>> Le mardi 07 mai 2019 ? 20:04 +0530, Sanju Rakonde a ?crit : >>> > Looks like is_nfs_export_available started failing again in recent >>> > centos-regressions. >>> > >>> > Michael, can you please check? >>> >>> I will try but I am leaving for vacation tonight, so if I find nothing, >>> until I leave, I guess Deepshika will have to look. >>> >>> > On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul wrote: >>> > >>> > > >>> > > >>> > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer < >>> > > mscherer at redhat.com> >>> > > wrote: >>> > > >>> > > > Le lundi 22 avril 2019 ? 22:57 +0530, Atin Mukherjee a ?crit : >>> > > > > Is this back again? The recent patches are failing regression >>> > > > > :-\ . >>> > > > >>> > > > So, on builder206, it took me a while to find that the issue is >>> > > > that >>> > > > nfs (the service) was running. >>> > > > >>> > > > ./tests/basic/afr/tarissue.t failed, because the nfs >>> > > > initialisation >>> > > > failed with a rather cryptic message: >>> > > > >>> > > > [2019-04-23 13:17:05.371733] I >>> > > > [socket.c:991:__socket_server_bind] 0- >>> > > > socket.nfs-server: process started listening on port (38465) >>> > > > [2019-04-23 13:17:05.385819] E >>> > > > [socket.c:972:__socket_server_bind] 0- >>> > > > socket.nfs-server: binding to failed: Address already in use >>> > > > [2019-04-23 13:17:05.385843] E >>> > > > [socket.c:974:__socket_server_bind] 0- >>> > > > socket.nfs-server: Port is already in use >>> > > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- >>> > > > socket.nfs-server: __socket_server_bind failed;closing socket 14 >>> > > > >>> > > > I found where this came from, but a few stuff did surprised me: >>> > > > >>> > > > - the order of print is different that the order in the code >>> > > > >>> > > >>> > > Indeed strange... >>> > > >>> > > > - the message on "started listening" didn't take in account the >>> > > > fact >>> > > > that bind failed on: >>> > > > >>> > > >>> > > Shouldn't it bail out if it failed to bind? >>> > > Some missing 'goto out' around line 975/976? >>> > > Y. >>> > > >>> > > > >>> > > > >>> > > > >>> > > > >>> >>> https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 >>> > > > >>> > > > The message about port 38465 also threw me off the track. The >>> > > > real >>> > > > issue is that the service nfs was already running, and I couldn't >>> > > > find >>> > > > anything listening on port 38465 >>> > > > >>> > > > once I do service nfs stop, it no longer failed. >>> > > > >>> > > > So far, I do know why nfs.service was activated. >>> > > > >>> > > > But at least, 206 should be fixed, and we know a bit more on what >>> > > > would >>> > > > be causing some failure. >>> > > > >>> > > > >>> > > > >>> > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer < >>> > > > > mscherer at redhat.com> >>> > > > > wrote: >>> > > > > >>> > > > > > Le mercredi 03 avril 2019 ? 16:30 +0530, Atin Mukherjee a >>> > > > > > ?crit : >>> > > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < >>> > > > > > > jthottan at redhat.com> >>> > > > > > > wrote: >>> > > > > > > >>> > > > > > > > Hi, >>> > > > > > > > >>> > > > > > > > is_nfs_export_available is just a wrapper around >>> > > > > > > > "showmount" >>> > > > > > > > command AFAIR. >>> > > > > > > > I saw following messages in console output. >>> > > > > > > > mount.nfs: rpc.statd is not running but is required for >>> > > > > > > > remote >>> > > > > > > > locking. >>> > > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks >>> > > > > > > > local, >>> > > > > > > > or >>> > > > > > > > start >>> > > > > > > > statd. >>> > > > > > > > 05:06:55 mount.nfs: an incorrect mount option was >>> > > > > > > > specified >>> > > > > > > > >>> > > > > > > > For me it looks rpcbind may not be running on the >>> > > > > > > > machine. >>> > > > > > > > Usually rpcbind starts automatically on machines, don't >>> > > > > > > > know >>> > > > > > > > whether it >>> > > > > > > > can happen or not. >>> > > > > > > > >>> > > > > > > >>> > > > > > > That's precisely what the question is. Why suddenly we're >>> > > > > > > seeing >>> > > > > > > this >>> > > > > > > happening too frequently. Today I saw atleast 4 to 5 such >>> > > > > > > failures >>> > > > > > > already. >>> > > > > > > >>> > > > > > > Deepshika - Can you please help in inspecting this? >>> > > > > > >>> > > > > > So we think (we are not sure) that the issue is a bit >>> > > > > > complex. >>> > > > > > >>> > > > > > What we were investigating was nightly run fail on aws. When >>> > > > > > the >>> > > > > > build >>> > > > > > crash, the builder is restarted, since that's the easiest way >>> > > > > > to >>> > > > > > clean >>> > > > > > everything (since even with a perfect test suite that would >>> > > > > > clean >>> > > > > > itself, we could always end in a corrupt state on the system, >>> > > > > > WRT >>> > > > > > mount, fs, etc). >>> > > > > > >>> > > > > > In turn, this seems to cause trouble on aws, since cloud-init >>> > > > > > or >>> > > > > > something rename eth0 interface to ens5, without cleaning to >>> > > > > > the >>> > > > > > network configuration. >>> > > > > > >>> > > > > > So the network init script fail (because the image say "start >>> > > > > > eth0" >>> > > > > > and >>> > > > > > that's not present), but fail in a weird way. Network is >>> > > > > > initialised >>> > > > > > and working (we can connect), but the dhclient process is not >>> > > > > > in >>> > > > > > the >>> > > > > > right cgroup, and network.service is in failed state. >>> > > > > > Restarting >>> > > > > > network didn't work. In turn, this mean that rpc-statd refuse >>> > > > > > to >>> > > > > > start >>> > > > > > (due to systemd dependencies), which seems to impact various >>> > > > > > NFS >>> > > > > > tests. >>> > > > > > >>> > > > > > We have also seen that on some builders, rpcbind pick some IP >>> > > > > > v6 >>> > > > > > autoconfiguration, but we can't reproduce that, and there is >>> > > > > > no ip >>> > > > > > v6 >>> > > > > > set up anywhere. I suspect the network.service failure is >>> > > > > > somehow >>> > > > > > involved, but fail to see how. In turn, rpcbind.socket not >>> > > > > > starting >>> > > > > > could cause NFS test troubles. >>> > > > > > >>> > > > > > Our current stop gap fix was to fix all the builders one by >>> > > > > > one. >>> > > > > > Remove >>> > > > > > the config, kill the rogue dhclient, restart network service. >>> > > > > > >>> > > > > > However, we can't be sure this is going to fix the problem >>> > > > > > long >>> > > > > > term >>> > > > > > since this only manifest after a crash of the test suite, and >>> > > > > > it >>> > > > > > doesn't happen so often. (plus, it was working before some >>> > > > > > day in >>> > > > > > the >>> > > > > > past, when something did make this fail, and I do not know if >>> > > > > > that's a >>> > > > > > system upgrade, or a test change, or both). >>> > > > > > >>> > > > > > So we are still looking at it to have a complete >>> > > > > > understanding of >>> > > > > > the >>> > > > > > issue, but so far, we hacked our way to make it work (or so >>> > > > > > do I >>> > > > > > think). >>> > > > > > >>> > > > > > Deepshika is working to fix it long term, by fixing the issue >>> > > > > > regarding >>> > > > > > eth0/ens5 with a new base image. >>> > > > > > -- >>> > > > > > Michael Scherer >>> > > > > > Sysadmin, Community Infrastructure and Platform, OSAS >>> > > > > > >>> > > > > > >>> > > > > > -- >>> > > > > >>> > > > > - Atin (atinm) >>> > > > >>> > > > -- >>> > > > Michael Scherer >>> > > > Sysadmin, Community Infrastructure >>> > > > >>> > > > >>> > > > >>> > > > _______________________________________________ >>> > > > Gluster-devel mailing list >>> > > > Gluster-devel at gluster.org >>> > > > https://lists.gluster.org/mailman/listinfo/gluster-devel >>> > > >>> > > _______________________________________________ >>> > > Gluster-devel mailing list >>> > > Gluster-devel at gluster.org >>> > > https://lists.gluster.org/mailman/listinfo/gluster-devel >>> > >>> > >>> > >>> -- >>> Michael Scherer >>> Sysadmin, Community Infrastructure >>> >>> >>> >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> > > -- > Thanks, > Sanju > _______________________________________________ > > Community Meeting Calendar: > > APAC Schedule - > Every 2nd and 4th Tuesday at 11:30 AM IST > Bridge: https://bluejeans.com/836554017 > > NA/EMEA Schedule - > Every 1st and 3rd Tuesday at 01:00 PM EDT > Bridge: https://bluejeans.com/486278655 > > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ndevos at redhat.com Wed May 8 07:08:08 2019 From: ndevos at redhat.com (Niels de Vos) Date: Wed, 8 May 2019 09:08:08 +0200 Subject: [Gluster-devel] Meeting Details on footer of the gluster-devel and gluster-user mailing list In-Reply-To: References: <2029030585.17155612.1557220163425.JavaMail.zimbra@redhat.com> <1839109616.17156274.1557220745006.JavaMail.zimbra@redhat.com> Message-ID: <20190508070808.GA22482@ndevos-x270> On Tue, May 07, 2019 at 11:37:27AM -0700, Vijay Bellur wrote: > On Tue, May 7, 2019 at 11:15 AM FNU Raghavendra Manjunath > wrote: > > > > > + 1 to this. > > > > I have updated the footer of gluster-devel. If that looks ok, we can extend > it to gluster-users too. > > In case of a month with 5 Tuesdays, we can skip the 5th Tuesday and always > stick to the first 4 Tuesdays of every month. That will help in describing > the community meeting schedule better. If we want to keep the schedule > running on alternate Tuesdays, please let me know and the mailing list > footers can be updated accordingly :-). > > > > There is also one more thing. For some reason, the community meeting is > > not visible in my calendar (especially NA region). I am not sure if anyone > > else also facing this issue. > > > > I did face this issue. Realized that we had a meeting today and showed up > at the meeting a while later but did not see many participants. Perhaps, > the calendar invite has to be made a recurring one. Maybe a new invite can be sent with the minutes after a meeting has finished. This makes it easier for people that recently subscribed to the list to add it to their calendar? Niels > > Thanks, > Vijay > > > > > > Regards, > > Raghavendra > > > > On Tue, May 7, 2019 at 5:19 AM Ashish Pandey wrote: > > > >> Hi, > >> > >> While we send a mail on gluster-devel or gluster-user mailing list, > >> following content gets auto generated and placed at the end of mail. > >> > >> Gluster-users mailing list > >> Gluster-users at gluster.org > >> https://lists.gluster.org/mailman/listinfo/gluster-users > >> > >> Gluster-devel mailing list > >> Gluster-devel at gluster.org > >> https://lists.gluster.org/mailman/listinfo/gluster-devel > >> > >> In the similar way, is it possible to attach meeting schedule and link at the end of every such mails? > >> Like this - > >> > >> Meeting schedule - > >> > >> > >> - APAC friendly hours > >> - Tuesday 14th May 2019, 11:30AM IST > >> - Bridge: https://bluejeans.com/836554017 > >> - NA/EMEA > >> - Tuesday 7th May 2019, 01:00 PM EDT > >> - Bridge: https://bluejeans.com/486278655 > >> > >> Or just a link to meeting minutes details?? > >> https://github.com/gluster/community/tree/master/meetings > >> > >> This will help developers and users of the community to know when and where meeting happens and how to attend those meetings. > >> > >> --- > >> Ashish > >> > >> > >> > >> > >> > >> > >> _______________________________________________ > >> Gluster-users mailing list > >> Gluster-users at gluster.org > >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-users > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users From vbellur at redhat.com Wed May 8 07:31:37 2019 From: vbellur at redhat.com (Vijay Bellur) Date: Wed, 8 May 2019 00:31:37 -0700 Subject: [Gluster-devel] Meeting Details on footer of the gluster-devel and gluster-user mailing list In-Reply-To: <20190508070808.GA22482@ndevos-x270> References: <2029030585.17155612.1557220163425.JavaMail.zimbra@redhat.com> <1839109616.17156274.1557220745006.JavaMail.zimbra@redhat.com> <20190508070808.GA22482@ndevos-x270> Message-ID: On Wed, May 8, 2019 at 12:08 AM Niels de Vos wrote: > On Tue, May 07, 2019 at 11:37:27AM -0700, Vijay Bellur wrote: > > On Tue, May 7, 2019 at 11:15 AM FNU Raghavendra Manjunath < > rabhat at redhat.com> > > wrote: > > > > > > > > + 1 to this. > > > > > > > I have updated the footer of gluster-devel. If that looks ok, we can > extend > > it to gluster-users too. > > > > In case of a month with 5 Tuesdays, we can skip the 5th Tuesday and > always > > stick to the first 4 Tuesdays of every month. That will help in > describing > > the community meeting schedule better. If we want to keep the schedule > > running on alternate Tuesdays, please let me know and the mailing list > > footers can be updated accordingly :-). > > > > > > > There is also one more thing. For some reason, the community meeting is > > > not visible in my calendar (especially NA region). I am not sure if > anyone > > > else also facing this issue. > > > > > > > I did face this issue. Realized that we had a meeting today and showed up > > at the meeting a while later but did not see many participants. Perhaps, > > the calendar invite has to be made a recurring one. > > Maybe a new invite can be sent with the minutes after a meeting has > finished. This makes it easier for people that recently subscribed to > the list to add it to their calendar? > > > That is a good point. I have observed in google groups based mailing lists that a calendar invite for a recurring event is sent automatically to people after they subscribe to the list. I don't think mailman has a similar feature yet. Thanks, Vijay -------------- next part -------------- An HTML attachment was scrubbed... URL: From cynthia.zhou at nokia-sbell.com Wed May 8 07:58:20 2019 From: cynthia.zhou at nokia-sbell.com (Zhou, Cynthia (NSB - CN/Hangzhou)) Date: Wed, 8 May 2019 07:58:20 +0000 Subject: [Gluster-devel] glusterfsd memory leak issue found after enable ssl In-Reply-To: <6d3f68f73e6d440dab19028526745171@nokia-sbell.com> References: <07cb1c3aa08b414dbe37442955ddad36@nokia-sbell.com> <6ce04fb69243465295a71b6953eafa19@nokia-sbell.com> <3cd91d1ce39541e7ad30c60ef15000aa@nokia-sbell.com> <5d0c2ed30e884b86ba29bff5a47c960e@nokia-sbell.com> <6d3f68f73e6d440dab19028526745171@nokia-sbell.com> Message-ID: <0d7934cac01f4a43b4581a2f74928dbc@nokia-sbell.com> Hi 'Milind Changire' , The leak is getting more and more clear to me now. the unsolved memory leak is because of in gluterfs version 3.12.15 (in my env)the ssl context is a shared one, while we do ssl_acept, ssl will allocate some read/write buffer to ssl object, however, ssl_free in socket_reset or fini function of socket.c, the buffer is returened back to ssl context free list instead of completely freed. So following patch is able to fix the memory leak issue completely.(created for gluster master branch) --- a/rpc/rpc-transport/socket/src/socket.c +++ b/rpc/rpc-transport/socket/src/socket.c @@ -446,6 +446,7 @@ ssl_setup_connection_postfix(rpc_transport_t *this) gf_log(this->name, GF_LOG_DEBUG, "SSL verification succeeded (client: %s) (server: %s)", this->peerinfo.identifier, this->myinfo.identifier); + X509_free(peer); return gf_strdup(peer_CN); /* Error paths. */ @@ -1157,7 +1158,21 @@ __socket_reset(rpc_transport_t *this) memset(&priv->incoming, 0, sizeof(priv->incoming)); event_unregister_close(this->ctx->event_pool, priv->sock, priv->idx); - + if(priv->use_ssl&& priv->ssl_ssl) + { + gf_log(this->name, GF_LOG_TRACE, + "clear and reset for socket(%d), free ssl ", + priv->sock); + if(priv->ssl_ctx) + { + SSL_CTX_free(priv->ssl_ctx); + priv->ssl_ctx = NULL; + } + SSL_shutdown(priv->ssl_ssl); + SSL_clear(priv->ssl_ssl); + SSL_free(priv->ssl_ssl); + priv->ssl_ssl = NULL; + } priv->sock = -1; priv->idx = -1; priv->connected = -1; @@ -4675,6 +4690,21 @@ fini(rpc_transport_t *this) pthread_mutex_destroy(&priv->out_lock); pthread_mutex_destroy(&priv->cond_lock); pthread_cond_destroy(&priv->cond); + if(priv->use_ssl&& priv->ssl_ssl) + { + gf_log(this->name, GF_LOG_TRACE, + "clear and reset for socket(%d), free ssl ", + priv->sock); + if(priv->ssl_ctx) + { + SSL_CTX_free(priv->ssl_ctx); + priv->ssl_ctx = NULL; + } + SSL_shutdown(priv->ssl_ssl); + SSL_clear(priv->ssl_ssl); + SSL_free(priv->ssl_ssl); From: Zhou, Cynthia (NSB - CN/Hangzhou) Sent: Monday, May 06, 2019 2:12 PM To: 'Amar Tumballi Suryanarayan' Cc: 'Milind Changire' ; 'gluster-devel at gluster.org' Subject: RE: [Gluster-devel] glusterfsd memory leak issue found after enable ssl Hi, From our test valgrind and libleak all blame ssl3_accept ///////////////////////////from valgrind attached to glusterfds/////////////////////////////////////////// ==16673== 198,720 bytes in 12 blocks are definitely lost in loss record 1,114 of 1,123 ==16673== at 0x4C2EB7B: malloc (vg_replace_malloc.c:299) ==16673== by 0x63E1977: CRYPTO_malloc (in /usr/lib64/libcrypto.so.1.0.2p) ==16673== by 0xA855E0C: ssl3_setup_write_buffer (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA855E77: ssl3_setup_buffers (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA8485D9: ssl3_accept (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA610DDF: ssl_complete_connection (socket.c:400) ==16673== by 0xA617F38: ssl_handle_server_connection_attempt (socket.c:2409) ==16673== by 0xA618420: socket_complete_connection (socket.c:2554) ==16673== by 0xA618788: socket_event_handler (socket.c:2613) ==16673== by 0x4ED6983: event_dispatch_epoll_handler (event-epoll.c:587) ==16673== by 0x4ED6C5A: event_dispatch_epoll_worker (event-epoll.c:663) ==16673== by 0x615C5D9: start_thread (in /usr/lib64/libpthread-2.27.so) ==16673== ==16673== 200,544 bytes in 12 blocks are definitely lost in loss record 1,115 of 1,123 ==16673== at 0x4C2EB7B: malloc (vg_replace_malloc.c:299) ==16673== by 0x63E1977: CRYPTO_malloc (in /usr/lib64/libcrypto.so.1.0.2p) ==16673== by 0xA855D12: ssl3_setup_read_buffer (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA855E68: ssl3_setup_buffers (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA8485D9: ssl3_accept (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA610DDF: ssl_complete_connection (socket.c:400) ==16673== by 0xA617F38: ssl_handle_server_connection_attempt (socket.c:2409) ==16673== by 0xA618420: socket_complete_connection (socket.c:2554) ==16673== by 0xA618788: socket_event_handler (socket.c:2613) ==16673== by 0x4ED6983: event_dispatch_epoll_handler (event-epoll.c:587) ==16673== by 0x4ED6C5A: event_dispatch_epoll_worker (event-epoll.c:663) ==16673== by 0x615C5D9: start_thread (in /usr/lib64/libpthread-2.27.so) ==16673== valgrind --leak-check=f ////////////////////////////////////with libleak attached to glusterfsd///////////////////////////////////////// callstack[2419] expires. count=1 size=224/224 alloc=362 free=350 /home/robot/libleak/libleak.so(malloc+0x25) [0x7f1460604065] /lib64/libcrypto.so.10(CRYPTO_malloc+0x58) [0x7f145ecd9978] /lib64/libcrypto.so.10(EVP_DigestInit_ex+0x2a9) [0x7f145ed95749] /lib64/libssl.so.10(ssl3_digest_cached_records+0x11d) [0x7f145abb6ced] /lib64/libssl.so.10(ssl3_accept+0xc8f) [0x7f145abadc4f] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(ssl_complete_connection+0x5e) [0x7f145ae00f3a] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc16d) [0x7f145ae0816d] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc68a) [0x7f145ae0868a] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc9f2) [0x7f145ae089f2] /lib64/libglusterfs.so.0(+0x9b96f) [0x7f146038596f] /lib64/libglusterfs.so.0(+0x9bc46) [0x7f1460385c46] /lib64/libpthread.so.0(+0x75da) [0x7f145f0d15da] /lib64/libc.so.6(clone+0x3f) [0x7f145e9a7eaf] callstack[2432] expires. count=1 size=104/104 alloc=362 free=0 /home/robot/libleak/libleak.so(malloc+0x25) [0x7f1460604065] /lib64/libcrypto.so.10(CRYPTO_malloc+0x58) [0x7f145ecd9978] /lib64/libcrypto.so.10(BN_MONT_CTX_new+0x17) [0x7f145ed48627] /lib64/libcrypto.so.10(BN_MONT_CTX_set_locked+0x6d) [0x7f145ed489fd] /lib64/libcrypto.so.10(+0xff4d9) [0x7f145ed6a4d9] /lib64/libcrypto.so.10(int_rsa_verify+0x1cd) [0x7f145ed6d41d] /lib64/libcrypto.so.10(RSA_verify+0x32) [0x7f145ed6d972] /lib64/libcrypto.so.10(+0x107ff5) [0x7f145ed72ff5] /lib64/libcrypto.so.10(EVP_VerifyFinal+0x211) [0x7f145ed9dd51] /lib64/libssl.so.10(ssl3_get_cert_verify+0x5bb) [0x7f145abac06b] /lib64/libssl.so.10(ssl3_accept+0x988) [0x7f145abad948] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(ssl_complete_connection+0x5e) [0x7f145ae00f3a] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc16d) [0x7f145ae0816d] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc68a) [0x7f145ae0868a] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc9f2) [0x7f145ae089f2] /lib64/libglusterfs.so.0(+0x9b96f) [0x7f146038596f] /lib64/libglusterfs.so.0(+0x9bc46) [0x7f1460385c46] /lib64/libpthread.so.0(+0x75da) [0x7f145f0d15da] /lib64/libc.so.6(clone+0x3f) [0x7f145e9a7eaf] one interesting thing is that the memory goes up to about 300m then it stopped increasing !!! I am wondering if this is caused by open-ssl library? But when I search from openssl community, there is no such issue reported before. Is glusterfs using ssl_accept correctly? cynthia From: Zhou, Cynthia (NSB - CN/Hangzhou) Sent: Monday, May 06, 2019 10:34 AM To: 'Amar Tumballi Suryanarayan' > Cc: Milind Changire >; gluster-devel at gluster.org Subject: RE: [Gluster-devel] glusterfsd memory leak issue found after enable ssl Hi, Sorry, I am so busy with other issues these days, could you help me to submit my patch for review? It is based on glusterfs3.12.15 code. But even with this patch , memory leak still exists, from memory leak tool it should be related with ssl_accept, not sure if it is because of openssl library or because improper use of ssl interfaces. --- a/rpc/rpc-transport/socket/src/socket.c +++ b/rpc/rpc-transport/socket/src/socket.c @@ -1019,7 +1019,16 @@ static void __socket_reset(rpc_transport_t *this) { memset(&priv->incoming, 0, sizeof(priv->incoming)); event_unregister_close(this->ctx->event_pool, priv->sock, priv->idx); - + if(priv->use_ssl&& priv->ssl_ssl) + { + gf_log(this->name, GF_LOG_INFO, + "clear and reset for socket(%d), free ssl ", + priv->sock); + SSL_shutdown(priv->ssl_ssl); + SSL_clear(priv->ssl_ssl); + SSL_free(priv->ssl_ssl); + priv->ssl_ssl = NULL; + } priv->sock = -1; priv->idx = -1; priv->connected = -1; @@ -4238,6 +4250,16 @@ void fini(rpc_transport_t *this) { pthread_mutex_destroy(&priv->out_lock); pthread_mutex_destroy(&priv->cond_lock); pthread_cond_destroy(&priv->cond); + if(priv->use_ssl&& priv->ssl_ssl) + { + gf_log(this->name, GF_LOG_INFO, + "clear and reset for socket(%d), free ssl ", + priv->sock); + SSL_shutdown(priv->ssl_ssl); + SSL_clear(priv->ssl_ssl); + SSL_free(priv->ssl_ssl); + priv->ssl_ssl = NULL; + } if (priv->ssl_private_key) { GF_FREE(priv->ssl_private_key); } From: Amar Tumballi Suryanarayan > Sent: Wednesday, May 01, 2019 8:43 PM To: Zhou, Cynthia (NSB - CN/Hangzhou) > Cc: Milind Changire >; gluster-devel at gluster.org Subject: Re: [Gluster-devel] glusterfsd memory leak issue found after enable ssl Hi Cynthia Zhou, Can you post the patch which fixes the issue of missing free? We will continue to investigate the leak further, but would really appreciate getting the patch which is already worked on land into upstream master. -Amar On Mon, Apr 22, 2019 at 1:38 PM Zhou, Cynthia (NSB - CN/Hangzhou) > wrote: Ok, I am clear now. I?ve added ssl_free in socket reset and socket finish function, though glusterfsd memory leak is not that much, still it is leaking, from source code I can not find anything else, Could you help to check if this issue exists in your env? If not I may have a try to merge your patch . Step 1> while true;do gluster v heal info, 2> check the vol-name glusterfsd memory usage, it is obviously increasing. cynthia From: Milind Changire > Sent: Monday, April 22, 2019 2:36 PM To: Zhou, Cynthia (NSB - CN/Hangzhou) > Cc: Atin Mukherjee >; gluster-devel at gluster.org Subject: Re: [Gluster-devel] glusterfsd memory leak issue found after enable ssl According to BIO_new_socket() man page ... If the close flag is set then the socket is shut down and closed when the BIO is freed. For Gluster to have more control over the socket shutdown, the BIO_NOCLOSE flag is set. Otherwise, SSL takes control of socket shutdown whenever BIO is freed. _______________________________________________ Gluster-devel mailing list Gluster-devel at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From mchangir at redhat.com Wed May 8 09:01:29 2019 From: mchangir at redhat.com (Milind Changire) Date: Wed, 8 May 2019 14:31:29 +0530 Subject: [Gluster-devel] glusterfsd memory leak issue found after enable ssl In-Reply-To: <0d7934cac01f4a43b4581a2f74928dbc@nokia-sbell.com> References: <07cb1c3aa08b414dbe37442955ddad36@nokia-sbell.com> <6ce04fb69243465295a71b6953eafa19@nokia-sbell.com> <3cd91d1ce39541e7ad30c60ef15000aa@nokia-sbell.com> <5d0c2ed30e884b86ba29bff5a47c960e@nokia-sbell.com> <6d3f68f73e6d440dab19028526745171@nokia-sbell.com> <0d7934cac01f4a43b4581a2f74928dbc@nokia-sbell.com> Message-ID: awesome! well done! thank you for taking pain to fix the memory leak On Wed, May 8, 2019 at 1:28 PM Zhou, Cynthia (NSB - CN/Hangzhou) < cynthia.zhou at nokia-sbell.com> wrote: > Hi 'Milind Changire' , > > The leak is getting more and more clear to me now. the unsolved memory > leak is because of in gluterfs version 3.12.15 (in my env)the ssl context > is a shared one, while we do ssl_acept, ssl will allocate some read/write > buffer to ssl object, however, ssl_free in socket_reset or fini function of > socket.c, the buffer is returened back to ssl context free list instead of > completely freed. > > > > So following patch is able to fix the memory leak issue > completely.(created for gluster master branch) > > > > --- a/rpc/rpc-transport/socket/src/socket.c > +++ b/rpc/rpc-transport/socket/src/socket.c > @@ -446,6 +446,7 @@ ssl_setup_connection_postfix(rpc_transport_t *this) > gf_log(this->name, GF_LOG_DEBUG, > "SSL verification succeeded (client: %s) (server: %s)", > this->peerinfo.identifier, this->myinfo.identifier); > + X509_free(peer); > return gf_strdup(peer_CN); > > /* Error paths. */ > @@ -1157,7 +1158,21 @@ __socket_reset(rpc_transport_t *this) > memset(&priv->incoming, 0, sizeof(priv->incoming)); > > event_unregister_close(this->ctx->event_pool, priv->sock, priv->idx); > - > + if(priv->use_ssl&& priv->ssl_ssl) > + { > + gf_log(this->name, GF_LOG_TRACE, > + "clear and reset for socket(%d), free ssl ", > + priv->sock); > + if(priv->ssl_ctx) > + { > + SSL_CTX_free(priv->ssl_ctx); > + priv->ssl_ctx = NULL; > + } > + SSL_shutdown(priv->ssl_ssl); > + SSL_clear(priv->ssl_ssl); > + SSL_free(priv->ssl_ssl); > + priv->ssl_ssl = NULL; > + } > priv->sock = -1; > priv->idx = -1; > priv->connected = -1; > @@ -4675,6 +4690,21 @@ fini(rpc_transport_t *this) > pthread_mutex_destroy(&priv->out_lock); > pthread_mutex_destroy(&priv->cond_lock); > pthread_cond_destroy(&priv->cond); > + if(priv->use_ssl&& priv->ssl_ssl) > + { > + gf_log(this->name, GF_LOG_TRACE, > + "clear and reset for socket(%d), free ssl > ", > + priv->sock); > + if(priv->ssl_ctx) > + { > + SSL_CTX_free(priv->ssl_ctx); > + priv->ssl_ctx = NULL; > + } > + SSL_shutdown(priv->ssl_ssl); > + SSL_clear(priv->ssl_ssl); > + SSL_free(priv->ssl_ssl); > > *From:* Zhou, Cynthia (NSB - CN/Hangzhou) > *Sent:* Monday, May 06, 2019 2:12 PM > *To:* 'Amar Tumballi Suryanarayan' > *Cc:* 'Milind Changire' ; 'gluster-devel at gluster.org' > > *Subject:* RE: [Gluster-devel] glusterfsd memory leak issue found after > enable ssl > > > > Hi, > > From our test valgrind and libleak all blame ssl3_accept > > ///////////////////////////from valgrind attached to > glusterfds/////////////////////////////////////////// > > ==16673== 198,720 bytes in 12 blocks are definitely lost in loss record > 1,114 of 1,123 > ==16673== at 0x4C2EB7B: malloc (vg_replace_malloc.c:299) > ==16673== by 0x63E1977: CRYPTO_malloc (in /usr/lib64/ > *libcrypto.so.1.0.2p*) > ==16673== by 0xA855E0C: ssl3_setup_write_buffer (in /usr/lib64/ > *libssl.so.1.0.2p*) > ==16673== by 0xA855E77: ssl3_setup_buffers (in /usr/lib64/ > *libssl.so.1.0.2p*) > ==16673== by 0xA8485D9: ssl3_accept (in /usr/lib64/*libssl.so.1.0.2p*) > ==16673== by 0xA610DDF: ssl_complete_connection (socket.c:400) > ==16673== by 0xA617F38: ssl_handle_server_connection_attempt > (socket.c:2409) > ==16673== by 0xA618420: socket_complete_connection (socket.c:2554) > ==16673== by 0xA618788: socket_event_handler (socket.c:2613) > ==16673== by 0x4ED6983: event_dispatch_epoll_handler (event-epoll.c:587) > ==16673== by 0x4ED6C5A: event_dispatch_epoll_worker (event-epoll.c:663) > ==16673== by 0x615C5D9: start_thread (in /usr/lib64/*libpthread-2.27.so > *) > ==16673== > ==16673== 200,544 bytes in 12 blocks are definitely lost in loss record > 1,115 of 1,123 > ==16673== at 0x4C2EB7B: malloc (vg_replace_malloc.c:299) > ==16673== by 0x63E1977: CRYPTO_malloc (in /usr/lib64/ > *libcrypto.so.1.0.2p*) > ==16673== by 0xA855D12: ssl3_setup_read_buffer (in /usr/lib64/ > *libssl.so.1.0.2p*) > ==16673== by 0xA855E68: ssl3_setup_buffers (in /usr/lib64/ > *libssl.so.1.0.2p*) > ==16673== by 0xA8485D9: ssl3_accept (in /usr/lib64/*libssl.so.1.0.2p*) > ==16673== by 0xA610DDF: ssl_complete_connection (socket.c:400) > ==16673== by 0xA617F38: ssl_handle_server_connection_attempt > (socket.c:2409) > ==16673== by 0xA618420: socket_complete_connection (socket.c:2554) > ==16673== by 0xA618788: socket_event_handler (socket.c:2613) > ==16673== by 0x4ED6983: event_dispatch_epoll_handler (event-epoll.c:587) > ==16673== by 0x4ED6C5A: event_dispatch_epoll_worker (event-epoll.c:663) > ==16673== by 0x615C5D9: start_thread (in /usr/lib64/*libpthread-2.27.so > *) > ==16673== > valgrind --leak-check=f > > > > > > ////////////////////////////////////with libleak attached to > glusterfsd///////////////////////////////////////// > > callstack[2419] expires. count=1 size=224/224 alloc=362 free=350 > /home/robot/libleak/*libleak.so(malloc+0x25*) [0x7f1460604065] > /lib64/*libcrypto.so.10(CRYPTO_malloc+0x58*) [0x7f145ecd9978] > /lib64/*libcrypto.so.10(EVP_DigestInit_ex+0x2a9*) [0x7f145ed95749] > /lib64/*libssl.so.10(ssl3_digest_cached_records+0x11d*) > [0x7f145abb6ced] > /lib64/*libssl.so.10(**ssl3_accept**+0xc8f*) [0x7f145abadc4f] > /usr/lib64/glusterfs/3.12.15/rpc-transport/ > *socket.so(ssl_complete_connection+0x5e*) [0x7f145ae00f3a] > /usr/lib64/glusterfs/3.12.15/rpc-transport/*socket.so(+0xc16d*) > [0x7f145ae0816d] > /usr/lib64/glusterfs/3.12.15/rpc-transport/*socket.so(+0xc68a*) > [0x7f145ae0868a] > /usr/lib64/glusterfs/3.12.15/rpc-transport/*socket.so(+0xc9f2*) > [0x7f145ae089f2] > /lib64/*libglusterfs.so.0(+0x9b96f*) [0x7f146038596f] > /lib64/*libglusterfs.so.0(+0x9bc46*) [0x7f1460385c46] > /lib64/*libpthread.so.0(+0x75da*) [0x7f145f0d15da] > /lib64/*libc.so.6(clone+0x3f*) [0x7f145e9a7eaf] > > callstack[2432] expires. count=1 size=104/104 alloc=362 free=0 > /home/robot/libleak/*libleak.so(malloc+0x25*) [0x7f1460604065] > /lib64/*libcrypto.so.10(CRYPTO_malloc+0x58*) [0x7f145ecd9978] > /lib64/*libcrypto.so.10(BN_MONT_CTX_new+0x17*) [0x7f145ed48627] > /lib64/*libcrypto.so.10(BN_MONT_CTX_set_locked+0x6d*) [0x7f145ed489fd] > /lib64/*libcrypto.so.10(+0xff4d9*) [0x7f145ed6a4d9] > /lib64/*libcrypto.so.10(int_rsa_verify+0x1cd*) [0x7f145ed6d41d] > /lib64/*libcrypto.so.10(RSA_verify+0x32*) [0x7f145ed6d972] > /lib64/*libcrypto.so.10(+0x107ff5*) [0x7f145ed72ff5] > /lib64/*libcrypto.so.10(EVP_VerifyFinal+0x211*) [0x7f145ed9dd51] > /lib64/*libssl.so.10(ssl3_get_cert_verify+0x5bb*) [0x7f145abac06b] > /lib64/*libssl.so.10(**ssl3_accept**+0x988*) [0x7f145abad948] > /usr/lib64/glusterfs/3.12.15/rpc-transport/ > *socket.so(ssl_complete_connection+0x5e*) [0x7f145ae00f3a] > /usr/lib64/glusterfs/3.12.15/rpc-transport/*socket.so(+0xc16d*) > [0x7f145ae0816d] > /usr/lib64/glusterfs/3.12.15/rpc-transport/*socket.so(+0xc68a*) > [0x7f145ae0868a] > /usr/lib64/glusterfs/3.12.15/rpc-transport/*socket.so(+0xc9f2*) > [0x7f145ae089f2] > /lib64/*libglusterfs.so.0(+0x9b96f*) [0x7f146038596f] > /lib64/*libglusterfs.so.0(+0x9bc46*) [0x7f1460385c46] > /lib64/*libpthread.so.0(+0x75da*) [0x7f145f0d15da] > /lib64/*libc.so.6(clone+0x3f*) [0x7f145e9a7eaf] > > > > one interesting thing is that the memory goes up to about 300m then it > stopped increasing !!! > > I am wondering if this is caused by open-ssl library? But when I search > from openssl community, there is no such issue reported before. > > Is glusterfs using ssl_accept correctly? > > > > cynthia > > *From:* Zhou, Cynthia (NSB - CN/Hangzhou) > *Sent:* Monday, May 06, 2019 10:34 AM > *To:* 'Amar Tumballi Suryanarayan' > *Cc:* Milind Changire ; gluster-devel at gluster.org > *Subject:* RE: [Gluster-devel] glusterfsd memory leak issue found after > enable ssl > > > > Hi, > > Sorry, I am so busy with other issues these days, could you help me to > submit my patch for review? It is based on glusterfs3.12.15 code. But even > with this patch , memory leak still exists, from memory leak tool it should > be related with ssl_accept, not sure if it is because of openssl library or > because improper use of ssl interfaces. > > --- a/rpc/rpc-transport/socket/src/socket.c > > +++ b/rpc/rpc-transport/socket/src/socket.c > > @@ -1019,7 +1019,16 @@ static void __socket_reset(rpc_transport_t *this) { > > memset(&priv->incoming, 0, sizeof(priv->incoming)); > > > > event_unregister_close(this->ctx->event_pool, priv->sock, priv->idx); > > - > > + if(priv->use_ssl&& priv->ssl_ssl) > > + { > > + gf_log(this->name, GF_LOG_INFO, > > + "clear and reset for socket(%d), free ssl ", > > + priv->sock); > > + SSL_shutdown(priv->ssl_ssl); > > + SSL_clear(priv->ssl_ssl); > > + SSL_free(priv->ssl_ssl); > > + priv->ssl_ssl = NULL; > > + } > > priv->sock = -1; > > priv->idx = -1; > > priv->connected = -1; > > @@ -4238,6 +4250,16 @@ void fini(rpc_transport_t *this) { > > pthread_mutex_destroy(&priv->out_lock); > > pthread_mutex_destroy(&priv->cond_lock); > > pthread_cond_destroy(&priv->cond); > > + if(priv->use_ssl&& priv->ssl_ssl) > > + { > > + gf_log(this->name, GF_LOG_INFO, > > + "clear and reset for socket(%d), free ssl ", > > + priv->sock); > > + SSL_shutdown(priv->ssl_ssl); > > + SSL_clear(priv->ssl_ssl); > > + SSL_free(priv->ssl_ssl); > > + priv->ssl_ssl = NULL; > > + } > > if (priv->ssl_private_key) { > > GF_FREE(priv->ssl_private_key); > > } > > > > > > *From:* Amar Tumballi Suryanarayan > *Sent:* Wednesday, May 01, 2019 8:43 PM > *To:* Zhou, Cynthia (NSB - CN/Hangzhou) > *Cc:* Milind Changire ; gluster-devel at gluster.org > *Subject:* Re: [Gluster-devel] glusterfsd memory leak issue found after > enable ssl > > > > Hi Cynthia Zhou, > > > > Can you post the patch which fixes the issue of missing free? We will > continue to investigate the leak further, but would really appreciate > getting the patch which is already worked on land into upstream master. > > > > -Amar > > > > On Mon, Apr 22, 2019 at 1:38 PM Zhou, Cynthia (NSB - CN/Hangzhou) < > cynthia.zhou at nokia-sbell.com> wrote: > > Ok, I am clear now. > > I?ve added ssl_free in socket reset and socket finish function, though > glusterfsd memory leak is not that much, still it is leaking, from source > code I can not find anything else, > > Could you help to check if this issue exists in your env? If not I may > have a try to merge your patch . > > Step > > 1> while true;do gluster v heal info, > > 2> check the vol-name glusterfsd memory usage, it is obviously > increasing. > > cynthia > > > > *From:* Milind Changire > *Sent:* Monday, April 22, 2019 2:36 PM > *To:* Zhou, Cynthia (NSB - CN/Hangzhou) > *Cc:* Atin Mukherjee ; gluster-devel at gluster.org > *Subject:* Re: [Gluster-devel] glusterfsd memory leak issue found after > enable ssl > > > > According to BIO_new_socket() man page ... > > > > *If the close flag is set then the socket is shut down and closed when the > BIO is freed.* > > > > For Gluster to have more control over the socket shutdown, the BIO_NOCLOSE > flag is set. Otherwise, SSL takes control of socket shutdown whenever BIO > is freed. > > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > > > -- > > Amar Tumballi (amarts) > -- Milind -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Wed May 8 11:35:19 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Wed, 8 May 2019 17:05:19 +0530 Subject: [Gluster-devel] glusterfsd memory leak issue found after enable ssl In-Reply-To: <0d7934cac01f4a43b4581a2f74928dbc@nokia-sbell.com> References: <07cb1c3aa08b414dbe37442955ddad36@nokia-sbell.com> <6ce04fb69243465295a71b6953eafa19@nokia-sbell.com> <3cd91d1ce39541e7ad30c60ef15000aa@nokia-sbell.com> <5d0c2ed30e884b86ba29bff5a47c960e@nokia-sbell.com> <6d3f68f73e6d440dab19028526745171@nokia-sbell.com> <0d7934cac01f4a43b4581a2f74928dbc@nokia-sbell.com> Message-ID: On Wed, May 8, 2019 at 1:29 PM Zhou, Cynthia (NSB - CN/Hangzhou) < cynthia.zhou at nokia-sbell.com> wrote: > Hi 'Milind Changire' , > > The leak is getting more and more clear to me now. the unsolved memory > leak is because of in gluterfs version 3.12.15 (in my env)the ssl context > is a shared one, while we do ssl_acept, ssl will allocate some read/write > buffer to ssl object, however, ssl_free in socket_reset or fini function of > socket.c, the buffer is returened back to ssl context free list instead of > completely freed. > Thanks Cynthia for your efforts in identifying and fixing the leak. If you post a patch to gerrit, I'll be happy to merge it and get the fix into the codebase. > > So following patch is able to fix the memory leak issue > completely.(created for gluster master branch) > > > > --- a/rpc/rpc-transport/socket/src/socket.c > +++ b/rpc/rpc-transport/socket/src/socket.c > @@ -446,6 +446,7 @@ ssl_setup_connection_postfix(rpc_transport_t *this) > gf_log(this->name, GF_LOG_DEBUG, > "SSL verification succeeded (client: %s) (server: %s)", > this->peerinfo.identifier, this->myinfo.identifier); > + X509_free(peer); > return gf_strdup(peer_CN); > > /* Error paths. */ > @@ -1157,7 +1158,21 @@ __socket_reset(rpc_transport_t *this) > memset(&priv->incoming, 0, sizeof(priv->incoming)); > > event_unregister_close(this->ctx->event_pool, priv->sock, priv->idx); > - > + if(priv->use_ssl&& priv->ssl_ssl) > + { > + gf_log(this->name, GF_LOG_TRACE, > + "clear and reset for socket(%d), free ssl ", > + priv->sock); > + if(priv->ssl_ctx) > + { > + SSL_CTX_free(priv->ssl_ctx); > + priv->ssl_ctx = NULL; > + } > + SSL_shutdown(priv->ssl_ssl); > + SSL_clear(priv->ssl_ssl); > + SSL_free(priv->ssl_ssl); > + priv->ssl_ssl = NULL; > + } > priv->sock = -1; > priv->idx = -1; > priv->connected = -1; > @@ -4675,6 +4690,21 @@ fini(rpc_transport_t *this) > pthread_mutex_destroy(&priv->out_lock); > pthread_mutex_destroy(&priv->cond_lock); > pthread_cond_destroy(&priv->cond); > + if(priv->use_ssl&& priv->ssl_ssl) > + { > + gf_log(this->name, GF_LOG_TRACE, > + "clear and reset for socket(%d), free ssl > ", > + priv->sock); > + if(priv->ssl_ctx) > + { > + SSL_CTX_free(priv->ssl_ctx); > + priv->ssl_ctx = NULL; > + } > + SSL_shutdown(priv->ssl_ssl); > + SSL_clear(priv->ssl_ssl); > + SSL_free(priv->ssl_ssl); > > *From:* Zhou, Cynthia (NSB - CN/Hangzhou) > *Sent:* Monday, May 06, 2019 2:12 PM > *To:* 'Amar Tumballi Suryanarayan' > *Cc:* 'Milind Changire' ; 'gluster-devel at gluster.org' > > *Subject:* RE: [Gluster-devel] glusterfsd memory leak issue found after > enable ssl > > > > Hi, > > From our test valgrind and libleak all blame ssl3_accept > > ///////////////////////////from valgrind attached to > glusterfds/////////////////////////////////////////// > > ==16673== 198,720 bytes in 12 blocks are definitely lost in loss record > 1,114 of 1,123 > ==16673== at 0x4C2EB7B: malloc (vg_replace_malloc.c:299) > ==16673== by 0x63E1977: CRYPTO_malloc (in /usr/lib64/ > *libcrypto.so.1.0.2p*) > ==16673== by 0xA855E0C: ssl3_setup_write_buffer (in /usr/lib64/ > *libssl.so.1.0.2p*) > ==16673== by 0xA855E77: ssl3_setup_buffers (in /usr/lib64/ > *libssl.so.1.0.2p*) > ==16673== by 0xA8485D9: ssl3_accept (in /usr/lib64/*libssl.so.1.0.2p*) > ==16673== by 0xA610DDF: ssl_complete_connection (socket.c:400) > ==16673== by 0xA617F38: ssl_handle_server_connection_attempt > (socket.c:2409) > ==16673== by 0xA618420: socket_complete_connection (socket.c:2554) > ==16673== by 0xA618788: socket_event_handler (socket.c:2613) > ==16673== by 0x4ED6983: event_dispatch_epoll_handler (event-epoll.c:587) > ==16673== by 0x4ED6C5A: event_dispatch_epoll_worker (event-epoll.c:663) > ==16673== by 0x615C5D9: start_thread (in /usr/lib64/*libpthread-2.27.so > *) > ==16673== > ==16673== 200,544 bytes in 12 blocks are definitely lost in loss record > 1,115 of 1,123 > ==16673== at 0x4C2EB7B: malloc (vg_replace_malloc.c:299) > ==16673== by 0x63E1977: CRYPTO_malloc (in /usr/lib64/ > *libcrypto.so.1.0.2p*) > ==16673== by 0xA855D12: ssl3_setup_read_buffer (in /usr/lib64/ > *libssl.so.1.0.2p*) > ==16673== by 0xA855E68: ssl3_setup_buffers (in /usr/lib64/ > *libssl.so.1.0.2p*) > ==16673== by 0xA8485D9: ssl3_accept (in /usr/lib64/*libssl.so.1.0.2p*) > ==16673== by 0xA610DDF: ssl_complete_connection (socket.c:400) > ==16673== by 0xA617F38: ssl_handle_server_connection_attempt > (socket.c:2409) > ==16673== by 0xA618420: socket_complete_connection (socket.c:2554) > ==16673== by 0xA618788: socket_event_handler (socket.c:2613) > ==16673== by 0x4ED6983: event_dispatch_epoll_handler (event-epoll.c:587) > ==16673== by 0x4ED6C5A: event_dispatch_epoll_worker (event-epoll.c:663) > ==16673== by 0x615C5D9: start_thread (in /usr/lib64/*libpthread-2.27.so > *) > ==16673== > valgrind --leak-check=f > > > > > > ////////////////////////////////////with libleak attached to > glusterfsd///////////////////////////////////////// > > callstack[2419] expires. count=1 size=224/224 alloc=362 free=350 > /home/robot/libleak/*libleak.so(malloc+0x25*) [0x7f1460604065] > /lib64/*libcrypto.so.10(CRYPTO_malloc+0x58*) [0x7f145ecd9978] > /lib64/*libcrypto.so.10(EVP_DigestInit_ex+0x2a9*) [0x7f145ed95749] > /lib64/*libssl.so.10(ssl3_digest_cached_records+0x11d*) > [0x7f145abb6ced] > /lib64/*libssl.so.10(**ssl3_accept**+0xc8f*) [0x7f145abadc4f] > /usr/lib64/glusterfs/3.12.15/rpc-transport/ > *socket.so(ssl_complete_connection+0x5e*) [0x7f145ae00f3a] > /usr/lib64/glusterfs/3.12.15/rpc-transport/*socket.so(+0xc16d*) > [0x7f145ae0816d] > /usr/lib64/glusterfs/3.12.15/rpc-transport/*socket.so(+0xc68a*) > [0x7f145ae0868a] > /usr/lib64/glusterfs/3.12.15/rpc-transport/*socket.so(+0xc9f2*) > [0x7f145ae089f2] > /lib64/*libglusterfs.so.0(+0x9b96f*) [0x7f146038596f] > /lib64/*libglusterfs.so.0(+0x9bc46*) [0x7f1460385c46] > /lib64/*libpthread.so.0(+0x75da*) [0x7f145f0d15da] > /lib64/*libc.so.6(clone+0x3f*) [0x7f145e9a7eaf] > > callstack[2432] expires. count=1 size=104/104 alloc=362 free=0 > /home/robot/libleak/*libleak.so(malloc+0x25*) [0x7f1460604065] > /lib64/*libcrypto.so.10(CRYPTO_malloc+0x58*) [0x7f145ecd9978] > /lib64/*libcrypto.so.10(BN_MONT_CTX_new+0x17*) [0x7f145ed48627] > /lib64/*libcrypto.so.10(BN_MONT_CTX_set_locked+0x6d*) [0x7f145ed489fd] > /lib64/*libcrypto.so.10(+0xff4d9*) [0x7f145ed6a4d9] > /lib64/*libcrypto.so.10(int_rsa_verify+0x1cd*) [0x7f145ed6d41d] > /lib64/*libcrypto.so.10(RSA_verify+0x32*) [0x7f145ed6d972] > /lib64/*libcrypto.so.10(+0x107ff5*) [0x7f145ed72ff5] > /lib64/*libcrypto.so.10(EVP_VerifyFinal+0x211*) [0x7f145ed9dd51] > /lib64/*libssl.so.10(ssl3_get_cert_verify+0x5bb*) [0x7f145abac06b] > /lib64/*libssl.so.10(**ssl3_accept**+0x988*) [0x7f145abad948] > /usr/lib64/glusterfs/3.12.15/rpc-transport/ > *socket.so(ssl_complete_connection+0x5e*) [0x7f145ae00f3a] > /usr/lib64/glusterfs/3.12.15/rpc-transport/*socket.so(+0xc16d*) > [0x7f145ae0816d] > /usr/lib64/glusterfs/3.12.15/rpc-transport/*socket.so(+0xc68a*) > [0x7f145ae0868a] > /usr/lib64/glusterfs/3.12.15/rpc-transport/*socket.so(+0xc9f2*) > [0x7f145ae089f2] > /lib64/*libglusterfs.so.0(+0x9b96f*) [0x7f146038596f] > /lib64/*libglusterfs.so.0(+0x9bc46*) [0x7f1460385c46] > /lib64/*libpthread.so.0(+0x75da*) [0x7f145f0d15da] > /lib64/*libc.so.6(clone+0x3f*) [0x7f145e9a7eaf] > > > > one interesting thing is that the memory goes up to about 300m then it > stopped increasing !!! > > I am wondering if this is caused by open-ssl library? But when I search > from openssl community, there is no such issue reported before. > > Is glusterfs using ssl_accept correctly? > > > > cynthia > > *From:* Zhou, Cynthia (NSB - CN/Hangzhou) > *Sent:* Monday, May 06, 2019 10:34 AM > *To:* 'Amar Tumballi Suryanarayan' > *Cc:* Milind Changire ; gluster-devel at gluster.org > *Subject:* RE: [Gluster-devel] glusterfsd memory leak issue found after > enable ssl > > > > Hi, > > Sorry, I am so busy with other issues these days, could you help me to > submit my patch for review? It is based on glusterfs3.12.15 code. But even > with this patch , memory leak still exists, from memory leak tool it should > be related with ssl_accept, not sure if it is because of openssl library or > because improper use of ssl interfaces. > > --- a/rpc/rpc-transport/socket/src/socket.c > > +++ b/rpc/rpc-transport/socket/src/socket.c > > @@ -1019,7 +1019,16 @@ static void __socket_reset(rpc_transport_t *this) { > > memset(&priv->incoming, 0, sizeof(priv->incoming)); > > > > event_unregister_close(this->ctx->event_pool, priv->sock, priv->idx); > > - > > + if(priv->use_ssl&& priv->ssl_ssl) > > + { > > + gf_log(this->name, GF_LOG_INFO, > > + "clear and reset for socket(%d), free ssl ", > > + priv->sock); > > + SSL_shutdown(priv->ssl_ssl); > > + SSL_clear(priv->ssl_ssl); > > + SSL_free(priv->ssl_ssl); > > + priv->ssl_ssl = NULL; > > + } > > priv->sock = -1; > > priv->idx = -1; > > priv->connected = -1; > > @@ -4238,6 +4250,16 @@ void fini(rpc_transport_t *this) { > > pthread_mutex_destroy(&priv->out_lock); > > pthread_mutex_destroy(&priv->cond_lock); > > pthread_cond_destroy(&priv->cond); > > + if(priv->use_ssl&& priv->ssl_ssl) > > + { > > + gf_log(this->name, GF_LOG_INFO, > > + "clear and reset for socket(%d), free ssl ", > > + priv->sock); > > + SSL_shutdown(priv->ssl_ssl); > > + SSL_clear(priv->ssl_ssl); > > + SSL_free(priv->ssl_ssl); > > + priv->ssl_ssl = NULL; > > + } > > if (priv->ssl_private_key) { > > GF_FREE(priv->ssl_private_key); > > } > > > > > > *From:* Amar Tumballi Suryanarayan > *Sent:* Wednesday, May 01, 2019 8:43 PM > *To:* Zhou, Cynthia (NSB - CN/Hangzhou) > *Cc:* Milind Changire ; gluster-devel at gluster.org > *Subject:* Re: [Gluster-devel] glusterfsd memory leak issue found after > enable ssl > > > > Hi Cynthia Zhou, > > > > Can you post the patch which fixes the issue of missing free? We will > continue to investigate the leak further, but would really appreciate > getting the patch which is already worked on land into upstream master. > > > > -Amar > > > > On Mon, Apr 22, 2019 at 1:38 PM Zhou, Cynthia (NSB - CN/Hangzhou) < > cynthia.zhou at nokia-sbell.com> wrote: > > Ok, I am clear now. > > I?ve added ssl_free in socket reset and socket finish function, though > glusterfsd memory leak is not that much, still it is leaking, from source > code I can not find anything else, > > Could you help to check if this issue exists in your env? If not I may > have a try to merge your patch . > > Step > > 1> while true;do gluster v heal info, > > 2> check the vol-name glusterfsd memory usage, it is obviously > increasing. > > cynthia > > > > *From:* Milind Changire > *Sent:* Monday, April 22, 2019 2:36 PM > *To:* Zhou, Cynthia (NSB - CN/Hangzhou) > *Cc:* Atin Mukherjee ; gluster-devel at gluster.org > *Subject:* Re: [Gluster-devel] glusterfsd memory leak issue found after > enable ssl > > > > According to BIO_new_socket() man page ... > > > > *If the close flag is set then the socket is shut down and closed when the > BIO is freed.* > > > > For Gluster to have more control over the socket shutdown, the BIO_NOCLOSE > flag is set. Otherwise, SSL takes control of socket shutdown whenever BIO > is freed. > > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > > > -- > > Amar Tumballi (amarts) > _______________________________________________ > > Community Meeting Calendar: > > APAC Schedule - > Every 2nd and 4th Tuesday at 11:30 AM IST > Bridge: https://bluejeans.com/836554017 > > NA/EMEA Schedule - > Every 1st and 3rd Tuesday at 01:00 PM EDT > Bridge: https://bluejeans.com/486278655 > > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Wed May 8 14:08:15 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Wed, 8 May 2019 19:38:15 +0530 Subject: [Gluster-devel] [Gluster-infra] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> <797512f6ff7f1b9fedbf8b7968dd86a6968d9105.camel@redhat.com> Message-ID: builder204 needs to be fixed, too many failures, mostly none of the patches are passing regression. On Wed, May 8, 2019 at 9:53 AM Atin Mukherjee wrote: > > > On Wed, May 8, 2019 at 7:16 AM Sanju Rakonde wrote: > >> Deepshikha, >> >> I see the failure here[1] which ran on builder206. So, we are good. >> > > Not really, > https://build.gluster.org/job/centos7-regression/5909/consoleFull failed > on builder204 for similar reasons I believe? > > I am bit more worried on this issue being resurfacing more often these > days. What can we do to fix this permanently? > > >> [1] https://build.gluster.org/job/centos7-regression/5901/consoleFull >> >> On Wed, May 8, 2019 at 12:23 AM Deepshikha Khandelwal < >> dkhandel at redhat.com> wrote: >> >>> Sanju, can you please give us more info about the failures. >>> >>> I see the failures occurring on just one of the builder (builder206). >>> I'm taking it back offline for now. >>> >>> On Tue, May 7, 2019 at 9:42 PM Michael Scherer >>> wrote: >>> >>>> Le mardi 07 mai 2019 ? 20:04 +0530, Sanju Rakonde a ?crit : >>>> > Looks like is_nfs_export_available started failing again in recent >>>> > centos-regressions. >>>> > >>>> > Michael, can you please check? >>>> >>>> I will try but I am leaving for vacation tonight, so if I find nothing, >>>> until I leave, I guess Deepshika will have to look. >>>> >>>> > On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul wrote: >>>> > >>>> > > >>>> > > >>>> > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer < >>>> > > mscherer at redhat.com> >>>> > > wrote: >>>> > > >>>> > > > Le lundi 22 avril 2019 ? 22:57 +0530, Atin Mukherjee a ?crit : >>>> > > > > Is this back again? The recent patches are failing regression >>>> > > > > :-\ . >>>> > > > >>>> > > > So, on builder206, it took me a while to find that the issue is >>>> > > > that >>>> > > > nfs (the service) was running. >>>> > > > >>>> > > > ./tests/basic/afr/tarissue.t failed, because the nfs >>>> > > > initialisation >>>> > > > failed with a rather cryptic message: >>>> > > > >>>> > > > [2019-04-23 13:17:05.371733] I >>>> > > > [socket.c:991:__socket_server_bind] 0- >>>> > > > socket.nfs-server: process started listening on port (38465) >>>> > > > [2019-04-23 13:17:05.385819] E >>>> > > > [socket.c:972:__socket_server_bind] 0- >>>> > > > socket.nfs-server: binding to failed: Address already in use >>>> > > > [2019-04-23 13:17:05.385843] E >>>> > > > [socket.c:974:__socket_server_bind] 0- >>>> > > > socket.nfs-server: Port is already in use >>>> > > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- >>>> > > > socket.nfs-server: __socket_server_bind failed;closing socket 14 >>>> > > > >>>> > > > I found where this came from, but a few stuff did surprised me: >>>> > > > >>>> > > > - the order of print is different that the order in the code >>>> > > > >>>> > > >>>> > > Indeed strange... >>>> > > >>>> > > > - the message on "started listening" didn't take in account the >>>> > > > fact >>>> > > > that bind failed on: >>>> > > > >>>> > > >>>> > > Shouldn't it bail out if it failed to bind? >>>> > > Some missing 'goto out' around line 975/976? >>>> > > Y. >>>> > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> >>>> https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 >>>> > > > >>>> > > > The message about port 38465 also threw me off the track. The >>>> > > > real >>>> > > > issue is that the service nfs was already running, and I couldn't >>>> > > > find >>>> > > > anything listening on port 38465 >>>> > > > >>>> > > > once I do service nfs stop, it no longer failed. >>>> > > > >>>> > > > So far, I do know why nfs.service was activated. >>>> > > > >>>> > > > But at least, 206 should be fixed, and we know a bit more on what >>>> > > > would >>>> > > > be causing some failure. >>>> > > > >>>> > > > >>>> > > > >>>> > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer < >>>> > > > > mscherer at redhat.com> >>>> > > > > wrote: >>>> > > > > >>>> > > > > > Le mercredi 03 avril 2019 ? 16:30 +0530, Atin Mukherjee a >>>> > > > > > ?crit : >>>> > > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < >>>> > > > > > > jthottan at redhat.com> >>>> > > > > > > wrote: >>>> > > > > > > >>>> > > > > > > > Hi, >>>> > > > > > > > >>>> > > > > > > > is_nfs_export_available is just a wrapper around >>>> > > > > > > > "showmount" >>>> > > > > > > > command AFAIR. >>>> > > > > > > > I saw following messages in console output. >>>> > > > > > > > mount.nfs: rpc.statd is not running but is required for >>>> > > > > > > > remote >>>> > > > > > > > locking. >>>> > > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks >>>> > > > > > > > local, >>>> > > > > > > > or >>>> > > > > > > > start >>>> > > > > > > > statd. >>>> > > > > > > > 05:06:55 mount.nfs: an incorrect mount option was >>>> > > > > > > > specified >>>> > > > > > > > >>>> > > > > > > > For me it looks rpcbind may not be running on the >>>> > > > > > > > machine. >>>> > > > > > > > Usually rpcbind starts automatically on machines, don't >>>> > > > > > > > know >>>> > > > > > > > whether it >>>> > > > > > > > can happen or not. >>>> > > > > > > > >>>> > > > > > > >>>> > > > > > > That's precisely what the question is. Why suddenly we're >>>> > > > > > > seeing >>>> > > > > > > this >>>> > > > > > > happening too frequently. Today I saw atleast 4 to 5 such >>>> > > > > > > failures >>>> > > > > > > already. >>>> > > > > > > >>>> > > > > > > Deepshika - Can you please help in inspecting this? >>>> > > > > > >>>> > > > > > So we think (we are not sure) that the issue is a bit >>>> > > > > > complex. >>>> > > > > > >>>> > > > > > What we were investigating was nightly run fail on aws. When >>>> > > > > > the >>>> > > > > > build >>>> > > > > > crash, the builder is restarted, since that's the easiest way >>>> > > > > > to >>>> > > > > > clean >>>> > > > > > everything (since even with a perfect test suite that would >>>> > > > > > clean >>>> > > > > > itself, we could always end in a corrupt state on the system, >>>> > > > > > WRT >>>> > > > > > mount, fs, etc). >>>> > > > > > >>>> > > > > > In turn, this seems to cause trouble on aws, since cloud-init >>>> > > > > > or >>>> > > > > > something rename eth0 interface to ens5, without cleaning to >>>> > > > > > the >>>> > > > > > network configuration. >>>> > > > > > >>>> > > > > > So the network init script fail (because the image say "start >>>> > > > > > eth0" >>>> > > > > > and >>>> > > > > > that's not present), but fail in a weird way. Network is >>>> > > > > > initialised >>>> > > > > > and working (we can connect), but the dhclient process is not >>>> > > > > > in >>>> > > > > > the >>>> > > > > > right cgroup, and network.service is in failed state. >>>> > > > > > Restarting >>>> > > > > > network didn't work. In turn, this mean that rpc-statd refuse >>>> > > > > > to >>>> > > > > > start >>>> > > > > > (due to systemd dependencies), which seems to impact various >>>> > > > > > NFS >>>> > > > > > tests. >>>> > > > > > >>>> > > > > > We have also seen that on some builders, rpcbind pick some IP >>>> > > > > > v6 >>>> > > > > > autoconfiguration, but we can't reproduce that, and there is >>>> > > > > > no ip >>>> > > > > > v6 >>>> > > > > > set up anywhere. I suspect the network.service failure is >>>> > > > > > somehow >>>> > > > > > involved, but fail to see how. In turn, rpcbind.socket not >>>> > > > > > starting >>>> > > > > > could cause NFS test troubles. >>>> > > > > > >>>> > > > > > Our current stop gap fix was to fix all the builders one by >>>> > > > > > one. >>>> > > > > > Remove >>>> > > > > > the config, kill the rogue dhclient, restart network service. >>>> > > > > > >>>> > > > > > However, we can't be sure this is going to fix the problem >>>> > > > > > long >>>> > > > > > term >>>> > > > > > since this only manifest after a crash of the test suite, and >>>> > > > > > it >>>> > > > > > doesn't happen so often. (plus, it was working before some >>>> > > > > > day in >>>> > > > > > the >>>> > > > > > past, when something did make this fail, and I do not know if >>>> > > > > > that's a >>>> > > > > > system upgrade, or a test change, or both). >>>> > > > > > >>>> > > > > > So we are still looking at it to have a complete >>>> > > > > > understanding of >>>> > > > > > the >>>> > > > > > issue, but so far, we hacked our way to make it work (or so >>>> > > > > > do I >>>> > > > > > think). >>>> > > > > > >>>> > > > > > Deepshika is working to fix it long term, by fixing the issue >>>> > > > > > regarding >>>> > > > > > eth0/ens5 with a new base image. >>>> > > > > > -- >>>> > > > > > Michael Scherer >>>> > > > > > Sysadmin, Community Infrastructure and Platform, OSAS >>>> > > > > > >>>> > > > > > >>>> > > > > > -- >>>> > > > > >>>> > > > > - Atin (atinm) >>>> > > > >>>> > > > -- >>>> > > > Michael Scherer >>>> > > > Sysadmin, Community Infrastructure >>>> > > > >>>> > > > >>>> > > > >>>> > > > _______________________________________________ >>>> > > > Gluster-devel mailing list >>>> > > > Gluster-devel at gluster.org >>>> > > > https://lists.gluster.org/mailman/listinfo/gluster-devel >>>> > > >>>> > > _______________________________________________ >>>> > > Gluster-devel mailing list >>>> > > Gluster-devel at gluster.org >>>> > > https://lists.gluster.org/mailman/listinfo/gluster-devel >>>> > >>>> > >>>> > >>>> -- >>>> Michael Scherer >>>> Sysadmin, Community Infrastructure >>>> >>>> >>>> >>>> _______________________________________________ >>>> Gluster-devel mailing list >>>> Gluster-devel at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>> >>> >> >> -- >> Thanks, >> Sanju >> _______________________________________________ >> >> Community Meeting Calendar: >> >> APAC Schedule - >> Every 2nd and 4th Tuesday at 11:30 AM IST >> Bridge: https://bluejeans.com/836554017 >> >> NA/EMEA Schedule - >> Every 1st and 3rd Tuesday at 01:00 PM EDT >> Bridge: https://bluejeans.com/486278655 >> >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From cynthia.zhou at nokia-sbell.com Thu May 9 03:04:00 2019 From: cynthia.zhou at nokia-sbell.com (Zhou, Cynthia (NSB - CN/Hangzhou)) Date: Thu, 9 May 2019 03:04:00 +0000 Subject: [Gluster-devel] glusterfsd memory leak issue found after enable ssl In-Reply-To: References: <07cb1c3aa08b414dbe37442955ddad36@nokia-sbell.com> <6ce04fb69243465295a71b6953eafa19@nokia-sbell.com> <3cd91d1ce39541e7ad30c60ef15000aa@nokia-sbell.com> <5d0c2ed30e884b86ba29bff5a47c960e@nokia-sbell.com> <6d3f68f73e6d440dab19028526745171@nokia-sbell.com> <0d7934cac01f4a43b4581a2f74928dbc@nokia-sbell.com> Message-ID: <9ea2678487544232bfe66e0e7c6d3091@nokia-sbell.com> Hi, Ok, It is posted to https://review.gluster.org/#/c/glusterfs/+/22687/ From: Raghavendra Gowdappa Sent: Wednesday, May 08, 2019 7:35 PM To: Zhou, Cynthia (NSB - CN/Hangzhou) Cc: Amar Tumballi Suryanarayan ; gluster-devel at gluster.org Subject: Re: [Gluster-devel] glusterfsd memory leak issue found after enable ssl On Wed, May 8, 2019 at 1:29 PM Zhou, Cynthia (NSB - CN/Hangzhou) > wrote: Hi 'Milind Changire' , The leak is getting more and more clear to me now. the unsolved memory leak is because of in gluterfs version 3.12.15 (in my env)the ssl context is a shared one, while we do ssl_acept, ssl will allocate some read/write buffer to ssl object, however, ssl_free in socket_reset or fini function of socket.c, the buffer is returened back to ssl context free list instead of completely freed. Thanks Cynthia for your efforts in identifying and fixing the leak. If you post a patch to gerrit, I'll be happy to merge it and get the fix into the codebase. So following patch is able to fix the memory leak issue completely.(created for gluster master branch) --- a/rpc/rpc-transport/socket/src/socket.c +++ b/rpc/rpc-transport/socket/src/socket.c @@ -446,6 +446,7 @@ ssl_setup_connection_postfix(rpc_transport_t *this) gf_log(this->name, GF_LOG_DEBUG, "SSL verification succeeded (client: %s) (server: %s)", this->peerinfo.identifier, this->myinfo.identifier); + X509_free(peer); return gf_strdup(peer_CN); /* Error paths. */ @@ -1157,7 +1158,21 @@ __socket_reset(rpc_transport_t *this) memset(&priv->incoming, 0, sizeof(priv->incoming)); event_unregister_close(this->ctx->event_pool, priv->sock, priv->idx); - + if(priv->use_ssl&& priv->ssl_ssl) + { + gf_log(this->name, GF_LOG_TRACE, + "clear and reset for socket(%d), free ssl ", + priv->sock); + if(priv->ssl_ctx) + { + SSL_CTX_free(priv->ssl_ctx); + priv->ssl_ctx = NULL; + } + SSL_shutdown(priv->ssl_ssl); + SSL_clear(priv->ssl_ssl); + SSL_free(priv->ssl_ssl); + priv->ssl_ssl = NULL; + } priv->sock = -1; priv->idx = -1; priv->connected = -1; @@ -4675,6 +4690,21 @@ fini(rpc_transport_t *this) pthread_mutex_destroy(&priv->out_lock); pthread_mutex_destroy(&priv->cond_lock); pthread_cond_destroy(&priv->cond); + if(priv->use_ssl&& priv->ssl_ssl) + { + gf_log(this->name, GF_LOG_TRACE, + "clear and reset for socket(%d), free ssl ", + priv->sock); + if(priv->ssl_ctx) + { + SSL_CTX_free(priv->ssl_ctx); + priv->ssl_ctx = NULL; + } + SSL_shutdown(priv->ssl_ssl); + SSL_clear(priv->ssl_ssl); + SSL_free(priv->ssl_ssl); From: Zhou, Cynthia (NSB - CN/Hangzhou) Sent: Monday, May 06, 2019 2:12 PM To: 'Amar Tumballi Suryanarayan' > Cc: 'Milind Changire' >; 'gluster-devel at gluster.org' > Subject: RE: [Gluster-devel] glusterfsd memory leak issue found after enable ssl Hi, From our test valgrind and libleak all blame ssl3_accept ///////////////////////////from valgrind attached to glusterfds/////////////////////////////////////////// ==16673== 198,720 bytes in 12 blocks are definitely lost in loss record 1,114 of 1,123 ==16673== at 0x4C2EB7B: malloc (vg_replace_malloc.c:299) ==16673== by 0x63E1977: CRYPTO_malloc (in /usr/lib64/libcrypto.so.1.0.2p) ==16673== by 0xA855E0C: ssl3_setup_write_buffer (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA855E77: ssl3_setup_buffers (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA8485D9: ssl3_accept (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA610DDF: ssl_complete_connection (socket.c:400) ==16673== by 0xA617F38: ssl_handle_server_connection_attempt (socket.c:2409) ==16673== by 0xA618420: socket_complete_connection (socket.c:2554) ==16673== by 0xA618788: socket_event_handler (socket.c:2613) ==16673== by 0x4ED6983: event_dispatch_epoll_handler (event-epoll.c:587) ==16673== by 0x4ED6C5A: event_dispatch_epoll_worker (event-epoll.c:663) ==16673== by 0x615C5D9: start_thread (in /usr/lib64/libpthread-2.27.so) ==16673== ==16673== 200,544 bytes in 12 blocks are definitely lost in loss record 1,115 of 1,123 ==16673== at 0x4C2EB7B: malloc (vg_replace_malloc.c:299) ==16673== by 0x63E1977: CRYPTO_malloc (in /usr/lib64/libcrypto.so.1.0.2p) ==16673== by 0xA855D12: ssl3_setup_read_buffer (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA855E68: ssl3_setup_buffers (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA8485D9: ssl3_accept (in /usr/lib64/libssl.so.1.0.2p) ==16673== by 0xA610DDF: ssl_complete_connection (socket.c:400) ==16673== by 0xA617F38: ssl_handle_server_connection_attempt (socket.c:2409) ==16673== by 0xA618420: socket_complete_connection (socket.c:2554) ==16673== by 0xA618788: socket_event_handler (socket.c:2613) ==16673== by 0x4ED6983: event_dispatch_epoll_handler (event-epoll.c:587) ==16673== by 0x4ED6C5A: event_dispatch_epoll_worker (event-epoll.c:663) ==16673== by 0x615C5D9: start_thread (in /usr/lib64/libpthread-2.27.so) ==16673== valgrind --leak-check=f ////////////////////////////////////with libleak attached to glusterfsd///////////////////////////////////////// callstack[2419] expires. count=1 size=224/224 alloc=362 free=350 /home/robot/libleak/libleak.so(malloc+0x25) [0x7f1460604065] /lib64/libcrypto.so.10(CRYPTO_malloc+0x58) [0x7f145ecd9978] /lib64/libcrypto.so.10(EVP_DigestInit_ex+0x2a9) [0x7f145ed95749] /lib64/libssl.so.10(ssl3_digest_cached_records+0x11d) [0x7f145abb6ced] /lib64/libssl.so.10(ssl3_accept+0xc8f) [0x7f145abadc4f] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(ssl_complete_connection+0x5e) [0x7f145ae00f3a] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc16d) [0x7f145ae0816d] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc68a) [0x7f145ae0868a] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc9f2) [0x7f145ae089f2] /lib64/libglusterfs.so.0(+0x9b96f) [0x7f146038596f] /lib64/libglusterfs.so.0(+0x9bc46) [0x7f1460385c46] /lib64/libpthread.so.0(+0x75da) [0x7f145f0d15da] /lib64/libc.so.6(clone+0x3f) [0x7f145e9a7eaf] callstack[2432] expires. count=1 size=104/104 alloc=362 free=0 /home/robot/libleak/libleak.so(malloc+0x25) [0x7f1460604065] /lib64/libcrypto.so.10(CRYPTO_malloc+0x58) [0x7f145ecd9978] /lib64/libcrypto.so.10(BN_MONT_CTX_new+0x17) [0x7f145ed48627] /lib64/libcrypto.so.10(BN_MONT_CTX_set_locked+0x6d) [0x7f145ed489fd] /lib64/libcrypto.so.10(+0xff4d9) [0x7f145ed6a4d9] /lib64/libcrypto.so.10(int_rsa_verify+0x1cd) [0x7f145ed6d41d] /lib64/libcrypto.so.10(RSA_verify+0x32) [0x7f145ed6d972] /lib64/libcrypto.so.10(+0x107ff5) [0x7f145ed72ff5] /lib64/libcrypto.so.10(EVP_VerifyFinal+0x211) [0x7f145ed9dd51] /lib64/libssl.so.10(ssl3_get_cert_verify+0x5bb) [0x7f145abac06b] /lib64/libssl.so.10(ssl3_accept+0x988) [0x7f145abad948] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(ssl_complete_connection+0x5e) [0x7f145ae00f3a] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc16d) [0x7f145ae0816d] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc68a) [0x7f145ae0868a] /usr/lib64/glusterfs/3.12.15/rpc-transport/socket.so(+0xc9f2) [0x7f145ae089f2] /lib64/libglusterfs.so.0(+0x9b96f) [0x7f146038596f] /lib64/libglusterfs.so.0(+0x9bc46) [0x7f1460385c46] /lib64/libpthread.so.0(+0x75da) [0x7f145f0d15da] /lib64/libc.so.6(clone+0x3f) [0x7f145e9a7eaf] one interesting thing is that the memory goes up to about 300m then it stopped increasing !!! I am wondering if this is caused by open-ssl library? But when I search from openssl community, there is no such issue reported before. Is glusterfs using ssl_accept correctly? cynthia From: Zhou, Cynthia (NSB - CN/Hangzhou) Sent: Monday, May 06, 2019 10:34 AM To: 'Amar Tumballi Suryanarayan' > Cc: Milind Changire >; gluster-devel at gluster.org Subject: RE: [Gluster-devel] glusterfsd memory leak issue found after enable ssl Hi, Sorry, I am so busy with other issues these days, could you help me to submit my patch for review? It is based on glusterfs3.12.15 code. But even with this patch , memory leak still exists, from memory leak tool it should be related with ssl_accept, not sure if it is because of openssl library or because improper use of ssl interfaces. --- a/rpc/rpc-transport/socket/src/socket.c +++ b/rpc/rpc-transport/socket/src/socket.c @@ -1019,7 +1019,16 @@ static void __socket_reset(rpc_transport_t *this) { memset(&priv->incoming, 0, sizeof(priv->incoming)); event_unregister_close(this->ctx->event_pool, priv->sock, priv->idx); - + if(priv->use_ssl&& priv->ssl_ssl) + { + gf_log(this->name, GF_LOG_INFO, + "clear and reset for socket(%d), free ssl ", + priv->sock); + SSL_shutdown(priv->ssl_ssl); + SSL_clear(priv->ssl_ssl); + SSL_free(priv->ssl_ssl); + priv->ssl_ssl = NULL; + } priv->sock = -1; priv->idx = -1; priv->connected = -1; @@ -4238,6 +4250,16 @@ void fini(rpc_transport_t *this) { pthread_mutex_destroy(&priv->out_lock); pthread_mutex_destroy(&priv->cond_lock); pthread_cond_destroy(&priv->cond); + if(priv->use_ssl&& priv->ssl_ssl) + { + gf_log(this->name, GF_LOG_INFO, + "clear and reset for socket(%d), free ssl ", + priv->sock); + SSL_shutdown(priv->ssl_ssl); + SSL_clear(priv->ssl_ssl); + SSL_free(priv->ssl_ssl); + priv->ssl_ssl = NULL; + } if (priv->ssl_private_key) { GF_FREE(priv->ssl_private_key); } From: Amar Tumballi Suryanarayan > Sent: Wednesday, May 01, 2019 8:43 PM To: Zhou, Cynthia (NSB - CN/Hangzhou) > Cc: Milind Changire >; gluster-devel at gluster.org Subject: Re: [Gluster-devel] glusterfsd memory leak issue found after enable ssl Hi Cynthia Zhou, Can you post the patch which fixes the issue of missing free? We will continue to investigate the leak further, but would really appreciate getting the patch which is already worked on land into upstream master. -Amar On Mon, Apr 22, 2019 at 1:38 PM Zhou, Cynthia (NSB - CN/Hangzhou) > wrote: Ok, I am clear now. I?ve added ssl_free in socket reset and socket finish function, though glusterfsd memory leak is not that much, still it is leaking, from source code I can not find anything else, Could you help to check if this issue exists in your env? If not I may have a try to merge your patch . Step 1> while true;do gluster v heal info, 2> check the vol-name glusterfsd memory usage, it is obviously increasing. cynthia From: Milind Changire > Sent: Monday, April 22, 2019 2:36 PM To: Zhou, Cynthia (NSB - CN/Hangzhou) > Cc: Atin Mukherjee >; gluster-devel at gluster.org Subject: Re: [Gluster-devel] glusterfsd memory leak issue found after enable ssl According to BIO_new_socket() man page ... If the close flag is set then the socket is shut down and closed when the BIO is freed. For Gluster to have more control over the socket shutdown, the BIO_NOCLOSE flag is set. Otherwise, SSL takes control of socket shutdown whenever BIO is freed. _______________________________________________ Gluster-devel mailing list Gluster-devel at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel -- Amar Tumballi (amarts) _______________________________________________ Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/836554017 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/486278655 Gluster-devel mailing list Gluster-devel at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Thu May 9 04:12:49 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Thu, 9 May 2019 09:42:49 +0530 Subject: [Gluster-devel] glusterfsd memory leak issue found after enable ssl In-Reply-To: <9ea2678487544232bfe66e0e7c6d3091@nokia-sbell.com> References: <07cb1c3aa08b414dbe37442955ddad36@nokia-sbell.com> <6ce04fb69243465295a71b6953eafa19@nokia-sbell.com> <3cd91d1ce39541e7ad30c60ef15000aa@nokia-sbell.com> <5d0c2ed30e884b86ba29bff5a47c960e@nokia-sbell.com> <6d3f68f73e6d440dab19028526745171@nokia-sbell.com> <0d7934cac01f4a43b4581a2f74928dbc@nokia-sbell.com> <9ea2678487544232bfe66e0e7c6d3091@nokia-sbell.com> Message-ID: Thanks!! On Thu, May 9, 2019 at 8:34 AM Zhou, Cynthia (NSB - CN/Hangzhou) < cynthia.zhou at nokia-sbell.com> wrote: > Hi, > > Ok, It is posted to https://review.gluster.org/#/c/glusterfs/+/22687/ > > > > > > > > *From:* Raghavendra Gowdappa > *Sent:* Wednesday, May 08, 2019 7:35 PM > *To:* Zhou, Cynthia (NSB - CN/Hangzhou) > *Cc:* Amar Tumballi Suryanarayan ; > gluster-devel at gluster.org > *Subject:* Re: [Gluster-devel] glusterfsd memory leak issue found after > enable ssl > > > > > > > > On Wed, May 8, 2019 at 1:29 PM Zhou, Cynthia (NSB - CN/Hangzhou) < > cynthia.zhou at nokia-sbell.com> wrote: > > Hi 'Milind Changire' , > > The leak is getting more and more clear to me now. the unsolved memory > leak is because of in gluterfs version 3.12.15 (in my env)the ssl context > is a shared one, while we do ssl_acept, ssl will allocate some read/write > buffer to ssl object, however, ssl_free in socket_reset or fini function of > socket.c, the buffer is returened back to ssl context free list instead of > completely freed. > > > > Thanks Cynthia for your efforts in identifying and fixing the leak. If you > post a patch to gerrit, I'll be happy to merge it and get the fix into the > codebase. > > > > > > So following patch is able to fix the memory leak issue > completely.(created for gluster master branch) > > > > --- a/rpc/rpc-transport/socket/src/socket.c > +++ b/rpc/rpc-transport/socket/src/socket.c > @@ -446,6 +446,7 @@ ssl_setup_connection_postfix(rpc_transport_t *this) > gf_log(this->name, GF_LOG_DEBUG, > "SSL verification succeeded (client: %s) (server: %s)", > this->peerinfo.identifier, this->myinfo.identifier); > + X509_free(peer); > return gf_strdup(peer_CN); > > /* Error paths. */ > @@ -1157,7 +1158,21 @@ __socket_reset(rpc_transport_t *this) > memset(&priv->incoming, 0, sizeof(priv->incoming)); > > event_unregister_close(this->ctx->event_pool, priv->sock, priv->idx); > - > + if(priv->use_ssl&& priv->ssl_ssl) > + { > + gf_log(this->name, GF_LOG_TRACE, > + "clear and reset for socket(%d), free ssl ", > + priv->sock); > + if(priv->ssl_ctx) > + { > + SSL_CTX_free(priv->ssl_ctx); > + priv->ssl_ctx = NULL; > + } > + SSL_shutdown(priv->ssl_ssl); > + SSL_clear(priv->ssl_ssl); > + SSL_free(priv->ssl_ssl); > + priv->ssl_ssl = NULL; > + } > priv->sock = -1; > priv->idx = -1; > priv->connected = -1; > @@ -4675,6 +4690,21 @@ fini(rpc_transport_t *this) > pthread_mutex_destroy(&priv->out_lock); > pthread_mutex_destroy(&priv->cond_lock); > pthread_cond_destroy(&priv->cond); > + if(priv->use_ssl&& priv->ssl_ssl) > + { > + gf_log(this->name, GF_LOG_TRACE, > + "clear and reset for socket(%d), free ssl > ", > + priv->sock); > + if(priv->ssl_ctx) > + { > + SSL_CTX_free(priv->ssl_ctx); > + priv->ssl_ctx = NULL; > + } > + SSL_shutdown(priv->ssl_ssl); > + SSL_clear(priv->ssl_ssl); > + SSL_free(priv->ssl_ssl); > > *From:* Zhou, Cynthia (NSB - CN/Hangzhou) > *Sent:* Monday, May 06, 2019 2:12 PM > *To:* 'Amar Tumballi Suryanarayan' > *Cc:* 'Milind Changire' ; 'gluster-devel at gluster.org' > > *Subject:* RE: [Gluster-devel] glusterfsd memory leak issue found after > enable ssl > > > > Hi, > > From our test valgrind and libleak all blame ssl3_accept > > ///////////////////////////from valgrind attached to > glusterfds/////////////////////////////////////////// > > ==16673== 198,720 bytes in 12 blocks are definitely lost in loss record > 1,114 of 1,123 > ==16673== at 0x4C2EB7B: malloc (vg_replace_malloc.c:299) > ==16673== by 0x63E1977: CRYPTO_malloc (in /usr/lib64/ > *libcrypto.so.1.0.2p*) > ==16673== by 0xA855E0C: ssl3_setup_write_buffer (in /usr/lib64/ > *libssl.so.1.0.2p*) > ==16673== by 0xA855E77: ssl3_setup_buffers (in /usr/lib64/ > *libssl.so.1.0.2p*) > ==16673== by 0xA8485D9: ssl3_accept (in /usr/lib64/*libssl.so.1.0.2p*) > ==16673== by 0xA610DDF: ssl_complete_connection (socket.c:400) > ==16673== by 0xA617F38: ssl_handle_server_connection_attempt > (socket.c:2409) > ==16673== by 0xA618420: socket_complete_connection (socket.c:2554) > ==16673== by 0xA618788: socket_event_handler (socket.c:2613) > ==16673== by 0x4ED6983: event_dispatch_epoll_handler (event-epoll.c:587) > ==16673== by 0x4ED6C5A: event_dispatch_epoll_worker (event-epoll.c:663) > ==16673== by 0x615C5D9: start_thread (in /usr/lib64/*libpthread-2.27.so > *) > ==16673== > ==16673== 200,544 bytes in 12 blocks are definitely lost in loss record > 1,115 of 1,123 > ==16673== at 0x4C2EB7B: malloc (vg_replace_malloc.c:299) > ==16673== by 0x63E1977: CRYPTO_malloc (in /usr/lib64/ > *libcrypto.so.1.0.2p*) > ==16673== by 0xA855D12: ssl3_setup_read_buffer (in /usr/lib64/ > *libssl.so.1.0.2p*) > ==16673== by 0xA855E68: ssl3_setup_buffers (in /usr/lib64/ > *libssl.so.1.0.2p*) > ==16673== by 0xA8485D9: ssl3_accept (in /usr/lib64/*libssl.so.1.0.2p*) > ==16673== by 0xA610DDF: ssl_complete_connection (socket.c:400) > ==16673== by 0xA617F38: ssl_handle_server_connection_attempt > (socket.c:2409) > ==16673== by 0xA618420: socket_complete_connection (socket.c:2554) > ==16673== by 0xA618788: socket_event_handler (socket.c:2613) > ==16673== by 0x4ED6983: event_dispatch_epoll_handler (event-epoll.c:587) > ==16673== by 0x4ED6C5A: event_dispatch_epoll_worker (event-epoll.c:663) > ==16673== by 0x615C5D9: start_thread (in /usr/lib64/*libpthread-2.27.so > *) > ==16673== > valgrind --leak-check=f > > > > > > ////////////////////////////////////with libleak attached to > glusterfsd///////////////////////////////////////// > > callstack[2419] expires. count=1 size=224/224 alloc=362 free=350 > /home/robot/libleak/*libleak.so(malloc+0x25*) [0x7f1460604065] > /lib64/*libcrypto.so.10(CRYPTO_malloc+0x58*) [0x7f145ecd9978] > /lib64/*libcrypto.so.10(EVP_DigestInit_ex+0x2a9*) [0x7f145ed95749] > /lib64/*libssl.so.10(ssl3_digest_cached_records+0x11d*) > [0x7f145abb6ced] > /lib64/*libssl.so.10(**ssl3_accept**+0xc8f*) [0x7f145abadc4f] > /usr/lib64/glusterfs/3.12.15/rpc-transport/ > *socket.so(ssl_complete_connection+0x5e*) [0x7f145ae00f3a] > /usr/lib64/glusterfs/3.12.15/rpc-transport/*socket.so(+0xc16d*) > [0x7f145ae0816d] > /usr/lib64/glusterfs/3.12.15/rpc-transport/*socket.so(+0xc68a*) > [0x7f145ae0868a] > /usr/lib64/glusterfs/3.12.15/rpc-transport/*socket.so(+0xc9f2*) > [0x7f145ae089f2] > /lib64/*libglusterfs.so.0(+0x9b96f*) [0x7f146038596f] > /lib64/*libglusterfs.so.0(+0x9bc46*) [0x7f1460385c46] > /lib64/*libpthread.so.0(+0x75da*) [0x7f145f0d15da] > /lib64/*libc.so.6(clone+0x3f*) [0x7f145e9a7eaf] > > callstack[2432] expires. count=1 size=104/104 alloc=362 free=0 > /home/robot/libleak/*libleak.so(malloc+0x25*) [0x7f1460604065] > /lib64/*libcrypto.so.10(CRYPTO_malloc+0x58*) [0x7f145ecd9978] > /lib64/*libcrypto.so.10(BN_MONT_CTX_new+0x17*) [0x7f145ed48627] > /lib64/*libcrypto.so.10(BN_MONT_CTX_set_locked+0x6d*) [0x7f145ed489fd] > /lib64/*libcrypto.so.10(+0xff4d9*) [0x7f145ed6a4d9] > /lib64/*libcrypto.so.10(int_rsa_verify+0x1cd*) [0x7f145ed6d41d] > /lib64/*libcrypto.so.10(RSA_verify+0x32*) [0x7f145ed6d972] > /lib64/*libcrypto.so.10(+0x107ff5*) [0x7f145ed72ff5] > /lib64/*libcrypto.so.10(EVP_VerifyFinal+0x211*) [0x7f145ed9dd51] > /lib64/*libssl.so.10(ssl3_get_cert_verify+0x5bb*) [0x7f145abac06b] > /lib64/*libssl.so.10(**ssl3_accept**+0x988*) [0x7f145abad948] > /usr/lib64/glusterfs/3.12.15/rpc-transport/ > *socket.so(ssl_complete_connection+0x5e*) [0x7f145ae00f3a] > /usr/lib64/glusterfs/3.12.15/rpc-transport/*socket.so(+0xc16d*) > [0x7f145ae0816d] > /usr/lib64/glusterfs/3.12.15/rpc-transport/*socket.so(+0xc68a*) > [0x7f145ae0868a] > /usr/lib64/glusterfs/3.12.15/rpc-transport/*socket.so(+0xc9f2*) > [0x7f145ae089f2] > /lib64/*libglusterfs.so.0(+0x9b96f*) [0x7f146038596f] > /lib64/*libglusterfs.so.0(+0x9bc46*) [0x7f1460385c46] > /lib64/*libpthread.so.0(+0x75da*) [0x7f145f0d15da] > /lib64/*libc.so.6(clone+0x3f*) [0x7f145e9a7eaf] > > > > one interesting thing is that the memory goes up to about 300m then it > stopped increasing !!! > > I am wondering if this is caused by open-ssl library? But when I search > from openssl community, there is no such issue reported before. > > Is glusterfs using ssl_accept correctly? > > > > cynthia > > *From:* Zhou, Cynthia (NSB - CN/Hangzhou) > *Sent:* Monday, May 06, 2019 10:34 AM > *To:* 'Amar Tumballi Suryanarayan' > *Cc:* Milind Changire ; gluster-devel at gluster.org > *Subject:* RE: [Gluster-devel] glusterfsd memory leak issue found after > enable ssl > > > > Hi, > > Sorry, I am so busy with other issues these days, could you help me to > submit my patch for review? It is based on glusterfs3.12.15 code. But even > with this patch , memory leak still exists, from memory leak tool it should > be related with ssl_accept, not sure if it is because of openssl library or > because improper use of ssl interfaces. > > --- a/rpc/rpc-transport/socket/src/socket.c > > +++ b/rpc/rpc-transport/socket/src/socket.c > > @@ -1019,7 +1019,16 @@ static void __socket_reset(rpc_transport_t *this) { > > memset(&priv->incoming, 0, sizeof(priv->incoming)); > > > > event_unregister_close(this->ctx->event_pool, priv->sock, priv->idx); > > - > > + if(priv->use_ssl&& priv->ssl_ssl) > > + { > > + gf_log(this->name, GF_LOG_INFO, > > + "clear and reset for socket(%d), free ssl ", > > + priv->sock); > > + SSL_shutdown(priv->ssl_ssl); > > + SSL_clear(priv->ssl_ssl); > > + SSL_free(priv->ssl_ssl); > > + priv->ssl_ssl = NULL; > > + } > > priv->sock = -1; > > priv->idx = -1; > > priv->connected = -1; > > @@ -4238,6 +4250,16 @@ void fini(rpc_transport_t *this) { > > pthread_mutex_destroy(&priv->out_lock); > > pthread_mutex_destroy(&priv->cond_lock); > > pthread_cond_destroy(&priv->cond); > > + if(priv->use_ssl&& priv->ssl_ssl) > > + { > > + gf_log(this->name, GF_LOG_INFO, > > + "clear and reset for socket(%d), free ssl ", > > + priv->sock); > > + SSL_shutdown(priv->ssl_ssl); > > + SSL_clear(priv->ssl_ssl); > > + SSL_free(priv->ssl_ssl); > > + priv->ssl_ssl = NULL; > > + } > > if (priv->ssl_private_key) { > > GF_FREE(priv->ssl_private_key); > > } > > > > > > *From:* Amar Tumballi Suryanarayan > *Sent:* Wednesday, May 01, 2019 8:43 PM > *To:* Zhou, Cynthia (NSB - CN/Hangzhou) > *Cc:* Milind Changire ; gluster-devel at gluster.org > *Subject:* Re: [Gluster-devel] glusterfsd memory leak issue found after > enable ssl > > > > Hi Cynthia Zhou, > > > > Can you post the patch which fixes the issue of missing free? We will > continue to investigate the leak further, but would really appreciate > getting the patch which is already worked on land into upstream master. > > > > -Amar > > > > On Mon, Apr 22, 2019 at 1:38 PM Zhou, Cynthia (NSB - CN/Hangzhou) < > cynthia.zhou at nokia-sbell.com> wrote: > > Ok, I am clear now. > > I?ve added ssl_free in socket reset and socket finish function, though > glusterfsd memory leak is not that much, still it is leaking, from source > code I can not find anything else, > > Could you help to check if this issue exists in your env? If not I may > have a try to merge your patch . > > Step > > 1> while true;do gluster v heal info, > > 2> check the vol-name glusterfsd memory usage, it is obviously > increasing. > > cynthia > > > > *From:* Milind Changire > *Sent:* Monday, April 22, 2019 2:36 PM > *To:* Zhou, Cynthia (NSB - CN/Hangzhou) > *Cc:* Atin Mukherjee ; gluster-devel at gluster.org > *Subject:* Re: [Gluster-devel] glusterfsd memory leak issue found after > enable ssl > > > > According to BIO_new_socket() man page ... > > > > *If the close flag is set then the socket is shut down and closed when the > BIO is freed.* > > > > For Gluster to have more control over the socket shutdown, the BIO_NOCLOSE > flag is set. Otherwise, SSL takes control of socket shutdown whenever BIO > is freed. > > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > > > -- > > Amar Tumballi (amarts) > > _______________________________________________ > > Community Meeting Calendar: > > APAC Schedule - > Every 2nd and 4th Tuesday at 11:30 AM IST > Bridge: https://bluejeans.com/836554017 > > NA/EMEA Schedule - > Every 1st and 3rd Tuesday at 01:00 PM EDT > Bridge: https://bluejeans.com/486278655 > > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Thu May 9 04:31:47 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Thu, 9 May 2019 10:01:47 +0530 Subject: [Gluster-devel] [Gluster-infra] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> <797512f6ff7f1b9fedbf8b7968dd86a6968d9105.camel@redhat.com> Message-ID: On Wed, May 8, 2019 at 7:38 PM Atin Mukherjee wrote: > builder204 needs to be fixed, too many failures, mostly none of the > patches are passing regression. > And with that builder201 joins the pool, https://build.gluster.org/job/centos7-regression/5943/consoleFull > On Wed, May 8, 2019 at 9:53 AM Atin Mukherjee wrote: > >> >> >> On Wed, May 8, 2019 at 7:16 AM Sanju Rakonde wrote: >> >>> Deepshikha, >>> >>> I see the failure here[1] which ran on builder206. So, we are good. >>> >> >> Not really, >> https://build.gluster.org/job/centos7-regression/5909/consoleFull failed >> on builder204 for similar reasons I believe? >> >> I am bit more worried on this issue being resurfacing more often these >> days. What can we do to fix this permanently? >> >> >>> [1] https://build.gluster.org/job/centos7-regression/5901/consoleFull >>> >>> On Wed, May 8, 2019 at 12:23 AM Deepshikha Khandelwal < >>> dkhandel at redhat.com> wrote: >>> >>>> Sanju, can you please give us more info about the failures. >>>> >>>> I see the failures occurring on just one of the builder (builder206). >>>> I'm taking it back offline for now. >>>> >>>> On Tue, May 7, 2019 at 9:42 PM Michael Scherer >>>> wrote: >>>> >>>>> Le mardi 07 mai 2019 ? 20:04 +0530, Sanju Rakonde a ?crit : >>>>> > Looks like is_nfs_export_available started failing again in recent >>>>> > centos-regressions. >>>>> > >>>>> > Michael, can you please check? >>>>> >>>>> I will try but I am leaving for vacation tonight, so if I find nothing, >>>>> until I leave, I guess Deepshika will have to look. >>>>> >>>>> > On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul wrote: >>>>> > >>>>> > > >>>>> > > >>>>> > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer < >>>>> > > mscherer at redhat.com> >>>>> > > wrote: >>>>> > > >>>>> > > > Le lundi 22 avril 2019 ? 22:57 +0530, Atin Mukherjee a ?crit : >>>>> > > > > Is this back again? The recent patches are failing regression >>>>> > > > > :-\ . >>>>> > > > >>>>> > > > So, on builder206, it took me a while to find that the issue is >>>>> > > > that >>>>> > > > nfs (the service) was running. >>>>> > > > >>>>> > > > ./tests/basic/afr/tarissue.t failed, because the nfs >>>>> > > > initialisation >>>>> > > > failed with a rather cryptic message: >>>>> > > > >>>>> > > > [2019-04-23 13:17:05.371733] I >>>>> > > > [socket.c:991:__socket_server_bind] 0- >>>>> > > > socket.nfs-server: process started listening on port (38465) >>>>> > > > [2019-04-23 13:17:05.385819] E >>>>> > > > [socket.c:972:__socket_server_bind] 0- >>>>> > > > socket.nfs-server: binding to failed: Address already in use >>>>> > > > [2019-04-23 13:17:05.385843] E >>>>> > > > [socket.c:974:__socket_server_bind] 0- >>>>> > > > socket.nfs-server: Port is already in use >>>>> > > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- >>>>> > > > socket.nfs-server: __socket_server_bind failed;closing socket 14 >>>>> > > > >>>>> > > > I found where this came from, but a few stuff did surprised me: >>>>> > > > >>>>> > > > - the order of print is different that the order in the code >>>>> > > > >>>>> > > >>>>> > > Indeed strange... >>>>> > > >>>>> > > > - the message on "started listening" didn't take in account the >>>>> > > > fact >>>>> > > > that bind failed on: >>>>> > > > >>>>> > > >>>>> > > Shouldn't it bail out if it failed to bind? >>>>> > > Some missing 'goto out' around line 975/976? >>>>> > > Y. >>>>> > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> >>>>> https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 >>>>> > > > >>>>> > > > The message about port 38465 also threw me off the track. The >>>>> > > > real >>>>> > > > issue is that the service nfs was already running, and I couldn't >>>>> > > > find >>>>> > > > anything listening on port 38465 >>>>> > > > >>>>> > > > once I do service nfs stop, it no longer failed. >>>>> > > > >>>>> > > > So far, I do know why nfs.service was activated. >>>>> > > > >>>>> > > > But at least, 206 should be fixed, and we know a bit more on what >>>>> > > > would >>>>> > > > be causing some failure. >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer < >>>>> > > > > mscherer at redhat.com> >>>>> > > > > wrote: >>>>> > > > > >>>>> > > > > > Le mercredi 03 avril 2019 ? 16:30 +0530, Atin Mukherjee a >>>>> > > > > > ?crit : >>>>> > > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < >>>>> > > > > > > jthottan at redhat.com> >>>>> > > > > > > wrote: >>>>> > > > > > > >>>>> > > > > > > > Hi, >>>>> > > > > > > > >>>>> > > > > > > > is_nfs_export_available is just a wrapper around >>>>> > > > > > > > "showmount" >>>>> > > > > > > > command AFAIR. >>>>> > > > > > > > I saw following messages in console output. >>>>> > > > > > > > mount.nfs: rpc.statd is not running but is required for >>>>> > > > > > > > remote >>>>> > > > > > > > locking. >>>>> > > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks >>>>> > > > > > > > local, >>>>> > > > > > > > or >>>>> > > > > > > > start >>>>> > > > > > > > statd. >>>>> > > > > > > > 05:06:55 mount.nfs: an incorrect mount option was >>>>> > > > > > > > specified >>>>> > > > > > > > >>>>> > > > > > > > For me it looks rpcbind may not be running on the >>>>> > > > > > > > machine. >>>>> > > > > > > > Usually rpcbind starts automatically on machines, don't >>>>> > > > > > > > know >>>>> > > > > > > > whether it >>>>> > > > > > > > can happen or not. >>>>> > > > > > > > >>>>> > > > > > > >>>>> > > > > > > That's precisely what the question is. Why suddenly we're >>>>> > > > > > > seeing >>>>> > > > > > > this >>>>> > > > > > > happening too frequently. Today I saw atleast 4 to 5 such >>>>> > > > > > > failures >>>>> > > > > > > already. >>>>> > > > > > > >>>>> > > > > > > Deepshika - Can you please help in inspecting this? >>>>> > > > > > >>>>> > > > > > So we think (we are not sure) that the issue is a bit >>>>> > > > > > complex. >>>>> > > > > > >>>>> > > > > > What we were investigating was nightly run fail on aws. When >>>>> > > > > > the >>>>> > > > > > build >>>>> > > > > > crash, the builder is restarted, since that's the easiest way >>>>> > > > > > to >>>>> > > > > > clean >>>>> > > > > > everything (since even with a perfect test suite that would >>>>> > > > > > clean >>>>> > > > > > itself, we could always end in a corrupt state on the system, >>>>> > > > > > WRT >>>>> > > > > > mount, fs, etc). >>>>> > > > > > >>>>> > > > > > In turn, this seems to cause trouble on aws, since >>>>> cloud-init >>>>> > > > > > or >>>>> > > > > > something rename eth0 interface to ens5, without cleaning to >>>>> > > > > > the >>>>> > > > > > network configuration. >>>>> > > > > > >>>>> > > > > > So the network init script fail (because the image say "start >>>>> > > > > > eth0" >>>>> > > > > > and >>>>> > > > > > that's not present), but fail in a weird way. Network is >>>>> > > > > > initialised >>>>> > > > > > and working (we can connect), but the dhclient process is not >>>>> > > > > > in >>>>> > > > > > the >>>>> > > > > > right cgroup, and network.service is in failed state. >>>>> > > > > > Restarting >>>>> > > > > > network didn't work. In turn, this mean that rpc-statd refuse >>>>> > > > > > to >>>>> > > > > > start >>>>> > > > > > (due to systemd dependencies), which seems to impact various >>>>> > > > > > NFS >>>>> > > > > > tests. >>>>> > > > > > >>>>> > > > > > We have also seen that on some builders, rpcbind pick some IP >>>>> > > > > > v6 >>>>> > > > > > autoconfiguration, but we can't reproduce that, and there is >>>>> > > > > > no ip >>>>> > > > > > v6 >>>>> > > > > > set up anywhere. I suspect the network.service failure is >>>>> > > > > > somehow >>>>> > > > > > involved, but fail to see how. In turn, rpcbind.socket not >>>>> > > > > > starting >>>>> > > > > > could cause NFS test troubles. >>>>> > > > > > >>>>> > > > > > Our current stop gap fix was to fix all the builders one by >>>>> > > > > > one. >>>>> > > > > > Remove >>>>> > > > > > the config, kill the rogue dhclient, restart network service. >>>>> > > > > > >>>>> > > > > > However, we can't be sure this is going to fix the problem >>>>> > > > > > long >>>>> > > > > > term >>>>> > > > > > since this only manifest after a crash of the test suite, and >>>>> > > > > > it >>>>> > > > > > doesn't happen so often. (plus, it was working before some >>>>> > > > > > day in >>>>> > > > > > the >>>>> > > > > > past, when something did make this fail, and I do not know if >>>>> > > > > > that's a >>>>> > > > > > system upgrade, or a test change, or both). >>>>> > > > > > >>>>> > > > > > So we are still looking at it to have a complete >>>>> > > > > > understanding of >>>>> > > > > > the >>>>> > > > > > issue, but so far, we hacked our way to make it work (or so >>>>> > > > > > do I >>>>> > > > > > think). >>>>> > > > > > >>>>> > > > > > Deepshika is working to fix it long term, by fixing the issue >>>>> > > > > > regarding >>>>> > > > > > eth0/ens5 with a new base image. >>>>> > > > > > -- >>>>> > > > > > Michael Scherer >>>>> > > > > > Sysadmin, Community Infrastructure and Platform, OSAS >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > -- >>>>> > > > > >>>>> > > > > - Atin (atinm) >>>>> > > > >>>>> > > > -- >>>>> > > > Michael Scherer >>>>> > > > Sysadmin, Community Infrastructure >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > _______________________________________________ >>>>> > > > Gluster-devel mailing list >>>>> > > > Gluster-devel at gluster.org >>>>> > > > https://lists.gluster.org/mailman/listinfo/gluster-devel >>>>> > > >>>>> > > _______________________________________________ >>>>> > > Gluster-devel mailing list >>>>> > > Gluster-devel at gluster.org >>>>> > > https://lists.gluster.org/mailman/listinfo/gluster-devel >>>>> > >>>>> > >>>>> > >>>>> -- >>>>> Michael Scherer >>>>> Sysadmin, Community Infrastructure >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Gluster-devel mailing list >>>>> Gluster-devel at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>>> >>>> >>> >>> -- >>> Thanks, >>> Sanju >>> _______________________________________________ >>> >>> Community Meeting Calendar: >>> >>> APAC Schedule - >>> Every 2nd and 4th Tuesday at 11:30 AM IST >>> Bridge: https://bluejeans.com/836554017 >>> >>> NA/EMEA Schedule - >>> Every 1st and 3rd Tuesday at 01:00 PM EDT >>> Bridge: https://bluejeans.com/486278655 >>> >>> Gluster-devel mailing list >>> Gluster-devel at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From dkhandel at redhat.com Thu May 9 05:56:22 2019 From: dkhandel at redhat.com (Deepshikha Khandelwal) Date: Thu, 9 May 2019 11:26:22 +0530 Subject: [Gluster-devel] [Gluster-infra] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> <797512f6ff7f1b9fedbf8b7968dd86a6968d9105.camel@redhat.com> Message-ID: I took a quick look at the builders and noticed both have the same error of 'Cannot allocate memory' which comes up every time when the builder is rebooted after a build abort. It is happening in the same pattern. Though there's no such memory consumption on the builders. I?m investigating more on this. On Thu, May 9, 2019 at 10:02 AM Atin Mukherjee wrote: > > > On Wed, May 8, 2019 at 7:38 PM Atin Mukherjee wrote: > >> builder204 needs to be fixed, too many failures, mostly none of the >> patches are passing regression. >> > > And with that builder201 joins the pool, > https://build.gluster.org/job/centos7-regression/5943/consoleFull > > >> On Wed, May 8, 2019 at 9:53 AM Atin Mukherjee >> wrote: >> >>> >>> >>> On Wed, May 8, 2019 at 7:16 AM Sanju Rakonde >>> wrote: >>> >>>> Deepshikha, >>>> >>>> I see the failure here[1] which ran on builder206. So, we are good. >>>> >>> >>> Not really, >>> https://build.gluster.org/job/centos7-regression/5909/consoleFull >>> failed on builder204 for similar reasons I believe? >>> >>> I am bit more worried on this issue being resurfacing more often these >>> days. What can we do to fix this permanently? >>> >>> >>>> [1] https://build.gluster.org/job/centos7-regression/5901/consoleFull >>>> >>>> On Wed, May 8, 2019 at 12:23 AM Deepshikha Khandelwal < >>>> dkhandel at redhat.com> wrote: >>>> >>>>> Sanju, can you please give us more info about the failures. >>>>> >>>>> I see the failures occurring on just one of the builder (builder206). >>>>> I'm taking it back offline for now. >>>>> >>>>> On Tue, May 7, 2019 at 9:42 PM Michael Scherer >>>>> wrote: >>>>> >>>>>> Le mardi 07 mai 2019 ? 20:04 +0530, Sanju Rakonde a ?crit : >>>>>> > Looks like is_nfs_export_available started failing again in recent >>>>>> > centos-regressions. >>>>>> > >>>>>> > Michael, can you please check? >>>>>> >>>>>> I will try but I am leaving for vacation tonight, so if I find >>>>>> nothing, >>>>>> until I leave, I guess Deepshika will have to look. >>>>>> >>>>>> > On Wed, Apr 24, 2019 at 5:30 PM Yaniv Kaul >>>>>> wrote: >>>>>> > >>>>>> > > >>>>>> > > >>>>>> > > On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer < >>>>>> > > mscherer at redhat.com> >>>>>> > > wrote: >>>>>> > > >>>>>> > > > Le lundi 22 avril 2019 ? 22:57 +0530, Atin Mukherjee a ?crit : >>>>>> > > > > Is this back again? The recent patches are failing regression >>>>>> > > > > :-\ . >>>>>> > > > >>>>>> > > > So, on builder206, it took me a while to find that the issue is >>>>>> > > > that >>>>>> > > > nfs (the service) was running. >>>>>> > > > >>>>>> > > > ./tests/basic/afr/tarissue.t failed, because the nfs >>>>>> > > > initialisation >>>>>> > > > failed with a rather cryptic message: >>>>>> > > > >>>>>> > > > [2019-04-23 13:17:05.371733] I >>>>>> > > > [socket.c:991:__socket_server_bind] 0- >>>>>> > > > socket.nfs-server: process started listening on port (38465) >>>>>> > > > [2019-04-23 13:17:05.385819] E >>>>>> > > > [socket.c:972:__socket_server_bind] 0- >>>>>> > > > socket.nfs-server: binding to failed: Address already in use >>>>>> > > > [2019-04-23 13:17:05.385843] E >>>>>> > > > [socket.c:974:__socket_server_bind] 0- >>>>>> > > > socket.nfs-server: Port is already in use >>>>>> > > > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- >>>>>> > > > socket.nfs-server: __socket_server_bind failed;closing socket 14 >>>>>> > > > >>>>>> > > > I found where this came from, but a few stuff did surprised me: >>>>>> > > > >>>>>> > > > - the order of print is different that the order in the code >>>>>> > > > >>>>>> > > >>>>>> > > Indeed strange... >>>>>> > > >>>>>> > > > - the message on "started listening" didn't take in account the >>>>>> > > > fact >>>>>> > > > that bind failed on: >>>>>> > > > >>>>>> > > >>>>>> > > Shouldn't it bail out if it failed to bind? >>>>>> > > Some missing 'goto out' around line 975/976? >>>>>> > > Y. >>>>>> > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> >>>>>> https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 >>>>>> > > > >>>>>> > > > The message about port 38465 also threw me off the track. The >>>>>> > > > real >>>>>> > > > issue is that the service nfs was already running, and I >>>>>> couldn't >>>>>> > > > find >>>>>> > > > anything listening on port 38465 >>>>>> > > > >>>>>> > > > once I do service nfs stop, it no longer failed. >>>>>> > > > >>>>>> > > > So far, I do know why nfs.service was activated. >>>>>> > > > >>>>>> > > > But at least, 206 should be fixed, and we know a bit more on >>>>>> what >>>>>> > > > would >>>>>> > > > be causing some failure. >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer < >>>>>> > > > > mscherer at redhat.com> >>>>>> > > > > wrote: >>>>>> > > > > >>>>>> > > > > > Le mercredi 03 avril 2019 ? 16:30 +0530, Atin Mukherjee a >>>>>> > > > > > ?crit : >>>>>> > > > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < >>>>>> > > > > > > jthottan at redhat.com> >>>>>> > > > > > > wrote: >>>>>> > > > > > > >>>>>> > > > > > > > Hi, >>>>>> > > > > > > > >>>>>> > > > > > > > is_nfs_export_available is just a wrapper around >>>>>> > > > > > > > "showmount" >>>>>> > > > > > > > command AFAIR. >>>>>> > > > > > > > I saw following messages in console output. >>>>>> > > > > > > > mount.nfs: rpc.statd is not running but is required for >>>>>> > > > > > > > remote >>>>>> > > > > > > > locking. >>>>>> > > > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks >>>>>> > > > > > > > local, >>>>>> > > > > > > > or >>>>>> > > > > > > > start >>>>>> > > > > > > > statd. >>>>>> > > > > > > > 05:06:55 mount.nfs: an incorrect mount option was >>>>>> > > > > > > > specified >>>>>> > > > > > > > >>>>>> > > > > > > > For me it looks rpcbind may not be running on the >>>>>> > > > > > > > machine. >>>>>> > > > > > > > Usually rpcbind starts automatically on machines, don't >>>>>> > > > > > > > know >>>>>> > > > > > > > whether it >>>>>> > > > > > > > can happen or not. >>>>>> > > > > > > > >>>>>> > > > > > > >>>>>> > > > > > > That's precisely what the question is. Why suddenly we're >>>>>> > > > > > > seeing >>>>>> > > > > > > this >>>>>> > > > > > > happening too frequently. Today I saw atleast 4 to 5 such >>>>>> > > > > > > failures >>>>>> > > > > > > already. >>>>>> > > > > > > >>>>>> > > > > > > Deepshika - Can you please help in inspecting this? >>>>>> > > > > > >>>>>> > > > > > So we think (we are not sure) that the issue is a bit >>>>>> > > > > > complex. >>>>>> > > > > > >>>>>> > > > > > What we were investigating was nightly run fail on aws. When >>>>>> > > > > > the >>>>>> > > > > > build >>>>>> > > > > > crash, the builder is restarted, since that's the easiest >>>>>> way >>>>>> > > > > > to >>>>>> > > > > > clean >>>>>> > > > > > everything (since even with a perfect test suite that would >>>>>> > > > > > clean >>>>>> > > > > > itself, we could always end in a corrupt state on the >>>>>> system, >>>>>> > > > > > WRT >>>>>> > > > > > mount, fs, etc). >>>>>> > > > > > >>>>>> > > > > > In turn, this seems to cause trouble on aws, since >>>>>> cloud-init >>>>>> > > > > > or >>>>>> > > > > > something rename eth0 interface to ens5, without cleaning to >>>>>> > > > > > the >>>>>> > > > > > network configuration. >>>>>> > > > > > >>>>>> > > > > > So the network init script fail (because the image say >>>>>> "start >>>>>> > > > > > eth0" >>>>>> > > > > > and >>>>>> > > > > > that's not present), but fail in a weird way. Network is >>>>>> > > > > > initialised >>>>>> > > > > > and working (we can connect), but the dhclient process is >>>>>> not >>>>>> > > > > > in >>>>>> > > > > > the >>>>>> > > > > > right cgroup, and network.service is in failed state. >>>>>> > > > > > Restarting >>>>>> > > > > > network didn't work. In turn, this mean that rpc-statd >>>>>> refuse >>>>>> > > > > > to >>>>>> > > > > > start >>>>>> > > > > > (due to systemd dependencies), which seems to impact various >>>>>> > > > > > NFS >>>>>> > > > > > tests. >>>>>> > > > > > >>>>>> > > > > > We have also seen that on some builders, rpcbind pick some >>>>>> IP >>>>>> > > > > > v6 >>>>>> > > > > > autoconfiguration, but we can't reproduce that, and there is >>>>>> > > > > > no ip >>>>>> > > > > > v6 >>>>>> > > > > > set up anywhere. I suspect the network.service failure is >>>>>> > > > > > somehow >>>>>> > > > > > involved, but fail to see how. In turn, rpcbind.socket not >>>>>> > > > > > starting >>>>>> > > > > > could cause NFS test troubles. >>>>>> > > > > > >>>>>> > > > > > Our current stop gap fix was to fix all the builders one by >>>>>> > > > > > one. >>>>>> > > > > > Remove >>>>>> > > > > > the config, kill the rogue dhclient, restart network >>>>>> service. >>>>>> > > > > > >>>>>> > > > > > However, we can't be sure this is going to fix the problem >>>>>> > > > > > long >>>>>> > > > > > term >>>>>> > > > > > since this only manifest after a crash of the test suite, >>>>>> and >>>>>> > > > > > it >>>>>> > > > > > doesn't happen so often. (plus, it was working before some >>>>>> > > > > > day in >>>>>> > > > > > the >>>>>> > > > > > past, when something did make this fail, and I do not know >>>>>> if >>>>>> > > > > > that's a >>>>>> > > > > > system upgrade, or a test change, or both). >>>>>> > > > > > >>>>>> > > > > > So we are still looking at it to have a complete >>>>>> > > > > > understanding of >>>>>> > > > > > the >>>>>> > > > > > issue, but so far, we hacked our way to make it work (or so >>>>>> > > > > > do I >>>>>> > > > > > think). >>>>>> > > > > > >>>>>> > > > > > Deepshika is working to fix it long term, by fixing the >>>>>> issue >>>>>> > > > > > regarding >>>>>> > > > > > eth0/ens5 with a new base image. >>>>>> > > > > > -- >>>>>> > > > > > Michael Scherer >>>>>> > > > > > Sysadmin, Community Infrastructure and Platform, OSAS >>>>>> > > > > > >>>>>> > > > > > >>>>>> > > > > > -- >>>>>> > > > > >>>>>> > > > > - Atin (atinm) >>>>>> > > > >>>>>> > > > -- >>>>>> > > > Michael Scherer >>>>>> > > > Sysadmin, Community Infrastructure >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > _______________________________________________ >>>>>> > > > Gluster-devel mailing list >>>>>> > > > Gluster-devel at gluster.org >>>>>> > > > https://lists.gluster.org/mailman/listinfo/gluster-devel >>>>>> > > >>>>>> > > _______________________________________________ >>>>>> > > Gluster-devel mailing list >>>>>> > > Gluster-devel at gluster.org >>>>>> > > https://lists.gluster.org/mailman/listinfo/gluster-devel >>>>>> > >>>>>> > >>>>>> > >>>>>> -- >>>>>> Michael Scherer >>>>>> Sysadmin, Community Infrastructure >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Gluster-devel mailing list >>>>>> Gluster-devel at gluster.org >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>>>> >>>>> >>>> >>>> -- >>>> Thanks, >>>> Sanju >>>> _______________________________________________ >>>> >>>> Community Meeting Calendar: >>>> >>>> APAC Schedule - >>>> Every 2nd and 4th Tuesday at 11:30 AM IST >>>> Bridge: https://bluejeans.com/836554017 >>>> >>>> NA/EMEA Schedule - >>>> Every 1st and 3rd Tuesday at 01:00 PM EDT >>>> Bridge: https://bluejeans.com/486278655 >>>> >>>> Gluster-devel mailing list >>>> Gluster-devel at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From hgowtham at redhat.com Thu May 9 11:15:48 2019 From: hgowtham at redhat.com (Hari Gowtham) Date: Thu, 9 May 2019 16:45:48 +0530 Subject: [Gluster-devel] Release 6.2: Expected tagging on May 15th Message-ID: Hi, Expected tagging date for release-6.2 is on May, 15th, 2019. Please ensure required patches are backported and also are passing regressions and are appropriately reviewed for easy merging and tagging on the date. -- Regards, Hari Gowtham. From spisla80 at gmail.com Thu May 9 14:12:03 2019 From: spisla80 at gmail.com (David Spisla) Date: Thu, 9 May 2019 16:12:03 +0200 Subject: [Gluster-devel] Improve stability between SMB/CTDB and Gluster (together with Samba Core Developer) Message-ID: Dear Gluster Community, at the moment we are improving the stability of SMB/CTDB and Gluster. For this purpose we are working together with an advanced SAMBA Core Developer. He did some debugging but needs more information about Gluster Core Behaviour. *Would any of the Gluster Developer wants to have a online conference with him and me?* I would organize everything. In my opinion this is a good chance to improve stability of Glusterfs and this is at the moment one of the major issues in the Community. Regards David Spisla -------------- next part -------------- An HTML attachment was scrubbed... URL: From jenkins at build.gluster.org Mon May 13 01:45:02 2019 From: jenkins at build.gluster.org (jenkins at build.gluster.org) Date: Mon, 13 May 2019 01:45:02 +0000 (UTC) Subject: [Gluster-devel] Weekly Untriaged Bugs Message-ID: <997847813.11.1557711903412.JavaMail.jenkins@jenkins-el7.rht.gluster.org> [...truncated 6 lines...] https://bugzilla.redhat.com/1700295 / core: The data couldn't be flushed immediately even with O_SYNC in glfs_create or with glfs_fsync/glfs_fdatasync after glfs_write. https://bugzilla.redhat.com/1707866 / core: Thousands of duplicate files in glusterfs mountpoint directory listing https://bugzilla.redhat.com/1708505 / disperse: [EC] /tests/basic/ec/ec-data-heal.t is failing as heal is not happening properly https://bugzilla.redhat.com/1703322 / doc: Need to document about fips-mode-rchecksum in gluster-7 release notes. https://bugzilla.redhat.com/1702043 / fuse: Newly created files are inaccessible via FUSE https://bugzilla.redhat.com/1706716 / glusterd: glusterd generated core while running ./tests/bugs/cli/bug-1077682.t https://bugzilla.redhat.com/1703007 / glusterd: The telnet or something would cause high memory usage for glusterd & glusterfsd https://bugzilla.redhat.com/1706842 / gluster-smb: Hard Failover with Samba and Glusterfs fails https://bugzilla.redhat.com/1705351 / HDFS: glusterfsd crash after days of running https://bugzilla.redhat.com/1707671 / project-infrastructure: Cronjob of feeding gluster blogs from different account into planet gluster isn't working https://bugzilla.redhat.com/1703433 / project-infrastructure: gluster-block: setup GCOV & LCOV job https://bugzilla.redhat.com/1703435 / project-infrastructure: gluster-block: Upstream Jenkins job which get triggered at PR level https://bugzilla.redhat.com/1703329 / project-infrastructure: [gluster-infra]: Please create repo for plus one scale work https://bugzilla.redhat.com/1708257 / project-infrastructure: Grant additional maintainers merge rights on release branches https://bugzilla.redhat.com/1702289 / tiering: Promotion failed for a0afd3e3-0109-49b7-9b74-ba77bf653aba.11229 [...truncated 2 lines...] -------------- next part -------------- A non-text attachment was scrubbed... Name: build.log Type: application/octet-stream Size: 2157 bytes Desc: not available URL: From pgurusid at redhat.com Mon May 13 05:22:06 2019 From: pgurusid at redhat.com (Poornima Gurusiddaiah) Date: Mon, 13 May 2019 10:52:06 +0530 Subject: [Gluster-devel] Improve stability between SMB/CTDB and Gluster (together with Samba Core Developer) In-Reply-To: References: Message-ID: Hi, We would be definitely interested in this. Thank you for contacting us. For the starter we can have an online conference. Please suggest few possible date and times for the week(preferably between IST 7.00AM - 9.PM)? Adding Anoop and Gunther who are also the main contributors to the Gluster-Samba integration. Thanks, Poornima On Thu, May 9, 2019 at 7:43 PM David Spisla wrote: > Dear Gluster Community, > at the moment we are improving the stability of SMB/CTDB and Gluster. For > this purpose we are working together with an advanced SAMBA Core Developer. > He did some debugging but needs more information about Gluster Core > Behaviour. > > *Would any of the Gluster Developer wants to have a online conference with > him and me?* > > I would organize everything. In my opinion this is a good chance to > improve stability of Glusterfs and this is at the moment one of the major > issues in the Community. > > Regards > David Spisla > _______________________________________________ > > Community Meeting Calendar: > > APAC Schedule - > Every 2nd and 4th Tuesday at 11:30 AM IST > Bridge: https://bluejeans.com/836554017 > > NA/EMEA Schedule - > Every 1st and 3rd Tuesday at 01:00 PM EDT > Bridge: https://bluejeans.com/486278655 > > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kdhananj at redhat.com Mon May 13 07:19:25 2019 From: kdhananj at redhat.com (Krutika Dhananjay) Date: Mon, 13 May 2019 12:49:25 +0530 Subject: [Gluster-devel] [Gluster-users] VMs blocked for more than 120 seconds In-Reply-To: References: <20190513065548.GI25080@althea.ulrar.net> Message-ID: What version of gluster are you using? Also, can you capture and share volume-profile output for a run where you manage to recreate this issue? https://docs.gluster.org/en/v3/Administrator%20Guide/Monitoring%20Workload/#running-glusterfs-volume-profile-command Let me know if you have any questions. -Krutika On Mon, May 13, 2019 at 12:34 PM Martin Toth wrote: > Hi, > > there is no healing operation, not peer disconnects, no readonly > filesystem. Yes, storage is slow and unavailable for 120 seconds, but why, > its SSD with 10G, performance is good. > > > you'd have it's log on qemu's standard output, > > If you mean /var/log/libvirt/qemu/vm.log there is nothing. I am looking > for problem for more than month, tried everything. Can?t find anything. Any > more clues or leads? > > BR, > Martin > > > On 13 May 2019, at 08:55, lemonnierk at ulrar.net wrote: > > > > On Mon, May 13, 2019 at 08:47:45AM +0200, Martin Toth wrote: > >> Hi all, > > > > Hi > > > >> > >> I am running replica 3 on SSDs with 10G networking, everything works OK > but VMs stored in Gluster volume occasionally freeze with ?Task XY blocked > for more than 120 seconds?. > >> Only solution is to poweroff (hard) VM and than boot it up again. I am > unable to SSH and also login with console, its stuck probably on some disk > operation. No error/warning logs or messages are store in VMs logs. > >> > > > > As far as I know this should be unrelated, I get this during heals > > without any freezes, it just means the storage is slow I think. > > > >> KVM/Libvirt(qemu) using libgfapi and fuse mount to access VM disks on > replica volume. Can someone advice how to debug this problem or what can > cause these issues? > >> It?s really annoying, I?ve tried to google everything but nothing came > up. I?ve tried changing virtio-scsi-pci to virtio-blk-pci disk drivers, but > its not related. > >> > > > > Any chance your gluster goes readonly ? Have you checked your gluster > > logs to see if maybe they lose each other some times ? > > /var/log/glusterfs > > > > For libgfapi accesses you'd have it's log on qemu's standard output, > > that might contain the actual error at the time of the freez. > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From kdhananj at redhat.com Mon May 13 07:21:19 2019 From: kdhananj at redhat.com (Krutika Dhananjay) Date: Mon, 13 May 2019 12:51:19 +0530 Subject: [Gluster-devel] [Gluster-users] VMs blocked for more than 120 seconds In-Reply-To: References: <20190513065548.GI25080@althea.ulrar.net> Message-ID: Also, what's the caching policy that qemu is using on the affected vms? Is it cache=none? Or something else? You can get this information in the command line of qemu-kvm process corresponding to your vm in the ps output. -Krutika On Mon, May 13, 2019 at 12:49 PM Krutika Dhananjay wrote: > What version of gluster are you using? > Also, can you capture and share volume-profile output for a run where you > manage to recreate this issue? > > https://docs.gluster.org/en/v3/Administrator%20Guide/Monitoring%20Workload/#running-glusterfs-volume-profile-command > Let me know if you have any questions. > > -Krutika > > On Mon, May 13, 2019 at 12:34 PM Martin Toth wrote: > >> Hi, >> >> there is no healing operation, not peer disconnects, no readonly >> filesystem. Yes, storage is slow and unavailable for 120 seconds, but why, >> its SSD with 10G, performance is good. >> >> > you'd have it's log on qemu's standard output, >> >> If you mean /var/log/libvirt/qemu/vm.log there is nothing. I am looking >> for problem for more than month, tried everything. Can?t find anything. Any >> more clues or leads? >> >> BR, >> Martin >> >> > On 13 May 2019, at 08:55, lemonnierk at ulrar.net wrote: >> > >> > On Mon, May 13, 2019 at 08:47:45AM +0200, Martin Toth wrote: >> >> Hi all, >> > >> > Hi >> > >> >> >> >> I am running replica 3 on SSDs with 10G networking, everything works >> OK but VMs stored in Gluster volume occasionally freeze with ?Task XY >> blocked for more than 120 seconds?. >> >> Only solution is to poweroff (hard) VM and than boot it up again. I am >> unable to SSH and also login with console, its stuck probably on some disk >> operation. No error/warning logs or messages are store in VMs logs. >> >> >> > >> > As far as I know this should be unrelated, I get this during heals >> > without any freezes, it just means the storage is slow I think. >> > >> >> KVM/Libvirt(qemu) using libgfapi and fuse mount to access VM disks on >> replica volume. Can someone advice how to debug this problem or what can >> cause these issues? >> >> It?s really annoying, I?ve tried to google everything but nothing came >> up. I?ve tried changing virtio-scsi-pci to virtio-blk-pci disk drivers, but >> its not related. >> >> >> > >> > Any chance your gluster goes readonly ? Have you checked your gluster >> > logs to see if maybe they lose each other some times ? >> > /var/log/glusterfs >> > >> > For libgfapi accesses you'd have it's log on qemu's standard output, >> > that might contain the actual error at the time of the freez. >> > _______________________________________________ >> > Gluster-users mailing list >> > Gluster-users at gluster.org >> > https://lists.gluster.org/mailman/listinfo/gluster-users >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kdhananj at redhat.com Mon May 13 08:20:14 2019 From: kdhananj at redhat.com (Krutika Dhananjay) Date: Mon, 13 May 2019 13:50:14 +0530 Subject: [Gluster-devel] [Gluster-users] VMs blocked for more than 120 seconds In-Reply-To: <681F0862-7C80-414D-9637-7697A8C65AFA@gmail.com> References: <20190513065548.GI25080@althea.ulrar.net> <681F0862-7C80-414D-9637-7697A8C65AFA@gmail.com> Message-ID: OK. In that case, can you check if the following two changes help: # gluster volume set $VOL network.remote-dio off # gluster volume set $VOL performance.strict-o-direct on preferably one option changed at a time, its impact tested and then the next change applied and tested. Also, gluster version please? -Krutika On Mon, May 13, 2019 at 1:02 PM Martin Toth wrote: > Cache in qemu is none. That should be correct. This is full command : > > /usr/bin/qemu-system-x86_64 -name one-312 -S -machine > pc-i440fx-xenial,accel=kvm,usb=off -m 4096 -realtime mlock=off -smp > 4,sockets=4,cores=1,threads=1 -uuid e95a774e-a594-4e98-b141-9f30a3f848c1 > -no-user-config -nodefaults -chardev > socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-one-312/monitor.sock,server,nowait > -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime > -no-shutdown -boot order=c,menu=on,splash-time=3000,strict=on -device > piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 > > -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 > -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 > -drive file=/var/lib/one//datastores/116/312/*disk.0* > ,format=raw,if=none,id=drive-virtio-disk1,cache=none > -device > virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk1,id=virtio-disk1 > -drive file=gluster://localhost:24007/imagestore/ > *7b64d6757acc47a39503f68731f89b8e* > ,format=qcow2,if=none,id=drive-scsi0-0-0-0,cache=none > -device > scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 > -drive file=/var/lib/one//datastores/116/312/*disk.1* > ,format=raw,if=none,id=drive-ide0-0-0,readonly=on > -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 > > -netdev tap,fd=26,id=hostnet0 > -device e1000,netdev=hostnet0,id=net0,mac=02:00:5c:f0:e4:39,bus=pci.0,addr=0x3 > -chardev pty,id=charserial0 -device > isa-serial,chardev=charserial0,id=serial0 > -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-one-312/org.qemu.guest_agent.0,server,nowait > -device > virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 > -vnc 0.0.0.0:312,password -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 > -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on > > I?ve highlighted disks. First is VM context disk - Fuse used, second is > SDA (OS is installed here) - libgfapi used, third is SWAP - Fuse used. > > Krutika, > I will start profiling on Gluster Volumes and wait for next VM to fail. > Than I will attach/send profiling info after some VM will be failed. I > suppose this is correct profiling strategy. > About this, how many vms do you need to recreate it? A single vm? Or multiple vms doing IO in parallel? > Thanks, > BR! > Martin > > On 13 May 2019, at 09:21, Krutika Dhananjay wrote: > > Also, what's the caching policy that qemu is using on the affected vms? > Is it cache=none? Or something else? You can get this information in the > command line of qemu-kvm process corresponding to your vm in the ps output. > > -Krutika > > On Mon, May 13, 2019 at 12:49 PM Krutika Dhananjay > wrote: > >> What version of gluster are you using? >> Also, can you capture and share volume-profile output for a run where you >> manage to recreate this issue? >> >> https://docs.gluster.org/en/v3/Administrator%20Guide/Monitoring%20Workload/#running-glusterfs-volume-profile-command >> Let me know if you have any questions. >> >> -Krutika >> >> On Mon, May 13, 2019 at 12:34 PM Martin Toth >> wrote: >> >>> Hi, >>> >>> there is no healing operation, not peer disconnects, no readonly >>> filesystem. Yes, storage is slow and unavailable for 120 seconds, but why, >>> its SSD with 10G, performance is good. >>> >>> > you'd have it's log on qemu's standard output, >>> >>> If you mean /var/log/libvirt/qemu/vm.log there is nothing. I am looking >>> for problem for more than month, tried everything. Can?t find anything. Any >>> more clues or leads? >>> >>> BR, >>> Martin >>> >>> > On 13 May 2019, at 08:55, lemonnierk at ulrar.net wrote: >>> > >>> > On Mon, May 13, 2019 at 08:47:45AM +0200, Martin Toth wrote: >>> >> Hi all, >>> > >>> > Hi >>> > >>> >> >>> >> I am running replica 3 on SSDs with 10G networking, everything works >>> OK but VMs stored in Gluster volume occasionally freeze with ?Task XY >>> blocked for more than 120 seconds?. >>> >> Only solution is to poweroff (hard) VM and than boot it up again. I >>> am unable to SSH and also login with console, its stuck probably on some >>> disk operation. No error/warning logs or messages are store in VMs logs. >>> >> >>> > >>> > As far as I know this should be unrelated, I get this during heals >>> > without any freezes, it just means the storage is slow I think. >>> > >>> >> KVM/Libvirt(qemu) using libgfapi and fuse mount to access VM disks on >>> replica volume. Can someone advice how to debug this problem or what can >>> cause these issues? >>> >> It?s really annoying, I?ve tried to google everything but nothing >>> came up. I?ve tried changing virtio-scsi-pci to virtio-blk-pci disk >>> drivers, but its not related. >>> >> >>> > >>> > Any chance your gluster goes readonly ? Have you checked your gluster >>> > logs to see if maybe they lose each other some times ? >>> > /var/log/glusterfs >>> > >>> > For libgfapi accesses you'd have it's log on qemu's standard output, >>> > that might contain the actual error at the time of the freez. >>> > _______________________________________________ >>> > Gluster-users mailing list >>> > Gluster-users at gluster.org >>> > https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgurusid at redhat.com Tue May 14 04:36:21 2019 From: pgurusid at redhat.com (pgurusid at redhat.com) Date: Tue, 14 May 2019 04:36:21 +0000 Subject: [Gluster-devel] Invitation: Gluster Community Meeting (APAC friendly hours) @ Every 2 weeks at 11:30am on Tuesday 15 times (IST) (gluster-devel@gluster.org) Message-ID: <0000000000001e42ba0588d19373@google.com> You have been invited to the following event. Title: Gluster Community Meeting (APAC friendly hours) Bridge: https://bluejeans.com/836554017 Meeting minutes: https://hackmd.io/OqZbh7gfQe6uvVUXUVKJ5g?both Previous Meeting notes: http://github.com/gluster/community When: Every 2 weeks at 11:30am on Tuesday 15 times India Standard Time - Kolkata Where: https://bluejeans.com/836554017 Calendar: gluster-devel at gluster.org Who: * pgurusid at redhat.com - organizer * gluster-users at gluster.org * maintainers at gluster.org * gluster-devel at gluster.org Event details: https://www.google.com/calendar/event?action=VIEW&eid=NTEwOGJvMGZjMnRjN3Z0YzY0OGNmb3E4dXQgZ2x1c3Rlci1kZXZlbEBnbHVzdGVyLm9yZw&tok=MTkjcGd1cnVzaWRAcmVkaGF0LmNvbTdlM2Y3OWJjZDY4NDJjMDYzYjMwOWZjNDZmOGRiMDU0YjM3ZDhjYzk&ctz=Asia%2FKolkata&hl=en&es=0 Invitation from Google Calendar: https://www.google.com/calendar/ You are receiving this courtesy email at the account gluster-devel at gluster.org because you are an attendee of this event. To stop receiving future updates for this event, decline this event. Alternatively you can sign up for a Google account at https://www.google.com/calendar/ and control your notification settings for your entire calendar. Forwarding this invitation could allow any recipient to send a response to the organizer and be added to the guest list, or invite others regardless of their own invitation status, or to modify your RSVP. Learn more at https://support.google.com/calendar/answer/37135#forwarding -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/calendar Size: 2143 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: invite.ics Type: application/ics Size: 2195 bytes Desc: not available URL: From pgurusid at redhat.com Tue May 14 04:47:10 2019 From: pgurusid at redhat.com (pgurusid at redhat.com) Date: Tue, 14 May 2019 04:47:10 +0000 Subject: [Gluster-devel] Updated invitation: Gluster Community Meeting (APAC friendly hours) @ Every 2 weeks from 11:30am to 12:30pm on Tuesday 15 times (IST) (gluster-devel@gluster.org) Message-ID: <000000000000d586ae0588d1b9de@google.com> This event has been changed. Title: Gluster Community Meeting (APAC friendly hours) Bridge: https://bluejeans.com/836554017 Meeting minutes: https://hackmd.io/OqZbh7gfQe6uvVUXUVKJ5g?both Previous Meeting notes: http://github.com/gluster/community When: Every 2 weeks from 11:30am to 12:30pm on Tuesday 15 times India Standard Time - Kolkata (changed) Where: https://bluejeans.com/836554017 Calendar: gluster-devel at gluster.org Who: * pgurusid at redhat.com - organizer * gluster-users at gluster.org * maintainers at gluster.org * gluster-devel at gluster.org * ranaraya at redhat.com * khiremat at redhat.com * dcunningham at voisonics.com Event details: https://www.google.com/calendar/event?action=VIEW&eid=NTEwOGJvMGZjMnRjN3Z0YzY0OGNmb3E4dXQgZ2x1c3Rlci1kZXZlbEBnbHVzdGVyLm9yZw&tok=MTkjcGd1cnVzaWRAcmVkaGF0LmNvbTdlM2Y3OWJjZDY4NDJjMDYzYjMwOWZjNDZmOGRiMDU0YjM3ZDhjYzk&ctz=Asia%2FKolkata&hl=en&es=0 Invitation from Google Calendar: https://www.google.com/calendar/ You are receiving this courtesy email at the account gluster-devel at gluster.org because you are an attendee of this event. To stop receiving future updates for this event, decline this event. Alternatively you can sign up for a Google account at https://www.google.com/calendar/ and control your notification settings for your entire calendar. Forwarding this invitation could allow any recipient to send a response to the organizer and be added to the guest list, or invite others regardless of their own invitation status, or to modify your RSVP. Learn more at https://support.google.com/calendar/answer/37135#forwarding -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/calendar Size: 2586 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: invite.ics Type: application/ics Size: 2645 bytes Desc: not available URL: From amukherj at redhat.com Wed May 15 05:54:33 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Wed, 15 May 2019 11:24:33 +0530 Subject: [Gluster-devel] tests are timing out in master branch Message-ID: There're random tests which are timing out after 200 secs. My belief is this is a major regression introduced by some commit recently or the builders have become extremely slow which I highly doubt. I'd request that we first figure out the cause, get master back to it's proper health and then get back to the review/merge queue. Sanju has already started looking into /tests/bugs/glusterd/optimized-basic-testcases-in-cluster.t to understand what test is specifically hanging and consuming more time. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sankarshan.mukhopadhyay at gmail.com Wed May 15 06:14:53 2019 From: sankarshan.mukhopadhyay at gmail.com (Sankarshan Mukhopadhyay) Date: Wed, 15 May 2019 11:44:53 +0530 Subject: [Gluster-devel] tests are timing out in master branch In-Reply-To: References: Message-ID: On Wed, May 15, 2019 at 11:24 AM Atin Mukherjee wrote: > > There're random tests which are timing out after 200 secs. My belief is this is a major regression introduced by some commit recently or the builders have become extremely slow which I highly doubt. I'd request that we first figure out the cause, get master back to it's proper health and then get back to the review/merge queue. > For such dire situations, we also need to consider a proposal to back out patches in order to keep the master healthy. The outcome we seek is a healthy master - the isolation of the cause allows us to not repeat the same offense. > Sanju has already started looking into /tests/bugs/glusterd/optimized-basic-testcases-in-cluster.t to understand what test is specifically hanging and consuming more time. From ndevos at redhat.com Wed May 15 07:17:50 2019 From: ndevos at redhat.com (Niels de Vos) Date: Wed, 15 May 2019 09:17:50 +0200 Subject: [Gluster-devel] nightly builds are available again, with slightly different versioning Message-ID: <20190515071750.GA22685@ndevos-x270> This is sort of an RCA and notification to anyone interested in using nightly builds of GlusterFS. If you have any (automated) tests that consume the nightly builds for non-master branches, you did not run tests with updated packages since 2 May 2019. The nightly builds failed to run, but nobody was notified or reported this. Around two weeks ago the nightly builds for glusterfs of the non-master branches were broken due to a change in the CI script. This has been corrected now and a manual run of the job shows green balls again: https://ci.centos.org/view/Gluster/job/gluster_build-rpms/ The initial breakage was introduced by an optimization to not download the whole glusterfs git repository, but only the current HEAD. This did not take into account that 'git checkout' would not be able to switch to a branch that was not downloaded. With a few iterations of fixes, it became obvious that also tags were not fetched (duh), and 'git describe' would not work. Without tags it is not possible to mark builds with the most recent minor release that was made of a branch. Currently the date of the build + git-hash is part of the package version. That means that there is a new version of each branch every day, instead of only after commits have been merged. This might be changed in the future... As a reminder, the YUM .repo files for the nightly builds can be found at http://artifacts.ci.centos.org/gluster/nightly/ Cheers, Niels From atumball at redhat.com Wed May 15 08:18:09 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Wed, 15 May 2019 13:48:09 +0530 Subject: [Gluster-devel] nightly builds are available again, with slightly different versioning In-Reply-To: <20190515071750.GA22685@ndevos-x270> References: <20190515071750.GA22685@ndevos-x270> Message-ID: Thanks for noticing and correcting the issue Niels. Very helpful. On Wed, May 15, 2019 at 12:48 PM Niels de Vos wrote: > This is sort of an RCA and notification to anyone interested in using > nightly builds of GlusterFS. If you have any (automated) tests that > consume the nightly builds for non-master branches, you did not run > tests with updated packages since 2 May 2019. The nightly builds failed > to run, but nobody was notified or reported this. > > Around two weeks ago the nightly builds for glusterfs of the non-master > branches were broken due to a change in the CI script. This has been > corrected now and a manual run of the job shows green balls again: > https://ci.centos.org/view/Gluster/job/gluster_build-rpms/ > > The initial breakage was introduced by an optimization to not download > the whole glusterfs git repository, but only the current HEAD. This did > not take into account that 'git checkout' would not be able to switch to > a branch that was not downloaded. With a few iterations of fixes, it > became obvious that also tags were not fetched (duh), and 'git describe' > would not work. Without tags it is not possible to mark builds with the > most recent minor release that was made of a branch. Currently the date > of the build + git-hash is part of the package version. That means that > there is a new version of each branch every day, instead of only after > commits have been merged. This might be changed in the future... > > As a reminder, the YUM .repo files for the nightly builds can be found > at http://artifacts.ci.centos.org/gluster/nightly/ > > Cheers, > Niels > _______________________________________________ > > Community Meeting Calendar: > > APAC Schedule - > Every 2nd and 4th Tuesday at 11:30 AM IST > Bridge: https://bluejeans.com/836554017 > > NA/EMEA Schedule - > Every 1st and 3rd Tuesday at 01:00 PM EDT > Bridge: https://bluejeans.com/486278655 > > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From hgowtham at redhat.com Wed May 15 10:57:30 2019 From: hgowtham at redhat.com (Hari Gowtham) Date: Wed, 15 May 2019 16:27:30 +0530 Subject: [Gluster-devel] Release 6.2: Expected tagging on May 15th In-Reply-To: References: Message-ID: Hi, The following patch is waiting for centos regression. https://review.gluster.org/#/c/glusterfs/+/22725/ Sunny or Kotresh, please do take a look so that we can go ahead with the tagging. On Thu, May 9, 2019 at 4:45 PM Hari Gowtham wrote: > > Hi, > > Expected tagging date for release-6.2 is on May, 15th, 2019. > > Please ensure required patches are backported and also are passing > regressions and are appropriately reviewed for easy merging and tagging > on the date. > > -- > Regards, > Hari Gowtham. -- Regards, Hari Gowtham. From abhishpaliwal at gmail.com Thu May 16 05:06:14 2019 From: abhishpaliwal at gmail.com (ABHISHEK PALIWAL) Date: Thu, 16 May 2019 10:36:14 +0530 Subject: [Gluster-devel] Memory leak in glusterfs process Message-ID: Hi Team, I upload some valgrind logs from my gluster 5.4 setup. This is writing to the volume every 15 minutes. I stopped glusterd and then copy away the logs. The test was running for some simulated days. They are zipped in valgrind-54.zip. Lots of info in valgrind-2730.log. Lots of possibly lost bytes in glusterfs and even some definitely lost bytes. ==2737== 1,572,880 bytes in 1 blocks are possibly lost in loss record 391 of 391 ==2737== at 0x4C29C25: calloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==2737== by 0xA22485E: ??? (in /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so) ==2737== by 0xA217C94: ??? (in /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so) ==2737== by 0xA21D9F8: ??? (in /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so) ==2737== by 0xA21DED9: ??? (in /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so) ==2737== by 0xA21E685: ??? (in /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so) ==2737== by 0xA1B9D8C: init (in /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so) ==2737== by 0x4E511CE: xlator_init (in /usr/lib64/libglusterfs.so.0.0.1) ==2737== by 0x4E8A2B8: ??? (in /usr/lib64/libglusterfs.so.0.0.1) ==2737== by 0x4E8AAB3: glusterfs_graph_activate (in /usr/lib64/libglusterfs.so.0.0.1) ==2737== by 0x409C35: glusterfs_process_volfp (in /usr/sbin/glusterfsd) ==2737== by 0x409D99: glusterfs_volumes_init (in /usr/sbin/glusterfsd) ==2737== ==2737== LEAK SUMMARY: ==2737== definitely lost: 1,053 bytes in 10 blocks ==2737== indirectly lost: 317 bytes in 3 blocks ==2737== possibly lost: 2,374,971 bytes in 524 blocks ==2737== still reachable: 53,277 bytes in 201 blocks ==2737== suppressed: 0 bytes in 0 blocks -- Regards Abhishek Paliwal -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: valgrind-54.zip Type: application/zip Size: 45897 bytes Desc: not available URL: From abhishpaliwal at gmail.com Thu May 16 05:19:49 2019 From: abhishpaliwal at gmail.com (ABHISHEK PALIWAL) Date: Thu, 16 May 2019 10:49:49 +0530 Subject: [Gluster-devel] Memory leak in glusterfs Message-ID: Hi Team, I upload some valgrind logs from my gluster 5.4 setup. This is writing to the volume every 15 minutes. I stopped glusterd and then copy away the logs. The test was running for some simulated days. They are zipped in valgrind-54.zip. Lots of info in valgrind-2730.log. Lots of possibly lost bytes in glusterfs and even some definitely lost bytes. ==2737== 1,572,880 bytes in 1 blocks are possibly lost in loss record 391 of 391 ==2737== at 0x4C29C25: calloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==2737== by 0xA22485E: ??? (in /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so) ==2737== by 0xA217C94: ??? (in /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so) ==2737== by 0xA21D9F8: ??? (in /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so) ==2737== by 0xA21DED9: ??? (in /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so) ==2737== by 0xA21E685: ??? (in /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so) ==2737== by 0xA1B9D8C: init (in /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so) ==2737== by 0x4E511CE: xlator_init (in /usr/lib64/libglusterfs.so.0.0.1) ==2737== by 0x4E8A2B8: ??? (in /usr/lib64/libglusterfs.so.0.0.1) ==2737== by 0x4E8AAB3: glusterfs_graph_activate (in /usr/lib64/libglusterfs.so.0.0.1) ==2737== by 0x409C35: glusterfs_process_volfp (in /usr/sbin/glusterfsd) ==2737== by 0x409D99: glusterfs_volumes_init (in /usr/sbin/glusterfsd) ==2737== ==2737== LEAK SUMMARY: ==2737== definitely lost: 1,053 bytes in 10 blocks ==2737== indirectly lost: 317 bytes in 3 blocks ==2737== possibly lost: 2,374,971 bytes in 524 blocks ==2737== still reachable: 53,277 bytes in 201 blocks ==2737== suppressed: 0 bytes in 0 blocks -- Regards Abhishek Paliwal -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: valgrind-2748.log Type: text/x-log Size: 23721 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: valgrind-2746.log Type: text/x-log Size: 24526 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: valgrind-2730.log Type: text/x-log Size: 1239130 bytes Desc: not available URL: From srakonde at redhat.com Thu May 16 13:47:43 2019 From: srakonde at redhat.com (Sanju Rakonde) Date: Thu, 16 May 2019 19:17:43 +0530 Subject: [Gluster-devel] ./tests/bugs/snapshot/bug-1399598-uss-with-ssl.t generating core very often Message-ID: In most of the regression jobs ./tests/bugs/snapshot/bug-1399598-uss-with-ssl.t is dumping core, hence the regression is failing for many patches. Rafi/Raghavendra, can you please look into this issue? -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From rabhat at redhat.com Thu May 16 13:58:19 2019 From: rabhat at redhat.com (FNU Raghavendra Manjunath) Date: Thu, 16 May 2019 09:58:19 -0400 Subject: [Gluster-devel] ./tests/bugs/snapshot/bug-1399598-uss-with-ssl.t generating core very often In-Reply-To: References: Message-ID: I am working on other uss issue. i.e. the occasional failure of uss.t due to delays in the brick-mux regression. Rafi? Can you please look into this? Regards, Raghavendra On Thu, May 16, 2019 at 9:48 AM Sanju Rakonde wrote: > In most of the regression jobs ./tests/bugs/snapshot/bug-1399598-uss-with-ssl.t > is dumping core, hence the regression is failing for many patches. > > Rafi/Raghavendra, can you please look into this issue? > > -- > Thanks, > Sanju > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rkavunga at redhat.com Thu May 16 14:19:12 2019 From: rkavunga at redhat.com (RAFI KC) Date: Thu, 16 May 2019 19:49:12 +0530 Subject: [Gluster-devel] ./tests/bugs/snapshot/bug-1399598-uss-with-ssl.t generating core very often In-Reply-To: References: Message-ID: <2e4ddeef-91d0-b197-35e6-7e4f9b6e6b86@redhat.com> Currently I'm looking into one of the priority issue, In parallel I will also looking to this. Saju, Do you have a link to the core file? Regards Rafi KC On 5/16/19 7:28 PM, FNU Raghavendra Manjunath wrote: > > I am working on other uss issue. i.e. the occasional failure of uss.t > due to delays in the brick-mux regression. Rafi? Can you please look > into this? > > Regards, > Raghavendra > > On Thu, May 16, 2019 at 9:48 AM Sanju Rakonde > wrote: > > In most of the regression jobs > ./tests/bugs/snapshot/bug-1399598-uss-with-ssl.t is dumping core, > hence the regression is failing for many patches. > > Rafi/Raghavendra, can you please look into this issue? > > -- > Thanks, > Sanju > -------------- next part -------------- An HTML attachment was scrubbed... URL: From srakonde at redhat.com Thu May 16 14:31:03 2019 From: srakonde at redhat.com (Sanju Rakonde) Date: Thu, 16 May 2019 20:01:03 +0530 Subject: [Gluster-devel] ./tests/bugs/snapshot/bug-1399598-uss-with-ssl.t generating core very often In-Reply-To: <2e4ddeef-91d0-b197-35e6-7e4f9b6e6b86@redhat.com> References: <2e4ddeef-91d0-b197-35e6-7e4f9b6e6b86@redhat.com> Message-ID: Thank you for the quick responses. I missed pasting the links here. You can find the core file in the following links. https://build.gluster.org/job/centos7-regression/6035/consoleFull https://build.gluster.org/job/centos7-regression/6055/consoleFull https://build.gluster.org/job/centos7-regression/6045/consoleFull On Thu, May 16, 2019 at 7:49 PM RAFI KC wrote: > Currently I'm looking into one of the priority issue, In parallel I will > also looking to this. > > Saju, > > Do you have a link to the core file? > > > Regards > > Rafi KC > On 5/16/19 7:28 PM, FNU Raghavendra Manjunath wrote: > > > I am working on other uss issue. i.e. the occasional failure of uss.t due > to delays in the brick-mux regression. Rafi? Can you please look into this? > > Regards, > Raghavendra > > On Thu, May 16, 2019 at 9:48 AM Sanju Rakonde wrote: > >> In most of the regression jobs ./tests/bugs/snapshot/bug-1399598-uss-with-ssl.t >> is dumping core, hence the regression is failing for many patches. >> >> Rafi/Raghavendra, can you please look into this issue? >> >> -- >> Thanks, >> Sanju >> > -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From hgowtham at redhat.com Fri May 17 07:24:22 2019 From: hgowtham at redhat.com (Hari Gowtham) Date: Fri, 17 May 2019 12:54:22 +0530 Subject: [Gluster-devel] Release 6.2: Expected tagging on May 15th In-Reply-To: References: Message-ID: Hi Kotresh ans Sunny, The patch has been failing regression a few times. We need to look into why this is happening and take a decision as to take it in release 6.2 or drop it. On Wed, May 15, 2019 at 4:27 PM Hari Gowtham wrote: > > Hi, > > The following patch is waiting for centos regression. > https://review.gluster.org/#/c/glusterfs/+/22725/ > > Sunny or Kotresh, please do take a look so that we can go ahead with > the tagging. > > On Thu, May 9, 2019 at 4:45 PM Hari Gowtham wrote: > > > > Hi, > > > > Expected tagging date for release-6.2 is on May, 15th, 2019. > > > > Please ensure required patches are backported and also are passing > > regressions and are appropriately reviewed for easy merging and tagging > > on the date. > > > > -- > > Regards, > > Hari Gowtham. > > > > -- > Regards, > Hari Gowtham. -- Regards, Hari Gowtham. From sunkumar at redhat.com Fri May 17 07:35:50 2019 From: sunkumar at redhat.com (Sunny Kumar) Date: Fri, 17 May 2019 13:05:50 +0530 Subject: [Gluster-devel] Release 6.2: Expected tagging on May 15th In-Reply-To: References: Message-ID: Hi Hari, For this to pass regression other 3 patches needs to merge first, I tried to merge but do not have sufficient permissions to merge on 6.2 branch. I know bug is already in place to grant additional permission for us(Me, you and Rinku) so until then waiting on Shyam to merge it. -Sunny On Fri, May 17, 2019 at 12:54 PM Hari Gowtham wrote: > > Hi Kotresh ans Sunny, > The patch has been failing regression a few times. > We need to look into why this is happening and take a decision > as to take it in release 6.2 or drop it. > > On Wed, May 15, 2019 at 4:27 PM Hari Gowtham wrote: > > > > Hi, > > > > The following patch is waiting for centos regression. > > https://review.gluster.org/#/c/glusterfs/+/22725/ > > > > Sunny or Kotresh, please do take a look so that we can go ahead with > > the tagging. > > > > On Thu, May 9, 2019 at 4:45 PM Hari Gowtham wrote: > > > > > > Hi, > > > > > > Expected tagging date for release-6.2 is on May, 15th, 2019. > > > > > > Please ensure required patches are backported and also are passing > > > regressions and are appropriately reviewed for easy merging and tagging > > > on the date. > > > > > > -- > > > Regards, > > > Hari Gowtham. > > > > > > > > -- > > Regards, > > Hari Gowtham. > > > > -- > Regards, > Hari Gowtham. From hgowtham at redhat.com Fri May 17 07:39:37 2019 From: hgowtham at redhat.com (Hari Gowtham) Date: Fri, 17 May 2019 13:09:37 +0530 Subject: [Gluster-devel] Release 6.2: Expected tagging on May 15th In-Reply-To: References: Message-ID: Thanks Sunny. Have CCed Shyam. On Fri, May 17, 2019 at 1:06 PM Sunny Kumar wrote: > > Hi Hari, > > For this to pass regression other 3 patches needs to merge first, I > tried to merge but do not have sufficient permissions to merge on 6.2 > branch. > I know bug is already in place to grant additional permission for > us(Me, you and Rinku) so until then waiting on Shyam to merge it. > > -Sunny > > On Fri, May 17, 2019 at 12:54 PM Hari Gowtham wrote: > > > > Hi Kotresh ans Sunny, > > The patch has been failing regression a few times. > > We need to look into why this is happening and take a decision > > as to take it in release 6.2 or drop it. > > > > On Wed, May 15, 2019 at 4:27 PM Hari Gowtham wrote: > > > > > > Hi, > > > > > > The following patch is waiting for centos regression. > > > https://review.gluster.org/#/c/glusterfs/+/22725/ > > > > > > Sunny or Kotresh, please do take a look so that we can go ahead with > > > the tagging. > > > > > > On Thu, May 9, 2019 at 4:45 PM Hari Gowtham wrote: > > > > > > > > Hi, > > > > > > > > Expected tagging date for release-6.2 is on May, 15th, 2019. > > > > > > > > Please ensure required patches are backported and also are passing > > > > regressions and are appropriately reviewed for easy merging and tagging > > > > on the date. > > > > > > > > -- > > > > Regards, > > > > Hari Gowtham. > > > > > > > > > > > > -- > > > Regards, > > > Hari Gowtham. > > > > > > > > -- > > Regards, > > Hari Gowtham. -- Regards, Hari Gowtham. From atumball at redhat.com Fri May 17 07:42:42 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Fri, 17 May 2019 13:12:42 +0530 Subject: [Gluster-devel] Release 6.2: Expected tagging on May 15th In-Reply-To: References: Message-ID: Which are the patches? I can merge it for now. -Amar On Fri, May 17, 2019 at 1:10 PM Hari Gowtham wrote: > Thanks Sunny. > Have CCed Shyam. > > On Fri, May 17, 2019 at 1:06 PM Sunny Kumar wrote: > > > > Hi Hari, > > > > For this to pass regression other 3 patches needs to merge first, I > > tried to merge but do not have sufficient permissions to merge on 6.2 > > branch. > > I know bug is already in place to grant additional permission for > > us(Me, you and Rinku) so until then waiting on Shyam to merge it. > > > > -Sunny > > > > On Fri, May 17, 2019 at 12:54 PM Hari Gowtham > wrote: > > > > > > Hi Kotresh ans Sunny, > > > The patch has been failing regression a few times. > > > We need to look into why this is happening and take a decision > > > as to take it in release 6.2 or drop it. > > > > > > On Wed, May 15, 2019 at 4:27 PM Hari Gowtham > wrote: > > > > > > > > Hi, > > > > > > > > The following patch is waiting for centos regression. > > > > https://review.gluster.org/#/c/glusterfs/+/22725/ > > > > > > > > Sunny or Kotresh, please do take a look so that we can go ahead with > > > > the tagging. > > > > > > > > On Thu, May 9, 2019 at 4:45 PM Hari Gowtham > wrote: > > > > > > > > > > Hi, > > > > > > > > > > Expected tagging date for release-6.2 is on May, 15th, 2019. > > > > > > > > > > Please ensure required patches are backported and also are passing > > > > > regressions and are appropriately reviewed for easy merging and > tagging > > > > > on the date. > > > > > > > > > > -- > > > > > Regards, > > > > > Hari Gowtham. > > > > > > > > > > > > > > > > -- > > > > Regards, > > > > Hari Gowtham. > > > > > > > > > > > > -- > > > Regards, > > > Hari Gowtham. > > > > -- > Regards, > Hari Gowtham. > _______________________________________________ > > Community Meeting Calendar: > > APAC Schedule - > Every 2nd and 4th Tuesday at 11:30 AM IST > Bridge: https://bluejeans.com/836554017 > > NA/EMEA Schedule - > Every 1st and 3rd Tuesday at 01:00 PM EDT > Bridge: https://bluejeans.com/486278655 > > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From hgowtham at redhat.com Fri May 17 07:46:46 2019 From: hgowtham at redhat.com (Hari Gowtham) Date: Fri, 17 May 2019 13:16:46 +0530 Subject: [Gluster-devel] Release 6.2: Expected tagging on May 15th In-Reply-To: References: Message-ID: https://review.gluster.org/#/q/topic:%22ref-1709738%22+(status:open%20OR%20status:merged) On Fri, May 17, 2019 at 1:13 PM Amar Tumballi Suryanarayan wrote: > > Which are the patches? I can merge it for now. > > -Amar > > On Fri, May 17, 2019 at 1:10 PM Hari Gowtham wrote: >> >> Thanks Sunny. >> Have CCed Shyam. >> >> On Fri, May 17, 2019 at 1:06 PM Sunny Kumar wrote: >> > >> > Hi Hari, >> > >> > For this to pass regression other 3 patches needs to merge first, I >> > tried to merge but do not have sufficient permissions to merge on 6.2 >> > branch. >> > I know bug is already in place to grant additional permission for >> > us(Me, you and Rinku) so until then waiting on Shyam to merge it. >> > >> > -Sunny >> > >> > On Fri, May 17, 2019 at 12:54 PM Hari Gowtham wrote: >> > > >> > > Hi Kotresh ans Sunny, >> > > The patch has been failing regression a few times. >> > > We need to look into why this is happening and take a decision >> > > as to take it in release 6.2 or drop it. >> > > >> > > On Wed, May 15, 2019 at 4:27 PM Hari Gowtham wrote: >> > > > >> > > > Hi, >> > > > >> > > > The following patch is waiting for centos regression. >> > > > https://review.gluster.org/#/c/glusterfs/+/22725/ >> > > > >> > > > Sunny or Kotresh, please do take a look so that we can go ahead with >> > > > the tagging. >> > > > >> > > > On Thu, May 9, 2019 at 4:45 PM Hari Gowtham wrote: >> > > > > >> > > > > Hi, >> > > > > >> > > > > Expected tagging date for release-6.2 is on May, 15th, 2019. >> > > > > >> > > > > Please ensure required patches are backported and also are passing >> > > > > regressions and are appropriately reviewed for easy merging and tagging >> > > > > on the date. >> > > > > >> > > > > -- >> > > > > Regards, >> > > > > Hari Gowtham. >> > > > >> > > > >> > > > >> > > > -- >> > > > Regards, >> > > > Hari Gowtham. >> > > >> > > >> > > >> > > -- >> > > Regards, >> > > Hari Gowtham. >> >> >> >> -- >> Regards, >> Hari Gowtham. >> _______________________________________________ >> >> Community Meeting Calendar: >> >> APAC Schedule - >> Every 2nd and 4th Tuesday at 11:30 AM IST >> Bridge: https://bluejeans.com/836554017 >> >> NA/EMEA Schedule - >> Every 1st and 3rd Tuesday at 01:00 PM EDT >> Bridge: https://bluejeans.com/486278655 >> >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> > > > -- > Amar Tumballi (amarts) -- Regards, Hari Gowtham. From abhishpaliwal at gmail.com Fri May 17 09:16:20 2019 From: abhishpaliwal at gmail.com (ABHISHEK PALIWAL) Date: Fri, 17 May 2019 14:46:20 +0530 Subject: [Gluster-devel] Memory leak in glusterfs In-Reply-To: References: Message-ID: Anyone please reply.... On Thu, May 16, 2019, 10:49 ABHISHEK PALIWAL wrote: > Hi Team, > > I upload some valgrind logs from my gluster 5.4 setup. This is writing to > the volume every 15 minutes. I stopped glusterd and then copy away the > logs. The test was running for some simulated days. They are zipped in > valgrind-54.zip. > > Lots of info in valgrind-2730.log. Lots of possibly lost bytes in > glusterfs and even some definitely lost bytes. > > ==2737== 1,572,880 bytes in 1 blocks are possibly lost in loss record 391 > of 391 > ==2737== at 0x4C29C25: calloc (in > /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) > ==2737== by 0xA22485E: ??? (in > /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so) > ==2737== by 0xA217C94: ??? (in > /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so) > ==2737== by 0xA21D9F8: ??? (in > /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so) > ==2737== by 0xA21DED9: ??? (in > /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so) > ==2737== by 0xA21E685: ??? (in > /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so) > ==2737== by 0xA1B9D8C: init (in > /usr/lib64/glusterfs/5.4/xlator/mgmt/glusterd.so) > ==2737== by 0x4E511CE: xlator_init (in /usr/lib64/libglusterfs.so.0.0.1) > ==2737== by 0x4E8A2B8: ??? (in /usr/lib64/libglusterfs.so.0.0.1) > ==2737== by 0x4E8AAB3: glusterfs_graph_activate (in > /usr/lib64/libglusterfs.so.0.0.1) > ==2737== by 0x409C35: glusterfs_process_volfp (in /usr/sbin/glusterfsd) > ==2737== by 0x409D99: glusterfs_volumes_init (in /usr/sbin/glusterfsd) > ==2737== > ==2737== LEAK SUMMARY: > ==2737== definitely lost: 1,053 bytes in 10 blocks > ==2737== indirectly lost: 317 bytes in 3 blocks > ==2737== possibly lost: 2,374,971 bytes in 524 blocks > ==2737== still reachable: 53,277 bytes in 201 blocks > ==2737== suppressed: 0 bytes in 0 blocks > > -- > > > > > Regards > Abhishek Paliwal > -------------- next part -------------- An HTML attachment was scrubbed... URL: From srangana at redhat.com Fri May 17 11:13:50 2019 From: srangana at redhat.com (Shyam Ranganathan) Date: Fri, 17 May 2019 07:13:50 -0400 Subject: [Gluster-devel] Release 6.2: Expected tagging on May 15th In-Reply-To: References: Message-ID: <47e9433f-6aaa-f044-9442-015014e14f2c@redhat.com> These patches were dependent on each other, so a merge was not required to get regression passing, that analysis seems incorrect. A patch series when tested, will pull in all the dependent patches anyway, so please relook at what the failure could be. (I assume you would anyway). Shyam On 5/17/19 3:46 AM, Hari Gowtham wrote: > https://review.gluster.org/#/q/topic:%22ref-1709738%22+(status:open%20OR%20status:merged) > > On Fri, May 17, 2019 at 1:13 PM Amar Tumballi Suryanarayan > wrote: >> >> Which are the patches? I can merge it for now. >> >> -Amar >> >> On Fri, May 17, 2019 at 1:10 PM Hari Gowtham wrote: >>> >>> Thanks Sunny. >>> Have CCed Shyam. >>> >>> On Fri, May 17, 2019 at 1:06 PM Sunny Kumar wrote: >>>> >>>> Hi Hari, >>>> >>>> For this to pass regression other 3 patches needs to merge first, I >>>> tried to merge but do not have sufficient permissions to merge on 6.2 >>>> branch. >>>> I know bug is already in place to grant additional permission for >>>> us(Me, you and Rinku) so until then waiting on Shyam to merge it. >>>> >>>> -Sunny >>>> >>>> On Fri, May 17, 2019 at 12:54 PM Hari Gowtham wrote: >>>>> >>>>> Hi Kotresh ans Sunny, >>>>> The patch has been failing regression a few times. >>>>> We need to look into why this is happening and take a decision >>>>> as to take it in release 6.2 or drop it. >>>>> >>>>> On Wed, May 15, 2019 at 4:27 PM Hari Gowtham wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> The following patch is waiting for centos regression. >>>>>> https://review.gluster.org/#/c/glusterfs/+/22725/ >>>>>> >>>>>> Sunny or Kotresh, please do take a look so that we can go ahead with >>>>>> the tagging. >>>>>> >>>>>> On Thu, May 9, 2019 at 4:45 PM Hari Gowtham wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Expected tagging date for release-6.2 is on May, 15th, 2019. >>>>>>> >>>>>>> Please ensure required patches are backported and also are passing >>>>>>> regressions and are appropriately reviewed for easy merging and tagging >>>>>>> on the date. >>>>>>> >>>>>>> -- >>>>>>> Regards, >>>>>>> Hari Gowtham. >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> Hari Gowtham. >>>>> >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> Hari Gowtham. >>> >>> >>> >>> -- >>> Regards, >>> Hari Gowtham. >>> _______________________________________________ >>> >>> Community Meeting Calendar: >>> >>> APAC Schedule - >>> Every 2nd and 4th Tuesday at 11:30 AM IST >>> Bridge: https://bluejeans.com/836554017 >>> >>> NA/EMEA Schedule - >>> Every 1st and 3rd Tuesday at 01:00 PM EDT >>> Bridge: https://bluejeans.com/486278655 >>> >>> Gluster-devel mailing list >>> Gluster-devel at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>> >>> >> >> >> -- >> Amar Tumballi (amarts) > > > From rkavunga at redhat.com Sat May 18 10:26:16 2019 From: rkavunga at redhat.com (RAFI KC) Date: Sat, 18 May 2019 15:56:16 +0530 Subject: [Gluster-devel] ./tests/bugs/snapshot/bug-1399598-uss-with-ssl.t generating core very often In-Reply-To: References: <2e4ddeef-91d0-b197-35e6-7e4f9b6e6b86@redhat.com> Message-ID: <9df5d805-2250-7b45-d8f2-5eb48b10ffbc@redhat.com> All of this links have a common backtrace, and suggest a crash from socket layer with ssl code path, Backtrace is Thread 1 (Thread 0x7f9cfbfff700 (LWP 31373)): #0? 0x00007f9d14d65400 in ssl3_free_digest_list () from /lib64/libssl.so.10 No symbol table info available. #1? 0x00007f9d14d65586 in ssl3_digest_cached_records () from /lib64/libssl.so.10 No symbol table info available. #2? 0x00007f9d14d5f91d in ssl3_send_client_verify () from /lib64/libssl.so.10 No symbol table info available. #3? 0x00007f9d14d61be7 in ssl3_connect () from /lib64/libssl.so.10 No symbol table info available. #4? 0x00007f9d14fb3585 in ssl_complete_connection (this=0x7f9ce802e980) at /home/jenkins/root/workspace/centos7-regression/rpc/rpc-transport/socket/src/socket.c:482 ??????? ret = -1 ??????? cname = 0x0 ??????? r = -1 ??????? ssl_error = -1 ??????? priv = 0x7f9ce802efc0 ??????? __FUNCTION__ = "ssl_complete_connection" #5? 0x00007f9d14fbb596 in ssl_handle_client_connection_attempt (this=0x7f9ce802e980) at /home/jenkins/root/workspace/centos7-regression/rpc/rpc-transport/socket/src/socket.c:2809 ??????? priv = 0x7f9ce802efc0 ??????? ctx = 0x7f9d08001170 ??????? idx = 1 ??????? ret = -1 ??????? fd = 16 ??????? __FUNCTION__ = "ssl_handle_client_connection_attempt" #6? 0x00007f9d14fbb8b3 in socket_complete_connection (this=0x7f9ce802e980) at /home/jenkins/root/workspace/centos7-regression/rpc/rpc-transport/socket/src/socket.c:2908 ??????? priv = 0x7f9ce802efc0 ??????? ctx = 0x7f9d08001170 ??????? idx = 1 ??????? gen = 4 ??????? ret = -1 ??????? fd = 16 #7? 0x00007f9d14fbbc16 in socket_event_handler (fd=16, idx=1, gen=4, data=0x7f9ce802e980, poll_in=0, poll_out=4, poll_err=0, event_thread_died=0 '\000') at /home/jenkins/root/workspace/centos7-regression/rpc/rpc-transport/socket/src/socket.c:2970 #8? 0x00007f9d20c896c1 in event_dispatch_epoll_handler (event_pool=0x7f9d08034960, event=0x7f9cfbffe140) at /home/jenkins/root/workspace/centos7-regression/libglusterfs/src/event-epoll.c:648 #9? 0x00007f9d20c89bda in event_dispatch_epoll_worker (data=0x7f9cf4000b60) at /home/jenkins/root/workspace/centos7-regression/libglusterfs/src/event-epoll.c:761 #10 0x00007f9d1fa39dd5 in start_thread () from /lib64/libpthread.so.0 Mohith, Do you have any idea what is going on with ssl? Regards Rafi KC On 5/16/19 8:01 PM, Sanju Rakonde wrote: > Thank you for the quick responses. > > I missed pasting the links here. You can find the core file in the > following?links. > https://build.gluster.org/job/centos7-regression/6035/consoleFull > https://build.gluster.org/job/centos7-regression/6055/consoleFull > https://build.gluster.org/job/centos7-regression/6045/consoleFull > > On Thu, May 16, 2019 at 7:49 PM RAFI KC > wrote: > > Currently I'm looking into one of the priority issue, In parallel > I will also looking to this. > > Saju, > > Do you have a link to the core file? > > > Regards > > Rafi KC > > On 5/16/19 7:28 PM, FNU Raghavendra Manjunath wrote: >> >> I am working on other uss issue. i.e. the occasional failure of >> uss.t due to delays in the brick-mux regression. Rafi? Can you >> please look into this? >> >> Regards, >> Raghavendra >> >> On Thu, May 16, 2019 at 9:48 AM Sanju Rakonde >> > wrote: >> >> In most of the regression jobs >> ./tests/bugs/snapshot/bug-1399598-uss-with-ssl.t is dumping >> core, hence the regression is failing for many patches. >> >> Rafi/Raghavendra, can you please look into this issue? >> >> -- >> Thanks, >> Sanju >> > > > -- > Thanks, > Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From moagrawa at redhat.com Sat May 18 12:56:35 2019 From: moagrawa at redhat.com (Mohit Agrawal) Date: Sat, 18 May 2019 18:26:35 +0530 Subject: [Gluster-devel] ./tests/bugs/snapshot/bug-1399598-uss-with-ssl.t generating core very often In-Reply-To: <9df5d805-2250-7b45-d8f2-5eb48b10ffbc@redhat.com> References: <2e4ddeef-91d0-b197-35e6-7e4f9b6e6b86@redhat.com> <9df5d805-2250-7b45-d8f2-5eb48b10ffbc@redhat.com> Message-ID: Hi Rafi, I have not checked yet, on Monday I will check the same. Thanks, Mohit Agrawal On Sat, May 18, 2019 at 3:56 PM RAFI KC wrote: > All of this links have a common backtrace, and suggest a crash from socket > layer with ssl code path, > > Backtrace is > > Thread 1 (Thread 0x7f9cfbfff700 (LWP 31373)): > #0 0x00007f9d14d65400 in ssl3_free_digest_list () from /lib64/libssl.so.10 > No symbol table info available. > #1 0x00007f9d14d65586 in ssl3_digest_cached_records () from > /lib64/libssl.so.10 > No symbol table info available. > #2 0x00007f9d14d5f91d in ssl3_send_client_verify () from > /lib64/libssl.so.10 > No symbol table info available. > #3 0x00007f9d14d61be7 in ssl3_connect () from /lib64/libssl.so.10 > No symbol table info available. > #4 0x00007f9d14fb3585 in ssl_complete_connection (this=0x7f9ce802e980) at > /home/jenkins/root/workspace/centos7-regression/rpc/rpc-transport/socket/src/socket.c:482 > ret = -1 > cname = 0x0 > r = -1 > ssl_error = -1 > priv = 0x7f9ce802efc0 > __FUNCTION__ = "ssl_complete_connection" > #5 0x00007f9d14fbb596 in ssl_handle_client_connection_attempt > (this=0x7f9ce802e980) at > /home/jenkins/root/workspace/centos7-regression/rpc/rpc-transport/socket/src/socket.c:2809 > priv = 0x7f9ce802efc0 > ctx = 0x7f9d08001170 > idx = 1 > ret = -1 > fd = 16 > __FUNCTION__ = "ssl_handle_client_connection_attempt" > #6 0x00007f9d14fbb8b3 in socket_complete_connection (this=0x7f9ce802e980) > at > /home/jenkins/root/workspace/centos7-regression/rpc/rpc-transport/socket/src/socket.c:2908 > priv = 0x7f9ce802efc0 > ctx = 0x7f9d08001170 > idx = 1 > gen = 4 > ret = -1 > fd = 16 > #7 0x00007f9d14fbbc16 in socket_event_handler (fd=16, idx=1, gen=4, > data=0x7f9ce802e980, poll_in=0, poll_out=4, poll_err=0, event_thread_died=0 > '\000') at > /home/jenkins/root/workspace/centos7-regression/rpc/rpc-transport/socket/src/socket.c:2970 > #8 0x00007f9d20c896c1 in event_dispatch_epoll_handler > (event_pool=0x7f9d08034960, event=0x7f9cfbffe140) at > /home/jenkins/root/workspace/centos7-regression/libglusterfs/src/event-epoll.c:648 > #9 0x00007f9d20c89bda in event_dispatch_epoll_worker > (data=0x7f9cf4000b60) at > /home/jenkins/root/workspace/centos7-regression/libglusterfs/src/event-epoll.c:761 > #10 0x00007f9d1fa39dd5 in start_thread () from /lib64/libpthread.so.0 > > > Mohith, > > Do you have any idea what is going on with ssl? > > Regards > > Rafi KC > On 5/16/19 8:01 PM, Sanju Rakonde wrote: > > Thank you for the quick responses. > > I missed pasting the links here. You can find the core file in the > following links. > https://build.gluster.org/job/centos7-regression/6035/consoleFull > https://build.gluster.org/job/centos7-regression/6055/consoleFull > https://build.gluster.org/job/centos7-regression/6045/consoleFull > > On Thu, May 16, 2019 at 7:49 PM RAFI KC wrote: > >> Currently I'm looking into one of the priority issue, In parallel I will >> also looking to this. >> >> Saju, >> >> Do you have a link to the core file? >> >> >> Regards >> >> Rafi KC >> On 5/16/19 7:28 PM, FNU Raghavendra Manjunath wrote: >> >> >> I am working on other uss issue. i.e. the occasional failure of uss.t due >> to delays in the brick-mux regression. Rafi? Can you please look into this? >> >> Regards, >> Raghavendra >> >> On Thu, May 16, 2019 at 9:48 AM Sanju Rakonde >> wrote: >> >>> In most of the regression jobs ./tests/bugs/snapshot/bug-1399598-uss-with-ssl.t >>> is dumping core, hence the regression is failing for many patches. >>> >>> Rafi/Raghavendra, can you please look into this issue? >>> >>> -- >>> Thanks, >>> Sanju >>> >> > > -- > Thanks, > Sanju > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jenkins at build.gluster.org Mon May 20 01:45:02 2019 From: jenkins at build.gluster.org (jenkins at build.gluster.org) Date: Mon, 20 May 2019 01:45:02 +0000 (UTC) Subject: [Gluster-devel] Weekly Untriaged Bugs Message-ID: <1886700542.45.1558316702971.JavaMail.jenkins@jenkins-el7.rht.gluster.org> [...truncated 6 lines...] https://bugzilla.redhat.com/1709959 / core: Gluster causing Kubernetes containers to enter crash loop with 'mkdir ... file exists' error message https://bugzilla.redhat.com/1707866 / core: Thousands of duplicate files in glusterfs mountpoint directory listing https://bugzilla.redhat.com/1711400 / disperse: Dispersed volumes leave open file descriptors on nodes https://bugzilla.redhat.com/1708505 / disperse: [EC] /tests/basic/ec/ec-data-heal.t is failing as heal is not happening properly https://bugzilla.redhat.com/1708531 / distribute: gluster rebalance status brain splits https://bugzilla.redhat.com/1703322 / doc: Need to document about fips-mode-rchecksum in gluster-7 release notes. https://bugzilla.redhat.com/1710744 / fuse: [FUSE] Endpoint is not connected after "Found anomalies" error https://bugzilla.redhat.com/1702043 / fuse: Newly created files are inaccessible via FUSE https://bugzilla.redhat.com/1706716 / glusterd: glusterd generated core while running ./tests/bugs/cli/bug-1077682.t https://bugzilla.redhat.com/1703007 / glusterd: The telnet or something would cause high memory usage for glusterd & glusterfsd https://bugzilla.redhat.com/1706842 / gluster-smb: Hard Failover with Samba and Glusterfs fails https://bugzilla.redhat.com/1705351 / HDFS: glusterfsd crash after days of running https://bugzilla.redhat.com/1703433 / project-infrastructure: gluster-block: setup GCOV & LCOV job https://bugzilla.redhat.com/1703435 / project-infrastructure: gluster-block: Upstream Jenkins job which get triggered at PR level https://bugzilla.redhat.com/1703329 / project-infrastructure: [gluster-infra]: Please create repo for plus one scale work https://bugzilla.redhat.com/1708257 / project-infrastructure: Grant additional maintainers merge rights on release branches https://bugzilla.redhat.com/1702289 / tiering: Promotion failed for a0afd3e3-0109-49b7-9b74-ba77bf653aba.11229 [...truncated 2 lines...] -------------- next part -------------- A non-text attachment was scrubbed... Name: build.log Type: application/octet-stream Size: 2279 bytes Desc: not available URL: From pkalever at redhat.com Mon May 20 12:36:41 2019 From: pkalever at redhat.com (Prasanna Kalever) Date: Mon, 20 May 2019 18:06:41 +0530 Subject: [Gluster-devel] [Gluster-users] gluster-block v0.4 is alive! In-Reply-To: References: Message-ID: Hey Vlad, Thanks for trying gluster-block. Appreciate your feedback. Here is the patch which should fix the issue you have noticed: https://github.com/gluster/gluster-block/pull/233 Thanks! -- Prasanna On Sat, May 18, 2019 at 4:48 AM Vlad Kopylov wrote: > > > straight from > > ./autogen.sh && ./configure && make -j install > > > CentOS Linux release 7.6.1810 (Core) > > > May 17 19:13:18 vm2 gluster-blockd[24294]: Error opening log file: No such file or directory > May 17 19:13:18 vm2 gluster-blockd[24294]: Logging to stderr. > May 17 19:13:18 vm2 gluster-blockd[24294]: [2019-05-17 23:13:18.966992] CRIT: trying to change logDir from /var/log/gluster-block to /var/log/gluster-block [at utils.c+495 :] > May 17 19:13:19 vm2 gluster-blockd[24294]: No such path /backstores/user:glfs > May 17 19:13:19 vm2 systemd[1]: gluster-blockd.service: main process exited, code=exited, status=1/FAILURE > May 17 19:13:19 vm2 systemd[1]: Unit gluster-blockd.service entered failed state. > May 17 19:13:19 vm2 systemd[1]: gluster-blockd.service failed. > > > > On Thu, May 2, 2019 at 1:35 PM Prasanna Kalever wrote: >> >> Hello Gluster folks, >> >> Gluster-block team is happy to announce the v0.4 release [1]. >> >> This is the new stable version of gluster-block, lots of new and >> exciting features and interesting bug fixes are made available as part >> of this release. >> Please find the big list of release highlights and notable fixes at [2]. >> >> Details about installation can be found in the easy install guide at >> [3]. Find the details about prerequisites and setup guide at [4]. >> If you are a new user, checkout the demo video attached in the README >> doc [5], which will be a good source of intro to the project. >> There are good examples about how to use gluster-block both in the man >> pages [6] and test file [7] (also in the README). >> >> gluster-block is part of fedora package collection, an updated package >> with release version v0.4 will be soon made available. And the >> community provided packages will be soon made available at [8]. >> >> Please spend a minute to report any kind of issue that comes to your >> notice with this handy link [9]. >> We look forward to your feedback, which will help gluster-block get better! >> >> We would like to thank all our users, contributors for bug filing and >> fixes, also the whole team who involved in the huge effort with >> pre-release testing. >> >> >> [1] https://github.com/gluster/gluster-block >> [2] https://github.com/gluster/gluster-block/releases >> [3] https://github.com/gluster/gluster-block/blob/master/INSTALL >> [4] https://github.com/gluster/gluster-block#usage >> [5] https://github.com/gluster/gluster-block/blob/master/README.md >> [6] https://github.com/gluster/gluster-block/tree/master/docs >> [7] https://github.com/gluster/gluster-block/blob/master/tests/basic.t >> [8] https://download.gluster.org/pub/gluster/gluster-block/ >> [9] https://github.com/gluster/gluster-block/issues/new >> >> Cheers, >> Team Gluster-Block! >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users From dkhandel at redhat.com Mon May 20 13:09:15 2019 From: dkhandel at redhat.com (Deepshikha Khandelwal) Date: Mon, 20 May 2019 18:39:15 +0530 Subject: [Gluster-devel] ./tests/bugs/snapshot/bug-1399598-uss-with-ssl.t generating core very often In-Reply-To: References: <2e4ddeef-91d0-b197-35e6-7e4f9b6e6b86@redhat.com> <9df5d805-2250-7b45-d8f2-5eb48b10ffbc@redhat.com> Message-ID: Any updates on this? It's failing few of the regression runs. On Sat, May 18, 2019 at 6:27 PM Mohit Agrawal wrote: > Hi Rafi, > > I have not checked yet, on Monday I will check the same. > > Thanks, > Mohit Agrawal > > On Sat, May 18, 2019 at 3:56 PM RAFI KC wrote: > >> All of this links have a common backtrace, and suggest a crash from >> socket layer with ssl code path, >> >> Backtrace is >> >> Thread 1 (Thread 0x7f9cfbfff700 (LWP 31373)): >> #0 0x00007f9d14d65400 in ssl3_free_digest_list () from >> /lib64/libssl.so.10 >> No symbol table info available. >> #1 0x00007f9d14d65586 in ssl3_digest_cached_records () from >> /lib64/libssl.so.10 >> No symbol table info available. >> #2 0x00007f9d14d5f91d in ssl3_send_client_verify () from >> /lib64/libssl.so.10 >> No symbol table info available. >> #3 0x00007f9d14d61be7 in ssl3_connect () from /lib64/libssl.so.10 >> No symbol table info available. >> #4 0x00007f9d14fb3585 in ssl_complete_connection (this=0x7f9ce802e980) >> at >> /home/jenkins/root/workspace/centos7-regression/rpc/rpc-transport/socket/src/socket.c:482 >> ret = -1 >> cname = 0x0 >> r = -1 >> ssl_error = -1 >> priv = 0x7f9ce802efc0 >> __FUNCTION__ = "ssl_complete_connection" >> #5 0x00007f9d14fbb596 in ssl_handle_client_connection_attempt >> (this=0x7f9ce802e980) at >> /home/jenkins/root/workspace/centos7-regression/rpc/rpc-transport/socket/src/socket.c:2809 >> priv = 0x7f9ce802efc0 >> ctx = 0x7f9d08001170 >> idx = 1 >> ret = -1 >> fd = 16 >> __FUNCTION__ = "ssl_handle_client_connection_attempt" >> #6 0x00007f9d14fbb8b3 in socket_complete_connection >> (this=0x7f9ce802e980) at >> /home/jenkins/root/workspace/centos7-regression/rpc/rpc-transport/socket/src/socket.c:2908 >> priv = 0x7f9ce802efc0 >> ctx = 0x7f9d08001170 >> idx = 1 >> gen = 4 >> ret = -1 >> fd = 16 >> #7 0x00007f9d14fbbc16 in socket_event_handler (fd=16, idx=1, gen=4, >> data=0x7f9ce802e980, poll_in=0, poll_out=4, poll_err=0, event_thread_died=0 >> '\000') at >> /home/jenkins/root/workspace/centos7-regression/rpc/rpc-transport/socket/src/socket.c:2970 >> #8 0x00007f9d20c896c1 in event_dispatch_epoll_handler >> (event_pool=0x7f9d08034960, event=0x7f9cfbffe140) at >> /home/jenkins/root/workspace/centos7-regression/libglusterfs/src/event-epoll.c:648 >> #9 0x00007f9d20c89bda in event_dispatch_epoll_worker >> (data=0x7f9cf4000b60) at >> /home/jenkins/root/workspace/centos7-regression/libglusterfs/src/event-epoll.c:761 >> #10 0x00007f9d1fa39dd5 in start_thread () from /lib64/libpthread.so.0 >> >> >> Mohith, >> >> Do you have any idea what is going on with ssl? >> >> Regards >> >> Rafi KC >> On 5/16/19 8:01 PM, Sanju Rakonde wrote: >> >> Thank you for the quick responses. >> >> I missed pasting the links here. You can find the core file in the >> following links. >> https://build.gluster.org/job/centos7-regression/6035/consoleFull >> https://build.gluster.org/job/centos7-regression/6055/consoleFull >> https://build.gluster.org/job/centos7-regression/6045/consoleFull >> >> On Thu, May 16, 2019 at 7:49 PM RAFI KC wrote: >> >>> Currently I'm looking into one of the priority issue, In parallel I will >>> also looking to this. >>> >>> Saju, >>> >>> Do you have a link to the core file? >>> >>> >>> Regards >>> >>> Rafi KC >>> On 5/16/19 7:28 PM, FNU Raghavendra Manjunath wrote: >>> >>> >>> I am working on other uss issue. i.e. the occasional failure of uss.t >>> due to delays in the brick-mux regression. Rafi? Can you please look into >>> this? >>> >>> Regards, >>> Raghavendra >>> >>> On Thu, May 16, 2019 at 9:48 AM Sanju Rakonde >>> wrote: >>> >>>> In most of the regression jobs ./tests/bugs/snapshot/bug-1399598-uss-with-ssl.t >>>> is dumping core, hence the regression is failing for many patches. >>>> >>>> Rafi/Raghavendra, can you please look into this issue? >>>> >>>> -- >>>> Thanks, >>>> Sanju >>>> >>> >> >> -- >> Thanks, >> Sanju >> >> _______________________________________________ > > Community Meeting Calendar: > > APAC Schedule - > Every 2nd and 4th Tuesday at 11:30 AM IST > Bridge: https://bluejeans.com/836554017 > > NA/EMEA Schedule - > Every 1st and 3rd Tuesday at 01:00 PM EDT > Bridge: https://bluejeans.com/486278655 > > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From moagrawa at redhat.com Mon May 20 13:11:39 2019 From: moagrawa at redhat.com (Mohit Agrawal) Date: Mon, 20 May 2019 18:41:39 +0530 Subject: [Gluster-devel] ./tests/bugs/snapshot/bug-1399598-uss-with-ssl.t generating core very often In-Reply-To: References: <2e4ddeef-91d0-b197-35e6-7e4f9b6e6b86@redhat.com> <9df5d805-2250-7b45-d8f2-5eb48b10ffbc@redhat.com> Message-ID: I am working on it. On Mon, May 20, 2019 at 6:39 PM Deepshikha Khandelwal wrote: > Any updates on this? > > It's failing few of the regression runs. > > On Sat, May 18, 2019 at 6:27 PM Mohit Agrawal wrote: > >> Hi Rafi, >> >> I have not checked yet, on Monday I will check the same. >> >> Thanks, >> Mohit Agrawal >> >> On Sat, May 18, 2019 at 3:56 PM RAFI KC wrote: >> >>> All of this links have a common backtrace, and suggest a crash from >>> socket layer with ssl code path, >>> >>> Backtrace is >>> >>> Thread 1 (Thread 0x7f9cfbfff700 (LWP 31373)): >>> #0 0x00007f9d14d65400 in ssl3_free_digest_list () from >>> /lib64/libssl.so.10 >>> No symbol table info available. >>> #1 0x00007f9d14d65586 in ssl3_digest_cached_records () from >>> /lib64/libssl.so.10 >>> No symbol table info available. >>> #2 0x00007f9d14d5f91d in ssl3_send_client_verify () from >>> /lib64/libssl.so.10 >>> No symbol table info available. >>> #3 0x00007f9d14d61be7 in ssl3_connect () from /lib64/libssl.so.10 >>> No symbol table info available. >>> #4 0x00007f9d14fb3585 in ssl_complete_connection (this=0x7f9ce802e980) >>> at >>> /home/jenkins/root/workspace/centos7-regression/rpc/rpc-transport/socket/src/socket.c:482 >>> ret = -1 >>> cname = 0x0 >>> r = -1 >>> ssl_error = -1 >>> priv = 0x7f9ce802efc0 >>> __FUNCTION__ = "ssl_complete_connection" >>> #5 0x00007f9d14fbb596 in ssl_handle_client_connection_attempt >>> (this=0x7f9ce802e980) at >>> /home/jenkins/root/workspace/centos7-regression/rpc/rpc-transport/socket/src/socket.c:2809 >>> priv = 0x7f9ce802efc0 >>> ctx = 0x7f9d08001170 >>> idx = 1 >>> ret = -1 >>> fd = 16 >>> __FUNCTION__ = "ssl_handle_client_connection_attempt" >>> #6 0x00007f9d14fbb8b3 in socket_complete_connection >>> (this=0x7f9ce802e980) at >>> /home/jenkins/root/workspace/centos7-regression/rpc/rpc-transport/socket/src/socket.c:2908 >>> priv = 0x7f9ce802efc0 >>> ctx = 0x7f9d08001170 >>> idx = 1 >>> gen = 4 >>> ret = -1 >>> fd = 16 >>> #7 0x00007f9d14fbbc16 in socket_event_handler (fd=16, idx=1, gen=4, >>> data=0x7f9ce802e980, poll_in=0, poll_out=4, poll_err=0, event_thread_died=0 >>> '\000') at >>> /home/jenkins/root/workspace/centos7-regression/rpc/rpc-transport/socket/src/socket.c:2970 >>> #8 0x00007f9d20c896c1 in event_dispatch_epoll_handler >>> (event_pool=0x7f9d08034960, event=0x7f9cfbffe140) at >>> /home/jenkins/root/workspace/centos7-regression/libglusterfs/src/event-epoll.c:648 >>> #9 0x00007f9d20c89bda in event_dispatch_epoll_worker >>> (data=0x7f9cf4000b60) at >>> /home/jenkins/root/workspace/centos7-regression/libglusterfs/src/event-epoll.c:761 >>> #10 0x00007f9d1fa39dd5 in start_thread () from /lib64/libpthread.so.0 >>> >>> >>> Mohith, >>> >>> Do you have any idea what is going on with ssl? >>> >>> Regards >>> >>> Rafi KC >>> On 5/16/19 8:01 PM, Sanju Rakonde wrote: >>> >>> Thank you for the quick responses. >>> >>> I missed pasting the links here. You can find the core file in the >>> following links. >>> https://build.gluster.org/job/centos7-regression/6035/consoleFull >>> https://build.gluster.org/job/centos7-regression/6055/consoleFull >>> https://build.gluster.org/job/centos7-regression/6045/consoleFull >>> >>> On Thu, May 16, 2019 at 7:49 PM RAFI KC wrote: >>> >>>> Currently I'm looking into one of the priority issue, In parallel I >>>> will also looking to this. >>>> >>>> Saju, >>>> >>>> Do you have a link to the core file? >>>> >>>> >>>> Regards >>>> >>>> Rafi KC >>>> On 5/16/19 7:28 PM, FNU Raghavendra Manjunath wrote: >>>> >>>> >>>> I am working on other uss issue. i.e. the occasional failure of uss.t >>>> due to delays in the brick-mux regression. Rafi? Can you please look into >>>> this? >>>> >>>> Regards, >>>> Raghavendra >>>> >>>> On Thu, May 16, 2019 at 9:48 AM Sanju Rakonde >>>> wrote: >>>> >>>>> In most of the regression jobs ./tests/bugs/snapshot/bug-1399598-uss-with-ssl.t >>>>> is dumping core, hence the regression is failing for many patches. >>>>> >>>>> Rafi/Raghavendra, can you please look into this issue? >>>>> >>>>> -- >>>>> Thanks, >>>>> Sanju >>>>> >>>> >>> >>> -- >>> Thanks, >>> Sanju >>> >>> _______________________________________________ >> >> Community Meeting Calendar: >> >> APAC Schedule - >> Every 2nd and 4th Tuesday at 11:30 AM IST >> Bridge: https://bluejeans.com/836554017 >> >> NA/EMEA Schedule - >> Every 1st and 3rd Tuesday at 01:00 PM EDT >> Bridge: https://bluejeans.com/486278655 >> >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From cynthia.zhou at nokia-sbell.com Tue May 21 07:12:28 2019 From: cynthia.zhou at nokia-sbell.com (Zhou, Cynthia (NSB - CN/Hangzhou)) Date: Tue, 21 May 2019 07:12:28 +0000 Subject: [Gluster-devel] glusterfs coredump--mempool Message-ID: <3cf2c0a2d1ca4ac19e085b1ff5fe2370@nokia-sbell.com> Hi glusterfs expert, I meet glusterfs process coredump again in my env, short after glusterfs process startup. The local become NULL, but seems this frame is not destroyed yet since the magic number(GF_MEM_HEADER_MAGIC) still untouched. Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterfs --acl --volfile-server=mn-0.local --volfile-server=mn-1.loc'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f867fcd2971 in client3_3_inodelk_cbk (req=, iov=, count=, myframe=0x7f8654008830) at client-rpc-fops.c:1510 1510 CLIENT_STACK_UNWIND (inodelk, frame, rsp.op_ret, [Current thread is 1 (Thread 0x7f867d6d4700 (LWP 3046))] Missing separate debuginfos, use: dnf debuginfo-install glusterfs-fuse-3.12.15-1.wos2.wf29.x86_64 (gdb) bt #0 0x00007f867fcd2971 in client3_3_inodelk_cbk (req=, iov=, count=, myframe=0x7f8654008830) at client-rpc-fops.c:1510 #1 0x00007f8685ea5584 in rpc_clnt_handle_reply (clnt=clnt at entry=0x7f8678070030, pollin=pollin at entry=0x7f86702833e0) at rpc-clnt.c:782 #2 0x00007f8685ea587b in rpc_clnt_notify (trans=, mydata=0x7f8678070060, event=, data=0x7f86702833e0) at rpc-clnt.c:975 #3 0x00007f8685ea1b83 in rpc_transport_notify (this=this at entry=0x7f8678070270, event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=data at entry=0x7f86702833e0) at rpc-transport.c:538 #4 0x00007f8680b99867 in socket_event_poll_in (notify_handled=_gf_true, this=0x7f8678070270) at socket.c:2260 #5 socket_event_handler (fd=, idx=3, gen=1, data=0x7f8678070270, poll_in=, poll_out=, poll_err=) at socket.c:2645 #6 0x00007f8686132911 in event_dispatch_epoll_handler (event=0x7f867d6d3e6c, event_pool=0x55e1b2792b00) at event-epoll.c:583 #7 event_dispatch_epoll_worker (data=0x7f867805ece0) at event-epoll.c:659 #8 0x00007f8684ea65da in start_thread () from /lib64/libpthread.so.0 #9 0x00007f868474eeaf in clone () from /lib64/libc.so.6 (gdb) print *(call_frame_t*)myframe $3 = {root = 0x7f86540271a0, parent = 0x0, frames = {next = 0x7f8654027898, prev = 0x7f8654027898}, local = 0x0, this = 0x7f8678013080, ret = 0x0, ref_count = 0, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' , __align = 0}}, cookie = 0x0, complete = _gf_false, xid = 0, op = GF_FOP_NULL, begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0}, wind_from = 0x0, wind_to = 0x0, unwind_from = 0x0, unwind_to = 0x0} (gdb) x/4xw 0x7f8654008810 0x7f8654008810: 0xcafebabe 0x00000000 0x00000000 0x00000000 (gdb) p *(pooled_obj_hdr_t *)0x7f8654008810 $2 = {magic = 3405691582, next = 0x0, pool_list = 0x7f8654000b80, power_of_two = 8} I add "uint32_t xid" in data structure _call_frame, and set it according to the rcpreq->xid in __save_frame function. In normal situation this xid should only be 0 immediately after create_frame from memory pool. But in this case this xid is 0, so seems like that the frame has been given out for use before freed. Have you any idea how this happen? cynthia -------------- next part -------------- An HTML attachment was scrubbed... URL: From kdhananj at redhat.com Tue May 21 10:05:07 2019 From: kdhananj at redhat.com (Krutika Dhananjay) Date: Tue, 21 May 2019 15:35:07 +0530 Subject: [Gluster-devel] [Gluster-users] VMs blocked for more than 120 seconds In-Reply-To: <76CB580E-0F53-468F-B7F9-FE46C2971B8C@gmail.com> References: <20190513065548.GI25080@althea.ulrar.net> <681F0862-7C80-414D-9637-7697A8C65AFA@gmail.com> <76CB580E-0F53-468F-B7F9-FE46C2971B8C@gmail.com> Message-ID: Hi Martin, Glad it worked! And yes, 3.7.6 is really old! :) So the issue is occurring when the vm flushes outstanding data to disk. And this is taking > 120s because there's lot of buffered writes to flush, possibly followed by an fsync too which needs to sync them to disk (volume profile would have been helpful in confirming this). All these two options do is to truly honor O_DIRECT flag (which is what we want anyway given the vms are opened with 'cache=none' qemu option). This will skip write-caching on gluster client side and also bypass the page-cache on the gluster-bricks, and so data gets flushed faster, thereby eliminating these timeouts. -Krutika On Mon, May 20, 2019 at 3:38 PM Martin wrote: > Hi Krutika, > > Also, gluster version please? > > I am running old 3.7.6. (Yes I know I should upgrade asap) > > I?ve applied firstly "network.remote-dio off", behaviour did not changed, > VMs got stuck after some time again. > Then I?ve set "performance.strict-o-direct on" and problem completly > disappeared. No more stucks at all (7 days without any problems at all). > This SOLVED the issue. > > Can you explain what remote-dio and strict-o-direct variables changed in > behaviour of my Gluster? It would be great for later archive/users to > understand what and why this solved my issue. > > Anyway, Thanks a LOT!!! > > BR, > Martin > > On 13 May 2019, at 10:20, Krutika Dhananjay wrote: > > OK. In that case, can you check if the following two changes help: > > # gluster volume set $VOL network.remote-dio off > # gluster volume set $VOL performance.strict-o-direct on > > preferably one option changed at a time, its impact tested and then the > next change applied and tested. > > Also, gluster version please? > > -Krutika > > On Mon, May 13, 2019 at 1:02 PM Martin Toth wrote: > >> Cache in qemu is none. That should be correct. This is full command : >> >> /usr/bin/qemu-system-x86_64 -name one-312 -S -machine >> pc-i440fx-xenial,accel=kvm,usb=off -m 4096 -realtime mlock=off -smp >> 4,sockets=4,cores=1,threads=1 -uuid e95a774e-a594-4e98-b141-9f30a3f848c1 >> -no-user-config -nodefaults -chardev >> socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-one-312/monitor.sock,server,nowait >> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime >> -no-shutdown -boot order=c,menu=on,splash-time=3000,strict=on -device >> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 >> >> -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 >> -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 >> -drive file=/var/lib/one//datastores/116/312/*disk.0* >> ,format=raw,if=none,id=drive-virtio-disk1,cache=none >> -device >> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk1,id=virtio-disk1 >> -drive file=gluster://localhost:24007/imagestore/ >> *7b64d6757acc47a39503f68731f89b8e* >> ,format=qcow2,if=none,id=drive-scsi0-0-0-0,cache=none >> -device >> scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 >> -drive file=/var/lib/one//datastores/116/312/*disk.1* >> ,format=raw,if=none,id=drive-ide0-0-0,readonly=on >> -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 >> >> -netdev tap,fd=26,id=hostnet0 >> -device e1000,netdev=hostnet0,id=net0,mac=02:00:5c:f0:e4:39,bus=pci.0,addr=0x3 >> -chardev pty,id=charserial0 -device >> isa-serial,chardev=charserial0,id=serial0 >> -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-one-312/org.qemu.guest_agent.0,server,nowait >> -device >> virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 >> -vnc 0.0.0.0:312,password -device >> cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device >> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on >> >> I?ve highlighted disks. First is VM context disk - Fuse used, second is >> SDA (OS is installed here) - libgfapi used, third is SWAP - Fuse used. >> >> Krutika, >> I will start profiling on Gluster Volumes and wait for next VM to fail. >> Than I will attach/send profiling info after some VM will be failed. I >> suppose this is correct profiling strategy. >> > > About this, how many vms do you need to recreate it? A single vm? Or > multiple vms doing IO in parallel? > > >> Thanks, >> BR! >> Martin >> >> On 13 May 2019, at 09:21, Krutika Dhananjay wrote: >> >> Also, what's the caching policy that qemu is using on the affected vms? >> Is it cache=none? Or something else? You can get this information in the >> command line of qemu-kvm process corresponding to your vm in the ps output. >> >> -Krutika >> >> On Mon, May 13, 2019 at 12:49 PM Krutika Dhananjay >> wrote: >> >>> What version of gluster are you using? >>> Also, can you capture and share volume-profile output for a run where >>> you manage to recreate this issue? >>> >>> https://docs.gluster.org/en/v3/Administrator%20Guide/Monitoring%20Workload/#running-glusterfs-volume-profile-command >>> Let me know if you have any questions. >>> >>> -Krutika >>> >>> On Mon, May 13, 2019 at 12:34 PM Martin Toth >>> wrote: >>> >>>> Hi, >>>> >>>> there is no healing operation, not peer disconnects, no readonly >>>> filesystem. Yes, storage is slow and unavailable for 120 seconds, but why, >>>> its SSD with 10G, performance is good. >>>> >>>> > you'd have it's log on qemu's standard output, >>>> >>>> If you mean /var/log/libvirt/qemu/vm.log there is nothing. I am looking >>>> for problem for more than month, tried everything. Can?t find anything. Any >>>> more clues or leads? >>>> >>>> BR, >>>> Martin >>>> >>>> > On 13 May 2019, at 08:55, lemonnierk at ulrar.net wrote: >>>> > >>>> > On Mon, May 13, 2019 at 08:47:45AM +0200, Martin Toth wrote: >>>> >> Hi all, >>>> > >>>> > Hi >>>> > >>>> >> >>>> >> I am running replica 3 on SSDs with 10G networking, everything works >>>> OK but VMs stored in Gluster volume occasionally freeze with ?Task XY >>>> blocked for more than 120 seconds?. >>>> >> Only solution is to poweroff (hard) VM and than boot it up again. I >>>> am unable to SSH and also login with console, its stuck probably on some >>>> disk operation. No error/warning logs or messages are store in VMs logs. >>>> >> >>>> > >>>> > As far as I know this should be unrelated, I get this during heals >>>> > without any freezes, it just means the storage is slow I think. >>>> > >>>> >> KVM/Libvirt(qemu) using libgfapi and fuse mount to access VM disks >>>> on replica volume. Can someone advice how to debug this problem or what >>>> can cause these issues? >>>> >> It?s really annoying, I?ve tried to google everything but nothing >>>> came up. I?ve tried changing virtio-scsi-pci to virtio-blk-pci disk >>>> drivers, but its not related. >>>> >> >>>> > >>>> > Any chance your gluster goes readonly ? Have you checked your gluster >>>> > logs to see if maybe they lose each other some times ? >>>> > /var/log/glusterfs >>>> > >>>> > For libgfapi accesses you'd have it's log on qemu's standard output, >>>> > that might contain the actual error at the time of the freez. >>>> > _______________________________________________ >>>> > Gluster-users mailing list >>>> > Gluster-users at gluster.org >>>> > https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Tue May 21 13:27:59 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Tue, 21 May 2019 18:57:59 +0530 Subject: [Gluster-devel] tests are timing out in master branch In-Reply-To: References: Message-ID: Looks like after reverting a patch on RPC layer reconnection logic ( https://review.gluster.org/22750) things are back to normal. For those who submitted a patch in last 1 week, please resubmit. (which should take care of rebasing on top of this patch). This event proves that there are very delicate races in our RPC layer, which can trigger random failures. While it was discussed in brief earlier. We need to debug this further, and come up with possible next actions. Volunteers welcome. I recommend to use https://github.com/gluster/glusterfs/issues/391 to capture our observations, and continue on github from here. -Amar On Wed, May 15, 2019 at 11:46 AM Sankarshan Mukhopadhyay < sankarshan.mukhopadhyay at gmail.com> wrote: > On Wed, May 15, 2019 at 11:24 AM Atin Mukherjee > wrote: > > > > There're random tests which are timing out after 200 secs. My belief is > this is a major regression introduced by some commit recently or the > builders have become extremely slow which I highly doubt. I'd request that > we first figure out the cause, get master back to it's proper health and > then get back to the review/merge queue. > > > > For such dire situations, we also need to consider a proposal to back > out patches in order to keep the master healthy. The outcome we seek > is a healthy master - the isolation of the cause allows us to not > repeat the same offense. > > > Sanju has already started looking into > /tests/bugs/glusterd/optimized-basic-testcases-in-cluster.t to understand > what test is specifically hanging and consuming more time. > _______________________________________________ > Atin Mukherjee , Sankarshan Mukhopadhyay < > sankarshan.mukhopadhyay at gmail.com> > Community Meeting Calendar: > > APAC Schedule -https://review.gluster.org/22750 > Every 2nd and 4th Tuesday at 11:30 AM IST > Bridge: https://bluejeans.com/836554017 > > NA/EMEA Schedule - > Every 1st and 3rd Tuesday at 01:00 PM EDT > Bridge: https://bluejeans.com/486278655 > > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From pkalever at redhat.com Tue May 21 14:39:22 2019 From: pkalever at redhat.com (Prasanna Kalever) Date: Tue, 21 May 2019 20:09:22 +0530 Subject: [Gluster-devel] [Gluster-users] gluster-block v0.4 is alive! In-Reply-To: References: Message-ID: On Mon, May 20, 2019 at 9:05 PM Vlad Kopylov wrote: > > Thank you Prasanna. > > Do we have architecture somewhere? Vlad, Although the complete set of details might be missing at one place right now, some pointers to start are available at, https://github.com/gluster/gluster-block#gluster-block and https://pkalever.wordpress.com/2019/05/06/starting-with-gluster-block, hopefully that should give some clarity about the project. Also checkout the man pages. > Dies it bypass Fuse and go directly gfapi ? yes, we don't use Fuse access with gluster-block. The management as-well-as IO happens over gfapi. Please go through the docs pointed above, if you have any specific queries, feel free to ask them here or on github. Best Regards, -- Prasanna > > v > > On Mon, May 20, 2019, 8:36 AM Prasanna Kalever wrote: >> >> Hey Vlad, >> >> Thanks for trying gluster-block. Appreciate your feedback. >> >> Here is the patch which should fix the issue you have noticed: >> https://github.com/gluster/gluster-block/pull/233 >> >> Thanks! >> -- >> Prasanna >> >> On Sat, May 18, 2019 at 4:48 AM Vlad Kopylov wrote: >> > >> > >> > straight from >> > >> > ./autogen.sh && ./configure && make -j install >> > >> > >> > CentOS Linux release 7.6.1810 (Core) >> > >> > >> > May 17 19:13:18 vm2 gluster-blockd[24294]: Error opening log file: No such file or directory >> > May 17 19:13:18 vm2 gluster-blockd[24294]: Logging to stderr. >> > May 17 19:13:18 vm2 gluster-blockd[24294]: [2019-05-17 23:13:18.966992] CRIT: trying to change logDir from /var/log/gluster-block to /var/log/gluster-block [at utils.c+495 :] >> > May 17 19:13:19 vm2 gluster-blockd[24294]: No such path /backstores/user:glfs >> > May 17 19:13:19 vm2 systemd[1]: gluster-blockd.service: main process exited, code=exited, status=1/FAILURE >> > May 17 19:13:19 vm2 systemd[1]: Unit gluster-blockd.service entered failed state. >> > May 17 19:13:19 vm2 systemd[1]: gluster-blockd.service failed. >> > >> > >> > >> > On Thu, May 2, 2019 at 1:35 PM Prasanna Kalever wrote: >> >> >> >> Hello Gluster folks, >> >> >> >> Gluster-block team is happy to announce the v0.4 release [1]. >> >> >> >> This is the new stable version of gluster-block, lots of new and >> >> exciting features and interesting bug fixes are made available as part >> >> of this release. >> >> Please find the big list of release highlights and notable fixes at [2]. >> >> >> >> Details about installation can be found in the easy install guide at >> >> [3]. Find the details about prerequisites and setup guide at [4]. >> >> If you are a new user, checkout the demo video attached in the README >> >> doc [5], which will be a good source of intro to the project. >> >> There are good examples about how to use gluster-block both in the man >> >> pages [6] and test file [7] (also in the README). >> >> >> >> gluster-block is part of fedora package collection, an updated package >> >> with release version v0.4 will be soon made available. And the >> >> community provided packages will be soon made available at [8]. >> >> >> >> Please spend a minute to report any kind of issue that comes to your >> >> notice with this handy link [9]. >> >> We look forward to your feedback, which will help gluster-block get better! >> >> >> >> We would like to thank all our users, contributors for bug filing and >> >> fixes, also the whole team who involved in the huge effort with >> >> pre-release testing. >> >> >> >> >> >> [1] https://github.com/gluster/gluster-block >> >> [2] https://github.com/gluster/gluster-block/releases >> >> [3] https://github.com/gluster/gluster-block/blob/master/INSTALL >> >> [4] https://github.com/gluster/gluster-block#usage >> >> [5] https://github.com/gluster/gluster-block/blob/master/README.md >> >> [6] https://github.com/gluster/gluster-block/tree/master/docs >> >> [7] https://github.com/gluster/gluster-block/blob/master/tests/basic.t >> >> [8] https://download.gluster.org/pub/gluster/gluster-block/ >> >> [9] https://github.com/gluster/gluster-block/issues/new >> >> >> >> Cheers, >> >> Team Gluster-Block! >> >> _______________________________________________ >> >> Gluster-users mailing list >> >> Gluster-users at gluster.org >> >> https://lists.gluster.org/mailman/listinfo/gluster-users From kdhananj at redhat.com Wed May 22 13:22:52 2019 From: kdhananj at redhat.com (Krutika Dhananjay) Date: Wed, 22 May 2019 18:52:52 +0530 Subject: [Gluster-devel] More intelligent file distribution across subvols of DHT when file size is known Message-ID: Hi, I've proposed a solution to the problem of space running out in some children of DHT even when its other children have free space available, here - https://github.com/gluster/glusterfs/issues/675. The proposal aims to solve a very specific instance of this generic class of problems where fortunately the size of the file that is getting created is known beforehand. Requesting feedback on the proposal or even alternate solutions, if you have any. -Krutika -------------- next part -------------- An HTML attachment was scrubbed... URL: From rabhat at redhat.com Wed May 22 14:18:03 2019 From: rabhat at redhat.com (FNU Raghavendra Manjunath) Date: Wed, 22 May 2019 10:18:03 -0400 Subject: [Gluster-devel] ./tests/basic/uss.t is timing out in release-6 branch In-Reply-To: References: Message-ID: More analysis: It looks like in the 1st iteration, the testcase is stuck at the test (TEST ! stat $M0/.history/snap6/aaa) from line 385 of uss.t Before, it was the last test to be executed from uss.t. So the assumption was that after the completion of that test (i.e. test from line 385), cleanup function was either getting blocked or taking more time to do cleanups. Hence the patch [1] was submitted to reduce the amount of work done by cleanup function. The patch ensured that, the snapshots, volume etc created in the test are deleted before cleanup function is executed. But even with that, we observed uss.t to fail sometimes (mainly with brick-mux regressions). To get more infomration regarding the failure, another patch [2] was sent. From that patch some more information is received. 1) Everytime uss.t times out, the script (uss.t) is stuck in executing the particular test from line 385 (TEST ! stat $M0/.history/snap6/aaa) - This test's purpose is to ensure that, looking for a file that does not exist in a snapshot should fail. 2) Adding TRACE logs via [2] indicates that: - the stat request sent by the test reaches snapshot daemon and later the gfapi client instance that the snapshot daemon spawns to communicate with the snapshot volume. - The stat request is served by the md-cache xlator in the gfapi client instance (and hence successful). "[2019-05-16 18:31:18.607521]:++++++++++ G_LOG:./tests/basic/uss.t: TEST: 392 ! stat /mnt/glusterfs/0/.history/snap6/aaa ++++++++++ [2019-05-16 18:31:18.617104] T [MSGID: 0] [syncop.c:2424:syncop_stat] 0-stack-trace: stack-address: 0x7fc63405dba8, winding from gfapi to meta-autoload [2019-05-16 18:31:18.617119] T [MSGID: 0] [defaults.c:2841:default_stat] 0-stack-trace: stack-address: 0x7fc63405dba8, winding from meta-autoload to 0e69605de2974f1b887deee5b3f63b52 [2019-05-16 18:31:18.617130] T [MSGID: 0] [io-stats.c:2709:io_stats_stat] 0-stack-trace: stack-address: 0x7fc63405dba8, winding from 0e69605de2974f1b887deee5b3f63b52 to 0e69605de2974f1b887deee5b3f63b52-io-threads [2019-05-16 18:31:18.617142] D [MSGID: 0] [io-threads.c:376:iot_schedule] 0-0e69605de2974f1b887deee5b3f63b52-io-threads: STAT scheduled as fast priority fop [2019-05-16 18:31:18.617162] T [MSGID: 0] [defaults.c:2068:default_stat_resume] 0-stack-trace: stack-address: 0x7fc63405dba8, winding from 0e69605de2974f1b887deee5b3f63b52-io-threads to 0e69605de2974f1b887deee5b3f63b52-md-cache [2019-05-16 18:31:18.617176] T [MSGID: 0] [md-cache.c:1359:mdc_stat] 0-stack-trace: stack-address: 0x7fc63405dba8, 0e69605de2974f1b887deee5b3f63b52-md-cache returned 0 =========> SUCCESSFUL HERE [2019-05-16 18:31:18.617186] T [MSGID: 0] [defaults.c:1406:default_stat_cbk] 0-stack-trace: stack-address: 0x7fc63405dba8, 0e69605de2974f1b887deee5b3f63b52-io-threads returned 0 [2019-05-16 18:31:18.617195] T [MSGID: 0] [io-stats.c:2059:io_stats_stat_cbk] 0-stack-trace: stack-address: 0x7fc63405dba8, 0e69605de2974f1b887deee5b3f63b52 returned 0 " - The stat response does not reach snapshot daemon. So snapshot daemon is not able to send any response back to the gluster client which initiated this stat request. This leads to client waiting for a response resulting in timeout as per the regression test infra (which sets 200 seconds timeout) Suspects: ========== * First of all the the stat request from the line 385 (TEST ! stat $M0/.history/snap6/aaa) should not be successful. Because, the test deletes the snapshot "snap6", removes the file "aaa" from the mount point, again takes the snapshot "snap6" and performs the stat operatoin on the deleted file "aaa". So the stat should fail. * The patch [2] has been sent to collect more information about the failure (with more logs added to snapview-server and also log level being changed to TRACE in the .t file) [1] https://review.gluster.org/#/c/glusterfs/+/22649/ [2] https://review.gluster.org/#/c/glusterfs/+/22728/ Regards, Raghavendra On Wed, May 1, 2019 at 11:11 AM Sanju Rakonde wrote: > Thank you Raghavendra. > > On Tue, Apr 30, 2019 at 11:46 PM FNU Raghavendra Manjunath < > rabhat at redhat.com> wrote: > >> >> To make things relatively easy for the cleanup () function in the test >> framework, I think it would be better to ensure that uss.t itself deletes >> snapshots and the volume once the tests are done. Patch [1] has been >> submitted for review. >> >> [1] https://review.gluster.org/#/c/glusterfs/+/22649/ >> >> Regards, >> Raghavendra >> >> On Tue, Apr 30, 2019 at 10:42 AM FNU Raghavendra Manjunath < >> rabhat at redhat.com> wrote: >> >>> >>> The failure looks similar to the issue I had mentioned in [1] >>> >>> In short for some reason the cleanup (the cleanup function that we call >>> in our .t files) seems to be taking more time and also not cleaning up >>> properly. This leads to problems for the 2nd iteration (where basic things >>> such as volume creation or volume start itself fails due to ENODATA or >>> ENOENT errors). >>> >>> The 2nd iteration of the uss.t ran had the following errors. >>> >>> "[2019-04-29 09:08:15.275773]:++++++++++ G_LOG:./tests/basic/uss.t: >>> TEST: 39 gluster --mode=script --wignore volume set patchy nfs.disable >>> false ++++++++++ >>> [2019-04-29 09:08:15.390550] : volume set patchy nfs.disable false : >>> SUCCESS >>> [2019-04-29 09:08:15.404624]:++++++++++ G_LOG:./tests/basic/uss.t: TEST: >>> 42 gluster --mode=script --wignore volume start patchy ++++++++++ >>> [2019-04-29 09:08:15.468780] : volume start patchy : FAILED : Failed to >>> get extended attribute trusted.glusterfs.volume-id for brick dir >>> /d/backends/3/patchy_snap_mnt. Reason : No data available >>> " >>> >>> These are the initial steps to create and start volume. Why >>> trusted.glusterfs.volume-id extended attribute is absent is not sure. The >>> analysis in [1] had errors of ENOENT (i.e. export directory itself was >>> absent). >>> I suspect this to be because of some issue with the cleanup mechanism at >>> the end of the tests. >>> >>> [1] >>> https://lists.gluster.org/pipermail/gluster-devel/2019-April/056104.html >>> >>> On Tue, Apr 30, 2019 at 8:37 AM Sanju Rakonde >>> wrote: >>> >>>> Hi Raghavendra, >>>> >>>> ./tests/basic/uss.t is timing out in release-6 branch consistently. >>>> One such instance is https://review.gluster.org/#/c/glusterfs/+/22641/. >>>> Can you please look into this? >>>> >>>> -- >>>> Thanks, >>>> Sanju >>>> >>> > > -- > Thanks, > Sanju > -------------- next part -------------- An HTML attachment was scrubbed... URL: From srakonde at redhat.com Thu May 23 10:34:37 2019 From: srakonde at redhat.com (Sanju Rakonde) Date: Thu, 23 May 2019 16:04:37 +0530 Subject: [Gluster-devel] ./tests/basic/gfapi/gfapi-ssl-test.t is failing too often in regression Message-ID: I see a lot of patches are failing regressions due to the .t mentioned in the subject line. I've filed a bug[1] for the same. https://bugzilla.redhat.com/show_bug.cgi?id=1713284 -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From srakonde at redhat.com Thu May 23 11:04:28 2019 From: srakonde at redhat.com (Sanju Rakonde) Date: Thu, 23 May 2019 16:34:28 +0530 Subject: [Gluster-devel] ./tests/basic/gfapi/gfapi-ssl-test.t is failing too often in regression In-Reply-To: References: Message-ID: I apologize for the wrong mail. This .t failed only for one patch and I don't think it is spurious. Closing this bug as not a bug. On Thu, May 23, 2019 at 4:04 PM Sanju Rakonde wrote: > I see a lot of patches are failing regressions due to the .t mentioned in > the subject line. I've filed a bug[1] for the same. > > https://bugzilla.redhat.com/show_bug.cgi?id=1713284 > -- > Thanks, > Sanju > -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From pkarampu at redhat.com Fri May 24 06:26:47 2019 From: pkarampu at redhat.com (Pranith Kumar Karampuri) Date: Fri, 24 May 2019 11:56:47 +0530 Subject: [Gluster-devel] making frame->root->unique more effective in debugging hung frames Message-ID: Hi, At the moment new stack doesn't populate frame->root->unique in all cases. This makes it difficult to debug hung frames by examining successive state dumps. Fuse and server xlator populate it whenever they can, but other xlators won't be able to assign one when they need to create a new frame/stack. Is it okay to change create_frame() code to always populate it with an increasing number for this purpose? I checked both fuse and server xlator use it only in gf_log() so it doesn't seem like there is any other link between frame->root->unique and the functionality of fuse, server xlators. Do let me know if I missed anything before sending this change. -- Pranith -------------- next part -------------- An HTML attachment was scrubbed... URL: From rabhat at redhat.com Fri May 24 17:27:16 2019 From: rabhat at redhat.com (FNU Raghavendra Manjunath) Date: Fri, 24 May 2019 13:27:16 -0400 Subject: [Gluster-devel] making frame->root->unique more effective in debugging hung frames In-Reply-To: References: Message-ID: The idea looks OK. One of the things that probably need to be considered (more of an implementation detail though) is how to generate frame->root->unique. Because, for fuse, frame->root->unique is obtained by finh->unique which IIUC is got from the incoming fop from kernel itself. For protocol/server IIUC frame->root->unique is got from req->xit of the rpc request, which itself is obtained from transport->xid of the rpc_transport_t structure (and from my understanding, the transport->xid is just incremented by everytime a new rpc request is created). Overall the suggestion looks fine though. Regards, Raghavendra On Fri, May 24, 2019 at 2:27 AM Pranith Kumar Karampuri wrote: > Hi, > At the moment new stack doesn't populate frame->root->unique in > all cases. This makes it difficult to debug hung frames by examining > successive state dumps. Fuse and server xlator populate it whenever they > can, but other xlators won't be able to assign one when they need to create > a new frame/stack. Is it okay to change create_frame() code to always > populate it with an increasing number for this purpose? > I checked both fuse and server xlator use it only in gf_log() so it > doesn't seem like there is any other link between frame->root->unique and > the functionality of fuse, server xlators. > Do let me know if I missed anything before sending this change. > > -- > Pranith > _______________________________________________ > > Community Meeting Calendar: > > APAC Schedule - > Every 2nd and 4th Tuesday at 11:30 AM IST > Bridge: https://bluejeans.com/836554017 > > NA/EMEA Schedule - > Every 1st and 3rd Tuesday at 01:00 PM EDT > Bridge: https://bluejeans.com/486278655 > > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pkarampu at redhat.com Sat May 25 04:52:13 2019 From: pkarampu at redhat.com (Pranith Kumar Karampuri) Date: Sat, 25 May 2019 10:22:13 +0530 Subject: [Gluster-devel] making frame->root->unique more effective in debugging hung frames In-Reply-To: References: Message-ID: On Fri, May 24, 2019 at 10:57 PM FNU Raghavendra Manjunath < rabhat at redhat.com> wrote: > > The idea looks OK. One of the things that probably need to be considered > (more of an implementation detail though) is how to generate > frame->root->unique. > > Because, for fuse, frame->root->unique is obtained by finh->unique which > IIUC is got from the incoming fop from kernel itself. > For protocol/server IIUC frame->root->unique is got from req->xit of the > rpc request, which itself is obtained from transport->xid of the > rpc_transport_t structure (and from my understanding, the transport->xid is > just incremented by everytime a > new rpc request is created). > > Overall the suggestion looks fine though. > I am planning to do the same thing transport->xid does. I will send out the patch > Regards, > Raghavendra > > > On Fri, May 24, 2019 at 2:27 AM Pranith Kumar Karampuri < > pkarampu at redhat.com> wrote: > >> Hi, >> At the moment new stack doesn't populate frame->root->unique in >> all cases. This makes it difficult to debug hung frames by examining >> successive state dumps. Fuse and server xlator populate it whenever they >> can, but other xlators won't be able to assign one when they need to create >> a new frame/stack. Is it okay to change create_frame() code to always >> populate it with an increasing number for this purpose? >> I checked both fuse and server xlator use it only in gf_log() so it >> doesn't seem like there is any other link between frame->root->unique and >> the functionality of fuse, server xlators. >> Do let me know if I missed anything before sending this change. >> >> -- >> Pranith >> _______________________________________________ >> >> Community Meeting Calendar: >> >> APAC Schedule - >> Every 2nd and 4th Tuesday at 11:30 AM IST >> Bridge: https://bluejeans.com/836554017 >> >> NA/EMEA Schedule - >> Every 1st and 3rd Tuesday at 01:00 PM EDT >> Bridge: https://bluejeans.com/486278655 >> >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel >> >> -- Pranith -------------- next part -------------- An HTML attachment was scrubbed... URL: From jenkins at build.gluster.org Mon May 27 01:45:02 2019 From: jenkins at build.gluster.org (jenkins at build.gluster.org) Date: Mon, 27 May 2019 01:45:02 +0000 (UTC) Subject: [Gluster-devel] Weekly Untriaged Bugs Message-ID: <1771699257.77.1558921502881.JavaMail.jenkins@jenkins-el7.rht.gluster.org> [...truncated 6 lines...] https://bugzilla.redhat.com/1713391 / project-infrastructure: Access to wordpress instance of gluster.org required for release management [...truncated 2 lines...] -------------- next part -------------- A non-text attachment was scrubbed... Name: build.log Type: application/octet-stream Size: 516 bytes Desc: not available URL: From pkarampu at redhat.com Mon May 27 12:04:57 2019 From: pkarampu at redhat.com (Pranith Kumar Karampuri) Date: Mon, 27 May 2019 17:34:57 +0530 Subject: [Gluster-devel] making frame->root->unique more effective in debugging hung frames In-Reply-To: References: Message-ID: On Sat, May 25, 2019 at 10:22 AM Pranith Kumar Karampuri < pkarampu at redhat.com> wrote: > > > On Fri, May 24, 2019 at 10:57 PM FNU Raghavendra Manjunath < > rabhat at redhat.com> wrote: > >> >> The idea looks OK. One of the things that probably need to be considered >> (more of an implementation detail though) is how to generate >> frame->root->unique. >> >> Because, for fuse, frame->root->unique is obtained by finh->unique which >> IIUC is got from the incoming fop from kernel itself. >> For protocol/server IIUC frame->root->unique is got from req->xit of the >> rpc request, which itself is obtained from transport->xid of the >> rpc_transport_t structure (and from my understanding, the transport->xid is >> just incremented by everytime a >> new rpc request is created). >> >> Overall the suggestion looks fine though. >> > > I am planning to do the same thing transport->xid does. I will send out > the patch > https://review.gluster.org/c/glusterfs/+/22773 > >> Regards, >> Raghavendra >> >> >> On Fri, May 24, 2019 at 2:27 AM Pranith Kumar Karampuri < >> pkarampu at redhat.com> wrote: >> >>> Hi, >>> At the moment new stack doesn't populate frame->root->unique in >>> all cases. This makes it difficult to debug hung frames by examining >>> successive state dumps. Fuse and server xlator populate it whenever they >>> can, but other xlators won't be able to assign one when they need to create >>> a new frame/stack. Is it okay to change create_frame() code to always >>> populate it with an increasing number for this purpose? >>> I checked both fuse and server xlator use it only in gf_log() so it >>> doesn't seem like there is any other link between frame->root->unique and >>> the functionality of fuse, server xlators. >>> Do let me know if I missed anything before sending this change. >>> >>> -- >>> Pranith >>> _______________________________________________ >>> >>> Community Meeting Calendar: >>> >>> APAC Schedule - >>> Every 2nd and 4th Tuesday at 11:30 AM IST >>> Bridge: https://bluejeans.com/836554017 >>> >>> NA/EMEA Schedule - >>> Every 1st and 3rd Tuesday at 01:00 PM EDT >>> Bridge: https://bluejeans.com/486278655 >>> >>> Gluster-devel mailing list >>> Gluster-devel at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-devel >>> >>> > > -- > Pranith > -- Pranith -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhernandez at redhat.com Thu May 30 06:03:18 2019 From: xhernandez at redhat.com (Xavi Hernandez) Date: Thu, 30 May 2019 08:03:18 +0200 Subject: [Gluster-devel] Should we enable features.locks-notify.contention by default ? Message-ID: Hi all, a patch [1] was added some time ago to send upcall notifications from the locks xlator to the current owner of a granted lock when another client tries to acquire the same lock (inodelk or entrylk). This makes it possible to use eager-locking on the client side, which improves performance significantly, while also keeping good performance when multiple clients are accessing the same files (the current owner of the lock receives the notification and releases it as soon as possible, allowing the other client to acquire it and proceed very soon). Currently both AFR and EC are ready to handle these contention notifications and both use eager-locking. However the upcall contention notification is disabled by default. I think we should enabled it by default. Does anyone see any possible issue if we do that ? Regards, Xavi [1] https://review.gluster.org/c/glusterfs/+/14736 -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Thu May 30 06:34:43 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Thu, 30 May 2019 12:04:43 +0530 Subject: [Gluster-devel] Should we enable features.locks-notify.contention by default ? In-Reply-To: References: Message-ID: On Thu, May 30, 2019 at 11:34 AM Xavi Hernandez wrote: > Hi all, > > a patch [1] was added some time ago to send upcall notifications from the > locks xlator to the current owner of a granted lock when another client > tries to acquire the same lock (inodelk or entrylk). This makes it possible > to use eager-locking on the client side, which improves performance > significantly, while also keeping good performance when multiple clients > are accessing the same files (the current owner of the lock receives the > notification and releases it as soon as possible, allowing the other client > to acquire it and proceed very soon). > > Currently both AFR and EC are ready to handle these contention > notifications and both use eager-locking. However the upcall contention > notification is disabled by default. > > I think we should enabled it by default. Does anyone see any possible > issue if we do that ? > > If it helps performance, we should ideally do it. But, considering we are days away from glusterfs-7.0 branching, should we do it now, or wait for branch out, and make it default for next version? (so that it gets time for testing). Considering it is about consistency I would like to hear everyone's opinion here. Regards, Amar > Regards, > > Xavi > > [1] https://review.gluster.org/c/glusterfs/+/14736 > _______________________________________________ > > -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From aspandey at redhat.com Thu May 30 07:03:43 2019 From: aspandey at redhat.com (Ashish Pandey) Date: Thu, 30 May 2019 03:03:43 -0400 (EDT) Subject: [Gluster-devel] Should we enable features.locks-notify.contention by default ? In-Reply-To: References: Message-ID: <82655844.20628389.1559199823610.JavaMail.zimbra@redhat.com> I am only concerned about in-service upgrade. If a feature/option is not present in V1, then I would prefer not to enable it by default on V2. We have seen some problem in other-eager-lock when we changed it to enable by default. --- Ashish ----- Original Message ----- From: "Amar Tumballi Suryanarayan" To: "Xavi Hernandez" Cc: "gluster-devel" Sent: Thursday, May 30, 2019 12:04:43 PM Subject: Re: [Gluster-devel] Should we enable features.locks-notify.contention by default ? On Thu, May 30, 2019 at 11:34 AM Xavi Hernandez < xhernandez at redhat.com > wrote: Hi all, a patch [1] was added some time ago to send upcall notifications from the locks xlator to the current owner of a granted lock when another client tries to acquire the same lock (inodelk or entrylk). This makes it possible to use eager-locking on the client side, which improves performance significantly, while also keeping good performance when multiple clients are accessing the same files (the current owner of the lock receives the notification and releases it as soon as possible, allowing the other client to acquire it and proceed very soon). Currently both AFR and EC are ready to handle these contention notifications and both use eager-locking. However the upcall contention notification is disabled by default. I think we should enabled it by default. Does anyone see any possible issue if we do that ? If it helps performance, we should ideally do it. But, considering we are days away from glusterfs-7.0 branching, should we do it now, or wait for branch out, and make it default for next version? (so that it gets time for testing). Considering it is about consistency I would like to hear everyone's opinion here. Regards, Amar
Regards, Xavi [1] https://review.gluster.org/c/glusterfs/+/14736 _______________________________________________
-- Amar Tumballi (amarts) _______________________________________________ Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/836554017 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/486278655 Gluster-devel mailing list Gluster-devel at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From xhernandez at redhat.com Thu May 30 08:33:54 2019 From: xhernandez at redhat.com (Xavi Hernandez) Date: Thu, 30 May 2019 10:33:54 +0200 Subject: [Gluster-devel] Should we enable features.locks-notify.contention by default ? In-Reply-To: <82655844.20628389.1559199823610.JavaMail.zimbra@redhat.com> References: <82655844.20628389.1559199823610.JavaMail.zimbra@redhat.com> Message-ID: On Thu, May 30, 2019 at 9:03 AM Ashish Pandey wrote: > > > I am only concerned about in-service upgrade. > If a feature/option is not present in V1, then I would prefer not to > enable it by default on V2. > The problem is that without enabling it, (other-)eager-lock will cause performance issues in some cases. It doesn't seem good to keep an option disabled if enabling it solves these problems. > We have seen some problem in other-eager-lock when we changed it to enable > by default. > Which problems ? I think the only issue with other-eager-lock has been precisely that locks-notify-contention was disabled and a bug that needed to be solved anyway. The difference will be that upgraded bricks will start sending upcall notifications. If clients are too old, these will simply be ignored. So I don't see any problem right now. Am I missing something ? > --- > Ashish > > ------------------------------ > *From: *"Amar Tumballi Suryanarayan" > *To: *"Xavi Hernandez" > *Cc: *"gluster-devel" > *Sent: *Thursday, May 30, 2019 12:04:43 PM > *Subject: *Re: [Gluster-devel] Should we enable > features.locks-notify.contention by default ? > > > > On Thu, May 30, 2019 at 11:34 AM Xavi Hernandez > wrote: > >> Hi all, >> >> a patch [1] was added some time ago to send upcall notifications from the >> locks xlator to the current owner of a granted lock when another client >> tries to acquire the same lock (inodelk or entrylk). This makes it possible >> to use eager-locking on the client side, which improves performance >> significantly, while also keeping good performance when multiple clients >> are accessing the same files (the current owner of the lock receives the >> notification and releases it as soon as possible, allowing the other client >> to acquire it and proceed very soon). >> >> Currently both AFR and EC are ready to handle these contention >> notifications and both use eager-locking. However the upcall contention >> notification is disabled by default. >> >> I think we should enabled it by default. Does anyone see any possible >> issue if we do that ? >> >> > If it helps performance, we should ideally do it. > > But, considering we are days away from glusterfs-7.0 branching, should we > do it now, or wait for branch out, and make it default for next version? > (so that it gets time for testing). Considering it is about consistency I > would like to hear everyone's opinion here. > > Regards, > Amar > > > > >> >> Regards, >> >> Xavi >> >> [1] https://review.gluster.org/c/glusterfs/+/14736 >> _______________________________________________ >> >> > -- > Amar Tumballi (amarts) > > _______________________________________________ > > Community Meeting Calendar: > > APAC Schedule - > Every 2nd and 4th Tuesday at 11:30 AM IST > Bridge: https://bluejeans.com/836554017 > > NA/EMEA Schedule - > Every 1st and 3rd Tuesday at 01:00 PM EDT > Bridge: https://bluejeans.com/486278655 > > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hgowtham at redhat.com Thu May 30 08:46:14 2019 From: hgowtham at redhat.com (Hari Gowtham) Date: Thu, 30 May 2019 14:16:14 +0530 Subject: [Gluster-devel] Release 4.1.9 : Expected tagging on June 4th Message-ID: Hi, Expected tagging date for release-4.1.9 is on June, 4th, 2019. Please ensure required patches are back-ported and also are passing regressions and are appropriately reviewed for easy merging and tagging on the date. Note: This will be the last release in the 4.1 series as release 7 is around the corner. -- Regards, Hari Gowtham. From hgowtham at redhat.com Thu May 30 08:49:31 2019 From: hgowtham at redhat.com (Hari Gowtham) Date: Thu, 30 May 2019 14:19:31 +0530 Subject: [Gluster-devel] Release 5.7: Expected tagging on June 6th Message-ID: Hi, Expected tagging date for release-5.7 is on June, 6th, 2019. Please ensure required patches are back-ported and also are passing regressions and are appropriately reviewed for easy merging and tagging on the date. -- Regards, Hari Gowtham. From aspandey at redhat.com Thu May 30 09:23:36 2019 From: aspandey at redhat.com (Ashish Pandey) Date: Thu, 30 May 2019 05:23:36 -0400 (EDT) Subject: [Gluster-devel] Should we enable features.locks-notify.contention by default ? In-Reply-To: References: <82655844.20628389.1559199823610.JavaMail.zimbra@redhat.com> Message-ID: <1863496869.20640433.1559208216880.JavaMail.zimbra@redhat.com> ----- Original Message ----- From: "Xavi Hernandez" To: "Ashish Pandey" Cc: "Amar Tumballi Suryanarayan" , "gluster-devel" Sent: Thursday, May 30, 2019 2:03:54 PM Subject: Re: [Gluster-devel] Should we enable features.locks-notify.contention by default ? On Thu, May 30, 2019 at 9:03 AM Ashish Pandey < aspandey at redhat.com > wrote: I am only concerned about in-service upgrade. If a feature/option is not present in V1, then I would prefer not to enable it by default on V2. The problem is that without enabling it, (other-)eager-lock will cause performance issues in some cases. It doesn't seem good to keep an option disabled if enabling it solves these problems.
We have seen some problem in other-eager-lock when we changed it to enable by default.
Which problems ? I think the only issue with other-eager-lock has been precisely that locks-notify-contention was disabled and a bug that needed to be solved anyway. I was talking about the issue when we have other-eager-lock disabled and then try to do in-service upgrade to a version where this option is ON by default. Although we don't have root cause of that, I was wondering if similar issue could happen in this case also. The difference will be that upgraded bricks will start sending upcall notifications. If clients are too old, these will simply be ignored. So I don't see any problem right now. Am I missing something ?
--- Ashish From: "Amar Tumballi Suryanarayan" < atumball at redhat.com > To: "Xavi Hernandez" < xhernandez at redhat.com > Cc: "gluster-devel" < gluster-devel at gluster.org > Sent: Thursday, May 30, 2019 12:04:43 PM Subject: Re: [Gluster-devel] Should we enable features.locks-notify.contention by default ? On Thu, May 30, 2019 at 11:34 AM Xavi Hernandez < xhernandez at redhat.com > wrote:
Hi all, a patch [1] was added some time ago to send upcall notifications from the locks xlator to the current owner of a granted lock when another client tries to acquire the same lock (inodelk or entrylk). This makes it possible to use eager-locking on the client side, which improves performance significantly, while also keeping good performance when multiple clients are accessing the same files (the current owner of the lock receives the notification and releases it as soon as possible, allowing the other client to acquire it and proceed very soon). Currently both AFR and EC are ready to handle these contention notifications and both use eager-locking. However the upcall contention notification is disabled by default. I think we should enabled it by default. Does anyone see any possible issue if we do that ?
If it helps performance, we should ideally do it. But, considering we are days away from glusterfs-7.0 branching, should we do it now, or wait for branch out, and make it default for next version? (so that it gets time for testing). Considering it is about consistency I would like to hear everyone's opinion here. Regards, Amar
Regards, Xavi [1] https://review.gluster.org/c/glusterfs/+/14736 _______________________________________________
-- Amar Tumballi (amarts) _______________________________________________ Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/836554017 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/486278655 Gluster-devel mailing list Gluster-devel at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel
-------------- next part -------------- An HTML attachment was scrubbed... URL: From hunter86_bg at yahoo.com Sat May 4 06:34:42 2019 From: hunter86_bg at yahoo.com (Strahil) Date: Sat, 04 May 2019 06:34:42 -0000 Subject: [Gluster-devel] [Gluster-users] Proposing to previous ganesha HA clustersolution back to gluster code as gluster-7 feature Message-ID: Hi Jiffin, No vendor will support your corosync/pacemaker stack if you do not have proper fencing. As Gluster is already a cluster of its own, it makes sense to control everything from there. Best Regards, Strahil NikolovOn May 3, 2019 09:08, Jiffin Tony Thottan wrote: > > > On 30/04/19 6:59 PM, Strahil Nikolov wrote: > > Hi, > > > > I'm posting this again as it got bounced. > > Keep in mind that corosync/pacemaker? is hard for proper setup by new admins/users. > > > > I'm still trying to remediate the effects of poor configuration at work. > > Also, storhaug is nice for hyperconverged setups where the host is not only hosting bricks, but? other? workloads. > > Corosync/pacemaker require proper fencing to be setup and most of the stonith resources 'shoot the other node in the head'. > > I would be happy to see an easy to deploy (let say 'cluster.enable-ha-ganesha true') and gluster to be bringing up the Floating IPs and taking care of the NFS locks, so no disruption will be felt by the clients. > > > It do take care those, but need to follow certain prerequisite, but > please fencing won't configured for this setup. May we think about in > future. > > -- > > Jiffin > > > > > Still, this will be a lot of work to achieve. > > > > Best Regards, > > Strahil Nikolov > > > > On Apr 30, 2019 15:19, Jim Kinney wrote: > >>??? > >> +1! > >> I'm using nfs-ganesha in my next upgrade so my client systems can use NFS instead of fuse mounts. Having an integrated, designed in process to coordinate multiple nodes into an HA cluster will very welcome. > >> > >> On April 30, 2019 3:20:11 AM EDT, Jiffin Tony Thottan wrote: > >>>??? > >>> Hi all, > >>> > >>> Some of you folks may be familiar with HA solution provided for nfs-ganesha by gluster using pacemaker and corosync. > >>> > >>> That feature was removed in glusterfs 3.10 in favour for common HA project "Storhaug". Even Storhaug was not progressed > >>> > >>> much from last two years and current development is in halt state, hence planning to restore old HA ganesha solution back > >>> > >>> to gluster code repository with some improvement and targetting for next gluster release 7. > >>> > >>>? ??I have opened up an issue [1] with details and posted initial set of patches [2] > >>> > >>> Please share your thoughts on the same > >>> > >>> > >>> Regards, > >>> > >>> Jiffin > >>> > >>> [1] https://github.com/gluster/glusterfs/issues/663 > >>> > >>> [2] https://review.gluster.org/#/q/topic:rfc-663+(status:open+OR+status:merged) > >>> > >>> > >> -- > >> Sent from my Android device with K-9 Mail. All tyopes are thumb related and reflect authenticity. > > Keep in mind that corosync/pacemaker? is hard for proper setup by new admins/users. > > > > I'm still trying to remediate the effects of poor configuration at work. > > Also, storhaug is nice for hyperconverged setups where the host is not only hosting bricks, but? other? workloads. > > Corosync/pacemaker require proper fencing to be setup and most of the stonith resources 'shoot the other node in the head'. > > I would be happy to see an easy to deploy (let say 'cluster.enable-ha-ganesha true') and gluster to be bringing up the Floating IPs and taking care of the NFS locks, so no disruption will be felt by the clients. > > > > Still, this will be a lot of work to achieve. > > > > Best Regards, > > Strahil NikolovOn Apr 30, 2019 15:19, Jim Kinney wrote: > >> +1! > >> I'm using nfs-ganesha in my next upgrade so my client systems can use NFS instead of fuse mounts. Having an integrated, designed in process to coordinate multiple nodes into an HA cluster will very welcome. > >> > >> On April 30, 2019 3:20:11 AM EDT, Jiffin Tony Thottan wrote: > >>> Hi all, > >>> > >>> Some of you folks may be familiar with HA solution provided for nfs-ganesha by gluster using pacemaker and corosync. > >>> > >>> That feature was removed in glusterfs 3.10 in favour for common HA project "Storhaug". Even Storhaug was not progressed > >>> > >>> much from last two years and current development is in halt state, hence planning to restore old HA ganesha solution back > >>> > >>> to gluster code repository with some improvement and targetting for next gluster release 7. > >>> > >>> I have opened up an issue [1] with details and posted initial set of patches [2] > >>> > >>> Please share your thoughts on the same > >>> > >>> Regards, > >>> > >>> Jiffin > >>> > >>> [1] https://github.com/gluster/glusterfs/issues/663 > >>> > >>> [2] https://review.gluster.org/#/q/topic:rfc-663+(status:open+OR+status:merged) > >> > >> -- > >> Sent from my Android device with K-9 Mail. All tyopes are thumb related and reflect authenticity. From rkothiya at redhat.com Tue May 7 10:56:24 2019 From: rkothiya at redhat.com (Rinku Kothiya) Date: Tue, 07 May 2019 10:56:24 -0000 Subject: [Gluster-devel] [Gluster-Maintainers] Release 7: Kick off! Message-ID: It is time to start some activities for release-7. ## Scope It is time to collect and determine scope for the release, so as usual, please send in features/enhancements that you are working towards reaching maturity for this release to the devel list, and mark/open the github issue with the required milestone [1]. ## Schedule Curretnly the plan working backwards on the schedule, here's what we have: - Announcement: Week of Aug 4th, 2019 - GA tagging: Aug-02-2019 - RC1: On demand before GA - RC0: July-03-2019 - Late features cut-off: Week of June-24th, 2018 - Branching (feature cutoff date): June-17-2018 (~45 days prior to branching) - Feature/scope proposal for the release (end date): May-22-2018 Regards Rinku Kothiya -------------- next part -------------- An HTML attachment was scrubbed... URL: From rkothiya at redhat.com Fri May 10 10:16:01 2019 From: rkothiya at redhat.com (rkothiya at redhat.com) Date: Fri, 10 May 2019 10:16:01 -0000 Subject: [Gluster-devel] gluster-devel, rkothiya@redhat.com recommends that you use Google Calendar Message-ID: <00000000000067a39d058885daf4@google.com> I've been using Google Calendar to organize my calendar, find interesting events, and share my schedule with friends and family members. I thought you might like to use Google Calendar, too. rkothiya at redhat.com recommends that you use Google Calendar. To accept this invitation and register for an account, please visit: [https://www.google.com/calendar/render?cid=cmVkaGF0LmNvbV9yM2hvb3RjcjZ0MXY0YWc2MzFvY2dzZXNoZ0Bncm91cC5jYWxlbmRhci5nb29nbGUuY29t&invEmailKey=gluster-devel at gluster.org:885485af2a01c0e13938156dbb3531c58af68e52] Google Calendar helps you keep track of everything going on in your life and those of the important people around you, and also help you discover interesting things to do with your time. -------------- next part -------------- An HTML attachment was scrubbed... URL: From snowmailer at gmail.com Mon May 13 06:47:56 2019 From: snowmailer at gmail.com (Martin Toth) Date: Mon, 13 May 2019 06:47:56 -0000 Subject: [Gluster-devel] VMs blocked for more than 120 seconds Message-ID: Hi all, I am running replica 3 on SSDs with 10G networking, everything works OK but VMs stored in Gluster volume occasionally freeze with ?Task XY blocked for more than 120 seconds?. Only solution is to poweroff (hard) VM and than boot it up again. I am unable to SSH and also login with console, its stuck probably on some disk operation. No error/warning logs or messages are store in VMs logs. KVM/Libvirt(qemu) using libgfapi and fuse mount to access VM disks on replica volume. Can someone advice how to debug this problem or what can cause these issues? It?s really annoying, I?ve tried to google everything but nothing came up. I?ve tried changing virtio-scsi-pci to virtio-blk-pci disk drivers, but its not related. BR, Martin These are volume settings : Type: Replicate Volume ID: b021bbb6-fa99-4cc7-88f6-49152a22cb9e Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: node1:/imagestore/brick1 Brick2: node2:/imagestore/brick1 Brick3: node3:/imagestore/brick1 Options Reconfigured: performance.client-io-threads: on performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: on cluster.min-free-disk: 10% cluster.server-quorum-type: server cluster.quorum-type: auto cluster.eager-lock: enable cluster.data-self-heal-algorithm: full network.remote-dio: enable network.ping-timeout: 30 diagnostics.count-fop-hits: on diagnostics.latency-measurement: on client.event-threads: 4 server.event-threads: 4 storage.owner-gid: 9869 storage.owner-uid: 9869 server.allow-insecure: on nfs.disable: on performance.readdir-ahead: on -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2019-05-13 at 08.32.24.png Type: image/png Size: 144426 bytes Desc: not available URL: From snowmailer at gmail.com Mon May 13 07:03:45 2019 From: snowmailer at gmail.com (Martin Toth) Date: Mon, 13 May 2019 07:03:45 -0000 Subject: [Gluster-devel] [Gluster-users] VMs blocked for more than 120 seconds In-Reply-To: <20190513065548.GI25080@althea.ulrar.net> References: <20190513065548.GI25080@althea.ulrar.net> Message-ID: Hi, there is no healing operation, not peer disconnects, no readonly filesystem. Yes, storage is slow and unavailable for 120 seconds, but why, its SSD with 10G, performance is good. > you'd have it's log on qemu's standard output, If you mean /var/log/libvirt/qemu/vm.log there is nothing. I am looking for problem for more than month, tried everything. Can?t find anything. Any more clues or leads? BR, Martin > On 13 May 2019, at 08:55, lemonnierk at ulrar.net wrote: > > On Mon, May 13, 2019 at 08:47:45AM +0200, Martin Toth wrote: >> Hi all, > > Hi > >> >> I am running replica 3 on SSDs with 10G networking, everything works OK but VMs stored in Gluster volume occasionally freeze with ?Task XY blocked for more than 120 seconds?. >> Only solution is to poweroff (hard) VM and than boot it up again. I am unable to SSH and also login with console, its stuck probably on some disk operation. No error/warning logs or messages are store in VMs logs. >> > > As far as I know this should be unrelated, I get this during heals > without any freezes, it just means the storage is slow I think. > >> KVM/Libvirt(qemu) using libgfapi and fuse mount to access VM disks on replica volume. Can someone advice how to debug this problem or what can cause these issues? >> It?s really annoying, I?ve tried to google everything but nothing came up. I?ve tried changing virtio-scsi-pci to virtio-blk-pci disk drivers, but its not related. >> > > Any chance your gluster goes readonly ? Have you checked your gluster > logs to see if maybe they lose each other some times ? > /var/log/glusterfs > > For libgfapi accesses you'd have it's log on qemu's standard output, > that might contain the actual error at the time of the freez. > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users From snowmailer at gmail.com Mon May 13 07:31:57 2019 From: snowmailer at gmail.com (Martin Toth) Date: Mon, 13 May 2019 07:31:57 -0000 Subject: [Gluster-devel] [Gluster-users] VMs blocked for more than 120 seconds In-Reply-To: References: <20190513065548.GI25080@althea.ulrar.net> Message-ID: <681F0862-7C80-414D-9637-7697A8C65AFA@gmail.com> Cache in qemu is none. That should be correct. This is full command : /usr/bin/qemu-system-x86_64 -name one-312 -S -machine pc-i440fx-xenial,accel=kvm,usb=off -m 4096 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid e95a774e-a594-4e98-b141-9f30a3f848c1 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-one-312/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime -no-shutdown -boot order=c,menu=on,splash-time=3000,strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive file=/var/lib/one//datastores/116/312/disk.0,format=raw,if=none,id=drive-virtio-disk1,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk1,id=virtio-disk1 -drive file=gluster://localhost:24007/imagestore/7b64d6757acc47a39503f68731f89b8e,format=qcow2,if=none,id=drive-scsi0-0-0-0,cache=none -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 -drive file=/var/lib/one//datastores/116/312/disk.1,format=raw,if=none,id=drive-ide0-0-0,readonly=on -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,fd=26,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=02:00:5c:f0:e4:39,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-one-312/org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -vnc 0.0.0.0:312,password -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on I?ve highlighted disks. First is VM context disk - Fuse used, second is SDA (OS is installed here) - libgfapi used, third is SWAP - Fuse used. Krutika, I will start profiling on Gluster Volumes and wait for next VM to fail. Than I will attach/send profiling info after some VM will be failed. I suppose this is correct profiling strategy. Thanks, BR! Martin > On 13 May 2019, at 09:21, Krutika Dhananjay wrote: > > Also, what's the caching policy that qemu is using on the affected vms? > Is it cache=none? Or something else? You can get this information in the command line of qemu-kvm process corresponding to your vm in the ps output. > > -Krutika > > On Mon, May 13, 2019 at 12:49 PM Krutika Dhananjay > wrote: > What version of gluster are you using? > Also, can you capture and share volume-profile output for a run where you manage to recreate this issue? > https://docs.gluster.org/en/v3/Administrator%20Guide/Monitoring%20Workload/#running-glusterfs-volume-profile-command > Let me know if you have any questions. > > -Krutika > > On Mon, May 13, 2019 at 12:34 PM Martin Toth > wrote: > Hi, > > there is no healing operation, not peer disconnects, no readonly filesystem. Yes, storage is slow and unavailable for 120 seconds, but why, its SSD with 10G, performance is good. > > > you'd have it's log on qemu's standard output, > > If you mean /var/log/libvirt/qemu/vm.log there is nothing. I am looking for problem for more than month, tried everything. Can?t find anything. Any more clues or leads? > > BR, > Martin > > > On 13 May 2019, at 08:55, lemonnierk at ulrar.net wrote: > > > > On Mon, May 13, 2019 at 08:47:45AM +0200, Martin Toth wrote: > >> Hi all, > > > > Hi > > > >> > >> I am running replica 3 on SSDs with 10G networking, everything works OK but VMs stored in Gluster volume occasionally freeze with ?Task XY blocked for more than 120 seconds?. > >> Only solution is to poweroff (hard) VM and than boot it up again. I am unable to SSH and also login with console, its stuck probably on some disk operation. No error/warning logs or messages are store in VMs logs. > >> > > > > As far as I know this should be unrelated, I get this during heals > > without any freezes, it just means the storage is slow I think. > > > >> KVM/Libvirt(qemu) using libgfapi and fuse mount to access VM disks on replica volume. Can someone advice how to debug this problem or what can cause these issues? > >> It?s really annoying, I?ve tried to google everything but nothing came up. I?ve tried changing virtio-scsi-pci to virtio-blk-pci disk drivers, but its not related. > >> > > > > Any chance your gluster goes readonly ? Have you checked your gluster > > logs to see if maybe they lose each other some times ? > > /var/log/glusterfs > > > > For libgfapi accesses you'd have it's log on qemu's standard output, > > that might contain the actual error at the time of the freez. > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrevolodin at gmail.com Mon May 13 07:34:07 2019 From: andrevolodin at gmail.com (Andrey Volodin) Date: Mon, 13 May 2019 07:34:07 -0000 Subject: [Gluster-devel] [Gluster-users] VMs blocked for more than 120 seconds In-Reply-To: <681F0862-7C80-414D-9637-7697A8C65AFA@gmail.com> References: <20190513065548.GI25080@althea.ulrar.net> <681F0862-7C80-414D-9637-7697A8C65AFA@gmail.com> Message-ID: as per https://helpful.knobs-dials.com/index.php/INFO:_task_blocked_for_more_than_120_seconds. , the informational warning could be suppressed with : "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" Moreover, as per their website : "*This message is not an error*. It is an indication that a program has had to wait for a very long time, and what it was doing. " More reference: https://serverfault.com/questions/405210/can-high-load-cause-server-hang-and-error-blocked-for-more-than-120-seconds Regards, Andrei On Mon, May 13, 2019 at 7:32 AM Martin Toth wrote: > Cache in qemu is none. That should be correct. This is full command : > > /usr/bin/qemu-system-x86_64 -name one-312 -S -machine > pc-i440fx-xenial,accel=kvm,usb=off -m 4096 -realtime mlock=off -smp > 4,sockets=4,cores=1,threads=1 -uuid e95a774e-a594-4e98-b141-9f30a3f848c1 > -no-user-config -nodefaults -chardev > socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-one-312/monitor.sock,server,nowait > -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime > -no-shutdown -boot order=c,menu=on,splash-time=3000,strict=on -device > piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 > > -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 > -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 > -drive file=/var/lib/one//datastores/116/312/*disk.0* > ,format=raw,if=none,id=drive-virtio-disk1,cache=none > -device > virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk1,id=virtio-disk1 > -drive file=gluster://localhost:24007/imagestore/ > *7b64d6757acc47a39503f68731f89b8e* > ,format=qcow2,if=none,id=drive-scsi0-0-0-0,cache=none > -device > scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 > -drive file=/var/lib/one//datastores/116/312/*disk.1* > ,format=raw,if=none,id=drive-ide0-0-0,readonly=on > -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 > > -netdev tap,fd=26,id=hostnet0 > -device e1000,netdev=hostnet0,id=net0,mac=02:00:5c:f0:e4:39,bus=pci.0,addr=0x3 > -chardev pty,id=charserial0 -device > isa-serial,chardev=charserial0,id=serial0 > -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-one-312/org.qemu.guest_agent.0,server,nowait > -device > virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 > -vnc 0.0.0.0:312,password -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 > -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on > > I?ve highlighted disks. First is VM context disk - Fuse used, second is > SDA (OS is installed here) - libgfapi used, third is SWAP - Fuse used. > > Krutika, > I will start profiling on Gluster Volumes and wait for next VM to fail. > Than I will attach/send profiling info after some VM will be failed. I > suppose this is correct profiling strategy. > > Thanks, > BR! > Martin > > On 13 May 2019, at 09:21, Krutika Dhananjay wrote: > > Also, what's the caching policy that qemu is using on the affected vms? > Is it cache=none? Or something else? You can get this information in the > command line of qemu-kvm process corresponding to your vm in the ps output. > > -Krutika > > On Mon, May 13, 2019 at 12:49 PM Krutika Dhananjay > wrote: > >> What version of gluster are you using? >> Also, can you capture and share volume-profile output for a run where you >> manage to recreate this issue? >> >> https://docs.gluster.org/en/v3/Administrator%20Guide/Monitoring%20Workload/#running-glusterfs-volume-profile-command >> Let me know if you have any questions. >> >> -Krutika >> >> On Mon, May 13, 2019 at 12:34 PM Martin Toth >> wrote: >> >>> Hi, >>> >>> there is no healing operation, not peer disconnects, no readonly >>> filesystem. Yes, storage is slow and unavailable for 120 seconds, but why, >>> its SSD with 10G, performance is good. >>> >>> > you'd have it's log on qemu's standard output, >>> >>> If you mean /var/log/libvirt/qemu/vm.log there is nothing. I am looking >>> for problem for more than month, tried everything. Can?t find anything. Any >>> more clues or leads? >>> >>> BR, >>> Martin >>> >>> > On 13 May 2019, at 08:55, lemonnierk at ulrar.net wrote: >>> > >>> > On Mon, May 13, 2019 at 08:47:45AM +0200, Martin Toth wrote: >>> >> Hi all, >>> > >>> > Hi >>> > >>> >> >>> >> I am running replica 3 on SSDs with 10G networking, everything works >>> OK but VMs stored in Gluster volume occasionally freeze with ?Task XY >>> blocked for more than 120 seconds?. >>> >> Only solution is to poweroff (hard) VM and than boot it up again. I >>> am unable to SSH and also login with console, its stuck probably on some >>> disk operation. No error/warning logs or messages are store in VMs logs. >>> >> >>> > >>> > As far as I know this should be unrelated, I get this during heals >>> > without any freezes, it just means the storage is slow I think. >>> > >>> >> KVM/Libvirt(qemu) using libgfapi and fuse mount to access VM disks on >>> replica volume. Can someone advice how to debug this problem or what can >>> cause these issues? >>> >> It?s really annoying, I?ve tried to google everything but nothing >>> came up. I?ve tried changing virtio-scsi-pci to virtio-blk-pci disk >>> drivers, but its not related. >>> >> >>> > >>> > Any chance your gluster goes readonly ? Have you checked your gluster >>> > logs to see if maybe they lose each other some times ? >>> > /var/log/glusterfs >>> > >>> > For libgfapi accesses you'd have it's log on qemu's standard output, >>> > that might contain the actual error at the time of the freez. >>> > _______________________________________________ >>> > Gluster-users mailing list >>> > Gluster-users at gluster.org >>> > https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrevolodin at gmail.com Mon May 13 07:37:27 2019 From: andrevolodin at gmail.com (Andrey Volodin) Date: Mon, 13 May 2019 07:37:27 -0000 Subject: [Gluster-devel] [Gluster-users] VMs blocked for more than 120 seconds In-Reply-To: References: <20190513065548.GI25080@althea.ulrar.net> <681F0862-7C80-414D-9637-7697A8C65AFA@gmail.com> Message-ID: what is the context from dmesg ? On Mon, May 13, 2019 at 7:33 AM Andrey Volodin wrote: > as per > https://helpful.knobs-dials.com/index.php/INFO:_task_blocked_for_more_than_120_seconds. , > the informational warning could be suppressed with : > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > Moreover, as per their website : "*This message is not an error*. > It is an indication that a program has had to wait for a very long time, > and what it was doing. " > More reference: > https://serverfault.com/questions/405210/can-high-load-cause-server-hang-and-error-blocked-for-more-than-120-seconds > > Regards, > Andrei > > On Mon, May 13, 2019 at 7:32 AM Martin Toth wrote: > >> Cache in qemu is none. That should be correct. This is full command : >> >> /usr/bin/qemu-system-x86_64 -name one-312 -S -machine >> pc-i440fx-xenial,accel=kvm,usb=off -m 4096 -realtime mlock=off -smp >> 4,sockets=4,cores=1,threads=1 -uuid e95a774e-a594-4e98-b141-9f30a3f848c1 >> -no-user-config -nodefaults -chardev >> socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-one-312/monitor.sock,server,nowait >> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime >> -no-shutdown -boot order=c,menu=on,splash-time=3000,strict=on -device >> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 >> >> -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 >> -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 >> -drive file=/var/lib/one//datastores/116/312/*disk.0* >> ,format=raw,if=none,id=drive-virtio-disk1,cache=none >> -device >> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk1,id=virtio-disk1 >> -drive file=gluster://localhost:24007/imagestore/ >> *7b64d6757acc47a39503f68731f89b8e* >> ,format=qcow2,if=none,id=drive-scsi0-0-0-0,cache=none >> -device >> scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 >> -drive file=/var/lib/one//datastores/116/312/*disk.1* >> ,format=raw,if=none,id=drive-ide0-0-0,readonly=on >> -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 >> >> -netdev tap,fd=26,id=hostnet0 >> -device e1000,netdev=hostnet0,id=net0,mac=02:00:5c:f0:e4:39,bus=pci.0,addr=0x3 >> -chardev pty,id=charserial0 -device >> isa-serial,chardev=charserial0,id=serial0 >> -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-one-312/org.qemu.guest_agent.0,server,nowait >> -device >> virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 >> -vnc 0.0.0.0:312,password -device >> cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device >> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on >> >> I?ve highlighted disks. First is VM context disk - Fuse used, second is >> SDA (OS is installed here) - libgfapi used, third is SWAP - Fuse used. >> >> Krutika, >> I will start profiling on Gluster Volumes and wait for next VM to fail. >> Than I will attach/send profiling info after some VM will be failed. I >> suppose this is correct profiling strategy. >> >> Thanks, >> BR! >> Martin >> >> On 13 May 2019, at 09:21, Krutika Dhananjay wrote: >> >> Also, what's the caching policy that qemu is using on the affected vms? >> Is it cache=none? Or something else? You can get this information in the >> command line of qemu-kvm process corresponding to your vm in the ps output. >> >> -Krutika >> >> On Mon, May 13, 2019 at 12:49 PM Krutika Dhananjay >> wrote: >> >>> What version of gluster are you using? >>> Also, can you capture and share volume-profile output for a run where >>> you manage to recreate this issue? >>> >>> https://docs.gluster.org/en/v3/Administrator%20Guide/Monitoring%20Workload/#running-glusterfs-volume-profile-command >>> Let me know if you have any questions. >>> >>> -Krutika >>> >>> On Mon, May 13, 2019 at 12:34 PM Martin Toth >>> wrote: >>> >>>> Hi, >>>> >>>> there is no healing operation, not peer disconnects, no readonly >>>> filesystem. Yes, storage is slow and unavailable for 120 seconds, but why, >>>> its SSD with 10G, performance is good. >>>> >>>> > you'd have it's log on qemu's standard output, >>>> >>>> If you mean /var/log/libvirt/qemu/vm.log there is nothing. I am looking >>>> for problem for more than month, tried everything. Can?t find anything. Any >>>> more clues or leads? >>>> >>>> BR, >>>> Martin >>>> >>>> > On 13 May 2019, at 08:55, lemonnierk at ulrar.net wrote: >>>> > >>>> > On Mon, May 13, 2019 at 08:47:45AM +0200, Martin Toth wrote: >>>> >> Hi all, >>>> > >>>> > Hi >>>> > >>>> >> >>>> >> I am running replica 3 on SSDs with 10G networking, everything works >>>> OK but VMs stored in Gluster volume occasionally freeze with ?Task XY >>>> blocked for more than 120 seconds?. >>>> >> Only solution is to poweroff (hard) VM and than boot it up again. I >>>> am unable to SSH and also login with console, its stuck probably on some >>>> disk operation. No error/warning logs or messages are store in VMs logs. >>>> >> >>>> > >>>> > As far as I know this should be unrelated, I get this during heals >>>> > without any freezes, it just means the storage is slow I think. >>>> > >>>> >> KVM/Libvirt(qemu) using libgfapi and fuse mount to access VM disks >>>> on replica volume. Can someone advice how to debug this problem or what >>>> can cause these issues? >>>> >> It?s really annoying, I?ve tried to google everything but nothing >>>> came up. I?ve tried changing virtio-scsi-pci to virtio-blk-pci disk >>>> drivers, but its not related. >>>> >> >>>> > >>>> > Any chance your gluster goes readonly ? Have you checked your gluster >>>> > logs to see if maybe they lose each other some times ? >>>> > /var/log/glusterfs >>>> > >>>> > For libgfapi accesses you'd have it's log on qemu's standard output, >>>> > that might contain the actual error at the time of the freez. >>>> > _______________________________________________ >>>> > Gluster-users mailing list >>>> > Gluster-users at gluster.org >>>> > https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From david.spisla at iternity.com Tue May 14 14:43:11 2019 From: david.spisla at iternity.com (David Spisla) Date: Tue, 14 May 2019 14:43:11 -0000 Subject: [Gluster-devel] Improve stability between SMB/CTDB and Gluster (together with Samba Core Developer) In-Reply-To: References: Message-ID: Hi Poornima, thats fine. I would suggest this dates and times: May 15th ? 17th at 12:30, 13:30, 14:30 IST (9:00, 10:00, 11:00 CEST) May 20th ? 24th at 12:30, 13:30, 14:30 IST (9:00, 10:00, 11:00 CEST) I add Volker Lendecke from Sernet to the mail. He is the Samba Expert. Can someone of you provide a host via bluejeans.com? If not, I will try it with GoToMeeting (https://www.gotomeeting.com). @all Please write your prefered dates and times. For me, all oft the above dates and times are fine Regards David David Spisla Software Engineer david.spisla at iternity.com +49 761 59034852 iTernity GmbH Heinrich-von-Stephan-Str. 21 79100 Freiburg Deutschland Website Newsletter Support Portal iTernity GmbH. Gesch?ftsf?hrer: Ralf Steinemann. ?Eingetragen beim Amtsgericht Freiburg: HRB-Nr. 701332. ?USt.Id DE242664311. [v01.023] Von: Poornima Gurusiddaiah Gesendet: Montag, 13. Mai 2019 07:22 An: David Spisla ; Anoop C S ; Gunther Deschner Cc: Gluster Devel ; gluster-users at gluster.org List Betreff: Re: [Gluster-devel] Improve stability between SMB/CTDB and Gluster (together with Samba Core Developer) Hi, We would be definitely interested in this. Thank you for contacting us. For the starter we can have an online conference. Please suggest few possible date and times for the week(preferably between IST 7.00AM - 9.PM)? Adding Anoop and Gunther who are also the main contributors to the Gluster-Samba integration. Thanks, Poornima On Thu, May 9, 2019 at 7:43 PM David Spisla > wrote: Dear Gluster Community, at the moment we are improving the stability of SMB/CTDB and Gluster. For this purpose we are working together with an advanced SAMBA Core Developer. He did some debugging but needs more information about Gluster Core Behaviour. Would any of the Gluster Developer wants to have a online conference with him and me? I would organize everything. In my opinion this is a good chance to improve stability of Glusterfs and this is at the moment one of the major issues in the Community. Regards David Spisla _______________________________________________ Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/836554017 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/486278655 Gluster-devel mailing list Gluster-devel at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image860747.png Type: image/png Size: 382 bytes Desc: image860747.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image735814.png Type: image/png Size: 412 bytes Desc: image735814.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image116096.png Type: image/png Size: 6545 bytes Desc: image116096.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image142576.png Type: image/png Size: 37146 bytes Desc: image142576.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image714843.png Type: image/png Size: 522 bytes Desc: image714843.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image293410.png Type: image/png Size: 591 bytes Desc: image293410.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image570372.png Type: image/png Size: 775 bytes Desc: image570372.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image031225.png Type: image/png Size: 508 bytes Desc: image031225.png URL: From rkothiya at redhat.com Wed May 15 06:56:16 2019 From: rkothiya at redhat.com (rkothiya at redhat.com) Date: Wed, 15 May 2019 06:56:16 -0000 Subject: [Gluster-devel] Invitation: End Date for Feature/Scope proposal for Release-7 @ Wed May 22, 2019 11am - 12pm (IST) (gluster-devel@gluster.org) Message-ID: <0000000000003d14950588e7a522@google.com> You have been invited to the following event. Title: End Date for Feature/Scope proposal for Release-7 This is just a reminder/notification for announcing that, 22-May-2019 is the end date for feature/scope proposal for the Release-7 When: Wed May 22, 2019 11am ? 12pm India Standard Time - Kolkata Calendar: gluster-devel at gluster.org Who: * rkothiya at redhat.com - creator * gluster-devel at gluster.org * maintainers at gluster.org Event details: https://www.google.com/calendar/event?action=VIEW&eid=M3UxZXRzMmg1OTZ1NWRyM2N1OHUxZDQxbDUgZ2x1c3Rlci1kZXZlbEBnbHVzdGVyLm9yZw&tok=NjMjcmVkaGF0LmNvbV9yM2hvb3RjcjZ0MXY0YWc2MzFvY2dzZXNoZ0Bncm91cC5jYWxlbmRhci5nb29nbGUuY29tMzgzZGUyNzQ1ZTg2NDg2NmU0ODliMzkyMWY2OGY1YmFmOGViODNlNQ&ctz=Asia%2FKolkata&hl=en&es=0 Invitation from Google Calendar: https://www.google.com/calendar/ You are receiving this courtesy email at the account gluster-devel at gluster.org because you are an attendee of this event. To stop receiving future updates for this event, decline this event. Alternatively you can sign up for a Google account at https://www.google.com/calendar/ and control your notification settings for your entire calendar. Forwarding this invitation could allow any recipient to send a response to the organizer and be added to the guest list, or invite others regardless of their own invitation status, or to modify your RSVP. Learn more at https://support.google.com/calendar/answer/37135#forwarding -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/calendar Size: 2012 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: invite.ics Type: application/ics Size: 2068 bytes Desc: not available URL: From vladkopy at gmail.com Fri May 17 23:18:49 2019 From: vladkopy at gmail.com (Vlad Kopylov) Date: Fri, 17 May 2019 23:18:49 -0000 Subject: [Gluster-devel] [Gluster-users] gluster-block v0.4 is alive! In-Reply-To: References: Message-ID: straight from ./autogen.sh && ./configure && make -j install CentOS Linux release 7.6.1810 (Core) May 17 19:13:18 vm2 gluster-blockd[24294]: Error opening log file: No such file or directory May 17 19:13:18 vm2 gluster-blockd[24294]: Logging to stderr. May 17 19:13:18 vm2 gluster-blockd[24294]: [2019-05-17 23:13:18.966992] CRIT: trying to change logDir from /var/log/gluster-block to /var/log/gluster-block [at utils.c+495 :] May 17 19:13:19 vm2 gluster-blockd[24294]: No such path /backstores/user:glfs May 17 19:13:19 vm2 systemd[1]: gluster-blockd.service: main process exited, code=exited, status=1/FAILURE May 17 19:13:19 vm2 systemd[1]: Unit gluster-blockd.service entered failed state. May 17 19:13:19 vm2 systemd[1]: gluster-blockd.service failed. On Thu, May 2, 2019 at 1:35 PM Prasanna Kalever wrote: > Hello Gluster folks, > > Gluster-block team is happy to announce the v0.4 release [1]. > > This is the new stable version of gluster-block, lots of new and > exciting features and interesting bug fixes are made available as part > of this release. > Please find the big list of release highlights and notable fixes at [2]. > > Details about installation can be found in the easy install guide at > [3]. Find the details about prerequisites and setup guide at [4]. > If you are a new user, checkout the demo video attached in the README > doc [5], which will be a good source of intro to the project. > There are good examples about how to use gluster-block both in the man > pages [6] and test file [7] (also in the README). > > gluster-block is part of fedora package collection, an updated package > with release version v0.4 will be soon made available. And the > community provided packages will be soon made available at [8]. > > Please spend a minute to report any kind of issue that comes to your > notice with this handy link [9]. > We look forward to your feedback, which will help gluster-block get better! > > We would like to thank all our users, contributors for bug filing and > fixes, also the whole team who involved in the huge effort with > pre-release testing. > > > [1] https://github.com/gluster/gluster-block > [2] https://github.com/gluster/gluster-block/releases > [3] https://github.com/gluster/gluster-block/blob/master/INSTALL > [4] https://github.com/gluster/gluster-block#usage > [5] https://github.com/gluster/gluster-block/blob/master/README.md > [6] https://github.com/gluster/gluster-block/tree/master/docs > [7] https://github.com/gluster/gluster-block/blob/master/tests/basic.t > [8] https://download.gluster.org/pub/gluster/gluster-block/ > [9] https://github.com/gluster/gluster-block/issues/new > > Cheers, > Team Gluster-Block! > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From snowmailer at gmail.com Mon May 20 10:07:58 2019 From: snowmailer at gmail.com (Martin) Date: Mon, 20 May 2019 10:07:58 -0000 Subject: [Gluster-devel] [Gluster-users] VMs blocked for more than 120 seconds In-Reply-To: References: <20190513065548.GI25080@althea.ulrar.net> <681F0862-7C80-414D-9637-7697A8C65AFA@gmail.com> Message-ID: <76CB580E-0F53-468F-B7F9-FE46C2971B8C@gmail.com> Hi Krutika, > Also, gluster version please? I am running old 3.7.6. (Yes I know I should upgrade asap) I?ve applied firstly "network.remote-dio off", behaviour did not changed, VMs got stuck after some time again. Then I?ve set "performance.strict-o-direct on" and problem completly disappeared. No more stucks at all (7 days without any problems at all). This SOLVED the issue. Can you explain what remote-dio and strict-o-direct variables changed in behaviour of my Gluster? It would be great for later archive/users to understand what and why this solved my issue. Anyway, Thanks a LOT!!! BR, Martin > On 13 May 2019, at 10:20, Krutika Dhananjay wrote: > > OK. In that case, can you check if the following two changes help: > > # gluster volume set $VOL network.remote-dio off > # gluster volume set $VOL performance.strict-o-direct on > > preferably one option changed at a time, its impact tested and then the next change applied and tested. > > Also, gluster version please? > > -Krutika > > On Mon, May 13, 2019 at 1:02 PM Martin Toth > wrote: > Cache in qemu is none. That should be correct. This is full command : > > /usr/bin/qemu-system-x86_64 -name one-312 -S -machine pc-i440fx-xenial,accel=kvm,usb=off -m 4096 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid e95a774e-a594-4e98-b141-9f30a3f848c1 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-one-312/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime -no-shutdown -boot order=c,menu=on,splash-time=3000,strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 > > -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 > -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 > -drive file=/var/lib/one//datastores/116/312/disk.0,format=raw,if=none,id=drive-virtio-disk1,cache=none > -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk1,id=virtio-disk1 > -drive file=gluster://localhost:24007/imagestore/ <>7b64d6757acc47a39503f68731f89b8e,format=qcow2,if=none,id=drive-scsi0-0-0-0,cache=none > -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 > -drive file=/var/lib/one//datastores/116/312/disk.1,format=raw,if=none,id=drive-ide0-0-0,readonly=on > -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 > > -netdev tap,fd=26,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=02:00:5c:f0:e4:39,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-one-312/org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -vnc 0.0.0.0:312 ,password -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on > > I?ve highlighted disks. First is VM context disk - Fuse used, second is SDA (OS is installed here) - libgfapi used, third is SWAP - Fuse used. > > Krutika, > I will start profiling on Gluster Volumes and wait for next VM to fail. Than I will attach/send profiling info after some VM will be failed. I suppose this is correct profiling strategy. > > About this, how many vms do you need to recreate it? A single vm? Or multiple vms doing IO in parallel? > > > Thanks, > BR! > Martin > >> On 13 May 2019, at 09:21, Krutika Dhananjay > wrote: >> >> Also, what's the caching policy that qemu is using on the affected vms? >> Is it cache=none? Or something else? You can get this information in the command line of qemu-kvm process corresponding to your vm in the ps output. >> >> -Krutika >> >> On Mon, May 13, 2019 at 12:49 PM Krutika Dhananjay > wrote: >> What version of gluster are you using? >> Also, can you capture and share volume-profile output for a run where you manage to recreate this issue? >> https://docs.gluster.org/en/v3/Administrator%20Guide/Monitoring%20Workload/#running-glusterfs-volume-profile-command >> Let me know if you have any questions. >> >> -Krutika >> >> On Mon, May 13, 2019 at 12:34 PM Martin Toth > wrote: >> Hi, >> >> there is no healing operation, not peer disconnects, no readonly filesystem. Yes, storage is slow and unavailable for 120 seconds, but why, its SSD with 10G, performance is good. >> >> > you'd have it's log on qemu's standard output, >> >> If you mean /var/log/libvirt/qemu/vm.log there is nothing. I am looking for problem for more than month, tried everything. Can?t find anything. Any more clues or leads? >> >> BR, >> Martin >> >> > On 13 May 2019, at 08:55, lemonnierk at ulrar.net wrote: >> > >> > On Mon, May 13, 2019 at 08:47:45AM +0200, Martin Toth wrote: >> >> Hi all, >> > >> > Hi >> > >> >> >> >> I am running replica 3 on SSDs with 10G networking, everything works OK but VMs stored in Gluster volume occasionally freeze with ?Task XY blocked for more than 120 seconds?. >> >> Only solution is to poweroff (hard) VM and than boot it up again. I am unable to SSH and also login with console, its stuck probably on some disk operation. No error/warning logs or messages are store in VMs logs. >> >> >> > >> > As far as I know this should be unrelated, I get this during heals >> > without any freezes, it just means the storage is slow I think. >> > >> >> KVM/Libvirt(qemu) using libgfapi and fuse mount to access VM disks on replica volume. Can someone advice how to debug this problem or what can cause these issues? >> >> It?s really annoying, I?ve tried to google everything but nothing came up. I?ve tried changing virtio-scsi-pci to virtio-blk-pci disk drivers, but its not related. >> >> >> > >> > Any chance your gluster goes readonly ? Have you checked your gluster >> > logs to see if maybe they lose each other some times ? >> > /var/log/glusterfs >> > >> > For libgfapi accesses you'd have it's log on qemu's standard output, >> > that might contain the actual error at the time of the freez. >> > _______________________________________________ >> > Gluster-users mailing list >> > Gluster-users at gluster.org >> > https://lists.gluster.org/mailman/listinfo/gluster-users >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladkopy at gmail.com Mon May 20 15:35:33 2019 From: vladkopy at gmail.com (Vlad Kopylov) Date: Mon, 20 May 2019 15:35:33 -0000 Subject: [Gluster-devel] [Gluster-users] gluster-block v0.4 is alive! In-Reply-To: References: Message-ID: Thank you Prasanna. Do we have architecture somewhere? Dies it bypass Fuse and go directly gfapi ? v On Mon, May 20, 2019, 8:36 AM Prasanna Kalever wrote: > Hey Vlad, > > Thanks for trying gluster-block. Appreciate your feedback. > > Here is the patch which should fix the issue you have noticed: > https://github.com/gluster/gluster-block/pull/233 > > Thanks! > -- > Prasanna > > On Sat, May 18, 2019 at 4:48 AM Vlad Kopylov wrote: > > > > > > straight from > > > > ./autogen.sh && ./configure && make -j install > > > > > > CentOS Linux release 7.6.1810 (Core) > > > > > > May 17 19:13:18 vm2 gluster-blockd[24294]: Error opening log file: No > such file or directory > > May 17 19:13:18 vm2 gluster-blockd[24294]: Logging to stderr. > > May 17 19:13:18 vm2 gluster-blockd[24294]: [2019-05-17 23:13:18.966992] > CRIT: trying to change logDir from /var/log/gluster-block to > /var/log/gluster-block [at utils.c+495 :] > > May 17 19:13:19 vm2 gluster-blockd[24294]: No such path > /backstores/user:glfs > > May 17 19:13:19 vm2 systemd[1]: gluster-blockd.service: main process > exited, code=exited, status=1/FAILURE > > May 17 19:13:19 vm2 systemd[1]: Unit gluster-blockd.service entered > failed state. > > May 17 19:13:19 vm2 systemd[1]: gluster-blockd.service failed. > > > > > > > > On Thu, May 2, 2019 at 1:35 PM Prasanna Kalever > wrote: > >> > >> Hello Gluster folks, > >> > >> Gluster-block team is happy to announce the v0.4 release [1]. > >> > >> This is the new stable version of gluster-block, lots of new and > >> exciting features and interesting bug fixes are made available as part > >> of this release. > >> Please find the big list of release highlights and notable fixes at [2]. > >> > >> Details about installation can be found in the easy install guide at > >> [3]. Find the details about prerequisites and setup guide at [4]. > >> If you are a new user, checkout the demo video attached in the README > >> doc [5], which will be a good source of intro to the project. > >> There are good examples about how to use gluster-block both in the man > >> pages [6] and test file [7] (also in the README). > >> > >> gluster-block is part of fedora package collection, an updated package > >> with release version v0.4 will be soon made available. And the > >> community provided packages will be soon made available at [8]. > >> > >> Please spend a minute to report any kind of issue that comes to your > >> notice with this handy link [9]. > >> We look forward to your feedback, which will help gluster-block get > better! > >> > >> We would like to thank all our users, contributors for bug filing and > >> fixes, also the whole team who involved in the huge effort with > >> pre-release testing. > >> > >> > >> [1] https://github.com/gluster/gluster-block > >> [2] https://github.com/gluster/gluster-block/releases > >> [3] https://github.com/gluster/gluster-block/blob/master/INSTALL > >> [4] https://github.com/gluster/gluster-block#usage > >> [5] https://github.com/gluster/gluster-block/blob/master/README.md > >> [6] https://github.com/gluster/gluster-block/tree/master/docs > >> [7] https://github.com/gluster/gluster-block/blob/master/tests/basic.t > >> [8] https://download.gluster.org/pub/gluster/gluster-block/ > >> [9] https://github.com/gluster/gluster-block/issues/new > >> > >> Cheers, > >> Team Gluster-Block! > >> _______________________________________________ > >> Gluster-users mailing list > >> Gluster-users at gluster.org > >> https://lists.gluster.org/mailman/listinfo/gluster-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amgad.saleh at nokia.com Fri May 24 10:38:56 2019 From: amgad.saleh at nokia.com (Saleh, Amgad (Nokia - US/Naperville)) Date: Fri, 24 May 2019 10:38:56 -0000 Subject: [Gluster-devel] =?windows-1252?q?Failure_during_=93git_review_=96?= =?windows-1252?q?setup=94?= In-Reply-To: References: Message-ID: Never mind ? It worked. The $USER when adding the remote gerrit should be my github user, documentation was not clear! From: Saleh, Amgad (Nokia - US/Naperville) Sent: Thursday, May 23, 2019 11:34 PM To: gluster-devel at gluster.org Subject: RE: Failure during ?git review ?setup? Importance: High Looking at the document https://gluster.readthedocs.io/en/latest/Developer-guide/Simplified-Development-Workflow/ I ran ./rfc.sh and got the following: [ahsaleh at null-d4bed9857109 glusterfs]$ ./rfc.sh % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 4780 100 4780 0 0 438 0 0:00:10 0:00:10 --:--:-- 1087 [bugfix-pureIPv6 d14e550] Return IPv6 when exists and not -1 1 file changed, 10 insertions(+), 5 deletions(-) Commit: "Return IPv6 when exists and not -1" Reference (Bugzilla ID or Github Issue ID): amgads Invalid reference ID (amgads)!!! Commit: "Return IPv6 when exists and not -1" Reference (Bugzilla ID or Github Issue ID): amgads Invalid reference ID (amgads)!!! Commit: "Return IPv6 when exists and not -1" Reference (Bugzilla ID or Github Issue ID): 677 Select yes '(y)' if this patch fixes the bug/feature completely, or is the last of the patchset which brings feature (Y/n): y [detached HEAD 8a5bb4a] Return IPv6 when exists and not -1 1 file changed, 10 insertions(+), 5 deletions(-) Successfully rebased and updated refs/heads/bugfix-pureIPv6. ./rfc.sh: line 287: clang-format: command not found [ahsaleh at null-d4bed9857109 glusterfs]$ Is the code submitted for review? Please advise what?s needd next ? this is my first time to use the process Submit for review To submit your change for review, run the rfc.sh script, $ ./rfc.sh From: Saleh, Amgad (Nokia - US/Naperville) Sent: Thursday, May 23, 2019 11:19 PM To: gluster-devel at gluster.org Subject: Failure during ?git review ?setup? Hi: After submitting a Pull Request at Github. I got the message about the gerrit review (attached) I follow the steps, added a public key but failed at the ?git review ?setup? step ? please see errors below Your urgent support is appreciated! Regards, Amgad Saleh Nokia. ?? [ahsaleh at null-d4bed9857109 glusterfs]$ git review --setup Problem running 'git remote update gerrit' Fetching gerrit Permission denied (publickey). fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists. error: Could not fetch gerrit Problems encountered installing commit-msg hook The following command failed with exit code 1 "scp ahsaleh at review.gluster.org:hooks/commit-msg .git/hooks/commit-msg" ----------------------- Permission denied (publickey). ----------------------- [ahsaleh at null-d4bed9857109 glusterfs]$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From amgad.saleh at nokia.com Sat May 25 02:06:24 2019 From: amgad.saleh at nokia.com (Saleh, Amgad (Nokia - US/Naperville)) Date: Sat, 25 May 2019 02:06:24 -0000 Subject: [Gluster-devel] =?windows-1252?q?Failure_during_=93git_review_=96?= =?windows-1252?q?setup=94?= In-Reply-To: References: Message-ID: Looking at the document https://gluster.readthedocs.io/en/latest/Developer-guide/Simplified-Development-Workflow/ I ran ./rfc.sh and got the following: [ahsaleh at null-d4bed9857109 glusterfs]$ ./rfc.sh % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 4780 100 4780 0 0 438 0 0:00:10 0:00:10 --:--:-- 1087 [bugfix-pureIPv6 d14e550] Return IPv6 when exists and not -1 1 file changed, 10 insertions(+), 5 deletions(-) Commit: "Return IPv6 when exists and not -1" Reference (Bugzilla ID or Github Issue ID): amgads Invalid reference ID (amgads)!!! Commit: "Return IPv6 when exists and not -1" Reference (Bugzilla ID or Github Issue ID): amgads Invalid reference ID (amgads)!!! Commit: "Return IPv6 when exists and not -1" Reference (Bugzilla ID or Github Issue ID): 677 Select yes '(y)' if this patch fixes the bug/feature completely, or is the last of the patchset which brings feature (Y/n): y [detached HEAD 8a5bb4a] Return IPv6 when exists and not -1 1 file changed, 10 insertions(+), 5 deletions(-) Successfully rebased and updated refs/heads/bugfix-pureIPv6. ./rfc.sh: line 287: clang-format: command not found [ahsaleh at null-d4bed9857109 glusterfs]$ Is the code submitted for review? Please advise what?s needd next ? this is my first time to use the process Submit for review To submit your change for review, run the rfc.sh script, $ ./rfc.sh From: Saleh, Amgad (Nokia - US/Naperville) Sent: Thursday, May 23, 2019 11:19 PM To: gluster-devel at gluster.org Subject: Failure during ?git review ?setup? Hi: After submitting a Pull Request at Github. I got the message about the gerrit review (attached) I follow the steps, added a public key but failed at the ?git review ?setup? step ? please see errors below Your urgent support is appreciated! Regards, Amgad Saleh Nokia. ?? [ahsaleh at null-d4bed9857109 glusterfs]$ git review --setup Problem running 'git remote update gerrit' Fetching gerrit Permission denied (publickey). fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists. error: Could not fetch gerrit Problems encountered installing commit-msg hook The following command failed with exit code 1 "scp ahsaleh at review.gluster.org:hooks/commit-msg .git/hooks/commit-msg" ----------------------- Permission denied (publickey). ----------------------- [ahsaleh at null-d4bed9857109 glusterfs]$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From sunkumar at redhat.com Mon May 27 12:53:17 2019 From: sunkumar at redhat.com (sunkumar at redhat.com) Date: Mon, 27 May 2019 12:53:17 -0000 Subject: [Gluster-devel] Gluster Community Meeting (APAC friendly hours) Message-ID: <0000000000001f378e0589de08c2@google.com> Bridge: https://bluejeans.com/836554017 Meeting minutes: https://hackmd.io/B4vOpJumRgexzqeQiNPVOw Flash Talk : What is Thin Arbiter? (By Ashish Pandey) Previous Meeting notes: http://github.com/gluster/community Title: Gluster Community Meeting (APAC friendly hours) Bridge: https://bluejeans.com/836554017Meeting minutes: https://hackmd.io/B4vOpJumRgexzqeQiNPVOwFlash Talk : What is Thin Arbiter? (By Ashish Pandey)Previous Meeting notes: http://github.com/gluster/community When: Tue May 28, 2019 11:30am ? 12:30pm India Standard Time - Kolkata Where: https://bluejeans.com/836554017 Who: * pgurusid at redhat.com - organizer * javico at paradigmadigital.com * spentaparthi at idirect.net * sstephen at redhat.com * brian.riddle at storagecraft.com * sthomas at rpstechnologysolutions.co.uk * kdhananj at redhat.com * rwareing at fb.com * david.spisla at iternity.com * khiremat at redhat.com * pkarampu at redhat.com * gluster-users at gluster.org * dcunningham at voisonics.com * m.vrgotic at activevideo.com * barchu02 at unm.edu * gluster-devel at gluster.org * sunkumar at redhat.com * jpark at dexyp.com * rouge2507 at gmail.com * dan at clough.xyz * Max de Graaf * mark.boulton at uwa.edu.au * hgowtham at redhat.com * gabriel.lindeborg at svenskaspel.se * maintainers at gluster.org * ranaraya at redhat.com * philip.ruenagel at gmail.com * spalai at redhat.com * m.ragusa at eurodata.de * pauyeung at connexity.com * duprel at email.sc.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: