From ykaul at redhat.com Mon Apr 1 06:19:52 2019 From: ykaul at redhat.com (Yaniv Kaul) Date: Mon, 1 Apr 2019 09:19:52 +0300 Subject: [Gluster-infra] [automated-testing] What is the current state of the Glusto test framework in upstream? In-Reply-To: References: Message-ID: On Mon, Apr 1, 2019 at 8:56 AM Vijay Bhaskar Reddy Avuthu < vavuthu at redhat.com> wrote: > > > On Sun, Mar 31, 2019 at 6:30 PM Yaniv Kaul wrote: > >> >> >> On Wed, Mar 13, 2019 at 4:14 PM Jonathan Holloway >> wrote: >> >>> >>> >>> On Wed, Mar 13, 2019 at 5:08 AM Sankarshan Mukhopadhyay < >>> sankarshan.mukhopadhyay at gmail.com> wrote: >>> >>>> On Wed, Mar 13, 2019 at 3:03 PM Yaniv Kaul wrote: >>>> > On Wed, Mar 13, 2019, 3:53 AM Sankarshan Mukhopadhyay < >>>> sankarshan.mukhopadhyay at gmail.com> wrote: >>>> >> >>>> >> What I am essentially looking to understand is whether there are >>>> >> regular Glusto runs and whether the tests receive refreshes. However, >>>> >> if there is no available Glusto service running upstream - that is a >>>> >> whole new conversation. >>>> > >>>> > >>>> > I'm* still trying to get it running properly on my simple >>>> Vagrant+Ansible setup[1]. >>>> > Right now I'm installing Gluster + Glusto + creating bricks, pool and >>>> a volume in ~3m on my latop. >>>> > >>>> >>>> This is good. I think my original question was to the maintainer(s) of >>>> Glusto along with the individuals involved in the automated testing >>>> part of Gluster to understand the challenges in deploying this for the >>>> project. >>>> >>>> > Once I do get it fully working, we'll get to make it work faster, >>>> clean it up and and see how can we get code coverage. >>>> > >>>> > Unless there's an alternative to the whole framework that I'm not >>>> aware of? >>>> >>>> I haven't read anything to this effect on any list. >>>> >>>> >>> This is cool. I haven't had a chance to give it a run on my laptop, but >>> it looked good. >>> Are you running into issues with Glusto, glusterlibs, and/or >>> Glusto-tests? >>> >> >> All of the above. >> - The client consumes at times 100% CPU, not sure why. >> - There are missing deps which I'm reverse engineering from Gluster CI >> (which by itself has some strange deps - why do we need python-docx ?) >> - I'm failing with the cvt test, with >> test_shrinking_volume_when_io_in_progress with the error: >> AssertionError: IO failed on some of the clients >> >> I had hoped it could give me a bit more hint: >> - which clients? (I happen to have one, so that's easy) >> - What IO workload? >> - What error? >> >> - I hope there's a mode that does NOT perform cleanup/teardown, so it's >> easier to look at the issue at hand. >> > > python-docx needs to be installed as part of "glusto-tests dependencies". > file_dir_ops.py supports writing docx files. > Anything special about docx files that we need to test with it? Have we ever had some corruption specifically there? I'd understand (sort-of) if we were running some application on top. Anyway, this is not it - I have it installed already. > IO failed on the client: 192.168.250.10. and it trying to write deep > directories with files. > Need to comment "tearDown" section if we want leave the cluster as it is > in the failed state. > I would say that in CI, we probably want to continue, and elsewhere, we probably want to stop. > > >> > - From glustomain.log, I can see: >> 2019-03-31 12:56:00,627 INFO (validate_io_procs) Validating IO on >> 192.168.250.10:/mnt/testvol_distributed-replicated_cifs >> 2019-03-31 12:56:00,627 INFO (_log_results) ESC[34;1mRETCODE ( >> root at 192.168.250.10): 1ESC[0m >> 2019-03-31 12:56:00,628 INFO (_log_results) ESC[47;30;1mSTDOUT ( >> root at 192.168.250.10)... >> Starting File/Dir Ops: 12:55:27:PM:Mar_31_2019 >> Unable to create dir '/mnt/testvol_distributed-replicated_cifs/user6' : >> Invalid argument >> Unable to create dir >> '/mnt/testvol_distributed-replicated_cifs/user6/dir0' : Invalid argument >> Unable to create dir >> '/mnt/testvol_distributed-replicated_cifs/user6/dir0/dir0' : Invalid >> argument >> Unable to create dir >> '/mnt/testvol_distributed-replicated_cifs/user6/dir0/dir1' : Invalid >> argument >> Unable to create dir >> '/mnt/testvol_distributed-replicated_cifs/user6/dir1' : Invalid argument >> Unable to create dir >> '/mnt/testvol_distributed-replicated_cifs/user6/dir1/dir0' : Invalid >> argument >> >> I'm right now assuming something's wrong on my setup. Unclear what, yet. >> > > + vivek ; for the inputs regarding cifs issue > > I had a conversation with vivek regarding the "Invalid argument" long time > back. > > >> >>> I was using the glusto-tests container to run tests locally and for BVT >>> in the lab. >>> I was running against lab VMs, so looking forward to giving the vagrant >>> piece a go. >>> >>> By upstream service are we talking about the Jenkins in the CentOS >>> environment, etc? >>> >> >> Yes. >> Y. >> >> @Vijay Bhaskar Reddy Avuthu @Akarsha Rai >>> any insight? >>> >>> Cheers, >>> Jonathan >>> >>> > Surely for most of the positive paths, we can (and perhaps should) use >>>> the the Gluster Ansible modules. >>>> > Y. >>>> > >>>> > [1] https://github.com/mykaul/vg >>>> > * with an intern's help. >>>> _______________________________________________ >>>> automated-testing mailing list >>>> automated-testing at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/automated-testing >>>> >>> _______________________________________________ >>> automated-testing mailing list >>> automated-testing at gluster.org >>> https://lists.gluster.org/mailman/listinfo/automated-testing >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From ykaul at redhat.com Mon Apr 1 08:04:54 2019 From: ykaul at redhat.com (Yaniv Kaul) Date: Mon, 1 Apr 2019 11:04:54 +0300 Subject: [Gluster-infra] [automated-testing] What is the current state of the Glusto test framework in upstream? In-Reply-To: References: Message-ID: On Mon, Apr 1, 2019 at 9:20 AM Vivek Das wrote: > Comments inline > > On Mon, Apr 1, 2019 at 11:26 AM Vijay Bhaskar Reddy Avuthu < > vavuthu at redhat.com> wrote: > >> >> >> On Sun, Mar 31, 2019 at 6:30 PM Yaniv Kaul wrote: >> >>> >>> >>> On Wed, Mar 13, 2019 at 4:14 PM Jonathan Holloway >>> wrote: >>> >>>> >>>> >>>> On Wed, Mar 13, 2019 at 5:08 AM Sankarshan Mukhopadhyay < >>>> sankarshan.mukhopadhyay at gmail.com> wrote: >>>> >>>>> On Wed, Mar 13, 2019 at 3:03 PM Yaniv Kaul wrote: >>>>> > On Wed, Mar 13, 2019, 3:53 AM Sankarshan Mukhopadhyay < >>>>> sankarshan.mukhopadhyay at gmail.com> wrote: >>>>> >> >>>>> >> What I am essentially looking to understand is whether there are >>>>> >> regular Glusto runs and whether the tests receive refreshes. >>>>> However, >>>>> >> if there is no available Glusto service running upstream - that is a >>>>> >> whole new conversation. >>>>> > >>>>> > >>>>> > I'm* still trying to get it running properly on my simple >>>>> Vagrant+Ansible setup[1]. >>>>> > Right now I'm installing Gluster + Glusto + creating bricks, pool >>>>> and a volume in ~3m on my latop. >>>>> > >>>>> >>>>> This is good. I think my original question was to the maintainer(s) of >>>>> Glusto along with the individuals involved in the automated testing >>>>> part of Gluster to understand the challenges in deploying this for the >>>>> project. >>>>> >>>>> > Once I do get it fully working, we'll get to make it work faster, >>>>> clean it up and and see how can we get code coverage. >>>>> > >>>>> > Unless there's an alternative to the whole framework that I'm not >>>>> aware of? >>>>> >>>>> I haven't read anything to this effect on any list. >>>>> >>>>> >>>> This is cool. I haven't had a chance to give it a run on my laptop, but >>>> it looked good. >>>> Are you running into issues with Glusto, glusterlibs, and/or >>>> Glusto-tests? >>>> >>> >>> All of the above. >>> - The client consumes at times 100% CPU, not sure why. >>> - There are missing deps which I'm reverse engineering from Gluster CI >>> (which by itself has some strange deps - why do we need python-docx ?) >>> - I'm failing with the cvt test, with >>> test_shrinking_volume_when_io_in_progress with the error: >>> AssertionError: IO failed on some of the clients >>> >>> I had hoped it could give me a bit more hint: >>> - which clients? (I happen to have one, so that's easy) >>> - What IO workload? >>> - What error? >>> >>> - I hope there's a mode that does NOT perform cleanup/teardown, so it's >>> easier to look at the issue at hand. >>> >> >> python-docx needs to be installed as part of "glusto-tests dependencies". >> file_dir_ops.py supports writing docx files. >> IO failed on the client: 192.168.250.10. and it trying to write deep >> directories with files. >> Need to comment "tearDown" section if we want leave the cluster as it is >> in the failed state. >> >> >>> >> - From glustomain.log, I can see: >>> 2019-03-31 12:56:00,627 INFO (validate_io_procs) Validating IO on >>> 192.168.250.10:/mnt/testvol_distributed-replicated_cifs >>> 2019-03-31 12:56:00,627 INFO (_log_results) ESC[34;1mRETCODE ( >>> root at 192.168.250.10): 1ESC[0m >>> 2019-03-31 12:56:00,628 INFO (_log_results) ESC[47;30;1mSTDOUT ( >>> root at 192.168.250.10)... >>> Starting File/Dir Ops: 12:55:27:PM:Mar_31_2019 >>> Unable to create dir '/mnt/testvol_distributed-replicated_cifs/user6' : >>> Invalid argument >>> Unable to create dir >>> '/mnt/testvol_distributed-replicated_cifs/user6/dir0' : Invalid argument >>> Unable to create dir >>> '/mnt/testvol_distributed-replicated_cifs/user6/dir0/dir0' : Invalid >>> argument >>> Unable to create dir >>> '/mnt/testvol_distributed-replicated_cifs/user6/dir0/dir1' : Invalid >>> argument >>> Unable to create dir >>> '/mnt/testvol_distributed-replicated_cifs/user6/dir1' : Invalid argument >>> Unable to create dir >>> '/mnt/testvol_distributed-replicated_cifs/user6/dir1/dir0' : Invalid >>> argument >>> >>> I'm right now assuming something's wrong on my setup. Unclear what, yet. >>> >> >> + vivek ; for the inputs regarding cifs issue >> > > A subsequent bug was raised for the above issue. > https://bugzilla.redhat.com/show_bug.cgi?id=1664618 > - The bug is lacking severity classification. - The bug is lacking AutomationBlocker keyword. - Is it a regression? - Why hasn't anyone from engineering looked at it? - How does it pass in CI? Y. > > > >> >> I had a conversation with vivek regarding the "Invalid argument" long >> time back. >> >> >>> >>>> I was using the glusto-tests container to run tests locally and for BVT >>>> in the lab. >>>> I was running against lab VMs, so looking forward to giving the vagrant >>>> piece a go. >>>> >>>> By upstream service are we talking about the Jenkins in the CentOS >>>> environment, etc? >>>> >>> >>> Yes. >>> Y. >>> >>> @Vijay Bhaskar Reddy Avuthu @Akarsha Rai >>>> any insight? >>>> >>>> Cheers, >>>> Jonathan >>>> >>>> > Surely for most of the positive paths, we can (and perhaps should) >>>>> use the the Gluster Ansible modules. >>>>> > Y. >>>>> > >>>>> > [1] https://github.com/mykaul/vg >>>>> > * with an intern's help. >>>>> _______________________________________________ >>>>> automated-testing mailing list >>>>> automated-testing at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/automated-testing >>>>> >>>> _______________________________________________ >>>> automated-testing mailing list >>>> automated-testing at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/automated-testing >>>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From vavuthu at redhat.com Mon Apr 1 05:55:49 2019 From: vavuthu at redhat.com (Vijay Bhaskar Reddy Avuthu) Date: Mon, 1 Apr 2019 11:25:49 +0530 Subject: [Gluster-infra] [automated-testing] What is the current state of the Glusto test framework in upstream? In-Reply-To: References: Message-ID: On Sun, Mar 31, 2019 at 6:30 PM Yaniv Kaul wrote: > > > On Wed, Mar 13, 2019 at 4:14 PM Jonathan Holloway > wrote: > >> >> >> On Wed, Mar 13, 2019 at 5:08 AM Sankarshan Mukhopadhyay < >> sankarshan.mukhopadhyay at gmail.com> wrote: >> >>> On Wed, Mar 13, 2019 at 3:03 PM Yaniv Kaul wrote: >>> > On Wed, Mar 13, 2019, 3:53 AM Sankarshan Mukhopadhyay < >>> sankarshan.mukhopadhyay at gmail.com> wrote: >>> >> >>> >> What I am essentially looking to understand is whether there are >>> >> regular Glusto runs and whether the tests receive refreshes. However, >>> >> if there is no available Glusto service running upstream - that is a >>> >> whole new conversation. >>> > >>> > >>> > I'm* still trying to get it running properly on my simple >>> Vagrant+Ansible setup[1]. >>> > Right now I'm installing Gluster + Glusto + creating bricks, pool and >>> a volume in ~3m on my latop. >>> > >>> >>> This is good. I think my original question was to the maintainer(s) of >>> Glusto along with the individuals involved in the automated testing >>> part of Gluster to understand the challenges in deploying this for the >>> project. >>> >>> > Once I do get it fully working, we'll get to make it work faster, >>> clean it up and and see how can we get code coverage. >>> > >>> > Unless there's an alternative to the whole framework that I'm not >>> aware of? >>> >>> I haven't read anything to this effect on any list. >>> >>> >> This is cool. I haven't had a chance to give it a run on my laptop, but >> it looked good. >> Are you running into issues with Glusto, glusterlibs, and/or Glusto-tests? >> > > All of the above. > - The client consumes at times 100% CPU, not sure why. > - There are missing deps which I'm reverse engineering from Gluster CI > (which by itself has some strange deps - why do we need python-docx ?) > - I'm failing with the cvt test, with > test_shrinking_volume_when_io_in_progress with the error: > AssertionError: IO failed on some of the clients > > I had hoped it could give me a bit more hint: > - which clients? (I happen to have one, so that's easy) > - What IO workload? > - What error? > > - I hope there's a mode that does NOT perform cleanup/teardown, so it's > easier to look at the issue at hand. > python-docx needs to be installed as part of "glusto-tests dependencies". file_dir_ops.py supports writing docx files. IO failed on the client: 192.168.250.10. and it trying to write deep directories with files. Need to comment "tearDown" section if we want leave the cluster as it is in the failed state. > - From glustomain.log, I can see: > 2019-03-31 12:56:00,627 INFO (validate_io_procs) Validating IO on > 192.168.250.10:/mnt/testvol_distributed-replicated_cifs > 2019-03-31 12:56:00,627 INFO (_log_results) ESC[34;1mRETCODE ( > root at 192.168.250.10): 1ESC[0m > 2019-03-31 12:56:00,628 INFO (_log_results) ESC[47;30;1mSTDOUT ( > root at 192.168.250.10)... > Starting File/Dir Ops: 12:55:27:PM:Mar_31_2019 > Unable to create dir '/mnt/testvol_distributed-replicated_cifs/user6' : > Invalid argument > Unable to create dir '/mnt/testvol_distributed-replicated_cifs/user6/dir0' > : Invalid argument > Unable to create dir > '/mnt/testvol_distributed-replicated_cifs/user6/dir0/dir0' : Invalid > argument > Unable to create dir > '/mnt/testvol_distributed-replicated_cifs/user6/dir0/dir1' : Invalid > argument > Unable to create dir '/mnt/testvol_distributed-replicated_cifs/user6/dir1' > : Invalid argument > Unable to create dir > '/mnt/testvol_distributed-replicated_cifs/user6/dir1/dir0' : Invalid > argument > > I'm right now assuming something's wrong on my setup. Unclear what, yet. > + vivek ; for the inputs regarding cifs issue I had a conversation with vivek regarding the "Invalid argument" long time back. > >> I was using the glusto-tests container to run tests locally and for BVT >> in the lab. >> I was running against lab VMs, so looking forward to giving the vagrant >> piece a go. >> >> By upstream service are we talking about the Jenkins in the CentOS >> environment, etc? >> > > Yes. > Y. > > @Vijay Bhaskar Reddy Avuthu @Akarsha Rai >> any insight? >> >> Cheers, >> Jonathan >> >> > Surely for most of the positive paths, we can (and perhaps should) use >>> the the Gluster Ansible modules. >>> > Y. >>> > >>> > [1] https://github.com/mykaul/vg >>> > * with an intern's help. >>> _______________________________________________ >>> automated-testing mailing list >>> automated-testing at gluster.org >>> https://lists.gluster.org/mailman/listinfo/automated-testing >>> >> _______________________________________________ >> automated-testing mailing list >> automated-testing at gluster.org >> https://lists.gluster.org/mailman/listinfo/automated-testing >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vdas at redhat.com Mon Apr 1 06:20:10 2019 From: vdas at redhat.com (Vivek Das) Date: Mon, 1 Apr 2019 11:50:10 +0530 Subject: [Gluster-infra] [automated-testing] What is the current state of the Glusto test framework in upstream? In-Reply-To: References: Message-ID: Comments inline On Mon, Apr 1, 2019 at 11:26 AM Vijay Bhaskar Reddy Avuthu < vavuthu at redhat.com> wrote: > > > On Sun, Mar 31, 2019 at 6:30 PM Yaniv Kaul wrote: > >> >> >> On Wed, Mar 13, 2019 at 4:14 PM Jonathan Holloway >> wrote: >> >>> >>> >>> On Wed, Mar 13, 2019 at 5:08 AM Sankarshan Mukhopadhyay < >>> sankarshan.mukhopadhyay at gmail.com> wrote: >>> >>>> On Wed, Mar 13, 2019 at 3:03 PM Yaniv Kaul wrote: >>>> > On Wed, Mar 13, 2019, 3:53 AM Sankarshan Mukhopadhyay < >>>> sankarshan.mukhopadhyay at gmail.com> wrote: >>>> >> >>>> >> What I am essentially looking to understand is whether there are >>>> >> regular Glusto runs and whether the tests receive refreshes. However, >>>> >> if there is no available Glusto service running upstream - that is a >>>> >> whole new conversation. >>>> > >>>> > >>>> > I'm* still trying to get it running properly on my simple >>>> Vagrant+Ansible setup[1]. >>>> > Right now I'm installing Gluster + Glusto + creating bricks, pool and >>>> a volume in ~3m on my latop. >>>> > >>>> >>>> This is good. I think my original question was to the maintainer(s) of >>>> Glusto along with the individuals involved in the automated testing >>>> part of Gluster to understand the challenges in deploying this for the >>>> project. >>>> >>>> > Once I do get it fully working, we'll get to make it work faster, >>>> clean it up and and see how can we get code coverage. >>>> > >>>> > Unless there's an alternative to the whole framework that I'm not >>>> aware of? >>>> >>>> I haven't read anything to this effect on any list. >>>> >>>> >>> This is cool. I haven't had a chance to give it a run on my laptop, but >>> it looked good. >>> Are you running into issues with Glusto, glusterlibs, and/or >>> Glusto-tests? >>> >> >> All of the above. >> - The client consumes at times 100% CPU, not sure why. >> - There are missing deps which I'm reverse engineering from Gluster CI >> (which by itself has some strange deps - why do we need python-docx ?) >> - I'm failing with the cvt test, with >> test_shrinking_volume_when_io_in_progress with the error: >> AssertionError: IO failed on some of the clients >> >> I had hoped it could give me a bit more hint: >> - which clients? (I happen to have one, so that's easy) >> - What IO workload? >> - What error? >> >> - I hope there's a mode that does NOT perform cleanup/teardown, so it's >> easier to look at the issue at hand. >> > > python-docx needs to be installed as part of "glusto-tests dependencies". > file_dir_ops.py supports writing docx files. > IO failed on the client: 192.168.250.10. and it trying to write deep > directories with files. > Need to comment "tearDown" section if we want leave the cluster as it is > in the failed state. > > >> > - From glustomain.log, I can see: >> 2019-03-31 12:56:00,627 INFO (validate_io_procs) Validating IO on >> 192.168.250.10:/mnt/testvol_distributed-replicated_cifs >> 2019-03-31 12:56:00,627 INFO (_log_results) ESC[34;1mRETCODE ( >> root at 192.168.250.10): 1ESC[0m >> 2019-03-31 12:56:00,628 INFO (_log_results) ESC[47;30;1mSTDOUT ( >> root at 192.168.250.10)... >> Starting File/Dir Ops: 12:55:27:PM:Mar_31_2019 >> Unable to create dir '/mnt/testvol_distributed-replicated_cifs/user6' : >> Invalid argument >> Unable to create dir >> '/mnt/testvol_distributed-replicated_cifs/user6/dir0' : Invalid argument >> Unable to create dir >> '/mnt/testvol_distributed-replicated_cifs/user6/dir0/dir0' : Invalid >> argument >> Unable to create dir >> '/mnt/testvol_distributed-replicated_cifs/user6/dir0/dir1' : Invalid >> argument >> Unable to create dir >> '/mnt/testvol_distributed-replicated_cifs/user6/dir1' : Invalid argument >> Unable to create dir >> '/mnt/testvol_distributed-replicated_cifs/user6/dir1/dir0' : Invalid >> argument >> >> I'm right now assuming something's wrong on my setup. Unclear what, yet. >> > > + vivek ; for the inputs regarding cifs issue > A subsequent bug was raised for the above issue. https://bugzilla.redhat.com/show_bug.cgi?id=1664618 > > I had a conversation with vivek regarding the "Invalid argument" long time > back. > > >> >>> I was using the glusto-tests container to run tests locally and for BVT >>> in the lab. >>> I was running against lab VMs, so looking forward to giving the vagrant >>> piece a go. >>> >>> By upstream service are we talking about the Jenkins in the CentOS >>> environment, etc? >>> >> >> Yes. >> Y. >> >> @Vijay Bhaskar Reddy Avuthu @Akarsha Rai >>> any insight? >>> >>> Cheers, >>> Jonathan >>> >>> > Surely for most of the positive paths, we can (and perhaps should) use >>>> the the Gluster Ansible modules. >>>> > Y. >>>> > >>>> > [1] https://github.com/mykaul/vg >>>> > * with an intern's help. >>>> _______________________________________________ >>>> automated-testing mailing list >>>> automated-testing at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/automated-testing >>>> >>> _______________________________________________ >>> automated-testing mailing list >>> automated-testing at gluster.org >>> https://lists.gluster.org/mailman/listinfo/automated-testing >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From vavuthu at redhat.com Mon Apr 1 06:38:47 2019 From: vavuthu at redhat.com (Vijay Bhaskar Reddy Avuthu) Date: Mon, 1 Apr 2019 12:08:47 +0530 Subject: [Gluster-infra] [automated-testing] What is the current state of the Glusto test framework in upstream? In-Reply-To: References: Message-ID: I an not aware of any corruption issues with docx and nowhere in test cases docx is used. I think it was there as an option to use. The only place where it used is "file_dir_ops.py". Regards, Vijay A On Mon, Apr 1, 2019 at 11:50 AM Yaniv Kaul wrote: > > > On Mon, Apr 1, 2019 at 8:56 AM Vijay Bhaskar Reddy Avuthu < > vavuthu at redhat.com> wrote: > >> >> >> On Sun, Mar 31, 2019 at 6:30 PM Yaniv Kaul wrote: >> >>> >>> >>> On Wed, Mar 13, 2019 at 4:14 PM Jonathan Holloway >>> wrote: >>> >>>> >>>> >>>> On Wed, Mar 13, 2019 at 5:08 AM Sankarshan Mukhopadhyay < >>>> sankarshan.mukhopadhyay at gmail.com> wrote: >>>> >>>>> On Wed, Mar 13, 2019 at 3:03 PM Yaniv Kaul wrote: >>>>> > On Wed, Mar 13, 2019, 3:53 AM Sankarshan Mukhopadhyay < >>>>> sankarshan.mukhopadhyay at gmail.com> wrote: >>>>> >> >>>>> >> What I am essentially looking to understand is whether there are >>>>> >> regular Glusto runs and whether the tests receive refreshes. >>>>> However, >>>>> >> if there is no available Glusto service running upstream - that is a >>>>> >> whole new conversation. >>>>> > >>>>> > >>>>> > I'm* still trying to get it running properly on my simple >>>>> Vagrant+Ansible setup[1]. >>>>> > Right now I'm installing Gluster + Glusto + creating bricks, pool >>>>> and a volume in ~3m on my latop. >>>>> > >>>>> >>>>> This is good. I think my original question was to the maintainer(s) of >>>>> Glusto along with the individuals involved in the automated testing >>>>> part of Gluster to understand the challenges in deploying this for the >>>>> project. >>>>> >>>>> > Once I do get it fully working, we'll get to make it work faster, >>>>> clean it up and and see how can we get code coverage. >>>>> > >>>>> > Unless there's an alternative to the whole framework that I'm not >>>>> aware of? >>>>> >>>>> I haven't read anything to this effect on any list. >>>>> >>>>> >>>> This is cool. I haven't had a chance to give it a run on my laptop, but >>>> it looked good. >>>> Are you running into issues with Glusto, glusterlibs, and/or >>>> Glusto-tests? >>>> >>> >>> All of the above. >>> - The client consumes at times 100% CPU, not sure why. >>> - There are missing deps which I'm reverse engineering from Gluster CI >>> (which by itself has some strange deps - why do we need python-docx ?) >>> - I'm failing with the cvt test, with >>> test_shrinking_volume_when_io_in_progress with the error: >>> AssertionError: IO failed on some of the clients >>> >>> I had hoped it could give me a bit more hint: >>> - which clients? (I happen to have one, so that's easy) >>> - What IO workload? >>> - What error? >>> >>> - I hope there's a mode that does NOT perform cleanup/teardown, so it's >>> easier to look at the issue at hand. >>> >> >> python-docx needs to be installed as part of "glusto-tests dependencies". >> file_dir_ops.py supports writing docx files. >> > > Anything special about docx files that we need to test with it? Have we > ever had some corruption specifically there? I'd understand (sort-of) if we > were running some application on top. > Anyway, this is not it - I have it installed already. > >> IO failed on the client: 192.168.250.10. and it trying to write deep >> directories with files. >> Need to comment "tearDown" section if we want leave the cluster as it is >> in the failed state. >> > > I would say that in CI, we probably want to continue, and elsewhere, we > probably want to stop. > >> >> >>> >> - From glustomain.log, I can see: >>> 2019-03-31 12:56:00,627 INFO (validate_io_procs) Validating IO on >>> 192.168.250.10:/mnt/testvol_distributed-replicated_cifs >>> 2019-03-31 12:56:00,627 INFO (_log_results) ESC[34;1mRETCODE ( >>> root at 192.168.250.10): 1ESC[0m >>> 2019-03-31 12:56:00,628 INFO (_log_results) ESC[47;30;1mSTDOUT ( >>> root at 192.168.250.10)... >>> Starting File/Dir Ops: 12:55:27:PM:Mar_31_2019 >>> Unable to create dir '/mnt/testvol_distributed-replicated_cifs/user6' : >>> Invalid argument >>> Unable to create dir >>> '/mnt/testvol_distributed-replicated_cifs/user6/dir0' : Invalid argument >>> Unable to create dir >>> '/mnt/testvol_distributed-replicated_cifs/user6/dir0/dir0' : Invalid >>> argument >>> Unable to create dir >>> '/mnt/testvol_distributed-replicated_cifs/user6/dir0/dir1' : Invalid >>> argument >>> Unable to create dir >>> '/mnt/testvol_distributed-replicated_cifs/user6/dir1' : Invalid argument >>> Unable to create dir >>> '/mnt/testvol_distributed-replicated_cifs/user6/dir1/dir0' : Invalid >>> argument >>> >>> I'm right now assuming something's wrong on my setup. Unclear what, yet. >>> >> >> + vivek ; for the inputs regarding cifs issue >> >> I had a conversation with vivek regarding the "Invalid argument" long >> time back. >> >> >>> >>>> I was using the glusto-tests container to run tests locally and for BVT >>>> in the lab. >>>> I was running against lab VMs, so looking forward to giving the vagrant >>>> piece a go. >>>> >>>> By upstream service are we talking about the Jenkins in the CentOS >>>> environment, etc? >>>> >>> >>> Yes. >>> Y. >>> >>> @Vijay Bhaskar Reddy Avuthu @Akarsha Rai >>>> any insight? >>>> >>>> Cheers, >>>> Jonathan >>>> >>>> > Surely for most of the positive paths, we can (and perhaps should) >>>>> use the the Gluster Ansible modules. >>>>> > Y. >>>>> > >>>>> > [1] https://github.com/mykaul/vg >>>>> > * with an intern's help. >>>>> _______________________________________________ >>>>> automated-testing mailing list >>>>> automated-testing at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/automated-testing >>>>> >>>> _______________________________________________ >>>> automated-testing mailing list >>>> automated-testing at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/automated-testing >>>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Wed Apr 3 05:16:51 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Wed, 3 Apr 2019 10:46:51 +0530 Subject: [Gluster-infra] is_nfs_export_available from nfs.rc failing too often? Message-ID: I'm observing the above test function failing too often because of which arbiter-mount.t test fails in many regression jobs. Such frequency of failures wasn't there earlier. Does anyone know what has changed recently to cause these failures in regression? I also hear when such failure happens a reboot is required, is that true and if so why? One of the reference : https://build.gluster.org/job/centos7-regression/5340/consoleFull -------------- next part -------------- An HTML attachment was scrubbed... URL: From jthottan at redhat.com Wed Apr 3 06:26:20 2019 From: jthottan at redhat.com (Jiffin Thottan) Date: Wed, 3 Apr 2019 02:26:20 -0400 (EDT) Subject: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: References: Message-ID: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> Hi, is_nfs_export_available is just a wrapper around "showmount" command AFAIR. I saw following messages in console output. mount.nfs: rpc.statd is not running but is required for remote locking. 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, or start statd. 05:06:55 mount.nfs: an incorrect mount option was specified For me it looks rpcbind may not be running on the machine. Usually rpcbind starts automatically on machines, don't know whether it can happen or not. Regards, Jiffin ----- Original Message ----- From: "Atin Mukherjee" To: "gluster-infra" , "Gluster Devel" Sent: Wednesday, April 3, 2019 10:46:51 AM Subject: [Gluster-devel] is_nfs_export_available from nfs.rc failing too often? I'm observing the above test function failing too often because of which arbiter-mount.t test fails in many regression jobs. Such frequency of failures wasn't there earlier. Does anyone know what has changed recently to cause these failures in regression? I also hear when such failure happens a reboot is required, is that true and if so why? One of the reference : https://build.gluster.org/job/centos7-regression/5340/consoleFull _______________________________________________ Gluster-devel mailing list Gluster-devel at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel From bugzilla at redhat.com Wed Apr 3 08:15:37 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Wed, 03 Apr 2019 08:15:37 +0000 Subject: [Gluster-infra] [Bug 1695484] New: smoke fails with "Build root is locked by another process" Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1695484 Bug ID: 1695484 Summary: smoke fails with "Build root is locked by another process" Product: GlusterFS Version: mainline Status: NEW Component: project-infrastructure Assignee: bugs at gluster.org Reporter: pkarampu at redhat.com CC: bugs at gluster.org, gluster-infra at gluster.org Target Milestone: --- Classification: Community Description of problem: Please check https://build.gluster.org/job/devrpm-fedora/15405/console for more details. Smoke is failing with the reason mentioned in the subject. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Wed Apr 3 08:35:11 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Wed, 03 Apr 2019 08:35:11 +0000 Subject: [Gluster-infra] [Bug 1695484] smoke fails with "Build root is locked by another process" In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1695484 Deepshikha khandelwal changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dkhandel at redhat.com --- Comment #1 from Deepshikha khandelwal --- It happens mainly because your previously running build was aborted by a new patchset and hence no cleanup. Re-triggering might help. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Wed Apr 3 10:29:53 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Wed, 03 Apr 2019 10:29:53 +0000 Subject: [Gluster-infra] [Bug 1695484] smoke fails with "Build root is locked by another process" In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1695484 M. Scherer changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mscherer at redhat.com --- Comment #2 from M. Scherer --- Mhh, then shouldn't we clean up when there is something that do stop the build ? -- You are receiving this mail because: You are on the CC list for the bug. From amukherj at redhat.com Wed Apr 3 11:00:42 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Wed, 3 Apr 2019 16:30:42 +0530 Subject: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> Message-ID: On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan wrote: > Hi, > > is_nfs_export_available is just a wrapper around "showmount" command AFAIR. > I saw following messages in console output. > mount.nfs: rpc.statd is not running but is required for remote locking. > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, or start > statd. > 05:06:55 mount.nfs: an incorrect mount option was specified > > For me it looks rpcbind may not be running on the machine. > Usually rpcbind starts automatically on machines, don't know whether it > can happen or not. > That's precisely what the question is. Why suddenly we're seeing this happening too frequently. Today I saw atleast 4 to 5 such failures already. Deepshika - Can you please help in inspecting this? > Regards, > Jiffin > > > ----- Original Message ----- > From: "Atin Mukherjee" > To: "gluster-infra" , "Gluster Devel" < > gluster-devel at gluster.org> > Sent: Wednesday, April 3, 2019 10:46:51 AM > Subject: [Gluster-devel] is_nfs_export_available from nfs.rc failing too > often? > > I'm observing the above test function failing too often because of which > arbiter-mount.t test fails in many regression jobs. Such frequency of > failures wasn't there earlier. Does anyone know what has changed recently > to cause these failures in regression? I also hear when such failure > happens a reboot is required, is that true and if so why? > > One of the reference : > https://build.gluster.org/job/centos7-regression/5340/consoleFull > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mscherer at redhat.com Wed Apr 3 11:52:50 2019 From: mscherer at redhat.com (Michael Scherer) Date: Wed, 03 Apr 2019 13:52:50 +0200 Subject: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> Message-ID: <46932285269538f29a3bdd0ccb177bfce091bf85.camel@redhat.com> Le mercredi 03 avril 2019 ? 16:30 +0530, Atin Mukherjee a ?crit : > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan > wrote: > > > Hi, > > > > is_nfs_export_available is just a wrapper around "showmount" > > command AFAIR. > > I saw following messages in console output. > > mount.nfs: rpc.statd is not running but is required for remote > > locking. > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, or > > start > > statd. > > 05:06:55 mount.nfs: an incorrect mount option was specified > > > > For me it looks rpcbind may not be running on the machine. > > Usually rpcbind starts automatically on machines, don't know > > whether it > > can happen or not. > > > > That's precisely what the question is. Why suddenly we're seeing this > happening too frequently. Today I saw atleast 4 to 5 such failures > already. > > Deepshika - Can you please help in inspecting this? So in the past, this kind of stuff did happen with ipv6, so this could be a change on AWS and/or a upgrade. We are currently investigating a set of failure that happen after reboot (resulting in partial network bring up, causing all kind of weird issue), but it take some time to verify it, and since we lost 33% of the team with Nigel departure, stuff do not move as fast as before. -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part URL: From ykaul at redhat.com Wed Apr 3 12:12:16 2019 From: ykaul at redhat.com (Yaniv Kaul) Date: Wed, 3 Apr 2019 15:12:16 +0300 Subject: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: <46932285269538f29a3bdd0ccb177bfce091bf85.camel@redhat.com> References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> <46932285269538f29a3bdd0ccb177bfce091bf85.camel@redhat.com> Message-ID: On Wed, Apr 3, 2019 at 2:53 PM Michael Scherer wrote: > Le mercredi 03 avril 2019 ? 16:30 +0530, Atin Mukherjee a ?crit : > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan > > wrote: > > > > > Hi, > > > > > > is_nfs_export_available is just a wrapper around "showmount" > > > command AFAIR. > > > I saw following messages in console output. > > > mount.nfs: rpc.statd is not running but is required for remote > > > locking. > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, or > > > start > > > statd. > > > 05:06:55 mount.nfs: an incorrect mount option was specified > > > > > > For me it looks rpcbind may not be running on the machine. > > > Usually rpcbind starts automatically on machines, don't know > > > whether it > > > can happen or not. > > > > > > > That's precisely what the question is. Why suddenly we're seeing this > > happening too frequently. Today I saw atleast 4 to 5 such failures > > already. > > > > Deepshika - Can you please help in inspecting this? > > So in the past, this kind of stuff did happen with ipv6, so this could > be a change on AWS and/or a upgrade. > We need to enable IPv6, for two reasons: 1. IPv6 is common these days, even if we don't test with it, it should be there. 2. We should test with IPv6... I'm not sure, but I suspect we do disable IPv6 here and there. Example[1]. Y. [1] https://github.com/gluster/centosci/blob/master/jobs/scripts/glusto/setup-glusto.yml > > We are currently investigating a set of failure that happen after > reboot (resulting in partial network bring up, causing all kind of > weird issue), but it take some time to verify it, and since we lost 33% > of the team with Nigel departure, stuff do not move as fast as before. > > > -- > Michael Scherer > Sysadmin, Community Infrastructure and Platform, OSAS > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From mscherer at redhat.com Wed Apr 3 12:19:15 2019 From: mscherer at redhat.com (Michael Scherer) Date: Wed, 03 Apr 2019 14:19:15 +0200 Subject: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> <46932285269538f29a3bdd0ccb177bfce091bf85.camel@redhat.com> Message-ID: <1658d7c7b3170ad7abe6afbcdf769775e9274da3.camel@redhat.com> Le mercredi 03 avril 2019 ? 15:12 +0300, Yaniv Kaul a ?crit : > On Wed, Apr 3, 2019 at 2:53 PM Michael Scherer > wrote: > > > Le mercredi 03 avril 2019 ? 16:30 +0530, Atin Mukherjee a ?crit : > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < > > > jthottan at redhat.com> > > > wrote: > > > > > > > Hi, > > > > > > > > is_nfs_export_available is just a wrapper around "showmount" > > > > command AFAIR. > > > > I saw following messages in console output. > > > > mount.nfs: rpc.statd is not running but is required for remote > > > > locking. > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, > > > > or > > > > start > > > > statd. > > > > 05:06:55 mount.nfs: an incorrect mount option was specified > > > > > > > > For me it looks rpcbind may not be running on the machine. > > > > Usually rpcbind starts automatically on machines, don't know > > > > whether it > > > > can happen or not. > > > > > > > > > > That's precisely what the question is. Why suddenly we're seeing > > > this > > > happening too frequently. Today I saw atleast 4 to 5 such > > > failures > > > already. > > > > > > Deepshika - Can you please help in inspecting this? > > > > So in the past, this kind of stuff did happen with ipv6, so this > > could > > be a change on AWS and/or a upgrade. > > > > We need to enable IPv6, for two reasons: > 1. IPv6 is common these days, even if we don't test with it, it > should be > there. > 2. We should test with IPv6... > > I'm not sure, but I suspect we do disable IPv6 here and there. > Example[1]. > Y. > > [1] > https://github.com/gluster/centosci/blob/master/jobs/scripts/glusto/setup-glusto.yml We do disable ipv6 for sure, Nigel spent 3 days just on that for the AWS migration, and we do have a dedicated playbook applied on all builders that try to disable everything in every possible way: https://github.com/gluster/gluster.org_ansible_configuration/blob/master/roles/jenkins_builder/tasks/disable_ipv6_linux.yml According to the comment, that's from 2016, and I am sure this go further in the past since it wasn't just documented before. -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part URL: From mscherer at redhat.com Wed Apr 3 13:56:36 2019 From: mscherer at redhat.com (Michael Scherer) Date: Wed, 03 Apr 2019 15:56:36 +0200 Subject: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> Message-ID: <797512f6ff7f1b9fedbf8b7968dd86a6968d9105.camel@redhat.com> Le mercredi 03 avril 2019 ? 16:30 +0530, Atin Mukherjee a ?crit : > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan > wrote: > > > Hi, > > > > is_nfs_export_available is just a wrapper around "showmount" > > command AFAIR. > > I saw following messages in console output. > > mount.nfs: rpc.statd is not running but is required for remote > > locking. > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, or > > start > > statd. > > 05:06:55 mount.nfs: an incorrect mount option was specified > > > > For me it looks rpcbind may not be running on the machine. > > Usually rpcbind starts automatically on machines, don't know > > whether it > > can happen or not. > > > > That's precisely what the question is. Why suddenly we're seeing this > happening too frequently. Today I saw atleast 4 to 5 such failures > already. > > Deepshika - Can you please help in inspecting this? So we think (we are not sure) that the issue is a bit complex. What we were investigating was nightly run fail on aws. When the build crash, the builder is restarted, since that's the easiest way to clean everything (since even with a perfect test suite that would clean itself, we could always end in a corrupt state on the system, WRT mount, fs, etc). In turn, this seems to cause trouble on aws, since cloud-init or something rename eth0 interface to ens5, without cleaning to the network configuration. So the network init script fail (because the image say "start eth0" and that's not present), but fail in a weird way. Network is initialised and working (we can connect), but the dhclient process is not in the right cgroup, and network.service is in failed state. Restarting network didn't work. In turn, this mean that rpc-statd refuse to start (due to systemd dependencies), which seems to impact various NFS tests. We have also seen that on some builders, rpcbind pick some IP v6 autoconfiguration, but we can't reproduce that, and there is no ip v6 set up anywhere. I suspect the network.service failure is somehow involved, but fail to see how. In turn, rpcbind.socket not starting could cause NFS test troubles. Our current stop gap fix was to fix all the builders one by one. Remove the config, kill the rogue dhclient, restart network service. However, we can't be sure this is going to fix the problem long term since this only manifest after a crash of the test suite, and it doesn't happen so often. (plus, it was working before some day in the past, when something did make this fail, and I do not know if that's a system upgrade, or a test change, or both). So we are still looking at it to have a complete understanding of the issue, but so far, we hacked our way to make it work (or so do I think). Deepshika is working to fix it long term, by fixing the issue regarding eth0/ens5 with a new base image. -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part URL: From bugzilla at redhat.com Wed Apr 3 15:09:52 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Wed, 03 Apr 2019 15:09:52 +0000 Subject: [Gluster-infra] [Bug 1695484] smoke fails with "Build root is locked by another process" In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1695484 --- Comment #3 from M. Scherer --- So indeed, https://build.gluster.org/job/devrpm-fedora/15404/ aborted the patch test, then https://build.gluster.org/job/devrpm-fedora/15405/ failed. but the next run worked. Maybe the problem is that it take more than 30 seconds to clean the build or something similar. Maybe we need to add some more time, but I can't seems to find a log to evaluate how long it does take when things are cancelled. Let's keep stuff opened if the issue arise again to collect the log, and see if there is a pattern. -- You are receiving this mail because: You are on the CC list for the bug. From amukherj at redhat.com Thu Apr 4 10:43:43 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Thu, 4 Apr 2019 16:13:43 +0530 Subject: [Gluster-infra] rebal-all-nodes-migrate.t always fails now Message-ID: Based on what I have seen that any multi node test case will fail and the above one is picked first from that group and If I am correct none of the code fixes will go through the regression until this is fixed. I suspect it to be an infra issue again. If we look at https://review.gluster.org/#/c/glusterfs/+/22501/ & https://build.gluster.org/job/centos7-regression/5382/ peer handshaking is stuck as 127.1.1.1 is unable to receive a response back, did we end up having firewall and other n/w settings screwed up? The test never fails locally. *15:51:21* Number of Peers: 2*15:51:21* *15:51:21* Hostname: 127.1.1.2*15:51:21* Uuid: 0e689ca8-d522-4b2f-b437-9dcde3579401*15:51:21* State: Accepted peer request (Connected)*15:51:21* *15:51:21* Hostname: 127.1.1.3*15:51:21* Uuid: a83a3bfa-729f-4a1c-8f9a-ae7d04ee4544*15:51:21* State: Accepted peer request (Connected) -------------- next part -------------- An HTML attachment was scrubbed... URL: From mscherer at redhat.com Thu Apr 4 11:53:21 2019 From: mscherer at redhat.com (Michael Scherer) Date: Thu, 04 Apr 2019 13:53:21 +0200 Subject: [Gluster-infra] rebal-all-nodes-migrate.t always fails now In-Reply-To: References: Message-ID: <94bd8147c5035da76c3ac3ae90a8a02ed000106a.camel@redhat.com> Le jeudi 04 avril 2019 ? 16:13 +0530, Atin Mukherjee a ?crit : > Based on what I have seen that any multi node test case will fail and > the > above one is picked first from that group and If I am correct none of > the > code fixes will go through the regression until this is fixed. I > suspect it > to be an infra issue again. If we look at > https://review.gluster.org/#/c/glusterfs/+/22501/ & > https://build.gluster.org/job/centos7-regression/5382/ peer > handshaking is > stuck as 127.1.1.1 is unable to receive a response back, did we end > up > having firewall and other n/w settings screwed up? The test never > fails > locally. The firewall didn't change, and since the start has a line: "-A INPUT -i lo -j ACCEPT", so all traffic on the localhost interface work. (I am not even sure that netfilter do anything meaningful on the loopback interface, but maybe I am wrong, and not keen on looking kernel code for that). Ping seems to work fine as well, so we can exclude a routing issue. Maybe we should look at the socket, does it listen to a specific address or not ? > *15:51:21* Number of Peers: 2*15:51:21* *15:51:21* Hostname: > 127.1.1.2*15:51:21* Uuid: > 0e689ca8-d522-4b2f-b437-9dcde3579401*15:51:21* State: Accepted peer > request (Connected)*15:51:21* *15:51:21* Hostname: > 127.1.1.3*15:51:21* > Uuid: a83a3bfa-729f-4a1c-8f9a-ae7d04ee4544*15:51:21* State: Accepted > peer request (Connected) > _______________________________________________ > Gluster-infra mailing list > Gluster-infra at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-infra -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part URL: From mscherer at redhat.com Thu Apr 4 13:19:25 2019 From: mscherer at redhat.com (Michael Scherer) Date: Thu, 04 Apr 2019 15:19:25 +0200 Subject: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now In-Reply-To: <94bd8147c5035da76c3ac3ae90a8a02ed000106a.camel@redhat.com> References: <94bd8147c5035da76c3ac3ae90a8a02ed000106a.camel@redhat.com> Message-ID: Le jeudi 04 avril 2019 ? 13:53 +0200, Michael Scherer a ?crit : > Le jeudi 04 avril 2019 ? 16:13 +0530, Atin Mukherjee a ?crit : > > Based on what I have seen that any multi node test case will fail > > and > > the > > above one is picked first from that group and If I am correct none > > of > > the > > code fixes will go through the regression until this is fixed. I > > suspect it > > to be an infra issue again. If we look at > > https://review.gluster.org/#/c/glusterfs/+/22501/ & > > https://build.gluster.org/job/centos7-regression/5382/ peer > > handshaking is > > stuck as 127.1.1.1 is unable to receive a response back, did we end > > up > > having firewall and other n/w settings screwed up? The test never > > fails > > locally. > > The firewall didn't change, and since the start has a line: > "-A INPUT -i lo -j ACCEPT", so all traffic on the localhost interface > work. (I am not even sure that netfilter do anything meaningful on > the > loopback interface, but maybe I am wrong, and not keen on looking > kernel code for that). > > > Ping seems to work fine as well, so we can exclude a routing issue. > > Maybe we should look at the socket, does it listen to a specific > address or not ? So, I did look at the 20 first ailure, removed all not related to rebal-all-nodes-migrate.t and seen all were run on builder203, who was freshly reinstalled. As Deepshika noticed today, this one had a issue with ipv6, the 2nd issue we were tracking. Summary, rpcbind.socket systemd unit listen on ipv6 despites ipv6 being disabled, and the fix is to reload systemd. We have so far no idea on why it happen, but suspect this might be related to the network issue we did identify, as that happen only after a reboot, that happen only if a build is cancelled/crashed/aborted. I apply the workaround on builder203, so if the culprit is that specific issue, guess that's fixed. I started a test to see how it go: https://build.gluster.org/job/centos7-regression/5383/ -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part URL: From mscherer at redhat.com Thu Apr 4 13:54:22 2019 From: mscherer at redhat.com (Michael Scherer) Date: Thu, 04 Apr 2019 15:54:22 +0200 Subject: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now In-Reply-To: References: <94bd8147c5035da76c3ac3ae90a8a02ed000106a.camel@redhat.com> Message-ID: Le jeudi 04 avril 2019 ? 15:19 +0200, Michael Scherer a ?crit : > Le jeudi 04 avril 2019 ? 13:53 +0200, Michael Scherer a ?crit : > > Le jeudi 04 avril 2019 ? 16:13 +0530, Atin Mukherjee a ?crit : > > > Based on what I have seen that any multi node test case will fail > > > and > > > the > > > above one is picked first from that group and If I am correct > > > none > > > of > > > the > > > code fixes will go through the regression until this is fixed. I > > > suspect it > > > to be an infra issue again. If we look at > > > https://review.gluster.org/#/c/glusterfs/+/22501/ & > > > https://build.gluster.org/job/centos7-regression/5382/ peer > > > handshaking is > > > stuck as 127.1.1.1 is unable to receive a response back, did we > > > end > > > up > > > having firewall and other n/w settings screwed up? The test never > > > fails > > > locally. > > > > The firewall didn't change, and since the start has a line: > > "-A INPUT -i lo -j ACCEPT", so all traffic on the localhost > > interface > > work. (I am not even sure that netfilter do anything meaningful on > > the > > loopback interface, but maybe I am wrong, and not keen on looking > > kernel code for that). > > > > > > Ping seems to work fine as well, so we can exclude a routing issue. > > > > Maybe we should look at the socket, does it listen to a specific > > address or not ? > > So, I did look at the 20 first ailure, removed all not related to > rebal-all-nodes-migrate.t and seen all were run on builder203, who > was > freshly reinstalled. As Deepshika noticed today, this one had a issue > with ipv6, the 2nd issue we were tracking. > > Summary, rpcbind.socket systemd unit listen on ipv6 despites ipv6 > being > disabled, and the fix is to reload systemd. We have so far no idea on > why it happen, but suspect this might be related to the network issue > we did identify, as that happen only after a reboot, that happen only > if a build is cancelled/crashed/aborted. > > I apply the workaround on builder203, so if the culprit is that > specific issue, guess that's fixed. > > I started a test to see how it go: > https://build.gluster.org/job/centos7-regression/5383/ The test did just pass, so I would assume the problem was local to builder203. Not sure why it was always selected, except because this was the only one that failed, so was always up for getting new jobs. Maybe we should increase the number of builder so this doesn't happen, as I guess the others builders were busy at that time ? -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part URL: From amukherj at redhat.com Thu Apr 4 15:55:39 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Thu, 4 Apr 2019 21:25:39 +0530 Subject: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now In-Reply-To: References: <94bd8147c5035da76c3ac3ae90a8a02ed000106a.camel@redhat.com> Message-ID: Thanks misc. I have always seen a pattern that on a reattempt (recheck centos) the same builder is picked up many time even though it's promised to pick up the builders in a round robin manner. On Thu, Apr 4, 2019 at 7:24 PM Michael Scherer wrote: > Le jeudi 04 avril 2019 ? 15:19 +0200, Michael Scherer a ?crit : > > Le jeudi 04 avril 2019 ? 13:53 +0200, Michael Scherer a ?crit : > > > Le jeudi 04 avril 2019 ? 16:13 +0530, Atin Mukherjee a ?crit : > > > > Based on what I have seen that any multi node test case will fail > > > > and > > > > the > > > > above one is picked first from that group and If I am correct > > > > none > > > > of > > > > the > > > > code fixes will go through the regression until this is fixed. I > > > > suspect it > > > > to be an infra issue again. If we look at > > > > https://review.gluster.org/#/c/glusterfs/+/22501/ & > > > > https://build.gluster.org/job/centos7-regression/5382/ peer > > > > handshaking is > > > > stuck as 127.1.1.1 is unable to receive a response back, did we > > > > end > > > > up > > > > having firewall and other n/w settings screwed up? The test never > > > > fails > > > > locally. > > > > > > The firewall didn't change, and since the start has a line: > > > "-A INPUT -i lo -j ACCEPT", so all traffic on the localhost > > > interface > > > work. (I am not even sure that netfilter do anything meaningful on > > > the > > > loopback interface, but maybe I am wrong, and not keen on looking > > > kernel code for that). > > > > > > > > > Ping seems to work fine as well, so we can exclude a routing issue. > > > > > > Maybe we should look at the socket, does it listen to a specific > > > address or not ? > > > > So, I did look at the 20 first ailure, removed all not related to > > rebal-all-nodes-migrate.t and seen all were run on builder203, who > > was > > freshly reinstalled. As Deepshika noticed today, this one had a issue > > with ipv6, the 2nd issue we were tracking. > > > > Summary, rpcbind.socket systemd unit listen on ipv6 despites ipv6 > > being > > disabled, and the fix is to reload systemd. We have so far no idea on > > why it happen, but suspect this might be related to the network issue > > we did identify, as that happen only after a reboot, that happen only > > if a build is cancelled/crashed/aborted. > > > > I apply the workaround on builder203, so if the culprit is that > > specific issue, guess that's fixed. > > > > I started a test to see how it go: > > https://build.gluster.org/job/centos7-regression/5383/ > > The test did just pass, so I would assume the problem was local to > builder203. Not sure why it was always selected, except because this > was the only one that failed, so was always up for getting new jobs. > > Maybe we should increase the number of builder so this doesn't happen, > as I guess the others builders were busy at that time ? > > -- > Michael Scherer > Sysadmin, Community Infrastructure and Platform, OSAS > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ykaul at redhat.com Thu Apr 4 16:10:34 2019 From: ykaul at redhat.com (Yaniv Kaul) Date: Thu, 4 Apr 2019 19:10:34 +0300 Subject: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now In-Reply-To: References: <94bd8147c5035da76c3ac3ae90a8a02ed000106a.camel@redhat.com> Message-ID: I'm not convinced this is solved. Just had what I believe is a similar failure: *00:12:02.532* A dependency job for rpc-statd.service failed. See 'journalctl -xe' for details.*00:12:02.532* mount.nfs: rpc.statd is not running but is required for remote locking.*00:12:02.532* mount.nfs: Either use '-o nolock' to keep locks local, or start statd.*00:12:02.532* mount.nfs: an incorrect mount option was specified (of course, it can always be my patch!) https://build.gluster.org/job/centos7-regression/5384/console On Thu, Apr 4, 2019 at 6:56 PM Atin Mukherjee wrote: > Thanks misc. I have always seen a pattern that on a reattempt (recheck > centos) the same builder is picked up many time even though it's promised > to pick up the builders in a round robin manner. > > On Thu, Apr 4, 2019 at 7:24 PM Michael Scherer > wrote: > >> Le jeudi 04 avril 2019 ? 15:19 +0200, Michael Scherer a ?crit : >> > Le jeudi 04 avril 2019 ? 13:53 +0200, Michael Scherer a ?crit : >> > > Le jeudi 04 avril 2019 ? 16:13 +0530, Atin Mukherjee a ?crit : >> > > > Based on what I have seen that any multi node test case will fail >> > > > and >> > > > the >> > > > above one is picked first from that group and If I am correct >> > > > none >> > > > of >> > > > the >> > > > code fixes will go through the regression until this is fixed. I >> > > > suspect it >> > > > to be an infra issue again. If we look at >> > > > https://review.gluster.org/#/c/glusterfs/+/22501/ & >> > > > https://build.gluster.org/job/centos7-regression/5382/ peer >> > > > handshaking is >> > > > stuck as 127.1.1.1 is unable to receive a response back, did we >> > > > end >> > > > up >> > > > having firewall and other n/w settings screwed up? The test never >> > > > fails >> > > > locally. >> > > >> > > The firewall didn't change, and since the start has a line: >> > > "-A INPUT -i lo -j ACCEPT", so all traffic on the localhost >> > > interface >> > > work. (I am not even sure that netfilter do anything meaningful on >> > > the >> > > loopback interface, but maybe I am wrong, and not keen on looking >> > > kernel code for that). >> > > >> > > >> > > Ping seems to work fine as well, so we can exclude a routing issue. >> > > >> > > Maybe we should look at the socket, does it listen to a specific >> > > address or not ? >> > >> > So, I did look at the 20 first ailure, removed all not related to >> > rebal-all-nodes-migrate.t and seen all were run on builder203, who >> > was >> > freshly reinstalled. As Deepshika noticed today, this one had a issue >> > with ipv6, the 2nd issue we were tracking. >> > >> > Summary, rpcbind.socket systemd unit listen on ipv6 despites ipv6 >> > being >> > disabled, and the fix is to reload systemd. We have so far no idea on >> > why it happen, but suspect this might be related to the network issue >> > we did identify, as that happen only after a reboot, that happen only >> > if a build is cancelled/crashed/aborted. >> > >> > I apply the workaround on builder203, so if the culprit is that >> > specific issue, guess that's fixed. >> > >> > I started a test to see how it go: >> > https://build.gluster.org/job/centos7-regression/5383/ >> >> The test did just pass, so I would assume the problem was local to >> builder203. Not sure why it was always selected, except because this >> was the only one that failed, so was always up for getting new jobs. >> >> Maybe we should increase the number of builder so this doesn't happen, >> as I guess the others builders were busy at that time ? >> >> -- >> Michael Scherer >> Sysadmin, Community Infrastructure and Platform, OSAS >> >> >> _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From mscherer at redhat.com Thu Apr 4 16:24:56 2019 From: mscherer at redhat.com (Michael Scherer) Date: Thu, 04 Apr 2019 18:24:56 +0200 Subject: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now In-Reply-To: References: <94bd8147c5035da76c3ac3ae90a8a02ed000106a.camel@redhat.com> Message-ID: Le jeudi 04 avril 2019 ? 19:10 +0300, Yaniv Kaul a ?crit : > I'm not convinced this is solved. Just had what I believe is a > similar > failure: > > *00:12:02.532* A dependency job for rpc-statd.service failed. See > 'journalctl -xe' for details.*00:12:02.532* mount.nfs: rpc.statd is > not running but is required for remote locking.*00:12:02.532* > mount.nfs: Either use '-o nolock' to keep locks local, or start > statd.*00:12:02.532* mount.nfs: an incorrect mount option was > specified > > (of course, it can always be my patch!) > > https://build.gluster.org/job/centos7-regression/5384/console same issue, different builder (206). I will check them all, as the issue is more widespread than I expected (or it did popup since last time I checked). > > On Thu, Apr 4, 2019 at 6:56 PM Atin Mukherjee > wrote: > > > Thanks misc. I have always seen a pattern that on a reattempt > > (recheck > > centos) the same builder is picked up many time even though it's > > promised > > to pick up the builders in a round robin manner. > > > > On Thu, Apr 4, 2019 at 7:24 PM Michael Scherer > > > > wrote: > > > > > Le jeudi 04 avril 2019 ? 15:19 +0200, Michael Scherer a ?crit : > > > > Le jeudi 04 avril 2019 ? 13:53 +0200, Michael Scherer a ?crit : > > > > > Le jeudi 04 avril 2019 ? 16:13 +0530, Atin Mukherjee a ?crit > > > > > : > > > > > > Based on what I have seen that any multi node test case > > > > > > will fail > > > > > > and > > > > > > the > > > > > > above one is picked first from that group and If I am > > > > > > correct > > > > > > none > > > > > > of > > > > > > the > > > > > > code fixes will go through the regression until this is > > > > > > fixed. I > > > > > > suspect it > > > > > > to be an infra issue again. If we look at > > > > > > https://review.gluster.org/#/c/glusterfs/+/22501/ & > > > > > > https://build.gluster.org/job/centos7-regression/5382/ peer > > > > > > handshaking is > > > > > > stuck as 127.1.1.1 is unable to receive a response back, > > > > > > did we > > > > > > end > > > > > > up > > > > > > having firewall and other n/w settings screwed up? The test > > > > > > never > > > > > > fails > > > > > > locally. > > > > > > > > > > The firewall didn't change, and since the start has a line: > > > > > "-A INPUT -i lo -j ACCEPT", so all traffic on the localhost > > > > > interface > > > > > work. (I am not even sure that netfilter do anything > > > > > meaningful on > > > > > the > > > > > loopback interface, but maybe I am wrong, and not keen on > > > > > looking > > > > > kernel code for that). > > > > > > > > > > > > > > > Ping seems to work fine as well, so we can exclude a routing > > > > > issue. > > > > > > > > > > Maybe we should look at the socket, does it listen to a > > > > > specific > > > > > address or not ? > > > > > > > > So, I did look at the 20 first ailure, removed all not related > > > > to > > > > rebal-all-nodes-migrate.t and seen all were run on builder203, > > > > who > > > > was > > > > freshly reinstalled. As Deepshika noticed today, this one had a > > > > issue > > > > with ipv6, the 2nd issue we were tracking. > > > > > > > > Summary, rpcbind.socket systemd unit listen on ipv6 despites > > > > ipv6 > > > > being > > > > disabled, and the fix is to reload systemd. We have so far no > > > > idea on > > > > why it happen, but suspect this might be related to the network > > > > issue > > > > we did identify, as that happen only after a reboot, that > > > > happen only > > > > if a build is cancelled/crashed/aborted. > > > > > > > > I apply the workaround on builder203, so if the culprit is that > > > > specific issue, guess that's fixed. > > > > > > > > I started a test to see how it go: > > > > https://build.gluster.org/job/centos7-regression/5383/ > > > > > > The test did just pass, so I would assume the problem was local > > > to > > > builder203. Not sure why it was always selected, except because > > > this > > > was the only one that failed, so was always up for getting new > > > jobs. > > > > > > Maybe we should increase the number of builder so this doesn't > > > happen, > > > as I guess the others builders were busy at that time ? > > > > > > -- > > > Michael Scherer > > > Sysadmin, Community Infrastructure and Platform, OSAS > > > > > > > > > _______________________________________________ > > > > Gluster-devel mailing list > > Gluster-devel at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-devel -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part URL: From bugzilla at redhat.com Fri Apr 5 04:51:44 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Fri, 05 Apr 2019 04:51:44 +0000 Subject: [Gluster-infra] [Bug 1696518] New: builder203 does not have a valid hostname set Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1696518 Bug ID: 1696518 Summary: builder203 does not have a valid hostname set Product: GlusterFS Version: mainline Status: NEW Component: project-infrastructure Assignee: bugs at gluster.org Reporter: dkhandel at redhat.com CC: bugs at gluster.org, gluster-infra at gluster.org Target Milestone: --- Classification: Community Description of problem: After reinstallation builder203 on AWS does not have a valid hostname set and hence it's network service might behave weird. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Fri Apr 5 05:55:21 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Fri, 05 Apr 2019 05:55:21 +0000 Subject: [Gluster-infra] [Bug 1696518] builder203 does not have a valid hostname set In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1696518 M. Scherer changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mscherer at redhat.com --- Comment #1 from M. Scherer --- Can you be a bit more specific on: - what network do behave weirdly ? I also did set the hostname (using hostnamectl), so maybe this requires a reboot, and/or a different hostname. -- You are receiving this mail because: You are on the CC list for the bug. From mscherer at redhat.com Fri Apr 5 06:45:19 2019 From: mscherer at redhat.com (Michael Scherer) Date: Fri, 05 Apr 2019 08:45:19 +0200 Subject: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now In-Reply-To: References: <94bd8147c5035da76c3ac3ae90a8a02ed000106a.camel@redhat.com> Message-ID: <0ca34e42063ad77f323155c85a7bb3ba7a79931b.camel@redhat.com> Le jeudi 04 avril 2019 ? 18:24 +0200, Michael Scherer a ?crit : > Le jeudi 04 avril 2019 ? 19:10 +0300, Yaniv Kaul a ?crit : > > I'm not convinced this is solved. Just had what I believe is a > > similar > > failure: > > > > *00:12:02.532* A dependency job for rpc-statd.service failed. See > > 'journalctl -xe' for details.*00:12:02.532* mount.nfs: rpc.statd is > > not running but is required for remote locking.*00:12:02.532* > > mount.nfs: Either use '-o nolock' to keep locks local, or start > > statd.*00:12:02.532* mount.nfs: an incorrect mount option was > > specified > > > > (of course, it can always be my patch!) > > > > https://build.gluster.org/job/centos7-regression/5384/console > > same issue, different builder (206). I will check them all, as the > issue is more widespread than I expected (or it did popup since last > time I checked). Deepshika did notice that the issue came back on one server (builder202) after a reboot, so the rpcbind issue is not related to the network initscript one, so the RCA continue. We are looking for another workaround involving fiddling with the socket (until we find why it do use ipv6 at boot, but not after, when ipv6 is disabled). Maybe we could run the test suite on a node without all the ipv6 disabling to see if that cause a issue ? -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part URL: From dkhandel at redhat.com Fri Apr 5 06:55:02 2019 From: dkhandel at redhat.com (Deepshikha Khandelwal) Date: Fri, 5 Apr 2019 12:25:02 +0530 Subject: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now In-Reply-To: <0ca34e42063ad77f323155c85a7bb3ba7a79931b.camel@redhat.com> References: <94bd8147c5035da76c3ac3ae90a8a02ed000106a.camel@redhat.com> <0ca34e42063ad77f323155c85a7bb3ba7a79931b.camel@redhat.com> Message-ID: On Fri, Apr 5, 2019 at 12:16 PM Michael Scherer wrote: > Le jeudi 04 avril 2019 ? 18:24 +0200, Michael Scherer a ?crit : > > Le jeudi 04 avril 2019 ? 19:10 +0300, Yaniv Kaul a ?crit : > > > I'm not convinced this is solved. Just had what I believe is a > > > similar > > > failure: > > > > > > *00:12:02.532* A dependency job for rpc-statd.service failed. See > > > 'journalctl -xe' for details.*00:12:02.532* mount.nfs: rpc.statd is > > > not running but is required for remote locking.*00:12:02.532* > > > mount.nfs: Either use '-o nolock' to keep locks local, or start > > > statd.*00:12:02.532* mount.nfs: an incorrect mount option was > > > specified > > > > > > (of course, it can always be my patch!) > > > > > > https://build.gluster.org/job/centos7-regression/5384/console > > > > same issue, different builder (206). I will check them all, as the > > issue is more widespread than I expected (or it did popup since last > > time I checked). > > Deepshika did notice that the issue came back on one server > (builder202) after a reboot, so the rpcbind issue is not related to the > network initscript one, so the RCA continue. > > We are looking for another workaround involving fiddling with the > socket (until we find why it do use ipv6 at boot, but not after, when > ipv6 is disabled). > > Maybe we could run the test suite on a node without all the ipv6 > disabling to see if that cause a issue ? > Do our test regression suit started supporting ipv6 now? Else this investigation would lead to further issues. > -- > Michael Scherer > Sysadmin, Community Infrastructure and Platform, OSAS > > > _______________________________________________ > Gluster-infra mailing list > Gluster-infra at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-infra > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bugzilla at redhat.com Fri Apr 5 07:02:14 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Fri, 05 Apr 2019 07:02:14 +0000 Subject: [Gluster-infra] [Bug 1696518] builder203 does not have a valid hostname set In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1696518 --- Comment #2 from M. Scherer --- So, answering to myself, rpc.statd didn't start after reboot, and the hostname was ip-172-31-38-158.us-east-2.compute.internal. After "hostnamectl set-hostname builder203.int.aws.gluster.org", that's better; Guess we need to automate that (as I used builder203.aws.gluster.org and this was wrong). -- You are receiving this mail because: You are on the CC list for the bug. From ykaul at redhat.com Fri Apr 5 07:09:44 2019 From: ykaul at redhat.com (Yaniv Kaul) Date: Fri, 5 Apr 2019 10:09:44 +0300 Subject: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now In-Reply-To: References: <94bd8147c5035da76c3ac3ae90a8a02ed000106a.camel@redhat.com> <0ca34e42063ad77f323155c85a7bb3ba7a79931b.camel@redhat.com> Message-ID: On Fri, Apr 5, 2019 at 9:55 AM Deepshikha Khandelwal wrote: > > > On Fri, Apr 5, 2019 at 12:16 PM Michael Scherer > wrote: > >> Le jeudi 04 avril 2019 ? 18:24 +0200, Michael Scherer a ?crit : >> > Le jeudi 04 avril 2019 ? 19:10 +0300, Yaniv Kaul a ?crit : >> > > I'm not convinced this is solved. Just had what I believe is a >> > > similar >> > > failure: >> > > >> > > *00:12:02.532* A dependency job for rpc-statd.service failed. See >> > > 'journalctl -xe' for details.*00:12:02.532* mount.nfs: rpc.statd is >> > > not running but is required for remote locking.*00:12:02.532* >> > > mount.nfs: Either use '-o nolock' to keep locks local, or start >> > > statd.*00:12:02.532* mount.nfs: an incorrect mount option was >> > > specified >> > > >> > > (of course, it can always be my patch!) >> > > >> > > https://build.gluster.org/job/centos7-regression/5384/console >> > >> > same issue, different builder (206). I will check them all, as the >> > issue is more widespread than I expected (or it did popup since last >> > time I checked). >> >> Deepshika did notice that the issue came back on one server >> (builder202) after a reboot, so the rpcbind issue is not related to the >> network initscript one, so the RCA continue. >> >> We are looking for another workaround involving fiddling with the >> socket (until we find why it do use ipv6 at boot, but not after, when >> ipv6 is disabled). >> >> Maybe we could run the test suite on a node without all the ipv6 >> disabling to see if that cause a issue ? >> > Do our test regression suit started supporting ipv6 now? Else this > investigation would lead to further issues. > I suspect not yet. But we certainly would like to, at some point, to ensure we run with IPv6 as well! Y. > -- >> Michael Scherer >> Sysadmin, Community Infrastructure and Platform, OSAS >> >> >> _______________________________________________ >> Gluster-infra mailing list >> Gluster-infra at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-infra >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nbalacha at redhat.com Fri Apr 5 11:25:58 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Fri, 5 Apr 2019 16:55:58 +0530 Subject: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now In-Reply-To: <0ca34e42063ad77f323155c85a7bb3ba7a79931b.camel@redhat.com> References: <94bd8147c5035da76c3ac3ae90a8a02ed000106a.camel@redhat.com> <0ca34e42063ad77f323155c85a7bb3ba7a79931b.camel@redhat.com> Message-ID: On Fri, 5 Apr 2019 at 12:16, Michael Scherer wrote: > Le jeudi 04 avril 2019 ? 18:24 +0200, Michael Scherer a ?crit : > > Le jeudi 04 avril 2019 ? 19:10 +0300, Yaniv Kaul a ?crit : > > > I'm not convinced this is solved. Just had what I believe is a > > > similar > > > failure: > > > > > > *00:12:02.532* A dependency job for rpc-statd.service failed. See > > > 'journalctl -xe' for details.*00:12:02.532* mount.nfs: rpc.statd is > > > not running but is required for remote locking.*00:12:02.532* > > > mount.nfs: Either use '-o nolock' to keep locks local, or start > > > statd.*00:12:02.532* mount.nfs: an incorrect mount option was > > > specified > > > > > > (of course, it can always be my patch!) > > > > > > https://build.gluster.org/job/centos7-regression/5384/console > > > > same issue, different builder (206). I will check them all, as the > > issue is more widespread than I expected (or it did popup since last > > time I checked). > > Deepshika did notice that the issue came back on one server > (builder202) after a reboot, so the rpcbind issue is not related to the > network initscript one, so the RCA continue. > > We are looking for another workaround involving fiddling with the > socket (until we find why it do use ipv6 at boot, but not after, when > ipv6 is disabled). > Could this be relevant? https://access.redhat.com/solutions/2798411 > > Maybe we could run the test suite on a node without all the ipv6 > disabling to see if that cause a issue ? > > -- > Michael Scherer > Sysadmin, Community Infrastructure and Platform, OSAS > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From mscherer at redhat.com Fri Apr 5 14:40:08 2019 From: mscherer at redhat.com (Michael Scherer) Date: Fri, 05 Apr 2019 16:40:08 +0200 Subject: [Gluster-infra] [Gluster-devel] rebal-all-nodes-migrate.t always fails now In-Reply-To: References: <94bd8147c5035da76c3ac3ae90a8a02ed000106a.camel@redhat.com> <0ca34e42063ad77f323155c85a7bb3ba7a79931b.camel@redhat.com> Message-ID: <090785225412c2b5b269454f8812d0a165aea62d.camel@redhat.com> Le vendredi 05 avril 2019 ? 16:55 +0530, Nithya Balachandran a ?crit : > On Fri, 5 Apr 2019 at 12:16, Michael Scherer > wrote: > > > Le jeudi 04 avril 2019 ? 18:24 +0200, Michael Scherer a ?crit : > > > Le jeudi 04 avril 2019 ? 19:10 +0300, Yaniv Kaul a ?crit : > > > > I'm not convinced this is solved. Just had what I believe is a > > > > similar > > > > failure: > > > > > > > > *00:12:02.532* A dependency job for rpc-statd.service failed. > > > > See > > > > 'journalctl -xe' for details.*00:12:02.532* mount.nfs: > > > > rpc.statd is > > > > not running but is required for remote locking.*00:12:02.532* > > > > mount.nfs: Either use '-o nolock' to keep locks local, or start > > > > statd.*00:12:02.532* mount.nfs: an incorrect mount option was > > > > specified > > > > > > > > (of course, it can always be my patch!) > > > > > > > > https://build.gluster.org/job/centos7-regression/5384/console > > > > > > same issue, different builder (206). I will check them all, as > > > the > > > issue is more widespread than I expected (or it did popup since > > > last > > > time I checked). > > > > Deepshika did notice that the issue came back on one server > > (builder202) after a reboot, so the rpcbind issue is not related to > > the > > network initscript one, so the RCA continue. > > > > We are looking for another workaround involving fiddling with the > > socket (until we find why it do use ipv6 at boot, but not after, > > when > > ipv6 is disabled). > > > > Could this be relevant? > https://access.redhat.com/solutions/2798411 Good catch. So, we already do that, Nigel took care of that (after 2 days of research). But I didn't knew the exact symptoms, and decided to double check just in case. And... there is no sysctl.conf in the initrd. Running dracut -v -f do not change anything. Running "dracut -v -f -H" take care of that (and this fix the problem), but: - our ansible script already run that - -H is hostonly, which is already the default on EL7 according to the doc. However, if dracut-config-generic is installed, it doesn't build a hostonly initrd, and so do not include the sysctl.conf file (who break rpcbnd, who break the test suite). And for some reason, it is installed the image in ec2 (likely default), but not by default on the builders. So what happen is that after a kernel upgrade, dracut rebuild a generic initrd instead of a hostonly one, who break things. And kernel was likely upgraded recently (and upgrade happen nightly (for some value of "night"), so we didn't see that earlier, nor with a fresh system. So now, we have several solution: - be explicit on using hostonly in dracut, so this doesn't happen again (or not for this reason) - disable ipv6 in rpcbind in a cleaner way (to be tested) - get the test suite work with ip v6 In the long term, I also want to monitor the processes, but for that, I need a VPN between the nagios server and ec2, and that project got blocked by several issues (like EC2 not support ecdsa keys, and we use that for ansible, so we have to come back to RSA for full automated deployment, and openvon requires to use certificates, so I need a newer python openssl for doing what I want, and RHEL 7 is too old, etc, etc). As the weekend approach for me, I just rebuilt the initrd for the time being. I guess forcing hostonly is the safest fix for now, but this will be for monday. -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part URL: From bugzilla at redhat.com Tue Apr 9 07:18:53 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Tue, 09 Apr 2019 07:18:53 +0000 Subject: [Gluster-infra] [Bug 1697812] New: mention a pointer to all the mailing lists available under glusterfs project(https://www.gluster.org/community/) Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1697812 Bug ID: 1697812 Summary: mention a pointer to all the mailing lists available under glusterfs project(https://www.gluster.org/community/) Product: GlusterFS Version: 6 Status: NEW Component: website Severity: medium Assignee: bugs at gluster.org Reporter: nchilaka at redhat.com CC: bugs at gluster.org, gluster-infra at gluster.org Target Milestone: --- Classification: Community Description of problem: ======================== currently under mailing lists in https://www.gluster.org/community/ only gluster-devel and gluster-users are mentioned. However they are more mailing lists available. For eg; I was stuggling to find automated-testing mailing list as it was not mentioned. Expected: Put a reference/pointer saying that more mailing lists can be subscribed to from here -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Tue Apr 9 09:00:49 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Tue, 09 Apr 2019 09:00:49 +0000 Subject: [Gluster-infra] [Bug 1697890] New: centos-regression is not giving its vote Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1697890 Bug ID: 1697890 Summary: centos-regression is not giving its vote Product: GlusterFS Version: mainline Status: NEW Component: project-infrastructure Assignee: bugs at gluster.org Reporter: srakonde at redhat.com CC: bugs at gluster.org, gluster-infra at gluster.org Target Milestone: --- Classification: Community Description of problem: on completion, centos-regression job is not giving its vote at the patches in gerrit. Such patches are: https://review.gluster.org/#/c/glusterfs/+/22528/ https://review.gluster.org/#/c/glusterfs/+/22530/ -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Tue Apr 9 10:15:18 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Tue, 09 Apr 2019 10:15:18 +0000 Subject: [Gluster-infra] [Bug 1697923] New: CI: collect core file in a job artifacts Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1697923 Bug ID: 1697923 Summary: CI: collect core file in a job artifacts Product: GlusterFS Version: mainline Status: NEW Component: project-infrastructure Severity: high Assignee: bugs at gluster.org Reporter: ykaul at redhat.com CC: bugs at gluster.org, gluster-infra at gluster.org Target Milestone: --- Classification: Community Description of problem: When CI fails with a coredump, it'll be possibly useful to save that core dump for a more thorough investigation. Example - https://build.gluster.org/job/centos7-regression/5473/ There is just /var/log/glusterfs files there. Would it be possible to run perhaps abrt-action-analyze-backtrace to get a better result? -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Tue Apr 9 13:11:37 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Tue, 09 Apr 2019 13:11:37 +0000 Subject: [Gluster-infra] [Bug 1697923] CI: collect core file in a job artifacts In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1697923 Deepshikha khandelwal changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dkhandel at redhat.com --- Comment #1 from Deepshikha khandelwal --- We do archive the core file at a centarlized log server. For the given regression build: https://logs.aws.gluster.org/centos7-regression-5473.tgz (build/install/cores/file) I need to look more abrt-action-analyze-backtrace and it can be implemented. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Tue Apr 9 13:23:06 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Tue, 09 Apr 2019 13:23:06 +0000 Subject: [Gluster-infra] [Bug 1697923] CI: collect core file in a job artifacts In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1697923 --- Comment #2 from Yaniv Kaul --- (In reply to Deepshikha khandelwal from comment #1) > We do archive the core file at a centarlized log server. > > For the given regression build: > https://logs.aws.gluster.org/centos7-regression-5473.tgz > (build/install/cores/file) Yes, I just saw it now in the console of the Jenkins job. I believe the instructions there how to use gdb to look at it are outdated, btw. > > I need to look more abrt-action-analyze-backtrace and it can be implemented. I couldn't get it - but perhaps on the machine itself its doable. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Wed Apr 10 15:05:24 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Wed, 10 Apr 2019 15:05:24 +0000 Subject: [Gluster-infra] [Bug 1697890] centos-regression is not giving its vote In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1697890 M. Scherer changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mscherer at redhat.com Flags| |needinfo?(srakonde at redhat.c | |om) --- Comment #1 from M. Scherer --- Mhh, I see that it did vote, can you explain a bit more the issue you have seen in details (as it could have been just a temporary problem) ? -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Wed Apr 10 15:07:39 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Wed, 10 Apr 2019 15:07:39 +0000 Subject: [Gluster-infra] [Bug 1697812] mention a pointer to all the mailing lists available under glusterfs project(https://www.gluster.org/community/) In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1697812 M. Scherer changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mscherer at redhat.com --- Comment #1 from M. Scherer --- I do agree, I am however unsure on who is formally in charge of the website :/ I guess I do have the access to do the change, so if no one volunteer first, I will just go ahead and see. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Wed Apr 10 15:16:38 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Wed, 10 Apr 2019 15:16:38 +0000 Subject: [Gluster-infra] [Bug 1697890] centos-regression is not giving its vote In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1697890 Sanju changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(srakonde at redhat.c | |om) | --- Comment #2 from Sanju --- There was an issue, Deepshikha has fixed it. Now, you can close this BZ as it won't exist anymore. Thanks, Sanju -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Thu Apr 11 02:43:38 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Thu, 11 Apr 2019 02:43:38 +0000 Subject: [Gluster-infra] [Bug 1698694] New: regression job isn't voting back to gerrit Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1698694 Bug ID: 1698694 Summary: regression job isn't voting back to gerrit Product: GlusterFS Version: 6 Status: NEW Component: project-infrastructure Assignee: bugs at gluster.org Reporter: amukherj at redhat.com CC: bugs at gluster.org, gluster-infra at gluster.org Target Milestone: --- Classification: Community Description of problem: Please see https://review.gluster.org/#/c/glusterfs/+/22544/ & https://build.gluster.org/job/centos7-regression/5518/ which should have voted back and that didn't happen. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Thu Apr 11 04:15:30 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Thu, 11 Apr 2019 04:15:30 +0000 Subject: [Gluster-infra] [Bug 1698694] regression job isn't voting back to gerrit In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1698694 Deepshikha khandelwal changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dkhandel at redhat.com --- Comment #1 from Deepshikha khandelwal --- I thought I solved this issue. But looking more into the problem it seems the issue is not with Gerrit trigger plugin of Jenkins but with the builder. As I see all the patches which had due vote built on the same 203builder. Looking at what recent changes introduced this. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Thu Apr 11 05:14:27 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Thu, 11 Apr 2019 05:14:27 +0000 Subject: [Gluster-infra] [Bug 1698716] New: Regression job did not vote for https://review.gluster.org/#/c/glusterfs/+/22366/ Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1698716 Bug ID: 1698716 Summary: Regression job did not vote for https://review.gluster.org/#/c/glusterfs/+/22366/ Product: GlusterFS Version: mainline Status: NEW Component: project-infrastructure Assignee: bugs at gluster.org Reporter: nbalacha at redhat.com CC: bugs at gluster.org, gluster-infra at gluster.org Target Milestone: --- Classification: Community Description of problem: A full regression run has passed on https://review.gluster.org/#/c/glusterfs/+/22366/ but the Centos regression vote has not been updated. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Thu Apr 11 08:55:44 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Thu, 11 Apr 2019 08:55:44 +0000 Subject: [Gluster-infra] [Bug 1698716] Regression job did not vote for https://review.gluster.org/#/c/glusterfs/+/22366/ In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1698716 Atin Mukherjee changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |amukherj at redhat.com --- Comment #1 from Atin Mukherjee --- I don't think full regression votes back into the gerrit. However even a normal regression run isn't doing either and a bug https://bugzilla.redhat.com/show_bug.cgi?id=1698694 is filed for the same. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Thu Apr 11 13:42:54 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Thu, 11 Apr 2019 13:42:54 +0000 Subject: [Gluster-infra] [Bug 1698716] Regression job did not vote for https://review.gluster.org/#/c/glusterfs/+/22366/ In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1698716 M. Scherer changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mscherer at redhat.com --- Comment #2 from M. Scherer --- Yep, they don't (afaik). -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Thu Apr 11 13:46:26 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Thu, 11 Apr 2019 13:46:26 +0000 Subject: [Gluster-infra] [Bug 1698694] regression job isn't voting back to gerrit In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1698694 M. Scherer changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mscherer at redhat.com --- Comment #2 from M. Scherer --- Mhh, I see: 17:51:40 Host key verification failed. Could be something related to reinstallation of the builder last time. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Thu Apr 11 13:51:28 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Thu, 11 Apr 2019 13:51:28 +0000 Subject: [Gluster-infra] [Bug 1698694] regression job isn't voting back to gerrit In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1698694 --- Comment #3 from M. Scherer --- Ok so I suspect the problem is the following. We reinstalled the builder, so the home was erased. The script do run ssh review, but it was blocked on the key verification step, because no one was here to type "yes", since this was the first attempt. I did it (su - jenkins, ssh review.gluster.org), and I suspect this should be better. If the symptom no longer appear, then my hypothesis was good. I think the fix would be to change the ssh command to accept the key on first use, i will provide a patch. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Thu Apr 11 14:37:59 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Thu, 11 Apr 2019 14:37:59 +0000 Subject: [Gluster-infra] [Bug 1698694] regression job isn't voting back to gerrit In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1698694 --- Comment #4 from M. Scherer --- https://review.gluster.org/#/c/build-jobs/+/22548 for my proposed fix -- You are receiving this mail because: You are on the CC list for the bug. From dkhandel at redhat.com Thu Apr 11 16:43:54 2019 From: dkhandel at redhat.com (Deepshikha Khandelwal) Date: Thu, 11 Apr 2019 22:13:54 +0530 Subject: [Gluster-infra] Upgrading build.gluster.org Message-ID: Hello, I?ve planned to do an upgrade of build.gluster.org tomorrow morning so as to install and pull in the latest security upgrade of the Jenkins plugins. I?ll stop all the running jobs and re-trigger them once I'm done with the upgrade. The downtime window will be from : UTC: 0330 to 0400 IST: 0900 to 0930 The outage is for 30 minutes. Please bear with us as we continue to ensure the latest plugins and fixes for build.gluster.org Thanks, Deepshikha -------------- next part -------------- An HTML attachment was scrubbed... URL: From bugzilla at redhat.com Mon Apr 15 04:11:40 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Mon, 15 Apr 2019 04:11:40 +0000 Subject: [Gluster-infra] [Bug 1699712] New: regression job is voting Success even in case of failure Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1699712 Bug ID: 1699712 Summary: regression job is voting Success even in case of failure Product: GlusterFS Version: mainline Status: NEW Component: project-infrastructure Severity: urgent Priority: urgent Assignee: bugs at gluster.org Reporter: atumball at redhat.com CC: bugs at gluster.org, gluster-infra at gluster.org Target Milestone: --- Classification: Community Description of problem: Check : https://build.gluster.org/job/centos7-regression/5596/consoleFull ---- 09:15:07 1 test(s) failed 09:15:07 ./tests/basic/uss.t 09:15:07 09:15:07 0 test(s) generated core 09:15:07 09:15:07 09:15:07 2 test(s) needed retry 09:15:07 ./tests/basic/quick-read-with-upcall.t 09:15:07 ./tests/basic/uss.t 09:15:07 09:15:07 Result is 124 09:15:07 09:15:07 tar: Removing leading `/' from member names 09:15:10 kernel.core_pattern = /%e-%p.core 09:15:10 + RET=0 09:15:10 + '[' 0 = 0 ']' 09:15:10 + V=+1 09:15:10 + VERDICT=SUCCESS ---- Version-Release number of selected component (if applicable): latest master -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Mon Apr 15 05:16:15 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Mon, 15 Apr 2019 05:16:15 +0000 Subject: [Gluster-infra] [Bug 1699712] regression job is voting Success even in case of failure In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1699712 Deepshikha khandelwal changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dkhandel at redhat.com --- Comment #1 from Deepshikha khandelwal --- Thank you Amar for pointing this out. It turned out to behave like this because of the changes in config we made last Friday evening. Now that it is fixed, I re-triggered on three of the impacted patches. Sorry for this. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Mon Apr 15 06:31:32 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Mon, 15 Apr 2019 06:31:32 +0000 Subject: [Gluster-infra] [Bug 1691617] clang-scan tests are failing nightly. In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1691617 --- Comment #2 from Amar Tumballi --- I guess it is fine to depend on f29 or f30. I know there are some warnings which we need to fix. But those are good to fix anyways. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Mon Apr 15 06:33:13 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Mon, 15 Apr 2019 06:33:13 +0000 Subject: [Gluster-infra] [Bug 1691617] clang-scan tests are failing nightly. In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1691617 Amar Tumballi changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CLOSED Resolution|--- |WORKSFORME Last Closed| |2019-04-15 06:33:13 --- Comment #3 from Amar Tumballi --- Jobs are now running! -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Mon Apr 15 11:31:23 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Mon, 15 Apr 2019 11:31:23 +0000 Subject: [Gluster-infra] [Bug 1691357] core archive link from regression jobs throw not found error In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1691357 Deepshikha khandelwal changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dkhandel at redhat.com --- Comment #2 from Deepshikha khandelwal --- It is fixed now. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Mon Apr 15 11:31:44 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Mon, 15 Apr 2019 11:31:44 +0000 Subject: [Gluster-infra] [Bug 1691357] core archive link from regression jobs throw not found error In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1691357 Deepshikha khandelwal changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CLOSED Resolution|--- |NOTABUG Last Closed| |2019-04-15 11:31:44 -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Mon Apr 15 11:33:35 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Mon, 15 Apr 2019 11:33:35 +0000 Subject: [Gluster-infra] [Bug 1693295] rpc.statd not started on builder204.aws.gluster.org In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1693295 Deepshikha khandelwal changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CLOSED CC| |dkhandel at redhat.com Resolution|--- |DUPLICATE Last Closed| |2019-04-15 11:33:35 --- Comment #2 from Deepshikha khandelwal --- *** This bug has been marked as a duplicate of bug 1691789 *** -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Mon Apr 15 11:33:35 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Mon, 15 Apr 2019 11:33:35 +0000 Subject: [Gluster-infra] [Bug 1691789] rpc-statd service stops on AWS builders In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1691789 Deepshikha khandelwal changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |nbalacha at redhat.com --- Comment #1 from Deepshikha khandelwal --- *** Bug 1693295 has been marked as a duplicate of this bug. *** -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Wed Apr 17 05:56:24 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Wed, 17 Apr 2019 05:56:24 +0000 Subject: [Gluster-infra] [Bug 1696518] builder203 does not have a valid hostname set In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1696518 Deepshikha khandelwal changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CLOSED Resolution|--- |NOTABUG Last Closed| |2019-04-17 05:56:24 --- Comment #3 from Deepshikha khandelwal --- Misc did set it up. Closing this one. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Wed Apr 17 05:57:06 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Wed, 17 Apr 2019 05:57:06 +0000 Subject: [Gluster-infra] [Bug 1697890] centos-regression is not giving its vote In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1697890 Deepshikha khandelwal changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CLOSED CC| |dkhandel at redhat.com Resolution|--- |NOTABUG Last Closed| |2019-04-17 05:57:06 -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Wed Apr 17 07:53:00 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Wed, 17 Apr 2019 07:53:00 +0000 Subject: [Gluster-infra] [Bug 1700695] New: smoke is failing for build https://review.gluster.org/#/c/glusterfs/+/22584/ Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1700695 Bug ID: 1700695 Summary: smoke is failing for build https://review.gluster.org/#/c/glusterfs/+/22584/ Product: GlusterFS Version: mainline Status: NEW Component: project-infrastructure Assignee: bugs at gluster.org Reporter: moagrawa at redhat.com CC: bugs at gluster.org, gluster-infra at gluster.org Target Milestone: --- Classification: Community Description of problem: Smoke is failing for build https://review.gluster.org/#/c/glusterfs/+/22584/ Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Wed Apr 17 08:27:48 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Wed, 17 Apr 2019 08:27:48 +0000 Subject: [Gluster-infra] [Bug 1700695] smoke is failing for build https://review.gluster.org/#/c/glusterfs/+/22584/ In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1700695 Deepshikha khandelwal changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CLOSED CC| |dkhandel at redhat.com Resolution|--- |NOTABUG Last Closed| |2019-04-17 08:27:48 --- Comment #1 from Deepshikha khandelwal --- Correct set of the firewall was not applied after reboot. Misc fixed it. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Wed Apr 17 08:38:25 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Wed, 17 Apr 2019 08:38:25 +0000 Subject: [Gluster-infra] [Bug 1697923] CI: collect core file in a job artifacts In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1697923 Deepshikha khandelwal changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CLOSED Resolution|--- |NOTABUG Last Closed| |2019-04-17 08:38:25 -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Wed Apr 17 09:37:09 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Wed, 17 Apr 2019 09:37:09 +0000 Subject: [Gluster-infra] [Bug 1700695] smoke is failing for build https://review.gluster.org/#/c/glusterfs/+/22584/ In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1700695 M. Scherer changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mscherer at redhat.com Resolution|NOTABUG |CURRENTRELEASE --- Comment #2 from M. Scherer --- yeah, freebsd was still using chrono, the test firewall (I did set it up for the migration to nftables). For some reason, that firewall (that I almost decomissioned yesterday) do not seems to route packet anymore. The rules are in place, the config is the same as the 2 prod firewall. We need also some better monitoring for that. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Mon Apr 22 00:57:38 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Mon, 22 Apr 2019 00:57:38 +0000 Subject: [Gluster-infra] [Bug 1564372] Setup Nagios server In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1564372 sankarshan changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |sankarshan at redhat.com -- You are receiving this mail because: You are on the CC list for the bug. From sankarshan.mukhopadhyay at gmail.com Mon Apr 22 01:00:02 2019 From: sankarshan.mukhopadhyay at gmail.com (Sankarshan Mukhopadhyay) Date: Mon, 22 Apr 2019 06:30:02 +0530 Subject: [Gluster-infra] Product: GlusterFS Component: project-infrastructure RHBZs in NEW and ASSIGNED Message-ID: will list RHBZs in these 2 status. Please review the ASSIGNED ones first to check if work on some of these are actually complete and the RHBZ state is stale. From bugzilla at redhat.com Mon Apr 22 03:18:12 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Mon, 22 Apr 2019 03:18:12 +0000 Subject: [Gluster-infra] [Bug 1699712] regression job is voting Success even in case of failure In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1699712 Deepshikha khandelwal changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CLOSED Resolution|--- |CURRENTRELEASE Last Closed| |2019-04-22 03:18:12 -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Mon Apr 22 03:18:14 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Mon, 22 Apr 2019 03:18:14 +0000 Subject: [Gluster-infra] [Bug 1701808] New: weird reasons for a regression failure. Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1701808 Bug ID: 1701808 Summary: weird reasons for a regression failure. Product: GlusterFS Version: mainline Status: NEW Component: project-infrastructure Assignee: bugs at gluster.org Reporter: amukherj at redhat.com CC: bugs at gluster.org, gluster-infra at gluster.org Target Milestone: --- Classification: Community Description of problem: https://review.gluster.org/#/c/glusterfs/+/22551/ was marked verified +1 and regression https://build.gluster.org/job/centos7-regression/5681/ failed with below reason. I don't believe any privilege violation has been attempted here. 08:40:13 Run the regression test 08:40:13 *********************** 08:40:13 08:40:13 08:40:13 We trust you have received the usual lecture from the local System 08:40:13 Administrator. It usually boils down to these three things: 08:40:13 08:40:13 #1) Respect the privacy of others. 08:40:13 #2) Think before you type. 08:40:13 #3) With great power comes great responsibility. 08:40:13 08:40:13 sudo: no tty present and no askpass program specified 08:40:15 08:40:15 We trust you have received the usual lecture from the local System 08:40:15 Administrator. It usually boils down to these three things: 08:40:15 08:40:15 #1) Respect the privacy of others. 08:40:15 #2) Think before you type. 08:40:15 #3) With great power comes great responsibility. 08:40:15 08:40:15 sudo: no tty present and no askpass program specified 08:40:17 08:40:17 We trust you have received the usual lecture from the local System 08:40:17 Administrator. It usually boils down to these three things: 08:40:17 08:40:17 #1) Respect the privacy of others. 08:40:17 #2) Think before you type. 08:40:17 #3) With great power comes great responsibility. 08:40:17 08:40:17 sudo: no tty present and no askpass program specified 08:40:19 + ssh -o StrictHostKeyChecking=no build at review.gluster.org gerrit review --message ''\''https://build.gluster.org/job/centos7-regression/5681/consoleFull : FAILED'\''' --project=glusterfs --label CentOS-regression=-1 ea59ec6c19a402717b9848e51bfe79eb4b9728a7 08:40:20 + exit 1 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Mon Apr 22 08:49:35 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Mon, 22 Apr 2019 08:49:35 +0000 Subject: [Gluster-infra] [Bug 1701808] weird reasons for a regression failure. In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1701808 Deepshikha khandelwal changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dkhandel at redhat.com Assignee|bugs at gluster.org |dkhandel at redhat.com -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Mon Apr 22 13:54:32 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Mon, 22 Apr 2019 13:54:32 +0000 Subject: [Gluster-infra] [Bug 1701936] New: comment-on-issue smoke job is experiencing crashes in certain conditions Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1701936 Bug ID: 1701936 Summary: comment-on-issue smoke job is experiencing crashes in certain conditions Product: GlusterFS Version: mainline Status: NEW Component: project-infrastructure Assignee: bugs at gluster.org Reporter: srangana at redhat.com CC: amukherj at redhat.com, bugs at gluster.org, gluster-infra at gluster.org Target Milestone: --- Classification: Community Description of problem: The smoke job at [1] failed with the following stack trace, and hence smoke is failing for the patch under consideration. The stack in the logs looks as follows, 03:04:53 [comment-on-issue] $ /bin/sh -xe /tmp/jenkins7951448835552971908.sh 03:04:53 + echo https://review.gluster.org/22471 03:04:53 https://review.gluster.org/22471 03:04:53 + echo master 03:04:53 master 03:04:53 + /opt/qa/github/handle_github.py --repo glusterfs -c 03:04:55 Issues found in the commit message: [{u'status': u'Fixes', u'id': u'647'}] 03:04:55 Bug fix, no extra flags required 03:04:55 No issues found in the commit message 03:04:55 Old issues: [] 03:04:55 Traceback (most recent call last): 03:04:55 File "/opt/qa/github/handle_github.py", line 187, in 03:04:55 main(ARGS.repo, ARGS.dry_run, ARGS.comment_file) 03:04:55 File "/opt/qa/github/handle_github.py", line 165, in main 03:04:55 github.comment_on_issues(newissues, commit_msg) 03:04:55 File "/opt/qa/github/handle_github.py", line 47, in comment_on_issues 03:04:55 self._comment_on_issue(issue['id'], comment) 03:04:55 TypeError: string indices must be integers Request an analysis and fix of the same as required. [1] Smoke job link: https://build.gluster.org/job/comment-on-issue/13672/console -- You are receiving this mail because: You are on the CC list for the bug. From amukherj at redhat.com Mon Apr 22 17:27:57 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Mon, 22 Apr 2019 22:57:57 +0530 Subject: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: <797512f6ff7f1b9fedbf8b7968dd86a6968d9105.camel@redhat.com> References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> <797512f6ff7f1b9fedbf8b7968dd86a6968d9105.camel@redhat.com> Message-ID: Is this back again? The recent patches are failing regression :-\ . On Wed, 3 Apr 2019 at 19:26, Michael Scherer wrote: > Le mercredi 03 avril 2019 ? 16:30 +0530, Atin Mukherjee a ?crit : > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan > > wrote: > > > > > Hi, > > > > > > is_nfs_export_available is just a wrapper around "showmount" > > > command AFAIR. > > > I saw following messages in console output. > > > mount.nfs: rpc.statd is not running but is required for remote > > > locking. > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, or > > > start > > > statd. > > > 05:06:55 mount.nfs: an incorrect mount option was specified > > > > > > For me it looks rpcbind may not be running on the machine. > > > Usually rpcbind starts automatically on machines, don't know > > > whether it > > > can happen or not. > > > > > > > That's precisely what the question is. Why suddenly we're seeing this > > happening too frequently. Today I saw atleast 4 to 5 such failures > > already. > > > > Deepshika - Can you please help in inspecting this? > > So we think (we are not sure) that the issue is a bit complex. > > What we were investigating was nightly run fail on aws. When the build > crash, the builder is restarted, since that's the easiest way to clean > everything (since even with a perfect test suite that would clean > itself, we could always end in a corrupt state on the system, WRT > mount, fs, etc). > > In turn, this seems to cause trouble on aws, since cloud-init or > something rename eth0 interface to ens5, without cleaning to the > network configuration. > > So the network init script fail (because the image say "start eth0" and > that's not present), but fail in a weird way. Network is initialised > and working (we can connect), but the dhclient process is not in the > right cgroup, and network.service is in failed state. Restarting > network didn't work. In turn, this mean that rpc-statd refuse to start > (due to systemd dependencies), which seems to impact various NFS tests. > > We have also seen that on some builders, rpcbind pick some IP v6 > autoconfiguration, but we can't reproduce that, and there is no ip v6 > set up anywhere. I suspect the network.service failure is somehow > involved, but fail to see how. In turn, rpcbind.socket not starting > could cause NFS test troubles. > > Our current stop gap fix was to fix all the builders one by one. Remove > the config, kill the rogue dhclient, restart network service. > > However, we can't be sure this is going to fix the problem long term > since this only manifest after a crash of the test suite, and it > doesn't happen so often. (plus, it was working before some day in the > past, when something did make this fail, and I do not know if that's a > system upgrade, or a test change, or both). > > So we are still looking at it to have a complete understanding of the > issue, but so far, we hacked our way to make it work (or so do I > think). > > Deepshika is working to fix it long term, by fixing the issue regarding > eth0/ens5 with a new base image. > -- > Michael Scherer > Sysadmin, Community Infrastructure and Platform, OSAS > > > -- - Atin (atinm) -------------- next part -------------- An HTML attachment was scrubbed... URL: From bugzilla at redhat.com Mon Apr 22 18:16:34 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Mon, 22 Apr 2019 18:16:34 +0000 Subject: [Gluster-infra] [Bug 1701936] comment-on-issue smoke job is experiencing crashes in certain conditions In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1701936 Deepshikha khandelwal changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |CLOSED CC| |dkhandel at redhat.com Resolution|--- |CURRENTRELEASE Severity|high |unspecified Last Closed| |2019-04-22 18:16:34 --- Comment #2 from Deepshikha khandelwal --- It is fixed now: https://build.gluster.org/job/comment-on-issue/13697/console -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Tue Apr 23 05:09:41 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Tue, 23 Apr 2019 05:09:41 +0000 Subject: [Gluster-infra] [Bug 1701808] weird reasons for a regression failure. In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1701808 Deepshikha khandelwal changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CLOSED Resolution|--- |CURRENTRELEASE Last Closed| |2019-04-23 05:09:41 --- Comment #1 from Deepshikha khandelwal --- Jenkins user on regression machines had been moved from 'wheel' group to 'mock' secondary group by ansible playbook and hence lost it's sudo permissions from sudoers config file. The change in playbook has been reverted and is fixed. -- You are receiving this mail because: You are on the CC list for the bug. From atumball at redhat.com Tue Apr 23 08:30:29 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Tue, 23 Apr 2019 14:00:29 +0530 Subject: [Gluster-infra] Fwd: glusterfs coverity scan is paused? In-Reply-To: References: Message-ID: Looks like the issue with http post to coverity site. Needs to refresh the keys. https://build.gluster.org/job/coverity-nightly/ Regards, Amar ---------- Forwarded message --------- From: Atin Mukherjee Date: Tue, 23 Apr, 2019, 10:07 AM Subject: glusterfs coverity scan is paused? To: Cc: Amar Tumballi Hi Team, Based on https://scan.coverity.com/projects/gluster-glusterfs?tab=overview , the scan is paused for last 5 days. Could you help us in getting this resolved as we track this metrics regularly? -------------- next part -------------- An HTML attachment was scrubbed... URL: From dkhandel at redhat.com Tue Apr 23 09:02:22 2019 From: dkhandel at redhat.com (Deepshikha Khandelwal) Date: Tue, 23 Apr 2019 14:32:22 +0530 Subject: [Gluster-infra] glusterfs coverity scan is paused? In-Reply-To: References: Message-ID: I just did. https://build.gluster.org/job/coverity-nightly/237/ On Tue, Apr 23, 2019 at 2:00 PM Amar Tumballi Suryanarayan < atumball at redhat.com> wrote: > Looks like the issue with http post to coverity site. Needs to refresh the > keys. > > https://build.gluster.org/job/coverity-nightly/ > > Regards, > Amar > > ---------- Forwarded message --------- > From: Atin Mukherjee > Date: Tue, 23 Apr, 2019, 10:07 AM > Subject: glusterfs coverity scan is paused? > To: > Cc: Amar Tumballi > > > Hi Team, > > Based on https://scan.coverity.com/projects/gluster-glusterfs?tab=overview > , the scan is paused for last 5 days. Could you help us in getting this > resolved as we track this metrics regularly? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mscherer at redhat.com Tue Apr 23 09:45:38 2019 From: mscherer at redhat.com (Michael Scherer) Date: Tue, 23 Apr 2019 11:45:38 +0200 Subject: [Gluster-infra] Fwd: glusterfs coverity scan is paused? In-Reply-To: References: Message-ID: Le mardi 23 avril 2019 ? 14:00 +0530, Amar Tumballi Suryanarayan a ?crit : > Looks like the issue with http post to coverity site. Needs to > refresh the > keys. > > https://build.gluster.org/job/coverity-nightly/ Problem is on their side, they forgot to include the intermediate certificate (so curl complain). To be fair, that's a mistake I keep doing too, and that's hard to notice, cause firefox tend to cache the said intermediate certificate, so you can't see with a casual check. Their monitoring should pick it up somehow I hope, and I think we are not the only impacted person. -- Michael Scherer Sysadmin, Community Infrastructure -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part URL: From dkhandel at redhat.com Tue Apr 23 10:22:46 2019 From: dkhandel at redhat.com (Deepshikha Khandelwal) Date: Tue, 23 Apr 2019 15:52:46 +0530 Subject: [Gluster-infra] Fwd: glusterfs coverity scan is paused? In-Reply-To: References: Message-ID: Agreed. For the time being, I've turned off the cert verification and you will get the coverage reports. We have to check with the Coverity team about this. On Tue, Apr 23, 2019 at 3:15 PM Michael Scherer wrote: > Le mardi 23 avril 2019 ? 14:00 +0530, Amar Tumballi Suryanarayan a > ?crit : > > Looks like the issue with http post to coverity site. Needs to > > refresh the > > keys. > > > > https://build.gluster.org/job/coverity-nightly/ > > Problem is on their side, they forgot to include the intermediate > certificate (so curl complain). > > To be fair, that's a mistake I keep doing too, and that's hard to > notice, cause firefox tend to cache the said intermediate certificate, > so you can't see with a casual check. > > Their monitoring should pick it up somehow I hope, and I think we are > not the only impacted person. > -- > Michael Scherer > Sysadmin, Community Infrastructure > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mscherer at redhat.com Tue Apr 23 14:14:49 2019 From: mscherer at redhat.com (Michael Scherer) Date: Tue, 23 Apr 2019 16:14:49 +0200 Subject: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> <797512f6ff7f1b9fedbf8b7968dd86a6968d9105.camel@redhat.com> Message-ID: Le lundi 22 avril 2019 ? 22:57 +0530, Atin Mukherjee a ?crit : > Is this back again? The recent patches are failing regression :-\ . So, on builder206, it took me a while to find that the issue is that nfs (the service) was running. ./tests/basic/afr/tarissue.t failed, because the nfs initialisation failed with a rather cryptic message: [2019-04-23 13:17:05.371733] I [socket.c:991:__socket_server_bind] 0- socket.nfs-server: process started listening on port (38465) [2019-04-23 13:17:05.385819] E [socket.c:972:__socket_server_bind] 0- socket.nfs-server: binding to failed: Address already in use [2019-04-23 13:17:05.385843] E [socket.c:974:__socket_server_bind] 0- socket.nfs-server: Port is already in use [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- socket.nfs-server: __socket_server_bind failed;closing socket 14 I found where this came from, but a few stuff did surprised me: - the order of print is different that the order in the code - the message on "started listening" didn't take in account the fact that bind failed on: https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 The message about port 38465 also threw me off the track. The real issue is that the service nfs was already running, and I couldn't find anything listening on port 38465 once I do service nfs stop, it no longer failed. So far, I do know why nfs.service was activated. But at least, 206 should be fixed, and we know a bit more on what would be causing some failure. > On Wed, 3 Apr 2019 at 19:26, Michael Scherer > wrote: > > > Le mercredi 03 avril 2019 ? 16:30 +0530, Atin Mukherjee a ?crit : > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < > > > jthottan at redhat.com> > > > wrote: > > > > > > > Hi, > > > > > > > > is_nfs_export_available is just a wrapper around "showmount" > > > > command AFAIR. > > > > I saw following messages in console output. > > > > mount.nfs: rpc.statd is not running but is required for remote > > > > locking. > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, > > > > or > > > > start > > > > statd. > > > > 05:06:55 mount.nfs: an incorrect mount option was specified > > > > > > > > For me it looks rpcbind may not be running on the machine. > > > > Usually rpcbind starts automatically on machines, don't know > > > > whether it > > > > can happen or not. > > > > > > > > > > That's precisely what the question is. Why suddenly we're seeing > > > this > > > happening too frequently. Today I saw atleast 4 to 5 such > > > failures > > > already. > > > > > > Deepshika - Can you please help in inspecting this? > > > > So we think (we are not sure) that the issue is a bit complex. > > > > What we were investigating was nightly run fail on aws. When the > > build > > crash, the builder is restarted, since that's the easiest way to > > clean > > everything (since even with a perfect test suite that would clean > > itself, we could always end in a corrupt state on the system, WRT > > mount, fs, etc). > > > > In turn, this seems to cause trouble on aws, since cloud-init or > > something rename eth0 interface to ens5, without cleaning to the > > network configuration. > > > > So the network init script fail (because the image say "start eth0" > > and > > that's not present), but fail in a weird way. Network is > > initialised > > and working (we can connect), but the dhclient process is not in > > the > > right cgroup, and network.service is in failed state. Restarting > > network didn't work. In turn, this mean that rpc-statd refuse to > > start > > (due to systemd dependencies), which seems to impact various NFS > > tests. > > > > We have also seen that on some builders, rpcbind pick some IP v6 > > autoconfiguration, but we can't reproduce that, and there is no ip > > v6 > > set up anywhere. I suspect the network.service failure is somehow > > involved, but fail to see how. In turn, rpcbind.socket not starting > > could cause NFS test troubles. > > > > Our current stop gap fix was to fix all the builders one by one. > > Remove > > the config, kill the rogue dhclient, restart network service. > > > > However, we can't be sure this is going to fix the problem long > > term > > since this only manifest after a crash of the test suite, and it > > doesn't happen so often. (plus, it was working before some day in > > the > > past, when something did make this fail, and I do not know if > > that's a > > system upgrade, or a test change, or both). > > > > So we are still looking at it to have a complete understanding of > > the > > issue, but so far, we hacked our way to make it work (or so do I > > think). > > > > Deepshika is working to fix it long term, by fixing the issue > > regarding > > eth0/ens5 with a new base image. > > -- > > Michael Scherer > > Sysadmin, Community Infrastructure and Platform, OSAS > > > > > > -- > > - Atin (atinm) -- Michael Scherer Sysadmin, Community Infrastructure -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part URL: From jthottan at redhat.com Wed Apr 24 06:59:11 2019 From: jthottan at redhat.com (Jiffin Thottan) Date: Wed, 24 Apr 2019 02:59:11 -0400 (EDT) Subject: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> <797512f6ff7f1b9fedbf8b7968dd86a6968d9105.camel@redhat.com> Message-ID: <527845347.24314753.1556089151794.JavaMail.zimbra@redhat.com> Below looks like kernel nfs was started (may be enabled on the machine). Did u start rpcbind manually on that machine, if yes can u please check kernel nfs status before and after that service? -- Jiffin ----- Original Message ----- From: "Michael Scherer" To: "Atin Mukherjee" Cc: "Deepshikha Khandelwal" , "Gluster Devel" , "Jiffin Thottan" , "gluster-infra" Sent: Tuesday, April 23, 2019 7:44:49 PM Subject: Re: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often? Le lundi 22 avril 2019 ? 22:57 +0530, Atin Mukherjee a ?crit : > Is this back again? The recent patches are failing regression :-\ . So, on builder206, it took me a while to find that the issue is that nfs (the service) was running. ./tests/basic/afr/tarissue.t failed, because the nfs initialisation failed with a rather cryptic message: [2019-04-23 13:17:05.371733] I [socket.c:991:__socket_server_bind] 0- socket.nfs-server: process started listening on port (38465) [2019-04-23 13:17:05.385819] E [socket.c:972:__socket_server_bind] 0- socket.nfs-server: binding to failed: Address already in use [2019-04-23 13:17:05.385843] E [socket.c:974:__socket_server_bind] 0- socket.nfs-server: Port is already in use [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- socket.nfs-server: __socket_server_bind failed;closing socket 14 I found where this came from, but a few stuff did surprised me: - the order of print is different that the order in the code - the message on "started listening" didn't take in account the fact that bind failed on: https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 The message about port 38465 also threw me off the track. The real issue is that the service nfs was already running, and I couldn't find anything listening on port 38465 once I do service nfs stop, it no longer failed. So far, I do know why nfs.service was activated. But at least, 206 should be fixed, and we know a bit more on what would be causing some failure. > On Wed, 3 Apr 2019 at 19:26, Michael Scherer > wrote: > > > Le mercredi 03 avril 2019 ? 16:30 +0530, Atin Mukherjee a ?crit : > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < > > > jthottan at redhat.com> > > > wrote: > > > > > > > Hi, > > > > > > > > is_nfs_export_available is just a wrapper around "showmount" > > > > command AFAIR. > > > > I saw following messages in console output. > > > > mount.nfs: rpc.statd is not running but is required for remote > > > > locking. > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, > > > > or > > > > start > > > > statd. > > > > 05:06:55 mount.nfs: an incorrect mount option was specified > > > > > > > > For me it looks rpcbind may not be running on the machine. > > > > Usually rpcbind starts automatically on machines, don't know > > > > whether it > > > > can happen or not. > > > > > > > > > > That's precisely what the question is. Why suddenly we're seeing > > > this > > > happening too frequently. Today I saw atleast 4 to 5 such > > > failures > > > already. > > > > > > Deepshika - Can you please help in inspecting this? > > > > So we think (we are not sure) that the issue is a bit complex. > > > > What we were investigating was nightly run fail on aws. When the > > build > > crash, the builder is restarted, since that's the easiest way to > > clean > > everything (since even with a perfect test suite that would clean > > itself, we could always end in a corrupt state on the system, WRT > > mount, fs, etc). > > > > In turn, this seems to cause trouble on aws, since cloud-init or > > something rename eth0 interface to ens5, without cleaning to the > > network configuration. > > > > So the network init script fail (because the image say "start eth0" > > and > > that's not present), but fail in a weird way. Network is > > initialised > > and working (we can connect), but the dhclient process is not in > > the > > right cgroup, and network.service is in failed state. Restarting > > network didn't work. In turn, this mean that rpc-statd refuse to > > start > > (due to systemd dependencies), which seems to impact various NFS > > tests. > > > > We have also seen that on some builders, rpcbind pick some IP v6 > > autoconfiguration, but we can't reproduce that, and there is no ip > > v6 > > set up anywhere. I suspect the network.service failure is somehow > > involved, but fail to see how. In turn, rpcbind.socket not starting > > could cause NFS test troubles. > > > > Our current stop gap fix was to fix all the builders one by one. > > Remove > > the config, kill the rogue dhclient, restart network service. > > > > However, we can't be sure this is going to fix the problem long > > term > > since this only manifest after a crash of the test suite, and it > > doesn't happen so often. (plus, it was working before some day in > > the > > past, when something did make this fail, and I do not know if > > that's a > > system upgrade, or a test change, or both). > > > > So we are still looking at it to have a complete understanding of > > the > > issue, but so far, we hacked our way to make it work (or so do I > > think). > > > > Deepshika is working to fix it long term, by fixing the issue > > regarding > > eth0/ens5 with a new base image. > > -- > > Michael Scherer > > Sysadmin, Community Infrastructure and Platform, OSAS > > > > > > -- > > - Atin (atinm) -- Michael Scherer Sysadmin, Community Infrastructure From ykaul at redhat.com Wed Apr 24 11:59:11 2019 From: ykaul at redhat.com (Yaniv Kaul) Date: Wed, 24 Apr 2019 14:59:11 +0300 Subject: [Gluster-infra] [Gluster-devel] is_nfs_export_available from nfs.rc failing too often? In-Reply-To: References: <2056284426.17636953.1554272780313.JavaMail.zimbra@redhat.com> <797512f6ff7f1b9fedbf8b7968dd86a6968d9105.camel@redhat.com> Message-ID: On Tue, Apr 23, 2019 at 5:15 PM Michael Scherer wrote: > Le lundi 22 avril 2019 ? 22:57 +0530, Atin Mukherjee a ?crit : > > Is this back again? The recent patches are failing regression :-\ . > > So, on builder206, it took me a while to find that the issue is that > nfs (the service) was running. > > ./tests/basic/afr/tarissue.t failed, because the nfs initialisation > failed with a rather cryptic message: > > [2019-04-23 13:17:05.371733] I [socket.c:991:__socket_server_bind] 0- > socket.nfs-server: process started listening on port (38465) > [2019-04-23 13:17:05.385819] E [socket.c:972:__socket_server_bind] 0- > socket.nfs-server: binding to failed: Address already in use > [2019-04-23 13:17:05.385843] E [socket.c:974:__socket_server_bind] 0- > socket.nfs-server: Port is already in use > [2019-04-23 13:17:05.385852] E [socket.c:3788:socket_listen] 0- > socket.nfs-server: __socket_server_bind failed;closing socket 14 > > I found where this came from, but a few stuff did surprised me: > > - the order of print is different that the order in the code > Indeed strange... > - the message on "started listening" didn't take in account the fact > that bind failed on: > Shouldn't it bail out if it failed to bind? Some missing 'goto out' around line 975/976? Y. > > > > https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/socket/src/socket.c#L967 > > The message about port 38465 also threw me off the track. The real > issue is that the service nfs was already running, and I couldn't find > anything listening on port 38465 > > once I do service nfs stop, it no longer failed. > > So far, I do know why nfs.service was activated. > > But at least, 206 should be fixed, and we know a bit more on what would > be causing some failure. > > > > > On Wed, 3 Apr 2019 at 19:26, Michael Scherer > > wrote: > > > > > Le mercredi 03 avril 2019 ? 16:30 +0530, Atin Mukherjee a ?crit : > > > > On Wed, Apr 3, 2019 at 11:56 AM Jiffin Thottan < > > > > jthottan at redhat.com> > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > is_nfs_export_available is just a wrapper around "showmount" > > > > > command AFAIR. > > > > > I saw following messages in console output. > > > > > mount.nfs: rpc.statd is not running but is required for remote > > > > > locking. > > > > > 05:06:55 mount.nfs: Either use '-o nolock' to keep locks local, > > > > > or > > > > > start > > > > > statd. > > > > > 05:06:55 mount.nfs: an incorrect mount option was specified > > > > > > > > > > For me it looks rpcbind may not be running on the machine. > > > > > Usually rpcbind starts automatically on machines, don't know > > > > > whether it > > > > > can happen or not. > > > > > > > > > > > > > That's precisely what the question is. Why suddenly we're seeing > > > > this > > > > happening too frequently. Today I saw atleast 4 to 5 such > > > > failures > > > > already. > > > > > > > > Deepshika - Can you please help in inspecting this? > > > > > > So we think (we are not sure) that the issue is a bit complex. > > > > > > What we were investigating was nightly run fail on aws. When the > > > build > > > crash, the builder is restarted, since that's the easiest way to > > > clean > > > everything (since even with a perfect test suite that would clean > > > itself, we could always end in a corrupt state on the system, WRT > > > mount, fs, etc). > > > > > > In turn, this seems to cause trouble on aws, since cloud-init or > > > something rename eth0 interface to ens5, without cleaning to the > > > network configuration. > > > > > > So the network init script fail (because the image say "start eth0" > > > and > > > that's not present), but fail in a weird way. Network is > > > initialised > > > and working (we can connect), but the dhclient process is not in > > > the > > > right cgroup, and network.service is in failed state. Restarting > > > network didn't work. In turn, this mean that rpc-statd refuse to > > > start > > > (due to systemd dependencies), which seems to impact various NFS > > > tests. > > > > > > We have also seen that on some builders, rpcbind pick some IP v6 > > > autoconfiguration, but we can't reproduce that, and there is no ip > > > v6 > > > set up anywhere. I suspect the network.service failure is somehow > > > involved, but fail to see how. In turn, rpcbind.socket not starting > > > could cause NFS test troubles. > > > > > > Our current stop gap fix was to fix all the builders one by one. > > > Remove > > > the config, kill the rogue dhclient, restart network service. > > > > > > However, we can't be sure this is going to fix the problem long > > > term > > > since this only manifest after a crash of the test suite, and it > > > doesn't happen so often. (plus, it was working before some day in > > > the > > > past, when something did make this fail, and I do not know if > > > that's a > > > system upgrade, or a test change, or both). > > > > > > So we are still looking at it to have a complete understanding of > > > the > > > issue, but so far, we hacked our way to make it work (or so do I > > > think). > > > > > > Deepshika is working to fix it long term, by fixing the issue > > > regarding > > > eth0/ens5 with a new base image. > > > -- > > > Michael Scherer > > > Sysadmin, Community Infrastructure and Platform, OSAS > > > > > > > > > -- > > > > - Atin (atinm) > -- > Michael Scherer > Sysadmin, Community Infrastructure > > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -------------- next part -------------- An HTML attachment was scrubbed... URL: From bugzilla at redhat.com Thu Apr 25 09:45:09 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Thu, 25 Apr 2019 09:45:09 +0000 Subject: [Gluster-infra] [Bug 1698716] Regression job did not vote for https://review.gluster.org/#/c/glusterfs/+/22366/ In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1698716 Deepshikha khandelwal changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CLOSED CC| |dkhandel at redhat.com Resolution|--- |NOTABUG Last Closed| |2019-04-25 09:45:09 -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Fri Apr 26 06:07:45 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Fri, 26 Apr 2019 06:07:45 +0000 Subject: [Gluster-infra] [Bug 1703329] New: [Plus one scale]: Please create repo for plus one scale work Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1703329 Bug ID: 1703329 Summary: [Plus one scale]: Please create repo for plus one scale work Product: GlusterFS Version: mainline Status: NEW Component: project-infrastructure Assignee: bugs at gluster.org Reporter: aspandey at redhat.com CC: bugs at gluster.org, gluster-infra at gluster.org Target Milestone: --- Classification: Community Description of problem: Please create repo for plus one scale work under gluster on github Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: repo on gluster github with following name - "plus-one-scale" Additional info: -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Fri Apr 26 06:08:30 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Fri, 26 Apr 2019 06:08:30 +0000 Subject: [Gluster-infra] [Bug 1703329] [gluster-infra]: Please create repo for plus one scale work In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1703329 Ashish Pandey changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|[Plus one scale]: Please |[gluster-infra]: Please |create repo for plus one |create repo for plus one |scale work |scale work -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Fri Apr 26 06:33:15 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Fri, 26 Apr 2019 06:33:15 +0000 Subject: [Gluster-infra] [Bug 1703329] [gluster-infra]: Please create repo for plus one scale work In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1703329 --- Comment #1 from Ashish Pandey --- The name could be "gluster-plus-one-scale" Let me know if you need any information regarding this project. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Fri Apr 26 08:13:48 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Fri, 26 Apr 2019 08:13:48 +0000 Subject: [Gluster-infra] [Bug 1703329] [gluster-infra]: Please create repo for plus one scale work In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1703329 M. Scherer changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mscherer at redhat.com --- Comment #2 from M. Scherer --- Yes, I would like to get more information, like what is it going to be used for (or more clearly, what is the project exactly), who need access, etc. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Fri Apr 26 11:16:08 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Fri, 26 Apr 2019 11:16:08 +0000 Subject: [Gluster-infra] [Bug 1703329] [gluster-infra]: Please create repo for plus one scale work In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1703329 --- Comment #3 from Ashish Pandey --- Problem: To increase the capacity of existing gluster volume (AFR or EC), we need to add more bricks on that volume. Currently it is required to add as many node as it is required to keep the volume fault tolerant, which depends on the configuration of the volume. This in turn requires adding more than 1 node to scale gluster volume. However, it is not always possible to buy and place lot of nodes/serves in one shot and provide it to scale our volume. To solve this we have to come up with some solution so that we can scale out volume even if we add one server with enough bricks on that one server. This tool is going to help user to move the drives from one server to other server so that all the bricks are properly distributed and new bricks can be added to volume with complete fault tolerance. Access is needed by following team members for now - 1 - vbellur at redhat.com 2 - atumball at redhat.com 3 - aspandey at redhat.com github issues related to this - https://github.com/gluster/glusterfs/issues/169 https://github.com/gluster/glusterfs/issues/497 https://github.com/gluster/glusterfs/issues/632 --- Ashish -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Fri Apr 26 12:08:59 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Fri, 26 Apr 2019 12:08:59 +0000 Subject: [Gluster-infra] [Bug 1703433] New: gluster-block: setup GCOV & LCOV job Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1703433 Bug ID: 1703433 Summary: gluster-block: setup GCOV & LCOV job Product: GlusterFS Version: mainline Status: NEW Component: project-infrastructure Assignee: bugs at gluster.org Reporter: prasanna.kalever at redhat.com CC: bugs at gluster.org, gluster-infra at gluster.org Target Milestone: --- Classification: Community Description of problem: ### Kind of issue Infra request: Setup GCOV & LCOV job for gluster-block project which should run nightly or weekly to help visualize test & line coverage achieved, which intern help us understand: * how often each line of code executes * what lines of code are actually executed * how much computing time each section of code uses and work on possible optimizations. ### Other useful information Repo: https://github.com/gluster/gluster-block To start with, we can build like we do with [travis](https://github.com/gluster/gluster-block/blob/master/extras/docker/Dockerfile.fedora29) Need to install run time dependencies: # yum install glusterfs-server targetcli tcmu-runner then start glusterd # systemctl start glusterd Then run simple test case # [./tests/basic.t](https://github.com/gluster/gluster-block/blob/master/tests/basic.t) -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Fri Apr 26 12:12:18 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Fri, 26 Apr 2019 12:12:18 +0000 Subject: [Gluster-infra] [Bug 1703435] New: gluster-block: Upstream Jenkins job which get triggered at PR level Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1703435 Bug ID: 1703435 Summary: gluster-block: Upstream Jenkins job which get triggered at PR level Product: GlusterFS Version: mainline Status: NEW Component: project-infrastructure Assignee: bugs at gluster.org Reporter: prasanna.kalever at redhat.com CC: bugs at gluster.org, gluster-infra at gluster.org Target Milestone: --- Classification: Community Description of problem: ### Kind of issue Infra request: Need a Jenkins job for gluster-block project which should run per PR (and may be more events like, refresh of PR) to help figure out any possible regressions and to help build overall confidence of upstream master branch. ### Other useful information Repo: https://github.com/gluster/gluster-block To start with, we can build like we do with [travis](https://github.com/gluster/gluster-block/blob/master/extras/docker/Dockerfile.fedora29) Need to install run time dependencies: # yum install glusterfs-server targetcli tcmu-runner then start glusterd # systemctl start glusterd Then run simple test case # [./tests/basic.t](https://github.com/gluster/gluster-block/blob/master/tests/basic.t) -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Fri Apr 26 15:39:28 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Fri, 26 Apr 2019 15:39:28 +0000 Subject: [Gluster-infra] [Bug 1703329] [gluster-infra]: Please create repo for plus one scale work In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1703329 --- Comment #4 from M. Scherer --- I created the repo, but neither atumball at redhat.com nor aspandey at redhat.com are valid email on github. I rather not guess that, so I would need either the gthub username, or the email used for the account to add them. -- You are receiving this mail because: You are on the CC list for the bug. From ykaul at redhat.com Sat Apr 27 19:18:41 2019 From: ykaul at redhat.com (Yaniv Kaul) Date: Sat, 27 Apr 2019 22:18:41 +0300 Subject: [Gluster-infra] Do we have a monitoring system on our builders? Message-ID: I'd like to see what is our status. Just had CI failures[1] because builder26.int.rht.gluster.org is not available, apparently. TIA, Y. [1] https://build.gluster.org/job/devrpm-el7/15846/console -------------- next part -------------- An HTML attachment was scrubbed... URL: From sankarshan.mukhopadhyay at gmail.com Sun Apr 28 05:06:26 2019 From: sankarshan.mukhopadhyay at gmail.com (Sankarshan Mukhopadhyay) Date: Sun, 28 Apr 2019 10:36:26 +0530 Subject: [Gluster-infra] Do we have a monitoring system on our builders? In-Reply-To: References: Message-ID: On Sun, Apr 28, 2019 at 12:49 AM Yaniv Kaul wrote: > > I'd like to see what is our status. is what we have at this point > Just had CI failures[1] because builder26.int.rht.gluster.org is not available, apparently. > > TIA, > Y. > > [1] https://build.gluster.org/job/devrpm-el7/15846/console From amukherj at redhat.com Sun Apr 28 13:13:22 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Sun, 28 Apr 2019 18:43:22 +0530 Subject: [Gluster-infra] [Gluster-devel] Weekly Untriaged Bugs In-Reply-To: <307671157.25.1555897502768.JavaMail.jenkins@jenkins-el7.rht.gluster.org> References: <307671157.25.1555897502768.JavaMail.jenkins@jenkins-el7.rht.gluster.org> Message-ID: While I understand this report captured bugs filed since last 1 week and do not have ?Triaged? keyword, does it make better sense to exclude bugs which aren?t in NEW state? I believe the intention of this report is to check what all bugs haven?t been looked at by maintainers/developers yet. BZs which are already fixed or in ASSIGNED/POST state need not to feature in this list is what I believe as otherwise it gives a false impression that too many bugs are getting unnoticed which isn?t the reality. Thoughts? On Mon, 22 Apr 2019 at 07:15, wrote: > [...truncated 6 lines...] > https://bugzilla.redhat.com/1699023 / core: Brick is not able to detach > successfully in brick_mux environment > https://bugzilla.redhat.com/1695416 / core: client log flooding with > intentional socket shutdown message when a brick is down > https://bugzilla.redhat.com/1695480 / core: Global Thread Pool > https://bugzilla.redhat.com/1694943 / core: parallel-readdir slows down > directory listing > https://bugzilla.redhat.com/1700295 / core: The data couldn't be flushed > immediately even with O_SYNC in glfs_create or with > glfs_fsync/glfs_fdatasync after glfs_write. > https://bugzilla.redhat.com/1698861 / disperse: Renaming a directory when > 2 bricks of multiple disperse subvols are down leaves both old and new dirs > on the bricks. > https://bugzilla.redhat.com/1697293 / distribute: DHT: print hash and > layout values in hexadecimal format in the logs > https://bugzilla.redhat.com/1701039 / distribute: gluster replica 3 > arbiter Unfortunately data not distributed equally > https://bugzilla.redhat.com/1697971 / fuse: Segfault in FUSE process, > potential use after free > https://bugzilla.redhat.com/1694139 / glusterd: Error waiting for job > 'heketi-storage-copy-job' to complete on one-node k3s deployment. > https://bugzilla.redhat.com/1695099 / glusterd: The number of glusterfs > processes keeps increasing, using all available resources > https://bugzilla.redhat.com/1692349 / project-infrastructure: > gluster-csi-containers job is failing > https://bugzilla.redhat.com/1698716 / project-infrastructure: Regression > job did not vote for https://review.gluster.org/#/c/glusterfs/+/22366/ > https://bugzilla.redhat.com/1698694 / project-infrastructure: regression > job isn't voting back to gerrit > https://bugzilla.redhat.com/1699712 / project-infrastructure: regression > job is voting Success even in case of failure > https://bugzilla.redhat.com/1693385 / project-infrastructure: request to > change the version of fedora in fedora-smoke-job > https://bugzilla.redhat.com/1695484 / project-infrastructure: smoke fails > with "Build root is locked by another process" > https://bugzilla.redhat.com/1693184 / replicate: A brick > process(glusterfsd) died with 'memory violation' > https://bugzilla.redhat.com/1698566 / selfheal: shd crashed while > executing ./tests/bugs/core/bug-1432542-mpx-restart-crash.t in CI > https://bugzilla.redhat.com/1699309 / snapshot: Gluster snapshot fails > with systemd autmounted bricks > https://bugzilla.redhat.com/1696633 / tests: GlusterFs v4.1.5 Tests from > /tests/bugs/ module failing on Intel > https://bugzilla.redhat.com/1697812 / website: mention a pointer to all > the mailing lists available under glusterfs project( > https://www.gluster.org/community/) > [...truncated 2 lines...]_______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -- - Atin (atinm) -------------- next part -------------- An HTML attachment was scrubbed... URL: From dkhandel at redhat.com Mon Apr 29 05:22:26 2019 From: dkhandel at redhat.com (Deepshikha Khandelwal) Date: Mon, 29 Apr 2019 10:52:26 +0530 Subject: [Gluster-infra] [Gluster-devel] Weekly Untriaged Bugs In-Reply-To: References: <307671157.25.1555897502768.JavaMail.jenkins@jenkins-el7.rht.gluster.org> Message-ID: This list also captures the BZs which are in NEW state. The search description goes like this: - *Status:* NEW - *Product:* GlusterFS - *Changed:* (is greater than or equal to) -4w - *Creation date:* (changed after) -4w - *Keywords:* (does not contain the string) Triaged On Sun, Apr 28, 2019 at 6:44 PM Atin Mukherjee wrote: > While I understand this report captured bugs filed since last 1 week and > do not have ?Triaged? keyword, does it make better sense to exclude bugs > which aren?t in NEW state? > > I believe the intention of this report is to check what all bugs haven?t > been looked at by maintainers/developers yet. BZs which are already fixed > or in ASSIGNED/POST state need not to feature in this list is what I > believe as otherwise it gives a false impression that too many bugs are > getting unnoticed which isn?t the reality. Thoughts? > > On Mon, 22 Apr 2019 at 07:15, wrote: > >> [...truncated 6 lines...] >> https://bugzilla.redhat.com/1699023 / core: Brick is not able to detach >> successfully in brick_mux environment >> https://bugzilla.redhat.com/1695416 / core: client log flooding with >> intentional socket shutdown message when a brick is down >> https://bugzilla.redhat.com/1695480 / core: Global Thread Pool >> https://bugzilla.redhat.com/1694943 / core: parallel-readdir slows down >> directory listing >> https://bugzilla.redhat.com/1700295 / core: The data couldn't be flushed >> immediately even with O_SYNC in glfs_create or with >> glfs_fsync/glfs_fdatasync after glfs_write. >> https://bugzilla.redhat.com/1698861 / disperse: Renaming a directory >> when 2 bricks of multiple disperse subvols are down leaves both old and new >> dirs on the bricks. >> https://bugzilla.redhat.com/1697293 / distribute: DHT: print hash and >> layout values in hexadecimal format in the logs >> https://bugzilla.redhat.com/1701039 / distribute: gluster replica 3 >> arbiter Unfortunately data not distributed equally >> https://bugzilla.redhat.com/1697971 / fuse: Segfault in FUSE process, >> potential use after free >> https://bugzilla.redhat.com/1694139 / glusterd: Error waiting for job >> 'heketi-storage-copy-job' to complete on one-node k3s deployment. >> https://bugzilla.redhat.com/1695099 / glusterd: The number of glusterfs >> processes keeps increasing, using all available resources >> https://bugzilla.redhat.com/1692349 / project-infrastructure: >> gluster-csi-containers job is failing >> https://bugzilla.redhat.com/1698716 / project-infrastructure: Regression >> job did not vote for https://review.gluster.org/#/c/glusterfs/+/22366/ >> https://bugzilla.redhat.com/1698694 / project-infrastructure: regression >> job isn't voting back to gerrit >> https://bugzilla.redhat.com/1699712 / project-infrastructure: regression >> job is voting Success even in case of failure >> https://bugzilla.redhat.com/1693385 / project-infrastructure: request to >> change the version of fedora in fedora-smoke-job >> https://bugzilla.redhat.com/1695484 / project-infrastructure: smoke >> fails with "Build root is locked by another process" >> https://bugzilla.redhat.com/1693184 / replicate: A brick >> process(glusterfsd) died with 'memory violation' >> https://bugzilla.redhat.com/1698566 / selfheal: shd crashed while >> executing ./tests/bugs/core/bug-1432542-mpx-restart-crash.t in CI >> https://bugzilla.redhat.com/1699309 / snapshot: Gluster snapshot fails >> with systemd autmounted bricks >> https://bugzilla.redhat.com/1696633 / tests: GlusterFs v4.1.5 Tests from >> /tests/bugs/ module failing on Intel >> https://bugzilla.redhat.com/1697812 / website: mention a pointer to all >> the mailing lists available under glusterfs project( >> https://www.gluster.org/community/) >> [...truncated 2 lines...]_______________________________________________ >> Gluster-devel mailing list >> Gluster-devel at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel > > -- > - Atin (atinm) > _______________________________________________ > Gluster-infra mailing list > Gluster-infra at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-infra -------------- next part -------------- An HTML attachment was scrubbed... URL: From bugzilla at redhat.com Mon Apr 29 05:26:38 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Mon, 29 Apr 2019 05:26:38 +0000 Subject: [Gluster-infra] [Bug 1703435] gluster-block: Upstream Jenkins job which get triggered at PR level In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1703435 Deepshikha khandelwal changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dkhandel at redhat.com Assignee|bugs at gluster.org |dkhandel at redhat.com -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Mon Apr 29 05:27:39 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Mon, 29 Apr 2019 05:27:39 +0000 Subject: [Gluster-infra] [Bug 1698694] regression job isn't voting back to gerrit In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1698694 Deepshikha khandelwal changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CLOSED Resolution|--- |CURRENTRELEASE Last Closed| |2019-04-29 05:27:39 --- Comment #5 from Deepshikha khandelwal --- It is fixed now. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Mon Apr 29 05:28:59 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Mon, 29 Apr 2019 05:28:59 +0000 Subject: [Gluster-infra] [Bug 1703433] gluster-block: setup GCOV & LCOV job In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1703433 Deepshikha khandelwal changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dkhandel at redhat.com Assignee|bugs at gluster.org |dkhandel at redhat.com -- You are receiving this mail because: You are on the CC list for the bug. From mscherer at redhat.com Mon Apr 29 08:17:37 2019 From: mscherer at redhat.com (Michael Scherer) Date: Mon, 29 Apr 2019 10:17:37 +0200 Subject: [Gluster-infra] Do we have a monitoring system on our builders? In-Reply-To: References: Message-ID: <35e33533bab777af9820703f783cb7a95cf49604.camel@redhat.com> Le samedi 27 avril 2019 ? 22:18 +0300, Yaniv Kaul a ?crit : > I'd like to see what is our status. > Just had CI failures[1] because builder26.int.rht.gluster.org is not > available, apparently. We have nagios too. Web interface is password protected so I can't give it (need to do a guest account, and so far, no one has expressed interest into that). This failure is weird, cause the builder is pretty much up and running, but it seems the jenkins agent crashed. This exact process is not monitored by nagios, as I was under the impression that jenkins was smart enough to start it on demand (seems I was wrong), and/or see it crashed and put the server out of rotation (seems I was wrong on that one too) I suspect this was related to the openjdk upgrade on the 20 on builder26. Since jenkins do not support that on the main server, I guess it also may be unstable on the agent side :/ I disconnected/reconnected the builder, this should fix for this one, but we definitely need to dig a bit more to see what happened and how to prevent that. Adding supervision of the agent should be quick (*cough* famous last words *cough*), so let's do that as a first step. -- Michael Scherer Sysadmin, Community Infrastructure -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part URL: From mscherer at redhat.com Mon Apr 29 09:30:09 2019 From: mscherer at redhat.com (Michael Scherer) Date: Mon, 29 Apr 2019 11:30:09 +0200 Subject: [Gluster-infra] Do we have a monitoring system on our builders? In-Reply-To: <35e33533bab777af9820703f783cb7a95cf49604.camel@redhat.com> References: <35e33533bab777af9820703f783cb7a95cf49604.camel@redhat.com> Message-ID: Le lundi 29 avril 2019 ? 10:17 +0200, Michael Scherer a ?crit : > Le samedi 27 avril 2019 ? 22:18 +0300, Yaniv Kaul a ?crit : > > I'd like to see what is our status. > > Just had CI failures[1] because builder26.int.rht.gluster.org is > > not > > available, apparently. > > We have nagios too. Web interface is password protected so I can't > give > it (need to do a guest account, and so far, no one has expressed > interest into that). So, if folks want to see nagios, that's on nagios.gluster.org, login is "guest", password is "gluster" (cf https://github.com/gluster/gluster.org_ansible_configuration/blob/master/roles/nagios/tasks/httpd.yml#L24 ). That's a guest account, so readonly (or so it should be). > > This failure is weird, cause the builder is pretty much up and > running, > but it seems the jenkins agent crashed. This exact process is not > monitored by nagios, as I was under the impression that jenkins was > smart enough to start it on demand (seems I was wrong), and/or see it > crashed and put the server out of rotation (seems I was wrong on that > one too) > > > I suspect this was related to the openjdk upgrade on the 20 on > builder26. Since jenkins do not support that on the main server, I > guess it also may be unstable on the agent side :/ > > I disconnected/reconnected the builder, this should fix for this one, > but we definitely need to dig a bit more to see what happened and how > to prevent that. > > Adding supervision of the agent should be quick (*cough* famous last > words *cough*), so let's do that as a first step. Ok so after discussing with Deepshika who did fix a few servers already, it seems the issue is not something that can be seen externally, but only from within jenkins. The agent is running, but failling, which make it tricky. We have a few options, and I think that is one that could work (until we move to on demand for all builders). 1) do not automatically upgrade the jdk (easy, drop a file to skip that file) 2) every week, run a jenkins job that - do the upgrade of openjdk (permit to make sure we do not reboot a builder building something at random) - reboot the builders The trick part is that I do not know how to write that in a way that is run: - on all builders - serially -- Michael Scherer Sysadmin, Community Infrastructure -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part URL: From bugzilla at redhat.com Mon Apr 29 09:34:18 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Mon, 29 Apr 2019 09:34:18 +0000 Subject: [Gluster-infra] [Bug 1703329] [gluster-infra]: Please create repo for plus one scale work In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1703329 --- Comment #5 from Ashish Pandey --- Hi, Please find the details as follows - 1 - Ashish Pandey github user name - aspandey github email id - ashishpandey.cdac at gmail.com 2 - Amar Tumballi github user name - amarts email - amarts at gmail.com 3 - Vijay Bellur github user name - vbellur email - vbellur at redhat.com I am extremely sorry for the inconvenience. I just did not think about the github account details and sent the redhat email. --- Ashish -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Mon Apr 29 10:14:25 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Mon, 29 Apr 2019 10:14:25 +0000 Subject: [Gluster-infra] [Bug 1703329] [gluster-infra]: Please create repo for plus one scale work In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1703329 --- Comment #6 from M. Scherer --- No need to be sorry, if we do not say what we need, people can't know. We should have a form or some way to tell people what we need, you shouldn't have to guess that :/ I have added folks to the repo, tell me if anything is missing, otherwise, i will close the bug later. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at redhat.com Tue Apr 30 11:57:19 2019 From: bugzilla at redhat.com (bugzilla at redhat.com) Date: Tue, 30 Apr 2019 11:57:19 +0000 Subject: [Gluster-infra] [Bug 1703329] [gluster-infra]: Please create repo for plus one scale work In-Reply-To: References: Message-ID: https://bugzilla.redhat.com/show_bug.cgi?id=1703329 --- Comment #7 from Ashish Pandey --- Hi, I think you may close the bug as repo has been created and can be used. Thanks!! --- Ashish -- You are receiving this mail because: You are on the CC list for the bug.