[automated-testing] reboot_nodes_and_wait_to_come_online bug

Jonathan Holloway jhollowa at redhat.com
Fri Jun 15 16:21:51 UTC 2018


Hey Vitalii,

If Vijay hasn't already plugged closing the DeployedServer into the reboot
function, I'll try to do that today and kick one off.
It has worked in my local testing.

Cheers,
Jonathan

On Fri, Jun 15, 2018 at 3:07 AM, Vitalii Koriakov <vkoriako at redhat.com>
wrote:

> Hello, guys.
> How things are going on this question?
> Is there any hope that it will work?
>
> Regards,
> Vitalii.
>
> ----- Исходное сообщение -----
> От: "Jonathan Holloway" <jhollowa at redhat.com>
> Кому: "Nigel Babu" <nigelb at redhat.com>
> Копия: automated-testing at gluster.org
> Отправленные: Четверг, 14 Июнь 2018 г 6:56:41
> Тема: Re: [automated-testing] reboot_nodes_and_wait_to_come_online bug
>
> I ran a variety of tests today, and what is happening is the RPyC server
> is taken down and cannot resume connection after restart.
> Glusto maintains a cache of SSH connection, RPyC DeployedServer, and RPyC
> connection.
> SSH handles it gracefully, but the DeployedServer (the piece that manages
> the remote RPyC server) is lost along with the RPyC connection that sits on
> top of it.
>
> As the test scripts are written, it is possible to add a
> g.rpyc_close_deployed_server(host) in the library function either before
> the reboot (assuming nothing else has to use RPyC before the system comes
> down) or immediately following the reboot and before the next RPyC call is
> made. The next RPyC connection made will essentially build a new connection
> and things will work as expected.
>
> Because the RPyC connection is typically used directly after Glusto sets
> it up and caches it, it is not straightforward to inject a connection check
> when direct calls are made as we typically use it.
> conn = g.rpyc_get_connection(...)
> conn.modules.sys.platform <-- this call is directly to the RPyC connection
> object.
>
> I am testing addition(s) to Glusto that will help manage the connection in
> these cases, but closing the DeployedServer, as described above, is the
> immediate solution.
>
> Cheers,
> Jonathan
>
>
> On Wed, Jun 13, 2018 at 12:07 PM, Jonathan Holloway < jhollowa at redhat.com
> > wrote:
>
>
>
> Hey Nigel,
>
> The RPyC server does not automatically restart nor does it reconnect after
> the server process is restarted.
> There are a couple of ways to handle it.
> I'll send steps that can be used without a change to Glusto, but right now
> I'm testing something that can be quickly injected into Glusto to make this
> seamless.
>
> Cheers,
> Jonathan
>
>
> On Wed, Jun 13, 2018 at 5:15 AM, Nigel Babu < nigelb at redhat.com > wrote:
>
>
>
> Jonathan,
>
> After a machine reboot, will rpyc reconnect automatically? Or are the
> communication issues a symptom of a larger problem that you can't restart a
> client and expect the connection to exist when it comes back online?
>
> On Mon, Jun 11, 2018 at 7:36 PM, Jonathan Holloway < jhollowa at redhat.com
> > wrote:
>
>
>
> Hey Vijay,
>
> In the AFR test run I started on Saturday, it looks like the 039 system
> had that communication issue we were tracking down on Friday, and it had
> just been rebooted as part of the test.
> Definitely worth re-running AFR after the fix.
>
> Cheers,
> Jonathan
>
> On Mon, Jun 11, 2018 at 6:49 AM, Vijay Bhaskar Reddy Avuthu <
> vavuthu at redhat.com > wrote:
>
>
>
> I will take a look.
>
> Regards,
> Vijay A
>
> On Mon, Jun 11, 2018 at 5:05 PM, Nigel Babu < nigelb at redhat.com > wrote:
>
>
>
> Oh dear. That's a problem. Vijay, I think you wrote the original code? Can
> you take a look?
>
> On Mon, Jun 11, 2018 at 1:58 PM, Vitalii Koriakov < vkoriako at redhat.com >
> wrote:
>
>
> Hello all
> Noticed such behavior:
>
> Reboot nodes with the method reboot_nodes_and_wait_to_come_online. In
> case when nodes are not online after timeout - it says that all nodes are
> online.
> So logs are:
>
>
> 2018-06-08 18:28:00,210 INFO (are_nodes_online) 172.19.2.122 is offline
> 2018-06-08 18:28:00,211 INFO (reboot_nodes_and_wait_to_come_online) Nodes
> are offline, Retry after 5 seconds .....
> 2018-06-08 18:28:05,216 INFO (reboot_nodes_and_wait_to_come_online) All
> nodes ['172.19.2.86', '172.19.2.126', '172.19.3.113', '172.19.2.122'] are
> up and running
>
> So it doesn't check are nodes online after 5 sec and just return that all
> nodes are online.
>
> Regards,
> Vitalii
> _______________________________________________
> automated-testing mailing list
> automated-testing at gluster.org
> http://lists.gluster.org/mailman/listinfo/automated-testing
>
>
>
> --
> nigelb
>
>
> _______________________________________________
> automated-testing mailing list
> automated-testing at gluster.org
> http://lists.gluster.org/mailman/listinfo/automated-testing
>
>
>
>
>
> --
> nigelb
>
>
>
> _______________________________________________
> automated-testing mailing list
> automated-testing at gluster.org
> http://lists.gluster.org/mailman/listinfo/automated-testing
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/automated-testing/attachments/20180615/b158e808/attachment.html>


More information about the automated-testing mailing list