[automated-testing] reboot_nodes_and_wait_to_come_online bug

Jonathan Holloway jhollowa at redhat.com
Thu Jun 14 03:56:41 UTC 2018


I ran a variety of tests today, and what is happening is the RPyC server is
taken down and cannot resume connection after restart.
Glusto maintains a cache of SSH connection, RPyC DeployedServer, and RPyC
connection.
SSH handles it gracefully, but the DeployedServer (the piece that manages
the remote RPyC server) is lost along with the RPyC connection that sits on
top of it.

As the test scripts are written, it is possible to add a
g.rpyc_close_deployed_server(host) in the library function either before
the reboot (assuming nothing else has to use RPyC before the system comes
down) or immediately following the reboot and before the next RPyC call is
made. The next RPyC connection made will essentially build a new connection
and things will work as expected.

Because the RPyC connection is typically used directly after Glusto sets it
up and caches it, it is not straightforward to inject a connection check
when direct calls are made as we typically use it.
conn = g.rpyc_get_connection(...)
conn.modules.sys.platform   <-- this call is directly to the RPyC
connection object.

I am testing addition(s) to Glusto that will help manage the connection in
these cases, but closing the DeployedServer, as described above, is the
immediate solution.

Cheers,
Jonathan


On Wed, Jun 13, 2018 at 12:07 PM, Jonathan Holloway <jhollowa at redhat.com>
wrote:

> Hey Nigel,
>
> The RPyC server does not automatically restart nor does it reconnect after
> the server process is restarted.
> There are a couple of ways to handle it.
> I'll send steps that can be used without a change to Glusto, but right now
> I'm testing something that can be quickly injected into Glusto to make this
> seamless.
>
> Cheers,
> Jonathan
>
>
> On Wed, Jun 13, 2018 at 5:15 AM, Nigel Babu <nigelb at redhat.com> wrote:
>
>> Jonathan,
>>
>> After a machine reboot, will rpyc reconnect automatically? Or are the
>> communication issues a symptom of a larger problem that you can't restart a
>> client and expect the connection to exist when it comes back online?
>>
>> On Mon, Jun 11, 2018 at 7:36 PM, Jonathan Holloway <jhollowa at redhat.com>
>> wrote:
>>
>>> Hey Vijay,
>>>
>>> In the AFR test run I started on Saturday, it looks like the 039 system
>>> had that communication issue we were tracking down on Friday, and it had
>>> just been rebooted as part of the test.
>>> Definitely worth re-running AFR after the fix.
>>>
>>> Cheers,
>>> Jonathan
>>>
>>> On Mon, Jun 11, 2018 at 6:49 AM, Vijay Bhaskar Reddy Avuthu <
>>> vavuthu at redhat.com> wrote:
>>>
>>>> I will take a look.
>>>>
>>>> Regards,
>>>> Vijay A
>>>>
>>>> On Mon, Jun 11, 2018 at 5:05 PM, Nigel Babu <nigelb at redhat.com> wrote:
>>>>
>>>>> Oh dear. That's a problem. Vijay, I think you wrote the original code?
>>>>> Can you take a look?
>>>>>
>>>>> On Mon, Jun 11, 2018 at 1:58 PM, Vitalii Koriakov <vkoriako at redhat.com
>>>>> > wrote:
>>>>>
>>>>>> Hello all
>>>>>> Noticed such behavior:
>>>>>>
>>>>>> Reboot nodes with the method reboot_nodes_and_wait_to_come_online.
>>>>>> In case when nodes are not online after timeout - it says that all nodes
>>>>>> are online.
>>>>>> So logs are:
>>>>>>
>>>>>>
>>>>>> 2018-06-08 18:28:00,210 INFO (are_nodes_online) 172.19.2.122 is
>>>>>> offline
>>>>>> 2018-06-08 18:28:00,211 INFO (reboot_nodes_and_wait_to_come_online)
>>>>>> Nodes are offline, Retry after 5 seconds .....
>>>>>> 2018-06-08 18:28:05,216 INFO (reboot_nodes_and_wait_to_come_online)
>>>>>> All nodes ['172.19.2.86', '172.19.2.126', '172.19.3.113', '172.19.2.122']
>>>>>> are up and running
>>>>>>
>>>>>> So it doesn't check are nodes online after 5 sec and just return that
>>>>>> all nodes are online.
>>>>>>
>>>>>> Regards,
>>>>>> Vitalii
>>>>>> _______________________________________________
>>>>>> automated-testing mailing list
>>>>>> automated-testing at gluster.org
>>>>>> http://lists.gluster.org/mailman/listinfo/automated-testing
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> nigelb
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> automated-testing mailing list
>>>> automated-testing at gluster.org
>>>> http://lists.gluster.org/mailman/listinfo/automated-testing
>>>>
>>>>
>>>
>>
>>
>> --
>> nigelb
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/automated-testing/attachments/20180613/af7bf30b/attachment-0001.html>


More information about the automated-testing mailing list