[Gluster-devel] [ovirt-users] Hosted-Engine HA problem

Jiri Moskovcak jmoskovc at redhat.com
Tue Nov 11 07:53:26 UTC 2014


On 11/11/2014 05:56 AM, Jaicel wrote:
> Hi Jirka,
>
> the patch works. it stabilized the status of my two hosts. the engine
> migration during failover also works fine. thanks guys!

Hi Jaicel,
I'm glad it works for you! Enjoy the hosted engine ;)

--Jirka

>
> Jaicel
>
> ------------------------------------------------------------------------
> *From: *"Jiri Moskovcak" <jmoskovc at redhat.com>
> *To: *"Jaicel" <jaicel at asti.dost.gov.ph>
> *Cc: *"Niels de Vos" <ndevos at redhat.com>, "Vijay Bellur"
> <vbellur at redhat.com>, users at ovirt.org, "Gluster Devel"
> <gluster-devel at gluster.org>
> *Sent: *Monday, November 3, 2014 3:33:16 PM
> *Subject: *Re: [ovirt-users] Hosted-Engine HA problem
>
> On 11/01/2014 07:43 AM, Jaicel wrote:
>  > Hi,
>  >
>  > my engine runs on Host1. current status and agent logs below.
>  >
>  > Host 1
>
> Hi,
> it seems like you ran into [1], you can either zero-out the metadata
> file or apply the patch from [1] manually.
>
> --Jirka
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1158925
>
>  >
>  > MainThread::INFO::2014-10-31
> 16:55:39,918::agent::52::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
> ovirt-hosted-engi
>  > ne-ha agent 1.1.6 started
>  > MainThread::INFO::2014-10-31
> 16:55:39,985::hosted_engine::223::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
>  > :(_get_hostname) Found certificate common name: 192.168.12.11
>  > MainThread::INFO::2014-10-31
> 16:55:40,228::hosted_engine::367::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
>  > :(_initialize_broker) Initializing ha-broker connection
>  > MainThread::INFO::2014-10-31
> 16:55:40,228::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
>  > nitor) Starting monitor ping, options {'addr': '192.168.12.254'}
>  > MainThread::INFO::2014-10-31
> 16:55:40,231::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
>  > nitor) Success, id 140634215107920
>  > MainThread::INFO::2014-10-31
> 16:55:40,231::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
>  > nitor) Starting monitor mgmt-bridge, options {'use_ssl': 'true',
> 'bridge_name': 'ovirtmgmt', 'address': '0'}
>  > MainThread::INFO::2014-10-31
> 16:55:40,237::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
>  > nitor) Success, id 140634215108432
>  > MainThread::INFO::2014-10-31
> 16:55:40,237::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
>  > nitor) Starting monitor mem-free, options {'use_ssl': 'true',
> 'address': '0'}
>  > MainThread::INFO::2014-10-31
> 16:55:40,240::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
>  > nitor) Success, id 39956688
>  > MainThread::INFO::2014-10-31
> 16:55:40,240::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
>  > nitor) Starting monitor cpu-load-no-engine, options {'use_ssl':
> 'true', 'vm_uuid': '41d4aff1-54e1-4946-a812-2e656bb7d3f
>  > 9', 'address': '0'}
>  > MainThread::INFO::2014-10-31
> 16:55:40,243::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
>  > nitor) Success, id 140634215107664
>  > MainThread::INFO::2014-10-31
> 16:55:40,244::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
>  > nitor) Starting monitor engine-health, options {'use_ssl': 'true',
> 'vm_uuid': '41d4aff1-54e1-4946-a812-2e656bb7d3f9', '
>  > address': '0'}
>  > MainThread::INFO::2014-10-31
> 16:55:40,249::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
>  > nitor) Success, id 140634006879632
>  > MainThread::INFO::2014-10-31
> 16:55:40,249::hosted_engine::391::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
>  > :(_initialize_broker) Broker initialized, all submonitors started
>  > MainThread::INFO::2014-10-31
> 16:55:40,298::hosted_engine::476::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
>  > :(_initialize_sanlock) Ensuring lease for lockspace hosted-engine,
> host id 1 is acquired (file: /rhev/data-center/mnt/g
>  >
> luster1:_engine/6eb220be-daff-4785-8f78-111cc24139c4/ha_agent/hosted-engine.lockspace)
>  > MainThread::INFO::2014-10-31
> 16:55:40,322::state_machine::153::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
>  > :(refresh) Global metadata: {'maintenance': False}
>  > MainThread::INFO::2014-10-31
> 16:55:40,322::state_machine::158::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
>  > :(refresh) Host 192.168.12.12 (id 2): {'live-data': False, 'extra':
> 'metadata_parse_version=1\nmetadata_feature_version
>  > =1\ntimestamp=1413882675 (Tue Oct 21 17:11:15
> 2014)\nhost-id=2\nscore=2400\nmaintenance=False\nstate=EngineDown\n',
> 'hostname': '192.168.12.12', 'host-id': 2, 'engine-status': {'reason':
> 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail':
> 'unknown'}, 'score': 2400, 'maintenance': False, 'host-ts': 1413882675}
>  > MainThread::INFO::2014-10-31
> 16:55:40,322::state_machine::161::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
> Local (id 1): {'engine-health': None, 'bridge': True, 'mem-free': None,
> 'maintenance': False, 'cpu-load': None, 'gateway': True}
>  > MainThread::INFO::2014-10-31
> 16:55:40,323::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> Trying: notify time=1414745740.32 type=state_transition
> detail=StartState-ReinitializeFSM hostname='ovirt1'
>  > MainThread::INFO::2014-10-31
> 16:55:40,392::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> Success, was notification of state_transition
> (StartState-ReinitializeFSM) sent? ignored
>  > MainThread::INFO::2014-10-31
> 16:55:40,675::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
> Current state ReinitializeFSM (score: 0)
>  > MainThread::INFO::2014-10-31
> 16:55:50,710::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>  > Trying: notify time=1414745750.71 type=state_transition
> detail=ReinitializeFSM-EngineUp hostname='ovirt1'
>  > MainThread::INFO::2014-10-31
> 16:55:50,710::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> Success, was notification of state_transition (ReinitializeFSM-EngineUp)
> sent? ignored
>  > MainThread::INFO::2014-10-31
> 16:55:51,001::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
> Current state EngineUp (score: 2400)
>  > MainThread::CRITICAL::2014-10-31
> 16:56:01,033::agent::103::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Could
> not start ha-agent
>  > Traceback (most recent call last):
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
> 97, in run
>  >      self._run_agent()
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
> 154, in _run_agent
>  >
>   hosted_engine.HostedEngine(self.shutdown_requested).start_monitoring()
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
> line 307, in start_monitoring
>  >      for old_state, state, delay in self.fsm:
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/fsm/machine.py",
> line 125, in next
>  >      new_data = self.refresh(self._state.data)
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/state_machine.py",
> line 77, in refresh
>  >      stats.update(self.hosted_engine.collect_stats())
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
> line 700, in collect_stats
>  >      stats = self.process_remote_metadata(host_id, remote_data)
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
> line 747, in process_remote_metadata
>  >      md['engine-status'] = engine_status(md["engine-status"])
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
> line 79, in engine_status
>  >      in json.loads(status).iteritems()])
>  > AttributeError: 'NoneType' object has no attribute 'iteritems'
>  > [root at ovirt1 ~]# hosted-engine --vm-status
>  >
>  >
>  > --== Host 1 status ==--
>  >
>  > Status up-to-date                  : False
>  > Hostname                           : 192.168.12.11
>  > Host ID                            : 1
>  > Engine status                      : unknown stale-data
>  > Score                              : 2400
>  > Local maintenance                  : False
>  > Host timestamp                     : 1414745750
>  > Extra metadata (valid at timestamp):
>  >          metadata_parse_version=1
>  >          metadata_feature_version=1
>  >          timestamp=1414745750 (Fri Oct 31 16:55:50 2014)
>  >          host-id=1
>  >          score=2400
>  >          maintenance=False
>  >          state=EngineUp
>  >
>  >
>  > --== Host 2 status ==--
>  >
>  > Status up-to-date                  : False
>  > Hostname                           : 192.168.12.12
>  > Host ID                            : 2
>  > Engine status                      : unknown stale-data
>  > Score                              : 2400
>  > Local maintenance                  : False
>  > Host timestamp                     : 1414745821
>  > Extra metadata (valid at timestamp):
>  >          metadata_parse_version=1
>  >          metadata_feature_version=1
>  >          timestamp=1414745821 (Fri Oct 31 16:57:01 2014)
>  >          host-id=2
>  >          score=2400
>  >          maintenance=False
>  >          state=EngineStart
>  > [root at ovirt1 ~]# service ovirt-ha-agent status
>  > ovirt-ha-agent dead but subsys locked
>  >
>  > Host2
>  >
>  > MainThread::INFO::2014-10-31
> 16:55:59,642::agent::52::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
> ovirt-hosted-engi
>  > ne-ha agent 1.1.6 started
>  > MainThread::INFO::2014-10-31
> 16:55:59,678::hosted_engine::223::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
>  > :(_get_hostname) Found certificate common name: 192.168.12.12
>  > MainThread::INFO::2014-10-31
> 16:55:59,918::hosted_engine::367::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
>  > :(_initialize_broker) Initializing ha-broker connection
>  > MainThread::INFO::2014-10-31
> 16:55:59,919::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
>  > nitor) Starting monitor ping, options {'addr': '192.168.12.254'}
>  > MainThread::INFO::2014-10-31
> 16:55:59,922::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
>  > nitor) Success, id 25353488
>  > MainThread::INFO::2014-10-31
> 16:55:59,922::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
>  > nitor) Starting monitor mgmt-bridge, options {'use_ssl': 'true',
> 'bridge_name': 'ovirtmgmt', 'address': '0'}
>  > MainThread::INFO::2014-10-31
> 16:55:59,928::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
>  > nitor) Success, id 25354128
>  > MainThread::INFO::2014-10-31
> 16:55:59,928::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
>  > nitor) Starting monitor mem-free, options {'use_ssl': 'true',
> 'address': '0'}
>  > MainThread::INFO::2014-10-31
> 16:55:59,931::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
>  > nitor) Success, id 25353552
>  > MainThread::INFO::2014-10-31
> 16:55:59,931::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
>  > nitor) Starting monitor cpu-load-no-engine, options {'use_ssl':
> 'true', 'vm_uuid': '41d4aff1-54e1-4946-a812-2e656bb7d3f
>  > 9', 'address': '0'}
>  > MainThread::INFO::2014-10-31
> 16:55:59,934::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
>  > nitor) Success, id 139976608389584
>  > MainThread::INFO::2014-10-31
> 16:55:59,934::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
>  > nitor) Starting monitor engine-health, options {'use_ssl': 'true',
> 'vm_uuid': '41d4aff1-54e1-4946-a812-2e656bb7d3f9', '
>  > address': '0'}
>  > MainThread::INFO::2014-10-31
> 16:55:59,939::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
>  > nitor) Success, id 139976608447760
>  > MainThread::INFO::2014-10-31
> 16:55:59,939::hosted_engine::391::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
>  > :(_initialize_broker) Broker initialized, all submonitors started
>  > MainThread::INFO::2014-10-31
> 16:55:59,983::hosted_engine::476::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
>  > :(_initialize_sanlock) Ensuring lease for lockspace hosted-engine,
> host id 2 is acquired (file: /rhev/data-center/mnt/g
>  >
> luster1:_engine/6eb220be-daff-4785-8f78-111cc24139c4/ha_agent/hosted-engine.lockspace)
>  > MainThread::INFO::2014-10-31
> 16:56:00,001::state_machine::153::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
>  > :(refresh) Global metadata: {'maintenance': False}
>  > MainThread::INFO::2014-10-31
> 16:56:00,001::state_machine::158::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
>  > :(refresh) Host 192.168.12.11 (id 1): {'live-data': True, 'extra':
> 'metadata_parse_version=1\nmetadata_feature_version=
>  > 1\ntimestamp=1414745750 (Fri Oct 31 16:55:50
> 2014)\nhost-id=1\nscore=2400\nmaintenance=False\nstate=EngineUp\n', 'hostn
>  > ame': '192.168.12.11', 'host-id': 1, 'engine-status': {'health':
> 'good', 'vm': 'up', 'detail': 'up'}, 'score': 2400, 'm
>  > aintenance': False, 'host-ts': 1414745750}
>  > MainThread::INFO::2014-10-31
> 16:56:00,001::state_machine::161::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
>  > :(refresh) Local (id 2): {'engine-health': None, 'bridge': True,
> 'mem-free': None, 'maintenance': False, 'cpu-load': No
>  > ne, 'gateway': True}
>  > MainThread::INFO::2014-10-31
> 16:56:00,002::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>  > Trying: notify time=1414745760.0 type=state_transition
> detail=StartState-ReinitializeFSM hostname='ovirt2'
>  > MainThread::INFO::2014-10-31
> 16:56:00,045::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>  > Success, was notification of state_transition
> (StartState-ReinitializeFSM) sent? ignored
>  > MainThread::INFO::2014-10-31
> 16:56:00,325::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
>  > :(start_monitoring) Current state ReinitializeFSM (score: 0)
>  > MainThread::INFO::2014-10-31
> 16:56:10,352::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> Trying: notify time=1414745770.35 type=state_transition
> detail=ReinitializeFSM-EngineDown hostname='ovirt2'
>  > MainThread::INFO::2014-10-31
> 16:56:10,353::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> Success, was notification of state_transition
> (ReinitializeFSM-EngineDown) sent? ignored
>  > MainThread::INFO::2014-10-31
> 16:56:10,638::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
> Current state EngineDown (score: 2400)
>  > MainThread::INFO::2014-10-31
> 16:56:20,663::states::441::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
> The engine is not running, but we do not have enough data to decide
> which hosts are alive
>  > MainThread::INFO::2014-10-31
> 16:56:20,663::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> Trying: notify time=1414745780.66 type=state_transition
> detail=EngineDown-EngineDown hostname='ovirt2'
>  > MainThread::INFO::2014-10-31
> 16:56:20,664::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> Success, was notification of state_transition (EngineDown-EngineDown)
> sent? ignored
>  > MainThread::INFO::2014-10-31
> 16:56:20,943::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
> Current state EngineDown (score: 2400)
>  > MainThread::INFO::2014-10-31
> 16:56:30,968::states::441::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
> The engine is not running, but we do not have enough data to decide
> which hosts are alive
>  > MainThread::INFO::2014-10-31
> 16:56:30,969::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> Trying: notify time=1414745790.97 type=state_transition
> detail=EngineDown-EngineDown hostname='ovirt2'
>  > MainThread::INFO::2014-10-31
> 16:56:30,969::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> Success, was notification of state_transition (EngineDown-EngineDown)
> sent? ignored
>  > MainThread::INFO::2014-10-31
> 16:56:31,248::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
> Current state EngineDown (score: 2400)
>  > MainThread::INFO::2014-10-31
> 16:56:41,274::states::441::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
> The engine is not running, but we do not have enough data to decide
> which hosts are alive
>  > MainThread::INFO::2014-10-31
> 16:56:41,275::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> Trying: notify time=1414745801.28 type=state_transition
> detail=EngineDown-EngineDown hostname='ovirt2'
>  > MainThread::INFO::2014-10-31
> 16:56:41,276::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> Success, was notification of state_transition (EngineDown-EngineDown)
> sent? ignored
>  > MainThread::INFO::2014-10-31
> 16:56:41,555::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
> Current state EngineDown (score: 2400)
>  > MainThread::INFO::2014-10-31
> 16:56:51,583::states::441::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
> The engine is not running, but we do not have enough data to decide
> which hosts are alive
>  > MainThread::INFO::2014-10-31
> 16:56:51,584::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> Trying: notify time=1414745811.58 type=state_transition
> detail=EngineDown-EngineDown hostname='ovirt2'
>  > MainThread::INFO::2014-10-31
> 16:56:51,584::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> Success, was notification of state_transition (EngineDown-EngineDown)
> sent? ignored
>  > MainThread::INFO::2014-10-31
> 16:56:51,864::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
> Current state EngineDown (score: 2400)
>  > MainThread::INFO::2014-10-31
> 16:57:01,897::states::454::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
> Engine down and local host has best score (2400), attempting to start
> engine VM
>  > MainThread::INFO::2014-10-31
> 16:57:01,898::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> Trying: notify time=1414745821.9 type=state_transition
> detail=EngineDown-EngineStart hostname='ovirt2'
>  > MainThread::INFO::2014-10-31
> 16:57:01,906::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> Success, was notification of state_transition (EngineDown-EngineStart)
> sent? ignored
>  > MainThread::INFO::2014-10-31
> 16:57:02,189::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
> Current state EngineStart (score: 2400)
>  > MainThread::CRITICAL::2014-10-31
> 16:57:02,207::agent::103::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Could
> not start ha-agent
>  > Traceback (most recent call last):
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
> 97, in run
>  >      self._run_agent()
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
> 154, in _run_agent
>  >
>   hosted_engine.HostedEngine(self.shutdown_requested).start_monitoring()
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
> line 307, in start_monitoring
>  >      for old_state, state, delay in self.fsm:
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/fsm/machine.py",
> line 125, in next
>  >      new_data = self.refresh(self._state.data)
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/state_machine.py",
> line 77, in refresh
>  >      stats.update(self.hosted_engine.collect_stats())
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
> line 662, in collect_stats
>  >      constants.SERVICE_TYPE)
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
> line 171, in get_stats_from_storage
>  >      result = self._checked_communicate(request)
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
> line 199, in _checked_communicate
>  >      .format(message or response))
>  > RequestError: Request failed: <type 'exceptions.OSError'>
>  >
>  > [root at ovirt2 ~]# hosted-engine --vm-status
>  > Traceback (most recent call last):
>  >    File "/usr/lib64/python2.6/runpy.py", line 122, in _run_module_as_main
>  >      "__main__", fname, loader, pkg_name)
>  >    File "/usr/lib64/python2.6/runpy.py", line 34, in _run_code
>  >      exec code in run_globals
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_setup/vm_status.py",
> line 111, in <module>
>  >      if not status_checker.print_status():
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_setup/vm_status.py",
> line 58, in print_status
>  >      all_host_stats = ha_cli.get_all_host_stats()
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py",
> line 137, in get_all_host_stats
>  >      return self.get_all_stats(self.StatModes.HOST)
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py",
> line 86, in get_all_stats
>  >      constants.SERVICE_TYPE)
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
> line 171, in get_stats_from_storage
>  >      result = self._checked_communicate(request)
>  >    File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
> line 199, in _checked_communicate
>  >      .format(message or response))
>  > ovirt_hosted_engine_ha.lib.exceptions.RequestError: Request failed:
> <type 'exceptions.OSError'>
>  > [root at ovirt2 ~]# service ovirt-ha-agent status
>  > ovirt-ha-agent dead but subsys locked
>  >
>  >
>  > Thanks,
>  > Jaicel
>  >
>  > ----- Original Message -----
>  > From: "Jiri Moskovcak" <jmoskovc at redhat.com>
>  > To: "Jaicel" <jaicel at asti.dost.gov.ph>
>  > Cc: "Niels de Vos" <ndevos at redhat.com>, "Vijay Bellur"
> <vbellur at redhat.com>, users at ovirt.org, "Gluster Devel"
> <gluster-devel at gluster.org>
>  > Sent: Friday, October 31, 2014 11:05:32 PM
>  > Subject: Re: [ovirt-users] Hosted-Engine HA problem
>  >
>  > On 10/31/2014 10:26 AM, Jaicel wrote:
>  >> i've increased the limit and then restarted agent and broker. status
> normalize, but then right now it went to "False" state again but still
> both having 2400 score. agent logs remains the same, with
> "ovirt-ha-agent dead but subsys locked" status. ha-broker logs below
>  >>
>  >> Thread-138::INFO::2014-10-31
> 17:24:22,981::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
> Connection established
>  >> Thread-138::INFO::2014-10-31
> 17:24:22,991::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
> Connection closed
>  >> Thread-139::INFO::2014-10-31
> 17:24:38,385::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
> Connection established
>  >> Thread-139::INFO::2014-10-31
> 17:24:38,395::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
> Connection closed
>  >> Thread-140::INFO::2014-10-31
> 17:24:53,816::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
> Connection established
>  >> Thread-140::INFO::2014-10-31
> 17:24:53,827::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
> Connection closed
>  >> Thread-141::INFO::2014-10-31
> 17:25:09,172::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
> Connection established
>  >> Thread-141::INFO::2014-10-31
> 17:25:09,182::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
> Connection closed
>  >> Thread-142::INFO::2014-10-31
> 17:25:24,551::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
> Connection established
>  >> Thread-142::INFO::2014-10-31
> 17:25:24,562::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
> Connection closed
>  >>
>  >> Thanks,
>  >> Jaicel
>  >
>  > ok, now it seems that broker runs fine, so I need the recent agent.log
>  > to debug it more.
>  >
>  > --Jirka
>  >
>  >>
>  >> ----- Original Message -----
>  >> From: "Jiri Moskovcak" <jmoskovc at redhat.com>
>  >> To: "Jaicel R. Sabonsolin" <jaicel at asti.dost.gov.ph>, "Niels de Vos"
> <ndevos at redhat.com>
>  >> Cc: "Vijay Bellur" <vbellur at redhat.com>, users at ovirt.org, "Gluster
> Devel" <gluster-devel at gluster.org>
>  >> Sent: Friday, October 31, 2014 4:32:02 PM
>  >> Subject: Re: [ovirt-users] Hosted-Engine HA problem
>  >>
>  >> On 10/31/2014 03:53 AM, Jaicel R. Sabonsolin wrote:
>  >>> Hi guys,
>  >>>
>  >>> these logs appear on both hosts just like the result of
> --vm-status. tried to tcpdump on ovirt hosts and gluster nodes but only
> packets exchange with my monitoring VM(zabbix) appeared.
>  >>>
>  >>> agent.log
>  >>>        new_data = self.refresh(self._state.data)
>  >>>      File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/state_machine.py",
> line 77, in refresh
>  >>>        stats.update(self.hosted_engine.collect_stats())
>  >>>      File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
> line 662, in collect_stats
>  >>>        constants.SERVICE_TYPE)
>  >>>      File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
> line 171, in get_stats_from_storage
>  >>>        result = self._checked_communicate(request)
>  >>>      File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
> line 199, in _checked_communicate
>  >>>        .format(message or response))
>  >>> RequestError: Request failed: <type 'exceptions.OSError'>
>  >>>
>  >>> broker.log
>  >>>      File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py",
> line 165, in handle
>  >>>        response = "success " + self._dispatch(data)
>  >>>      File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py",
> line 261, in _dispatch
>  >>>        .get_all_stats_for_service_type(**options)
>  >>>      File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py",
> line 41, in get_all_stats_for_service_type
>  >>>        d = self.get_raw_stats_for_service_type(storage_dir,
> service_type)
>  >>>      File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py",
> line 74, in get_raw_stats_for_service_type
>  >>>        f = os.open(path, direct_flag | os.O_RDONLY)
>  >>> OSError: [Errno 24] Too many open files:
> '/rhev/data-center/mnt/gluster1:_engine/6eb220be-daff-4785-8f78-111cc24139c4/ha_agent/hosted-engine.metadata'
>  >>
>  >> - ah, there we go ^^^^^^ you might need to tweak the limit of allowed
>  >> open files as described here [1] or find the app keeps so many files
> open
>  >>
>  >>
>  >> --Jirka
>  >>
>  >> [1]
>  >>
> http://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/
>  >>
>  >>> Thread-38160::INFO::2014-10-31
> 10:28:37,989::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
> Connection closed
>  >>> Thread-38161::INFO::2014-10-31
> 10:28:53,656::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
> Connection established
>  >>> Thread-38161::ERROR::2014-10-31
> 10:28:53,657::listener::190::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
> Error handling request, data: 'get-stats
> storage_dir=/rhev/data-center/mnt/gluster1:_engine/6eb220be-daff-4785-8f78-111cc24139c4/ha_agent
> service_type=hosted-engine'
>  >>> Traceback (most recent call last):
>  >>>      File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py",
> line 165, in handle
>  >>>        response = "success " + self._dispatch(data)
>  >>>      File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py",
> line 261, in _dispatch
>  >>>        .get_all_stats_for_service_type(**options)
>  >>>      File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py",
> line 41, in get_all_stats_for_service_type
>  >>>        d = self.get_raw_stats_for_service_type(storage_dir,
> service_type)
>  >>>      File
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py",
> line 74, in get_raw_stats_for_service_type
>  >>>        f = os.open(path, direct_flag | os.O_RDONLY)
>  >>> OSError: [Errno 24] Too many open files:
> '/rhev/data-center/mnt/gluster1:_engine/6eb220be-daff-4785-8f78-111cc24139c4/ha_agent/hosted-engine.metadata'
>  >>> Thread-38161::INFO::2014-10-31
> 10:28:53,658::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
> Connection closed
>  >>>
>  >>> Thanks,
>  >>> Jaicel
>  >>>
>  >>> ----- Original Message -----
>  >>> From: "Niels de Vos" <ndevos at redhat.com>
>  >>> To: "Vijay Bellur" <vbellur at redhat.com>
>  >>> Cc: "Jiri Moskovcak" <jmoskovc at redhat.com>, "Jaicel R. Sabonsolin"
> <jaicel at asti.dost.gov.ph>, users at ovirt.org, "Gluster Devel"
> <gluster-devel at gluster.org>
>  >>> Sent: Friday, October 31, 2014 4:11:25 AM
>  >>> Subject: Re: [ovirt-users] Hosted-Engine HA problem
>  >>>
>  >>> On Thu, Oct 30, 2014 at 09:07:24PM +0530, Vijay Bellur wrote:
>  >>>> On 10/30/2014 06:45 PM, Jiri Moskovcak wrote:
>  >>>>> On 10/30/2014 09:22 AM, Jaicel R. Sabonsolin wrote:
>  >>>>>> Hi Guys,
>  >>>>>>
>  >>>>>> I need help with my ovirt Hosted-Engine HA setup. I am running on 2
>  >>>>>> ovirt hosts and 2 gluster nodes with replicated volumes. i
> already have
>  >>>>>> VMs running on my hosts and they can migrate normally once i for
> example
>  >>>>>> power off the host that they are running on. the problem is that the
>  >>>>>> engine can't migrate once i switch off the host that hosts the
> engine.
>  >>>>>>
>  >>>>>>       oVirt        3.4.3-1.el6
>  >>>>>>       KVM         0.12.1.2 - 2.415.el6_5.10
>  >>>>>>       LIBVIRT   libvirt-0.10.2-29.el6_5.9
>  >>>>>>       VDSM      vdsm-4.14.17-0.el6
>  >>>>>>
>  >>>>>>
>  >>>>>> right now, i have this result from hosted-engine --vm-status.
>  >>>>>>
>  >>>>>>          File "/usr/lib64/python2.6/runpy.py", line 122, in
>  >>>>>>       _run_module_as_main
>  >>>>>>            "__main__", fname, loader, pkg_name)
>  >>>>>>          File "/usr/lib64/python2.6/runpy.py", line 34, in _run_code
>  >>>>>>            exec code in run_globals
>  >>>>>>          File
>  >>>>>>
>  >>>>>>
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_setup/vm_status.py",
>  >>>>>>
>  >>>>>>       line 111, in <module>
>  >>>>>>            if not status_checker.print_status():
>  >>>>>>          File
>  >>>>>>
>  >>>>>>
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_setup/vm_status.py",
>  >>>>>>
>  >>>>>>       line 58, in print_status
>  >>>>>>            all_host_stats = ha_cli.get_all_host_stats()
>  >>>>>>          File
>  >>>>>>
>  >>>>>>
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py",
>  >>>>>>
>  >>>>>>       line 137, in get_all_host_stats
>  >>>>>>            return self.get_all_stats(self.StatModes.HOST)
>  >>>>>>          File
>  >>>>>>
>  >>>>>>
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py",
>  >>>>>>
>  >>>>>>       line 86, in get_all_stats
>  >>>>>>            constants.SERVICE_TYPE)
>  >>>>>>          File
>  >>>>>>
>  >>>>>>
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
>  >>>>>>
>  >>>>>>       line 171, in get_stats_from_storage
>  >>>>>>            result = self._checked_communicate(request)
>  >>>>>>          File
>  >>>>>>
>  >>>>>>
> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
>  >>>>>>
>  >>>>>>       line 199, in _checked_communicate
>  >>>>>>            .format(message or response))
>  >>>>>>       ovirt_hosted_engine_ha.lib.exceptions.RequestError:
> Request failed:
>  >>>>>>       <type 'exceptions.OSError'>
>  >>>>>>
>  >>>>>>
>  >>>>>> restarting ha-broker and ha-agent normalizes the status but
> eventually
>  >>>>>> it would become "false" and then return to the result above.
> hope you
>  >>>>>> guys could help me with this.
>  >>>>>>
>  >>>>>
>  >>>>> Hi Jaicel,
>  >>>>> please attach agent.log and broker.log from the host where you
> trying to
>  >>>>> run hosted-engine --vm-status. I have a feeling that you ran into a
>  >>>>> known problem on gluster - stalled file descriptor, in that case the
>  >>>>> only known solution at this time is to restart the broker & agent
> as you
>  >>>>> have already found out.
>  >>>>>
>  >>>>
>  >>>> Adding Niels and gluster-devel to troubleshoot from Gluster NFS
> perspective.
>  >>>
>  >>> I'd welcome any details on this "stalled file descriptor" problem. Is
>  >>> there a bug filed with some details like logs, sysrq-t and maybe even
>  >>> tcpdumps? If there is an easy way to reproduce this behaviour, I can
>  >>> surely look into it and hopefully come up with some advise or fix.
>  >>>
>  >>> Thanks,
>  >>> Niels
>  >>>
>



More information about the Gluster-devel mailing list