Hi. I'm posting to this kind of old thread because we're seeing a problem with SYS_HOST_ALIVE and just wondering if anyone else has seen / experienced it.
I implemented that process to run every 15 minutes to check the status of all the agents in the system. If SYS_HOST_ALIVE returns N, it then submits a job to try and restart the agent using the ucybsmgr program. This works great 99% of the time. Of course, if the service manager is down, it doesn't work, but otherwise, it does.
However, what we've started seeing recently (or maybe not started recently, but just recently reported) is that sometimes people would be getting alerts for an agent being down, but when they went to go check - the agent was up. So basically, SYS_HOST_ALIVE returned an N, the job to restart the agent failed (because the agent was already up) when in fact the agent was UP.
In an attempt to get around what appears to be an intermittent network issue I implemented a couple of additional checks / waits in my process. So now i:
- Check the status of an agent using SYS_HOST_ALIVE.
- If it returns an N, it waits 15 seconds.
- Check the status of the agent using SYS_HOST_ALIVE.
- If it still returns an N, it submits the job to restart the agent.
- In the PreProcess tab, it first waits 120 seconds before doing anything.
- Check the status of the agent using SYS_HOST_ALIVE.
- If it still returns an N, the command to restart the agent via ucybsmgr will run.
- Check the status of the agent using SYS_HOST_ALIVE.
- If it still returns an N, it will send an email / open up a ticket.
Even doing this, we're still experiencing the issue with the agent being up and SYS_HOST_ALIVE returning an N. It's not the end of the world, but it does cause a little extra work for the support team as they have to go close these tickets that are being erroneously opened and do troubleshooting on an agent that is technically fine.
Any thoughts / ideas? I've reached out to the Linux / Networking team to see if there is high CPU and/or anything else going on, but I'm not 100% clear what or how exactly SYS_HOST_ALIVE is working.
Thanks in advance.
------------------------------
Enterprise Scheduling Lead
Takeda
------------------------------
Original Message:
Sent: 06-29-2016 04:11 PM
From: Eric Felker
Subject: Monitoring an agent's status
Just curious - I looked at the Agent Restarter. It seems like that is a manual process. So you run the script, approve which agents to restart, etc. I'm assuming we could modify the process slightly so that it could run every x minutes and just go ahead and start any downed agents.
I wrote something like this that runs periodically (I think mine runs every 10 minutes) and restarts any offline agents that are 'marked' for monitoring in the "VARA.AGENT.MONITOR" variable object. This VARA has 1st column holding exact agent name, and second column basically as boolean for whether or not to monitor or omit that agent from the process. "Y" will be monitored, anything else will ignore that agent line in the VARA object.
The trouble I found is that the system is not always able to restart an agent. Depending on the agent's state, I've found they can get hung in such a way that the system's MODIFY_SYSTEM function fails to restart the agent. So it's not entirely reliable.
Here is my script -- not guaranteed to be the best way to do this!
(syntax highlighting is erroneous)
: SET &HND# = PREP_PROCESS_VAR(VARA.AGENT.MONITOR): PROCESS &HND#: P " ": PSET &AGENT_NAME# = GET_PROCESS_LINE(&HND#,1): PSET &MONITOR_FLAG# = GET_PROCESS_LINE(&HND#,2)! Check if restart flag is set in vara object: IF &MONITOR_FLAG# = "Y"! Check agent's state/health: PSET &AGENT_STATE# = SYS_HOST_ALIVE(&AGENT_NAME#)! Is agent online?: IF &AGENT_STATE# = "N": P "Trying to restart &AGENT_NAME#"! Restart agent: PSET &MOD_SYS_RC# = MODIFY_SYSTEM('STARTUP', &AGENT_NAME#): P "Return code for restart was: &MOD_SYS_RC#"! Send notification of this event: PSET &CALL_RC# = ACTIVATE_UC_OBJECT(CALL.MAIL.HTML.AGENT.MONITOR,,,,,PASS_VALUES): ENDIF: ENDIF: P " ": ENDPROCESS: CLOSE_PROCESS &HND#
More recently I am tending to use service manager dialog via CLI commands, to try a graceful shutdown followed by abnormal kill if needed of any agent that is no longer connected from the platform's stand point of view, and then restart using service manager dialog CLI as well. This seems to be more reliable, but there are a lot of moving parts to get exactly right. So far I've only messed with this for Linux agents.