DX Infrastructure Management

Expand all | Collapse all

Availability in UIM - Let's discuss and solve

  • 1.  Availability in UIM - Let's discuss and solve

    Posted 01-31-2018 04:52 PM

    So, I've been thinking about this a little more recently and I'd like to try and come up with a solution to track, calculate and report on availability in UIM. Here is the public Github repository where I will be tracking this project

    GitHub - BryanKMorrow/uim-availability_monitor: Monitor the availability of different CIs in CA UIM  

     

    UIM AVAILABILITY MONITOR

    Monitor the availability of different CIs in CA UIM

    Download the latest version here: https://github.com/BryanKMorrow/uim-availability_monitor/blob/master/releases/availability_monitor.zip

    DESCRIPTION

    The goal of this probe will be to hopefully improve the collection of availability and reachability for different CIs in the UIM environment. It currently only runs for the local hub, so it only detects and reports on its directly connected robots.

    PREREQUISITES

    • You will need to insert the following two Metric Definitions into the database before using this probe. This can be done from the SLM portlet -> Tools -> SQL Query: insert into CM_CONFIGURATION_ITEM_METRIC_DEFINITION VALUES ('10.2:98', 'Robot Availability', 'state', '10.2', NULL); insert into CM_CONFIGURATION_ITEM_METRIC_DEFINITION VALUES ('10.2:99', 'Robot Reachability', 'state', '10.2', NULL);
    • Needs to be deployed onto a hub

    CURRENT FEATURES

    • Creates two QOS metrics to track robot availability -> QOS_AVAILABILITY_AVAILABILITY (QAA) and QOS_REACHABILITY_REACHABILITY
    • The metric vales are either a 1 (online) or 0 (offline)
    • Sends alarm if robot ip address is equal to 127.0.0.1
    • Sends alarm if hdb or spooler is in an inactive state
    • Current alarms are NOT automatically cleared

    TODO

    • CABI Report
    • Alarms automatically cleared
    • Automated SLA/SLO Creation
    • Robot List Management by Callback(s)
    • HTML 5 Dashboard

    TECHNICAL DETAILS

    1. Runs on every interval in seconds (setup->interval)
    2. Runs getrobots on its controller probe via callback
    3. Loops through each robot and does the following:
      1. Sends an alarm if the robot's configured IP address is 127.0.0.1
      2. Runs probe_list to return the list of probes running on the machine
      3. Sends an alarm if hdb and/or spooler are in a red state
        1. Sends QAA as 0 and QRR as 1
      4. If robot is active it sends QAA and QRR as 1
      5. If not active it will attempt to ping the ip address of the robot
        1. If pingable QAA is set to 0 and QRR is set to 1
        2. If not pingable, QAA is set to 0 and QRR is set to 0


  • 2.  Re: Availability in UIM - Let's discuss and solve

    Posted 01-31-2018 06:01 PM

    Reachability and availability typically go together when determining why a device may not be responding.  Was the robot unavailable because the network was down (reachability) or was it down because the host itself was down or because the robot (agent) was down.  For example, the host was reachable and available, but the robot/agent software was not operating.  These are all distinctly different issues and it would be nice if the product handled / tracked / alerted for each of these separately. 

     

    The eHealth/SystemEDGE products use these same concepts and metrics today (at least until March!), and it would be great if this functionality could be included to fill this particular gap being left there as those products are going EOL.



  • 3.  Re: Availability in UIM - Let's discuss and solve

    Posted 01-31-2018 06:08 PM

    Hello Bryan Morrow, i think Spectrum aproach migth give great results, because for the availability report ( our goal ) the device can be in three states up, down, and maintence, and It can be tricky to address the maintence schedule on a probe.
    So If we get track on some kind of specific alerts ( the one that caracterise downtime for each scenario ) and use then as outages intervals, It maybe solve the problem.



  • 4.  Re: Availability in UIM - Let's discuss and solve

    Posted 02-01-2018 02:21 PM

    Hi Bryan,

     

    There is an existing community "idea" on this topic that we're currently working on. As you know, one the biggest issues here, at least for locally monitored devices (using CDM) is that we don't have any metrics readily available that we can utilize to create these reports. We do have a system uptime metric, the problem here is that this is reset to zero on a system reboot. As a result of this, engineering are adding an "system state" metric (0 or 1) for each polling interval that will be generated at the hub. Initially, this will be based solely on robot connectivity.  As Chris points out, this doesn't solve the reachability scenario, and I've discussed with engineering about enhancing this check to validate reachability when we have a down robot. I do want to point out that we do have this data available "power state" for AWS, Azure, VMware and our virtualization probes. 

     

    Once we have this data available we can easily create CABI reports which work for both cloud, virtualization and on-premise physical devices. Of course we'll also have to factor in maintenance periods to this as well.  Unfortunately, I can't commit timelines and releases, but I want you to know I am pushing for this to be new metric to be available in UIM 9.0.  The beauty of the CABI reports is that we can deliver these reports off cycle.

     

    Cheers,

     

    Ben Nelson



  • 5.  Re: Availability in UIM - Let's discuss and solve

    Posted 02-01-2018 02:31 PM

    Thanks Ben, this is good information. Even though you have this planned for 9.0, I will still start working on this for backwards compatibility for all the previous versions.



  • 6.  Re: Availability in UIM - Let's discuss and solve

    Posted 02-08-2018 11:07 AM

    Updated the original post, added a link to the first version of the probe download.  Please let me know if you have any questions/concerns/thoughts.

     

    Bryan



  • 7.  Re: Availability in UIM - Let's discuss and solve

    Posted 02-12-2018 04:28 PM

    Hi Bryan,

     

    My past works feel similar

     

    GitHub - fraxken/robots_checker: CA UIM Robots_checker (check probes, and do callback on it) 

    GitHub - fraxken/selfmonitoring: CA UIM Self monitoring probe 

    GitHub - UIM-Community/checkconfig4: Checkconfig 4 - Retrieve UIM Configuration 

     

    I will do surely differently today... (I accumulated a lot of experience about was the product already handle etc..)

     

    Best Regards,

    Thomas



  • 8.  Re: Availability in UIM - Let's discuss and solve

    Posted 03-12-2018 05:12 AM

    Hi Bryan,

     

    I've tried this probe and the result is good. But I'm having a bit of a problem in CentOS 7, does this Probe not support Linux OS? Because this probe does not generate any QoS, while this probe works well in Windows 2008 R2 and Windows 2012.

     

    Regards,

    Andre



  • 9.  Re: Availability in UIM - Let's discuss and solve

    Posted 03-12-2018 10:45 AM

    I'm running it on multiple systems that are running Centos 7, there is nothing in the probe that should be stopping the QoS generation.  If you turn the loglevel to 4 can you post the logs from that system?

     

    Thanks,

     

    Bryan



  • 10.  Re: Availability in UIM - Let's discuss and solve

    Posted 03-12-2018 10:30 PM

    Here is the logs :

     

    Mar 13 09:23:48:194 [main, availability_monitor] Error when running on timer method in main thread. Reason is (1) error, Not able to callback for timer checkAvailability.60000. Reason is java.net.UnknownHostException: CENTOSSVR
    Mar 13 09:24:48:203 [main, availability_monitor] Checking hub for robot availability
    Mar 13 09:24:50:722 [main, availability_monitor] Failed to resolve host name "CENTOSSVR" for remote device: java.net.UnknownHostException: CENTOSSVR
    Mar 13 09:24:50:728 [main, availability_monitor] Error when running on timer method in main thread. Reason is (1) error, Not able to callback for timer checkAvailability.60000. Reason is java.net.UnknownHostException: CENTOSSVR
    Mar 13 09:24:54:091 [main, availability_monitor] ****************[ Stopped ]****************
    Mar 13 09:24:55:106 [main, availability_monitor] Login to NimBUS is OK
    Mar 13 09:24:55:118 [main, availability_monitor] Checking hub for robot availability
    Mar 13 09:24:55:382 [main, availability_monitor] Remote dev created for hostname 'CENTOSSVR', probe 'availability_monitor'
    Mar 13 09:24:57:654 [main, availability_monitor] Failed to resolve host name "CENTOSSVR" for remote device: java.net.UnknownHostException: CENTOSSVR
    Mar 13 09:24:57:654 [main, availability_monitor] Error while running checkAvailability on probe startup.
    Mar 13 09:24:57:654 [main, availability_monitor] ****************[ Starting ]****************
    Mar 13 09:24:57:654 [main, availability_monitor] 1.00
    Mar 13 09:24:57:654 [main, availability_monitor] CA Technologies
    Mar 13 09:24:57:674 [main, availability_monitor] port=48021
    Mar 13 09:24:57:728 [main, availability_monitor] Has already a SID
    Mar 13 09:24:57:729 [main, availability_monitor] Login to NimBUS is OK
    Mar 13 09:25:57:731 [main, availability_monitor] Checking hub for robot availability
    Mar 13 09:25:57:858 [main, availability_monitor] Remote dev created for hostname 'CENTOSSVR', probe 'availability_monitor'
    Mar 13 09:26:00:114 [main, availability_monitor] Failed to resolve host name "CENTOSSVR" for remote device: java.net.UnknownHostException: CENTOSSVR
    Mar 13 09:26:00:126 [main, availability_monitor] Error when running on timer method in main thread. Reason is (1) error, Not able to callback for timer checkAvailability.60000. Reason is java.net.UnknownHostException: CENTOSSVR
    Mar 13 09:27:00:127 [main, availability_monitor] Checking hub for robot availability
    Mar 13 09:27:00:252 [main, availability_monitor] Remote dev created for hostname 'CENTOSSVR', probe 'availability_monitor'
    Mar 13 09:27:02:520 [main, availability_monitor] Failed to resolve host name "CENTOSSVR" for remote device: java.net.UnknownHostException: CENTOSSVR
    Mar 13 09:27:02:529 [main, availability_monitor] Error when running on timer method in main thread. Reason is (1) error, Not able to callback for timer checkAvailability.60000. Reason is java.net.UnknownHostException: CENTOSSVR
    Mar 13 09:28:02:530 [main, availability_monitor] Checking hub for robot availability
    Mar 13 09:28:02:660 [main, availability_monitor] Remote dev created for hostname 'CENTOSSVR', probe 'availability_monitor'
    Mar 13 09:28:04:918 [main, availability_monitor] Failed to resolve host name "CENTOSSVR" for remote device: java.net.UnknownHostException: CENTOSSVR
    Mar 13 09:28:04:923 [main, availability_monitor] Error when running on timer method in main thread. Reason is (1) error, Not able to callback for timer checkAvailability.60000. Reason is java.net.UnknownHostException: CENTOSSVR

     

    Thanks,

     

    Andre



  • 11.  Re: Availability in UIM - Let's discuss and solve

    Posted 03-12-2018 10:35 PM

    As a test, please place and /etc/hosts entry for that robot name before any local host entry on that machine. I’m guessing the probe is trying to create a remote CI and can’t resolve the robotname properly. 





  • 12.  Re: Availability in UIM - Let's discuss and solve

    Posted 03-12-2018 11:36 PM

    Thanks Bryan, it worked