DX Unified Infrastructure Management

 View Only
Expand all | Collapse all

Detecting Malfunctioning Robots

  • 1.  Detecting Malfunctioning Robots

    Posted Feb 22, 2016 02:09 PM

    We occasionally have robots whose probes get in an Error state.  For example, we just had one where all probes were in the Error state except for controller.  Restarting the robot fixed it.  Is there a way to detect this kind of condition and alarm on it?



  • 2.  Re: Detecting Malfunctioning Robots

    Posted Feb 23, 2016 10:10 AM

    We also face the same problem. I think it happens when something is wrong with the network connection.

    I suppose a LUA script may find it out, but I don't have enough time to dedicate for this.

    I periodically start the probe report utility. It shows a list of all probes and their status.



  • 3.  Re: Detecting Malfunctioning Robots

    Posted Feb 23, 2016 10:53 AM

    Do you think it would help to uncheck the "Suspend all probes when no network connection is available" box in the config for controller -> Setup -> Advanced?  Or would that have negative consequences?



  • 4.  Re: Detecting Malfunctioning Robots

    Posted Feb 23, 2016 11:32 AM

    Hi

     

    I didn't test it yet...

     

     

     

    Sent from my BlackBerry 10 smartphone.

        

     

       From: wbdeford

        Sent: Dienstag, 23. Februar 2016 16:54

        To: Leandro Zampieri

        Reply To: jive-1001801403-1tk-2-3zzykj@mail.ca-tech.jiveon.com

        Subject: Re:  - Detecting Malfunctioning Robots

     

                     CA Communities           Detecting Malfunctioning Robots 

      reply from wbdeford in CA Infrastructure Management - View the full discussion 

       

         Do you think it would help to uncheck the "Suspend all probes when no network connection is available" box in the config for controller -> Setup -> Advanced?  Or would that have negative consequences?

        

          Reply to this message by replying to this email, or go to the message on CA Communities     Start a new discussion in CA Infrastructure Management by email or at CA Communities     Following Detecting Malfunctioning Robots in these streams: Inbox   

    You are receiving this email because you are a member of the CA Communities.

       

    If you'd like to change your email preferences, click here. If you want your communities account to be deactivated (opt out), please send an email to CustomerPrograms@ca.com

    Additionally, if you wish to opt out of all unsolicited commercial communications from CA Technologies, click here.

     

    AT&S Austria Technologie & Systemtechnik AG

    Legal form | Rechtsform: Aktiengesellschaft

    Registered office | Sitz: Fabriksgasse 13, 8700 Leoben, Austria

    Commercial Register Number | Firmenbuchnummer: FN 55638 x, Landesgericht Leoben

     

    This e-mail and any attachment may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorised copying, disclosure or distribution of the material in this e-mail is strictly forbidden.



  • 5.  Re: Detecting Malfunctioning Robots

    Posted Feb 25, 2016 11:36 AM

    We are also facing this errors ,sometimes validating the probes will fix the issue .Mostly in spooler probe .

     

    https://communities.ca.com/thread/241736645?q=Robot%20not%20responding



  • 6.  Re: Detecting Malfunctioning Robots

    Posted Feb 23, 2016 11:09 AM

    we find with this we generally get Max. restarts reached for probe 'probe name'.

     

    do you get any of these?



  • 7.  Re: Detecting Malfunctioning Robots

    Posted Feb 23, 2016 11:33 AM

    No, the max restarts alarm comes only for single probe errors, not in this case.

     

     

     

    Sent from my BlackBerry 10 smartphone.

        

     

       From: l-wright

        Sent: Dienstag, 23. Februar 2016 17:10

        To: Leandro Zampieri

        Reply To: jive-1001801403-1tk-2-3zzyl1@mail.ca-tech.jiveon.com

        Subject: Re:  - Detecting Malfunctioning Robots

     

                     CA Communities           Detecting Malfunctioning Robots 

      reply from l-wright in CA Infrastructure Management - View the full discussion 

       

         we find with this we generally get Max. restarts reached for probe 'probe name'.

         

       do you get any of these?

        

          Reply to this message by replying to this email, or go to the message on CA Communities     Start a new discussion in CA Infrastructure Management by email or at CA Communities     Following Detecting Malfunctioning Robots in these streams: Inbox   

    You are receiving this email because you are a member of the CA Communities.

       

    If you'd like to change your email preferences, click here. If you want your communities account to be deactivated (opt out), please send an email to CustomerPrograms@ca.com

    Additionally, if you wish to opt out of all unsolicited commercial communications from CA Technologies, click here.

     

    AT&S Austria Technologie & Systemtechnik AG

    Legal form | Rechtsform: Aktiengesellschaft

    Registered office | Sitz: Fabriksgasse 13, 8700 Leoben, Austria

    Commercial Register Number | Firmenbuchnummer: FN 55638 x, Landesgericht Leoben

     

    This e-mail and any attachment may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorised copying, disclosure or distribution of the material in this e-mail is strictly forbidden.



  • 8.  Re: Detecting Malfunctioning Robots

    Posted Feb 23, 2016 01:27 PM

    I disabled the network on a test system, then reenabled it and found the same condition--all probes red except controller, but could not contact controller when I tried to open its config.  Restarting robot  restored it to working properly.

     

    Next test, I unchecked the "Suspend all probes when no network connection is available".  This time, when I brought the network back online, all probes showed green, but I still could not contact the controller.

     

    So, it seems the robot is not smart enough to recover from a loss of network after the network is fixed.



  • 9.  Re: Detecting Malfunctioning Robots

    Posted Feb 23, 2016 02:49 PM

    I suspect, suspending the probes when network is not available will lead to no QOS data collection for that time frame since probes are inactive.

    Any clarifications?



  • 10.  Re: Detecting Malfunctioning Robots

    Posted Feb 25, 2016 05:40 AM

    I tend to uncheck the "Suspend all probes when no network connection is available" option as part of the base robot build, as I have seen this 'all probe red' situation in the past go undetected.

     

    A way of detecting spooler issues (and all red probe issues) would be to run a net_connect service check on port 48001 for each robot.

     

    wbdeford - When you say 'I disabled the network on a test system', do you mean the nic card on the server? If so, I don't really think this is a fair test.

    Robots most certainly should recover from upstream network unavailability.



  • 11.  Re: Detecting Malfunctioning Robots

    Posted Feb 25, 2016 08:24 AM

    Yes, I disabled the NIC. I agree that this may not be a fair test, but it does duplicate the situation we have seen on systems which has required manual intervention to recover from.



  • 12.  Re: Detecting Malfunctioning Robots

    Posted Feb 25, 2016 10:37 AM

    We have complained about this issue since before CA owned UIM/Nimbus.  Sometimes the robots are so hung restarting them through UIM does not work.  In those cases we have found that sometimes dropping a robot_update on them fixes the issue.  Otherwise we have to contact the Windows server team to log onto the server and manually bounce the service.



  • 13.  Re: Detecting Malfunctioning Robots

    Posted Feb 25, 2016 11:13 AM

    Wow, that doesn't sound good; And what have Nimsoft/CA said about this?



  • 14.  Re: Detecting Malfunctioning Robots

    Posted Feb 25, 2016 11:29 AM

    Support agent is looking into it.



  • 15.  Re: Detecting Malfunctioning Robots

    Posted Feb 25, 2016 11:12 AM

    It would be interesting to know how prolific this issue is.

    Anyone who reads this thread could do a Tools->Find, Probes, 'spooler' from IM

    Sort by the 'Active' column and highlight the red ones to reveal the count in the bottom right.

    See how many red spoolers there are out there.....

    I'm on a brand new implementation project at the moment so my stats are:

    Robots = 22

    Red Spooler = 0



  • 16.  Re: Detecting Malfunctioning Robots

    Posted Feb 25, 2016 01:10 PM

    Our current count is 8 red using your query.  We have several hundred robots.



  • 17.  Re: Detecting Malfunctioning Robots

    Posted Mar 01, 2016 09:32 AM

    robots = 313

    Red spoolers = 0



  • 18.  Re: Detecting Malfunctioning Robots

    Posted Feb 25, 2016 12:29 PM

    Hi,

     

    Normally you get all these probes in an invalid state when the magic key in the controller.cfg file gets corrupted or changed for any reason. And these are being alerted out in the console as well. They all come under probes as controller meaning, the controller generates these alerts. These are something that critically needs to be addressed as unless they are, you will not be getting any alert.

     

    Now, how to do this? Just select all these probes >rt click> security> validate >validate all.

     

    This will force the controller to accept the new magic key and then it should be back to normal. I know there was a lua script around somewhere that does this function automatically. It simply validates all the probes which are in error state.

     

    -kag



  • 19.  Re: Detecting Malfunctioning Robots

    Posted Feb 25, 2016 12:38 PM

    I know how to fix this condition in the GUI.  But we want to automate the detection and correction.



  • 20.  Re: Detecting Malfunctioning Robots

    Posted Feb 25, 2016 01:16 PM

    There are scripts on the forum that can do the detection. I've also made a probe that I install on every hub and it checks that each robot on that hub has a configure set of probes by checking probe port list on the controller. Additionally it can do callbacks and check RC on any of the probes is necessary. Unfortunately this is proprietary stuff and I can't share that here, but I can't see why people couldn't create similar probes and I'm pretty sure other people already have.

     

    Remedying the situation might be a little trickier, though, and will naturally depend on the error. There's some stuff you could do though:

    1. If probes need to be validated they can (I can post a script for this too.. though it'd have to be tomorrow as it's my last day)

    2. If robot reports 127.0.0.1 you can still do callbacks to it if you know it's IP, so you can change it.

    3. If controller is altogether unresponsive but the robot is a live, you could issue a WMI, WSMAN, SSH or whatever command to restart the service

     

    -jon



  • 21.  Re: Detecting Malfunctioning Robots

    Posted Feb 25, 2016 01:15 PM

    That solution does not work for us.  We have tried it.



  • 22.  Re: Detecting Malfunctioning Robots

    Broadcom Employee
    Posted Feb 25, 2016 05:39 PM

    The hub should be keeping track of how often the robot is checking in and raise an alarm if the check-in has not occurred in an interval of the robot's check-in time.

     

    Is this action not occurring for you?



  • 23.  Re: Detecting Malfunctioning Robots

    Posted Feb 25, 2016 06:07 PM

    The robot is able to check in to the hub.  But it tells the hub its address is 127.0.0.1, so the hub's attempts to contact it fail, such as when I try to open the Configure interface.



  • 24.  Re: Detecting Malfunctioning Robots

    Broadcom Employee
    Posted Feb 25, 2016 06:36 PM

    Right, so the robot checks in with the hub and the hub records that check-in time in Nimsoft\hub\robots.sds.  At the same time it records the robot's configured check-in interval.  Let's say it's 10 minutes.  The hub knows that the robot should check in once every 10 minutes and if it doesn't hear from the robot in 10*1.5, it (the hub) should raise an alarm.

     

    But the question I have is, is the 127.0.0.1 address reporting a desired configuration or is this a side effect of something else?



  • 25.  Re: Detecting Malfunctioning Robots

    Posted Feb 25, 2016 07:21 PM

    127.0.0.1 is not desired.  When it gets in the error state, the IP address gets changed to that.  If we restart the robot, it returns to what it should be.



  • 26.  Re: Detecting Malfunctioning Robots

    Posted Mar 01, 2016 09:31 AM

    The 127.0.0.1 issue is more common on Unix boxes - it's caused by an 'incorrect' /etc/hosts file when the robot tries to figure out what IP address to use - the install guide for robots does mention this known issue.

     

    I've been told to ensure that the robot always uses the right address is to define the robotip parameter in robot.cfg

     

    But if it's already registered that's not going to be possible but a solution exists - see Adjust IP address of robots from the Hub



  • 27.  Re: Detecting Malfunctioning Robots

    Posted Feb 25, 2016 08:39 PM

    As to why the robots are going into a failed state, i'm not sure about, but below is a way to detect when a robot or hub goes into an error state and stop's sending QOS/Alarms...

    I hope it helps you!

     

    ------------------------------------------------------------------------------------

     

    I had a customer that wanted to send a heartbeat from every Robot on their entire system every 15 minutes and check to see if it had been received every 30 minutes and Alarm if the Heartbeat had not been received, this was due to the fact that on a few occasion’s they have had Robot’s disappear and got no specific message saying that the robot had gone… There were always other clues like tunnel errors and queue errors, but nothing simple like Robot has missed sending a heartbeat…

     

    Basically using Dirscan looking for a single file every 15 mins to be there 99 times (stupid test as this will always fail which is what I want to happen) then it sends an informational alarm to the NAS, the NAS AO is looking for the Alarm and auto closes it on arrival. I then have SQL_Response running a script against the closed alarms Table (NAS_TRANSACTION_SUMMARY) every 30 mins to look for missed heartbeat alarms and send Alarm for each server that didn’t send. Also created an HTML 5 dashboard using the SQL_Table Widget to run the same (or very similar) Query, this was in case the NAS or Queues also broke. 

     

    More details can be found in attached "Robot Heartbeat Project.pdf"

     

    Attachment(s)



  • 28.  Re: Detecting Malfunctioning Robots

    Posted Feb 26, 2016 01:29 AM
      |   view attached

    A "sample" LUA that can detect the robot and probe status is attached

    It will generate an alarm level 1 for inactive robots and probes.

    Attachment(s)



  • 29.  Re: Detecting Malfunctioning Robots

    Posted Feb 26, 2016 04:59 AM

    Please refer the below link. It has a script that will run on any controller alerts and validates those probes. It works on all probes.

     

    Automatically validate hdb and spooler probes via script

     

    http://www.ca.com/us/support/ca-support-online/product-content/knowledgebase-articles/tec000002647.aspx?intcmp=searchresultclick&resultnum=2

     

    -kag



  • 30.  Re: Detecting Malfunctioning Robots

    Posted Feb 26, 2016 05:35 AM

    Thanks kanandaguberan.

    The relevant part of this articles script is '-- Now run the probe_verify callbacks on each probe which FAILED to start'

    As you can see we can send a 'verify' and 'activate' callback to a probe which is in an error state.

    If we amend Luc Christiaen's script to look for probes in an error state, rather than inactive, then combine this 'verify and activate piece', we have a working solution to detect and remediate the 'red probe' situation 



  • 31.  Re: Detecting Malfunctioning Robots

    Posted Feb 26, 2016 06:53 AM

    To amend Luc Christiaen's script to look for probes in an error state rather than inactive we can look at the process_state.

    Values are 'running' (active), 'stopped' (Inactive) and 'none' (Error)

    Change: if p_value.active == 0 then

    To: if p_value.process_state == "none" then

     



  • 32.  Re: Detecting Malfunctioning Robots

    Posted Feb 26, 2016 07:25 AM

    But why we need to run the script to monitor the robot functionality?Whether we can get a permanent fix for this from agent side .



  • 33.  Re: Detecting Malfunctioning Robots

    Posted Feb 26, 2016 08:55 AM

    Good question Issac08. This is an interim work-around which I personally will be looking to use for my current project, and will hopefully assist wbdeford and others have better visibility/higher availability/coverage in their UIM environments in the short term too.

    I know of CA customers who have written dedicated probes to perform this type functionality, so we are most definitely not the first to embark on this journey.

    CA are aware of this shortfall but are probably waiting for an 'idea' to be created for this product enhancement.

     

    my 2 cents: If I were going to create a permanent solution, this functionality would be added at a Hub level to distribute the callbacks and make it scalable.



  • 34.  Re: Detecting Malfunctioning Robots

    Posted Feb 26, 2016 09:11 AM

    If customers have created probes to do this, why doesn't CA just buy one of those probes from the customer?  Or give them free licenses for stuff in exchange?  CA would make a lot of other customers happier and save themselves development time. 



  • 35.  Re: Detecting Malfunctioning Robots

    Posted Feb 26, 2016 09:56 AM

    Just speculating:

    *The probe may not have been written in Java. I think all marketplace probes now need to be developed this way.

    *The customer may have hardcoded environment specific attributes or used cmdb lookups.



  • 36.  Re: Detecting Malfunctioning Robots

    Posted Feb 26, 2016 10:11 AM

    Nimboss -Created idea  .Add if any thing have been missed .

     

    hdb and spooler probes Not Responding -Need Fix



  • 37.  Re: Detecting Malfunctioning Robots

    Posted Feb 26, 2016 10:06 AM

    I started with Luc's script, then added NimBOSS's suggestion of if p_value.process_state == "none" to create a level 2 alarm (without deleting Luc's original check and level 1 alarm).

    The level 1 alarms get created as expected for Inactive probes and robots.  But only two Level 2 alarms were created--one for a probe that is Inactive, and one for a probe that seems to be up.  And there is one server with a spooler probe in the Error state that did not get an alarm.  Here is the script as I ran it:

     

    --

    -- check_robot_probe.lua

    --

    print('Robot & Probe Status')

    print('====================')

    print(' ')

     

     

    hublist = nimbus.request("hub","gethubs");

    hubs = hublist.hublist

    args = pds.create()

    for hub_key,hub_table in pairs(hubs) do

       hub = hubs[hub_key]

       print ("Processing hub: " .. hub.name .. "\n")

       robots = nimbus.request(hub.addr,"getrobots")

       if robots ~= nil then

          for r_key,r_value in pairs(robots.robotlist) do

             controller = r_value.addr.."/controller"

    --         print ("  Processing robot: " .. r_value.name .. "\n")

             probes = nimbus.request(controller,"probe_list")

             if probes ~= nil then

                for p_key,p_value in pairs(probes) do

                      if p_value.active == 0 then

                        print ("    * Probe: " .. p_value.name .. " is Inactive on robot: " .. r_value.name .. " *\n")

                        local resp,rc = nimbus.alarm(1, "Check_robot_probe_status - Probe: " .. p_value.name .. " is Inactive on robot: " .. r_value.name .. "","check_probe_"..r_value.name.."_"..p_value.name)

                      end

                      if p_value.process_state == "none" then

                        print ("    * Probe: " .. p_value.name .. " is not working on robot: " .. r_value.name .. " *\n")

                        local resp,rc = nimbus.alarm(2, "Check_robot_probe_status - Probe: " .. p_value.name .. " is not working on robot: " .. r_value.name .. "","check_probe_"..r_value.name.."_"..p_value.name)

                      end

                end

             else

                print("  ** Robot: " .. r_value.name.." is Inactive **\n")

                local resp,rc = nimbus.alarm(1, "Check_robot_probe_status - Robot: " .. r_value.name.." is Inactive","check_robot_"..r_value.name)

             end

          end

       end

    end

     

     



  • 38.  Re: Detecting Malfunctioning Robots

    Posted Feb 26, 2016 11:10 AM

    A couple of notes:

    1) You should have got alerts generated from all Hub robot spooler probes as they always have a process_state of 'none'. (using hub/robot 7.80) I think we will need to exclude hub robots from our check anyway.

    2) I tested the script in my dev environment by disabling the data_engine probe and restarting the robot to cause dependent probes to go into an error state. This is not exactly the same scenario as described, but the script did seem to pick up all the red probes.

     

    wbdeford - Please can you post the output of a manual PU - list_probes on a controller with the 'red probes issue'? I want to see if there is any different key/values we can use for our check which are unique to this issue.

    I assume in the 'active' column in IM the red probes display with 'error'?



  • 39.  Re: Detecting Malfunctioning Robots

    Posted Feb 26, 2016 01:31 PM

    I assume you mean probe_list...see output below for, first, the host that alarms that the spooler is down when it shows up, and, second, the host where it shows red but there is no alarm.

     

    F:\Program Files (x86)\Nimsoft\bin>"F:\Program Files (x86)\Nimsoft\bin\pu.exe" -u administrator -p xxxxxxxx /CC-CLNIMSOFT51dom/CC-CLNIMSOFT51hub/CC-CLNIMSOFT51/controller probe_list

    Feb 26 13:25:18:074 pu: SSL - init: mode=0, cipher=DEFAULT, context=OK

    Feb 26 13:25:18:075 pu: nimCharsetSet() - charset=

    Command requires data:

    (s) name       = spooler

    (s) robot      = cc-clnimsoft51

    ======================================================

    Address: /CC-CLNIMSOFT51dom/CC-CLNIMSOFT51hub/CC-CLNIMSOFT51/controller Request: probe_list

    ======================================================

    spooler         PDS_PDS         445

    name            PDS_PCH           8 spooler

    description     PDS_PCH          22 Robot Message Spooler

    group           PDS_PCH          15 Infrastructure

    active          PDS_I             2 1

    type            PDS_I             2 3

    command         PDS_PCH          12 spooler.exe

    arguments       PDS_PCH           1

    config          PDS_PCH          12 spooler.cfg

    logfile         PDS_PCH          12 spooler.log

    workdir         PDS_PCH           6 robot

    timespec        PDS_PCH           1

    times_activated PDS_I             2 0

    last_action     PDS_I             2 0

    pid             PDS_I             3 -1

    times_started   PDS_I             2 0

    last_started    PDS_I             2 0

    pkg_name        PDS_PCH          13 robot_update

    pkg_version     PDS_PCH           5 7.80

    process_state   PDS_PCH           5 none

    port            PDS_I             6 48001

    is_marketplace  PDS_I             2 0

    marketpl_block  PDS_I             2 0

     

     

    F:\Program Files (x86)\Nimsoft\bin>"F:\Program Files (x86)\Nimsoft\bin\pu.exe" -u administrator -p xxxxxxxx /CC-CLNIMSOFT51dom/CC-CLNIMSOFT51hub/cc-cldepesql53/controller probe_list

    Feb 26 13:20:10:673 pu: SSL - init: mode=0, cipher=DEFAULT, context=OK

    Feb 26 13:20:10:674 pu: nimCharsetSet() - charset=

    Command requires data:

    (s) name       = spooler

    (s) robot      = cc-cldepesql53

    ======================================================

    Address: /CC-CLNIMSOFT51dom/CC-CLNIMSOFT51hub/cc-cldepesql53/controller Request: probe_list

    ======================================================

    spooler         PDS_PDS         386

    name            PDS_PCH           8 spooler

    description     PDS_PCH          22 Robot Message Spooler

    group           PDS_PCH          15 Infrastructure

    active          PDS_I             2 2

    type            PDS_I             2 2

    command         PDS_PCH          12 spooler.exe

    arguments       PDS_PCH           1

    config          PDS_PCH          12 spooler.cfg

    logfile         PDS_PCH          12 spooler.log

    workdir         PDS_PCH           6 robot

    timespec        PDS_PCH           1

    times_activated PDS_I             2 1

    last_action     PDS_I            11 1454818300

    pid             PDS_I             3 -1

    times_started   PDS_I             3 11

    last_started    PDS_I            11 1454819052

    pkg_name        PDS_PCH          13 robot_update

    pkg_version     PDS_PCH           5 7.63



  • 40.  Re: Detecting Malfunctioning Robots

    Posted Feb 29, 2016 04:53 AM

    Yep. That's the one Thanks.

    OK. So the first one is a Hub. That's why the spooler probe reports it's process state as 'none'. The script only works for standard robots. (Was not intentional)

    Second one is not reporting process state at all. This could be indicative of the 'red probe' problem!?

    In that case we could change the script to look for something like 'if p_value.process_state == "none" or  p_value.process_state == nil then'



  • 41.  Re: Detecting Malfunctioning Robots

    Posted Feb 29, 2016 08:59 AM

    I gave that a try and it fired off an alarm for every probe in my system.



  • 42.  Re: Detecting Malfunctioning Robots

    Posted Feb 29, 2016 09:23 AM

    Oh. That didn't happen in my test environment.

    OK. here's my script with the nimbus.alarm statements commented out. Also it only looks for the spooler probes.

    --

    print ("Robot_Spooler_Status: Script Begin \n")

    hublist = nimbus.request("hub","gethubs");
    hubs = hublist.hublist
    args = pds.create()
    argsI = pds.create()
    for hub_key,hub_table in pairs(hubs) do
       hub = hubs[hub_key]
       print ("Processing hub: " .. hub.name .. "\n")
       robots = nimbus.request(hub.addr,"getrobots")
       if robots ~= nil then
          for r_key,r_value in pairs(robots.robotlist) do
             controller = r_value.addr.."/controller"
             probe_error_state = 0
             --print ("  Processing robot: " .. r_value.name .. "\n")
             pds.putString(argsI,"name", "spooler")
             probes = nimbus.request(controller,"probe_list", argsI)
             if probes ~= nil then
                for p_key,p_value in pairs(probes) do
                   --print (p_value.name..": "..p_value.active)
                   if p_value.process_state == nil then
                      --print ("probe error count: "..probe_error_state.." /n")
                      print ("    * Probe: " .. p_value.name .. " is in an error state on robot: " .. r_value.name .. " *\n")
                      --local resp,rc = nimbus.alarm(1, "Robot_Probe_Status - Probe: " .. p_value.name .. " is in an error state on robot: " .. r_value.name .. "","check_probe_"..r_value.name.."_"..p_value.name)
                   end  
                end  
             else         
                print("  ** Robot: " .. r_value.name.." is Inactive **\n")
                --local resp,rc = nimbus.alarm(1, "Robot_Probe_Status - Robot: " .. r_value.name.." is Inactive","check_robot_"..r_value.name)
             end
          end
       end
    end
    pds.delete ()
    print ("Robot_Spooler_Status: Script Complete \n")



  • 43.  Re: Detecting Malfunctioning Robots

    Posted Feb 29, 2016 09:43 AM

    I ran this version and it again has every probe in an error state, inside of if p_value.process_state == nil then....end



  • 44.  Re: Detecting Malfunctioning Robots

    Posted Feb 29, 2016 10:01 AM

    It was running against all probes because there was a mistake passing 'args' instead of 'argsI'. I have now amended the script in the post above.

    However, I'm still not sure why they are all returning an error state as this does not happen in my environment. I'm running robot 7.8, what robot versions are you using?



  • 45.  Re: Detecting Malfunctioning Robots

    Posted Feb 29, 2016 10:29 AM

    We have mostly 7.63.  CA broke things in 7.8 that it so far has refused to fix--in particular, the ability to stop and start a probe via pu.exe.



  • 46.  Re: Detecting Malfunctioning Robots

    Posted Mar 03, 2016 08:54 AM

    wbdeford wrote:

     

    We have mostly 7.63. CA broke things in 7.8 that it so far has refused to fix--in particular, the ability to stop and start a probe via pu.exe.

    Have you tried using the callbacks '_restart' (retains same PID) and/or '_stop' (new PID) instead of -R ??



  • 47.  Re: Detecting Malfunctioning Robots

    Posted Mar 03, 2016 09:27 AM

    This is how I've been doing it via Lua

     

    --Restart a Robot

    controller = robot.addr.."/controller"

    nimbus.request(controller, "_restart")

     

    --Start a probe

    local args = pds.create()

    pds.putString(args,"name",cdm)

    nimbus.request(robot.addr,"probe_activate",args)

     

    --where robot.addr is the full robot address

     

     

     

     



  • 48.  Re: Detecting Malfunctioning Robots

    Posted Mar 07, 2016 11:03 AM

    I  tried using nimbus.request(controller, "_restart") and nimbus.request(controller, "_stop"), but neither of those does anything on my systems where all probes except controller are red.  However, I was able to get this to work:

     

    nimbus.request(robot.addr,"_stop")

     

    With more testing, this may turn out to be the solution.  Thanks!



  • 49.  Re: Detecting Malfunctioning Robots

    Posted Mar 08, 2016 03:51 AM

    Good to hear.

    My understanding was that the 'robot.addr' variable, must be the full path of the robot in question including the probe you would like to perform the action on.

    i.e. /domain/hub/robot/controller.

    I might have to test it your way too.



  • 50.  Re: Detecting Malfunctioning Robots

    Posted Feb 29, 2016 10:30 AM

    but our hubs are 7.8



  • 51.  Re: Detecting Malfunctioning Robots

    Posted Feb 29, 2016 10:35 AM

    I re-ran it with your change and now it only reports all spooler probes as in an error state.



  • 52.  Re: Detecting Malfunctioning Robots

    Posted Feb 29, 2016 11:05 AM

    Just downgraded a robot to 7.63. The callback does not return process state. This is why they are all in an error state.

    There's probably an in-between robot version which still allows the PU.exe restart and provides the process state.

     

    With this in mind, for customers on a robot version that does return the process state, we should try the following:

    if hub.robotname ~= r_value.name and p_value.process_state == "none" then

    This goes back to my original idea, but excludes hub robots.



  • 53.  Re: Detecting Malfunctioning Robots

    Posted Feb 29, 2016 11:38 AM

    I put  if hub.robotname ~= r_value.name and p_value.process_state == "none" then  in the script.  It found one spooler that it said was in an error state, but it shows green in the GUI (though hdb is missing completely...not sure what that means).  The script did not find the one spooler probe that I know is in a red state in the GUI.



  • 54.  Re: Detecting Malfunctioning Robots

    Posted Mar 01, 2016 03:46 AM

    I probably wasn't clear before wbdeford, this will not work in your environment using 7.63 robots as they don't return the necessary value.

    As for the hdb probe issue, you will probably need to redeploy the robot to the server, as I doubt it's working properly.



  • 55.  Re: Detecting Malfunctioning Robots

    Posted Feb 13, 2017 12:22 PM

    Nimboss where should the attached script needs to be executed to get the output ? 



  • 56.  Re: Detecting Malfunctioning Robots

    Posted Feb 14, 2017 04:45 AM

    You can execute the LUA scripts from the NAS



  • 57.  Re: Detecting Malfunctioning Robots

    Posted Feb 14, 2017 09:34 AM

    Hi,

     

    If you want to do the same thing with Perl : 

     

    GitHub - fraxken/checkconfig2: CA UIM Checkconfig2 (not created to be a probe).

    Starter guide · fraxken/perluim Wiki · GitHub  (or use my framework to do something similar).

     

    My checkconfig3 work with my framework perluim but to much custom code to publish the code. 

     

    And i work on a new Checkconfig with my NodeJS binding of pu.exe 

     

    GitHub - fraxken/NodeUIM: CA UIM NodeJS interface to work with pu.exe in a full async way. Maybe the first release in few week-end (not a probe too).

     

    I have a complete lua version too (with SQLite & MSSQL support).

     

    Best Regards,

    Thomas



  • 58.  Re: Detecting Malfunctioning Robots

    Posted Feb 14, 2017 12:58 PM

    My LUA NAS Version : 

     

    GitHub - fraxken/checkconfig_lua: CA UIM Checkconfig LUA for NAS 

     

    - Hubs informations

    - Robots informations

    - Probes informations

    - Probes configuration parsing to find specific key

     

    With chunk optimization to reduce execution time by 10%  



  • 59.  Re: Detecting Malfunctioning Robots

    Broadcom Employee
    Posted May 02, 2017 06:17 AM

    Hi

     

    I have marked this as assumed answered as a temporary solution \ workaround has been found and and idea raised 

     

    regards

    Rich