DX Unified Infrastructure Management

 View Only
Expand all | Collapse all

UIM HUB failover issues

  • 1.  UIM HUB failover issues

    Posted Jun 24, 2019 10:22 AM
    ​Hi All,
    We have an issue where a secondary server went down do to infrastructure changes and the robots connected to that HUB moved over to an HA hub, working as planned. However, after the secondary hub was back up and working all the robots that moved to the HA hub still remain on the HA hub.  When will the robots return to the secondary hub? What is the "best practice"  to get those robots to return to the secondary hub as quickly as possible?


    TIA


  • 2.  RE: UIM HUB failover issues

    Broadcom Employee
    Posted Jun 24, 2019 12:52 PM
    https://docops.ca.com/ca-unified-infrastructure-management-probes/ga/en/alphabetical-probe-articles/ha-high-availability/ha-high-availability-release-notes


    The ha probe allows you to manage queues, probes and the NAS AutoOperator in a High Availability setup. The probe runs on the standby Hub. If it loses contact with the primary Hub it initiates a failover after a defined interval. When the primary Hub comes back online the probe will reverse the failover (failback)

    Check all the prereqs, and if necessary open a Support Ticket to help in the troubleshooting to help to figure out the root cause of this behavior.

    ------------------------------
    Senior Support Engineer
    Broadcom
    ------------------------------



  • 3.  RE: UIM HUB failover issues

    Posted Jun 25, 2019 10:47 AM
    ​Hi Alex, I guess I wasn't clear in the original post. This doesn't involve the Primary servers and their HA probes, which do run HA probes on our 2 primary hub servers. This is about our 2 secondary hubs that have a redundant Hub that are assigned as  "specified Hubs"  in the controller probe (setup tab)  as a secondary to them.  When the Main secondary hubs went down all the devices moved to the "specified Hub" shown in the controller.  And when the Main secondary hubs came back up all the robots remained on those redundant "specified Hubs" and did not return to the Main secondary hubs. I have found the only way to get all the robots that failed over to the redundant  "specified Hubs" back to the Main secondary hubs was to restart all the controllers on all the robots on all those servers. After the restart they then appear back on the Main Secondary Hubs. This does not seem very efficient.  My question is this the only way to get all these robots back to their Main Secondary hubs, is to restart all the robots? Or is there a more sophisticated way to accomplish this so it is automatic?


  • 4.  RE: UIM HUB failover issues

    Posted Jun 25, 2019 10:55 AM
    If the process is reproduced in reverse by stopping the secondary hub's robot watcher service, the robots should go back to the original hub.

    ------------------------------
    Support Engineer
    Broadcom
    ------------------------------



  • 5.  RE: UIM HUB failover issues

    Broadcom Employee
    Posted Jun 25, 2019 11:17 AM
    Please review:
    https://ca-broadcomcsm.wolkenservicedesk.com/wolken/esd/knowledgebase_search?articleId=34303

    ------------------------------
    Gene Howard
    Principal Support Engineer
    Broadcom
    ------------------------------



  • 6.  RE: UIM HUB failover issues

    Posted Jun 25, 2019 12:46 PM

    Hi David,

    I did do that, but it was just as quick stop and start. Is there a certain time frame I need to leave that hub down before starting the watcher service back up?

     

    Patrick






  • 7.  RE: UIM HUB failover issues

    Posted Jun 25, 2019 01:58 PM
    Did you check out the KB Gene referenced? 
    It indicates the need for broadcast to be enabled at the hub. 
    "When the primary is back up it will send a broadcast to all robots. All robots which receive this broadcast will switch back immediately.
    Then the hub starts to poll the rest of the robots one by one. This may take some time."

    ------------------------------
    Support Engineer
    Broadcom
    ------------------------------



  • 8.  RE: UIM HUB failover issues

    Posted Jun 25, 2019 02:25 PM

    Got an error when trying to access the doc.

     






  • 9.  RE: UIM HUB failover issues

    Posted Jun 25, 2019 04:43 PM
    sorry about that, we just discovered the link problem.
    this link will work
    https://ca-broadcom.wolkenservicedesk.com/external/article?articleId=34303

    FYI save that link and use it for any KB via replacing the KB ID # at the end.


    ------------------------------
    Support Engineer
    Broadcom
    ------------------------------



  • 10.  RE: UIM HUB failover issues

    Posted Jun 27, 2019 10:19 AM

    Hi David,

    I read the Doc.  Yes in theory it sounds good. But as it happens, we are days after the infrastructure event that took place and hundreds of devices still reside on the assigned secondary "Specified Hub".   They don't appear to want to return to the assigned primary on their own, even after restarting the robot on the assigned primary.  The  'assigned primary'   is the hub, and hubip named in the robot.cfg file of all the robots.  I have restarted several robots manually and then they return to the assigned hub, but only if I do that.

     

    Patrick

     

     






  • 11.  RE: UIM HUB failover issues
    Best Answer

    Posted Jun 27, 2019 11:01 AM
    so let's see
    main hub goes down
    by design and as configured the robots switched to the designated hub
    main hub goes up & broadcast is enabled
    some robots switched back to the main hub
    some robots are staying at the failover to hub

    just some thoughts on this:
    due to the high number of robots at the hub it is taking longer than expected
    it's related to the robot version, or OS

    recommendations:
    notepad++ > open \nimsoft\hub\robot.sds > Find > name > count - does the count match the number of robots?
    could be that over time it has not auto cleaned itself up as it should have as robots are removed or reconfigured.
    The rebuild process is to rename the file and then restart the hub. It takes time so should be scheduled for afterhours.

    deploy robot & hub 7.97


    ------------------------------
    Support Engineer
    Broadcom
    ------------------------------