DX Unified Infrastructure Management

 View Only
Expand all | Collapse all

Upgrade to 20.3 fails on configuring data_engine

  • 1.  Upgrade to 20.3 fails on configuring data_engine

    Posted Nov 03, 2020 04:48 AM
      |   view attached
    Hi all,

    Can anyone shed any light on why this install is failing when it attempts to configure the data_engine probe?

    Log attached, it suggests a connectivity issue, I've checked connectivity between the Servers and all seems fine.

    ------------------------------
    CA - UIM administrator
    ------------------------------

    Attachment(s)

    log
    uimserver_ia_install.log   3.31 MB 1 version


  • 2.  RE: Upgrade to 20.3 fails on configuring data_engine

    Posted Nov 04, 2020 03:36 PM
    looks like it did the database update correctly and then failed deploying the data_engine 
    strResultString=Failed, strInstState=Not Deployed
    Session error, Unable to open a client session for :48000: Connection refused (Connection refused)

    strange case if this is what it seems, the connection from ade running at the primary hub got a connection refused when trying to deploy to the primary hub. 
    not sure so some general ideas
    if you got any custom packages in the nimsoft archive folder, move them out
    ensure the virus scanner is disabled
    don't know enough to know if linux firewall could be involved but worth checking, and SELinux




    ------------------------------
    Support Engineer
    Broadcom
    ------------------------------



  • 3.  RE: Upgrade to 20.3 fails on configuring data_engine

    Posted Nov 05, 2020 07:05 AM
    Thanks David,

    I've noticed that during the ADE configuration that the contents of /opt/nimsoft/jre/java_folder are removed and not replaced.  Then the ADE probe errors when loading.  I then have to manually extract the JRE files and quickly restart the ADE probe in IM before the install times out.  Then the data_engine configuration starts, it deploys but then UIM restarts itself.  I'm guessing the connection isn't there because the service isn't listening on that port at that moment.

    ------------------------------
    CA - UIM administrator
    ------------------------------



  • 4.  RE: Upgrade to 20.3 fails on configuring data_engine

    Posted Nov 05, 2020 07:56 AM
    Yep, the controller is restarting during the deployment of the data_engine.  Controller log extract:


    Nov 5 12:42:39:167 [139936862943040] 0 Controller: Selecting robotip from configuration. config_robotip = Primary_NMS_IP, cglob robotip = Primary_NMS_IP, local_ip_validation = 1, validate_ip_suggestion = 0, strict_ip_binding = 0
    Nov 5 12:42:39:168 [139936862943040] 0 Controller: --------------------------------------------------------------------------------------------------------
    Nov 5 12:42:39:168 [139936862943040] 0 Controller: ----- Robot controller 9.31 [Build 9.31.1501, Sep 17 2020] started -----
    Nov 5 12:42:39:168 [139936862943040] 0 Controller: Name = Primary_NMS, IP = Primary_NMS_IP, Port = 48000
    Nov 5 12:42:39:168 [139936862943040] 0 Controller: OS = UNIX / Linux / Linux 3.10.0-693.17.1.el7.x86_64 #1 SMP Sun Jan 14 10:36:03 EST 2018 x86_64
    Nov 5 12:42:39:168 [139936862943040] 0 Controller: Domain = MDS
    Nov 5 12:42:39:168 [139936862943040] 0 Controller: Primary HUB = /MDS/Primary_NMS/Primary_NMS Primary_NMS_IP
    Nov 5 12:42:39:168 [139936862943040] 0 Controller: Loglevel = 0, Logfile = controller.log
    Nov 5 12:42:39:198 [139936862943040] 0 Controller: Running as user root (0)
    Nov 5 12:42:39:198 [139936862943040] 0 Controller: -----
    Nov 5 12:42:39:199 [139936862943040] 0 Controller: Controller on Primary_NMS port 48000 started
    Nov 5 12:42:40:139 [139936862943040] 0 Controller: _ProcStart - Probe 'hub' - starting
    Nov 5 12:42:42:101 [139936862943040] 0 Controller: Hub localhost(Primary_NMS_IP) contact established
    Nov 5 12:42:42:126 [139936862943040] 0 Controller: _ProcStart - Probe 'distsrv' - starting
    Nov 5 12:42:43:462 [139936862943040] 0 Controller: _ProcStart - Probe 'hdb' - starting
    Nov 5 12:42:44:586 [139936862943040] 0 Controller: _ProcStart - Probe 'mpse' - starting
    Nov 5 12:42:45:629 [139936862943040] 0 Controller: _ProcStart - Probe 'alarm_enrichment' - starting
    Nov 5 12:42:46:058 [139936862943040] 0 Controller: _ProcStart - Probe 'baseline_engine' - starting
    Nov 5 12:42:47:014 [139936862943040] 0 Controller: _ProcStart - Probe 'prediction_engine' - starting
    Nov 5 12:42:48:008 [139936862943040] 0 Controller: _ProcStart - Probe 'discovery_agent' - starting
    Nov 5 12:42:49:331 [139936862943040] 0 Controller: _ProcStart - Probe 'cm_data_import' - starting
    Nov 5 12:42:50:210 [139936862943040] 0 Controller: _ProcStart - Probe 'ppm' - starting
    Nov 5 12:42:51:112 [139936862943040] 0 Controller: _ProcStart - Probe 'ems' - starting
    Nov 5 12:42:51:533 [139936862943040] 0 Controller: login - unauthorized probe (Primary_NMS_IP/35024)
    Nov 5 12:42:52:030 [139936862943040] 0 Controller: _ProcStart - Probe 'automated_deployment_engine' - starting
    Nov 5 12:42:53:310 [139936862943040] 0 Controller: _ProcStart - Probe 'nas' - starting
    Nov 5 12:42:54:114 [139936862943040] 0 Controller: _ProcStart - Probe 'data_engine' - starting
    Nov 5 12:42:55:451 [139936862943040] 0 Controller: _ProcStart - Probe 'ems' - starting
    Nov 5 12:43:04:185 [139936862943040] 0 Controller: _ProcStart - Probe 'udm_manager' - starting
    Nov 5 12:43:05:008 [139936862943040] 0 Controller: _ProcStart - Probe 'maintenance_mode' - starting
    Nov 5 12:43:06:172 [139936862943040] 0 Controller: _ProcStart - Probe 'sla_engine' - starting
    Nov 5 12:43:07:002 [139936862943040] 0 Controller: _ProcStart - Probe 'qos_processor' - starting
    Nov 5 12:43:08:051 [139936862943040] 0 Controller: _ProcStart - Probe 'nis_server' - starting
    Nov 5 12:43:09:309 [139936862943040] 0 Controller: _ProcStart - Probe 'discovery_server' - starting
    Nov 5 12:43:10:534 [139936862943040] 0 Controller: _ProcStart - Probe 'mon_config_service' - starting
    Nov 5 12:43:11:156 [139936862943040] 0 Controller: _ProcStart - Probe 'ace' - starting
    Nov 5 12:43:15:039 [139936862943040] 0 Controller: _ProcStart - Probe 'trellis' - starting
    Nov 5 12:44:03:253 [140582783477568] 0 Controller: Selecting robotip from configuration. config_robotip = Primary_NMS_IP, cglob robotip = Primary_NMS_IP, local_ip_validation = 1, validate_ip_suggestion = 0, strict_ip_binding = 0
    Nov 5 12:44:03:254 [140582783477568] 0 Controller: --------------------------------------------------------------------------------------------------------
    Nov 5 12:44:03:254 [140582783477568] 0 Controller: ----- Robot controller 9.31 [Build 9.31.1501, Sep 17 2020] started -----
    Nov 5 12:44:03:254 [140582783477568] 0 Controller: Name = Primary_NMS, IP = Primary_NMS_IP, Port = 48000
    Nov 5 12:44:03:254 [140582783477568] 0 Controller: OS = UNIX / Linux / Linux 3.10.0-693.17.1.el7.x86_64 #1 SMP Sun Jan 14 10:36:03 EST 2018 x86_64
    Nov 5 12:44:03:254 [140582783477568] 0 Controller: Domain = MDS
    Nov 5 12:44:03:254 [140582783477568] 0 Controller: Primary HUB = /MDS/Primary_NMS/Primary_NMS Primary_NMS_IP
    Nov 5 12:44:03:254 [140582783477568] 0 Controller: Loglevel = 0, Logfile = controller.log
    Nov 5 12:44:03:285 [140582783477568] 0 Controller: Running as user root (0)
    Nov 5 12:44:03:285 [140582783477568] 0 Controller: -----
    Nov 5 12:44:03:285 [140582783477568] 0 Controller: Stopping processes from previous run
    Nov 5 12:44:03:285 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to hub (27545)...
    Nov 5 12:44:09:286 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:09:286 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to distsrv (27638)...
    Nov 5 12:44:10:286 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:10:286 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to hdb (27643)...
    Nov 5 12:44:12:287 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:12:287 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to mpse (27644)...
    Nov 5 12:44:13:287 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:13:287 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to alarm_enrichment (27657)...
    Nov 5 12:44:14:287 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:14:287 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to baseline_engine (27671)...
    Nov 5 12:44:15:287 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:15:287 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to prediction_engine (27702)...
    Nov 5 12:44:16:287 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:16:287 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to discovery_agent (27723)...
    Nov 5 12:44:17:288 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:17:288 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to cm_data_import (27748)...
    Nov 5 12:44:18:288 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:18:288 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to ppm (27781)...
    Nov 5 12:44:19:288 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:19:288 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to automated_deployment_engine (27861)...
    Nov 5 12:44:20:288 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:20:288 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to nas (27887)...
    Nov 5 12:44:27:289 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:27:289 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to ems (27928)...
    Nov 5 12:44:28:289 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:28:289 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to udm_manager (27971)...
    Nov 5 12:44:29:290 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:29:290 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to maintenance_mode (27995)...
    Nov 5 12:44:30:290 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:30:290 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to sla_engine (28021)...
    Nov 5 12:44:31:290 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:31:290 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to qos_processor (28043)...
    Nov 5 12:44:41:291 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:41:291 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to nis_server (28082)...
    Nov 5 12:44:42:291 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:42:292 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to discovery_server (28104)...
    Nov 5 12:44:43:292 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:43:292 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to mon_config_service (28127)...
    Nov 5 12:44:44:292 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:44:292 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to ace (28143)...
    Nov 5 12:44:45:292 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:45:292 [140582783477568] 0 Controller: ProcessControl: Sending SIGTERM signal to trellis (28198)...
    Nov 5 12:44:46:292 [140582783477568] 0 Controller: ProcessControl: Child exited
    Nov 5 12:44:46:293 [140582783477568] 0 Controller: Controller on Primary_NMS port 48000 started
    Nov 5 12:44:46:991 [140582783477568] 0 Controller: _ProcStart - Probe 'hub' - starting
    Nov 5 12:44:49:078 [140582783477568] 0 Controller: Hub localhost(Primary_NMS_IP) contact established
    Nov 5 12:44:49:138 [140582783477568] 0 Controller: _ProcStart - Probe 'distsrv' - starting
    Nov 5 12:44:50:417 [140582783477568] 0 Controller: _ProcStart - Probe 'hdb' - starting
    Nov 5 12:44:51:054 [140582783477568] 0 Controller: _ProcStart - Probe 'mpse' - starting
    Nov 5 12:44:52:001 [140582783477568] 0 Controller: _ProcStart - Probe 'alarm_enrichment' - starting
    Nov 5 12:44:59:257 [140243720660800] 0 Controller: Selecting robotip from configuration. config_robotip = Primary_NMS_IP, cglob robotip = Primary_NMS_IP, local_ip_validation = 1, validate_ip_suggestion = 0, strict_ip_binding = 0
    Nov 5 12:44:59:258 [140243720660800] 0 Controller: --------------------------------------------------------------------------------------------------------
    Nov 5 12:44:59:258 [140243720660800] 0 Controller: ----- Robot controller 9.31 [Build 9.31.1501, Sep 17 2020] started -----
    Nov 5 12:44:59:258 [140243720660800] 0 Controller: Name = Primary_NMS, IP = Primary_NMS_IP, Port = 48000
    Nov 5 12:44:59:258 [140243720660800] 0 Controller: OS = UNIX / Linux / Linux 3.10.0-693.17.1.el7.x86_64 #1 SMP Sun Jan 14 10:36:03 EST 2018 x86_64
    Nov 5 12:44:59:258 [140243720660800] 0 Controller: Domain = MDS
    Nov 5 12:44:59:258 [140243720660800] 0 Controller: Primary HUB = /MDS/Primary_NMS/Primary_NMS Primary_NMS_IP
    Nov 5 12:44:59:258 [140243720660800] 0 Controller: Loglevel = 0, Logfile = controller.log
    Nov 5 12:44:59:289 [140243720660800] 0 Controller: Running as user root (0)
    Nov 5 12:44:59:289 [140243720660800] 0 Controller: -----
    Nov 5 12:44:59:289 [140243720660800] 0 Controller: Stopping processes from previous run
    Nov 5 12:44:59:289 [140243720660800] 0 Controller: ProcessControl: Sending SIGTERM signal to hub (28590)...
    Nov 5 12:45:05:290 [140243720660800] 0 Controller: ProcessControl: Child exited
    Nov 5 12:45:05:290 [140243720660800] 0 Controller: ProcessControl: Sending SIGTERM signal to distsrv (28683)...
    Nov 5 12:45:06:290 [140243720660800] 0 Controller: ProcessControl: Child exited
    Nov 5 12:45:06:290 [140243720660800] 0 Controller: ProcessControl: Sending SIGTERM signal to hdb (28691)...
    Nov 5 12:45:08:290 [140243720660800] 0 Controller: ProcessControl: Child exited
    Nov 5 12:45:08:291 [140243720660800] 0 Controller: ProcessControl: Sending SIGTERM signal to mpse (28693)...
    Nov 5 12:45:08:291 [140243720660800] 0 Controller: ProcessControl: Unable to send stop signal to process mpse (28693)
    Nov 5 12:45:09:291 [140243720660800] 0 Controller: ProcessControl: Child exited
    Nov 5 12:45:09:291 [140243720660800] 0 Controller: ProcessControl: Sending SIGTERM signal to alarm_enrichment (28706)...
    Nov 5 12:45:09:291 [140243720660800] 0 Controller: ProcessControl: Unable to send stop signal to process alarm_enrichment (28706)
    Nov 5 12:45:10:291 [140243720660800] 0 Controller: ProcessControl: Child exited
    Nov 5 12:45:10:291 [140243720660800] 0 Controller: Controller on Primary_NMS port 48000 started
    Nov 5 12:45:11:000 [140243720660800] 0 Controller: _ProcStart - Probe 'hub' - starting
    Nov 5 12:45:13:121 [140243720660800] 0 Controller: Hub localhost(Primary_NMS_IP) contact established
    Nov 5 12:45:13:172 [140243720660800] 0 Controller: _ProcStart - Probe 'distsrv' - starting
    Nov 5 12:45:14:454 [140243720660800] 0 Controller: _ProcStart - Probe 'hdb' - starting
    Nov 5 12:45:15:192 [140243720660800] 0 Controller: _ProcStart - Probe 'mpse' - starting
    Nov 5 12:45:16:643 [140243720660800] 0 Controller: _ProcStart - Probe 'alarm_enrichment' - starting
    Nov 5 12:45:17:006 [140243720660800] 0 Controller: _ProcStart - Probe 'baseline_engine' - starting
    Nov 5 12:45:18:147 [140243720660800] 0 Controller: _ProcStart - Probe 'prediction_engine' - starting
    Nov 5 12:45:19:197 [140243720660800] 0 Controller: _ProcStart - Probe 'discovery_agent' - starting
    Nov 5 12:45:20:541 [140243720660800] 0 Controller: _ProcStart - Probe 'cm_data_import' - starting
    Nov 5 12:45:21:006 [140243720660800] 0 Controller: _ProcStart - Probe 'ppm' - starting
    Nov 5 12:45:22:000 [140243720660800] 0 Controller: _ProcStart - Probe 'ems' - starting
    Nov 5 12:45:22:063 [140243720660800] 0 Controller: login - unauthorized probe (Primary_NMS_IP/37938)
    Nov 5 12:45:23:063 [140243720660800] 0 Controller: _ProcStart - Probe 'automated_deployment_engine' - starting
    Nov 5 12:45:24:178 [140243720660800] 0 Controller: _ProcStart - Probe 'nas' - starting
    Nov 5 12:45:25:560 [140243720660800] 0 Controller: _ProcStart - Probe 'ems' - starting
    Nov 5 12:45:25:621 [140243720660800] 0 Controller: _ProcStart - Probe 'data_engine' - starting

    ------------------------------
    CA - UIM administrator
    ------------------------------



  • 5.  RE: Upgrade to 20.3 fails on configuring data_engine

    Posted Nov 05, 2020 09:03 AM
    update the primary hub to java_jre 2.05 before doing the upgrade.

    ------------------------------
    Support Engineer
    Broadcom
    ------------------------------



  • 6.  RE: Upgrade to 20.3 fails on configuring data_engine

    Posted Nov 05, 2020 09:49 AM
    Hi David,

    Just done that and reattempted the upgrade, but the service still restarts when configuring the data_engine...

    ------------------------------
    CA - UIM administrator
    ------------------------------



  • 7.  RE: Upgrade to 20.3 fails on configuring data_engine

    Posted Nov 05, 2020 04:48 PM
    Is ADE updating to 20.3? 
    Is it the nimsoft robot watcher service that's restarting when it is trying to update data_engine? 
    check the ade logs for it 
    $\Nimsoft\probes\service\automated_deployment_engine
    it seems to start with:
    INFO ProbeActor - Processing dependencies for package 'data_engine' from section 'scripts'

    ------------------------------
    Support Engineer
    Broadcom
    ------------------------------



  • 8.  RE: Upgrade to 20.3 fails on configuring data_engine

    Posted Nov 06, 2020 03:52 AM
      |   view attached
    Yes, ADE is updating to 20.3, then the install moves onto the data_engine, the robot watcher service restarts and the install fails.

    I'll attach the ADE log, service restart occurred at 08:43:30,571

    ------------------------------
    CA - UIM administrator
    ------------------------------

    Attachment(s)

    log
    ADE.log   158 KB 1 version


  • 9.  RE: Upgrade to 20.3 fails on configuring data_engine

    Posted Nov 06, 2020 10:35 AM
      |   view attached
    Thought I'd check out /var/log/messages whilst trying to deploy the data_engine, got this...

    ------------------------------
    CA - UIM administrator
    ------------------------------



  • 10.  RE: Upgrade to 20.3 fails on configuring data_engine

    Broadcom Employee
    Posted Nov 06, 2020 10:43 AM
    There must be a core dump(s).

    I found one case and KB Article with this type of error but it was for robot v7.70.

    https://knowledge.broadcom.com/external/article/34334/robot-770-crashing-repeatedly-wont-star.html

    Steve

    ------------------------------
    Support Engineer
    Broadcom
    US
    ------------------------------



  • 11.  RE: Upgrade to 20.3 fails on configuring data_engine

    Posted Nov 06, 2020 10:49 AM
    Thanks Stephen,

    Just checked those settings and log level is already 0, and ip_binding is already set to no.

    Worth checking though!

    ------------------------------
    CA - UIM administrator
    ------------------------------



  • 12.  RE: Upgrade to 20.3 fails on configuring data_engine

    Posted Nov 09, 2020 10:25 AM
    The ADE log is showing it is still version 20.1, wasn't it and java_jre updated prior to running the upgrade installer?

    ------------------------------
    Support Engineer
    Broadcom
    ------------------------------



  • 13.  RE: Upgrade to 20.3 fails on configuring data_engine

    Posted Nov 13, 2020 07:07 AM
    Thanks for all the suggestions everyone.

    This was resolved with support by removing the LD_LIBRARY_PATH variable from within the controller.

    ------------------------------
    CA - UIM administrator
    ------------------------------