DX Unified Infrastructure Management

 View Only
  • 1.  Robot inactive in AIX OS and it activates again without intervention

    Posted Jan 13, 2020 10:32 AM
      |   view attached
    Hi Dear community

    I have many server with OS AIX, and I see that frequently the robot has been inactive. The administrator review the nimbus service and is running, in few minutes the robot inactive alarm is cleared... 

    I checking the logs file controller.log and I see this events ----->

    Jan 1 19:30:26:360 [0001] Controller: --------------------------------------------------------------------------------------------------------
    Jan 1 19:30:26:360 [0001] Controller: ----- Robot controller 7.97 [Build 7.97.283, Nov 30 2018] started -----
    Jan 1 19:30:26:360 [0001] Controller: Name = serversecnw, IP = 172.19.202.220, Port = 48000
    Jan 1 19:30:26:360 [0001] Controller: OS = UNIX / AIX / AIX 2 7 00C577F64C00
    Jan 1 19:30:26:360 [0001] Controller: Domain = SERVER370_domain
    Jan 1 19:30:26:360 [0001] Controller: Primary HUB = /SERVER370_domain/hubsec373/hubsec373 172.16.1.63
    Jan 1 19:30:26:360 [0001] Controller: Loglevel = 0, Logfile = controller.log
    Jan 1 19:30:26:367 [0001] Controller: Running as user root (0)
    Jan 1 19:30:26:367 [0001] Controller: -----
    Jan 1 19:30:26:368 [0001] Controller: Stopping processes from previous run
    Jan 1 19:30:26:368 [0001] Controller: ProcessControl: Sending SIGTERM signal to spooler (4915572)...
    Jan 1 19:30:32:368 [0001] Controller: ProcessControl: Child exited
    Jan 1 19:30:32:368 [0001] Controller: ProcessControl: Sending SIGTERM signal to hdb (3932514)...
    Jan 1 19:30:33:368 [0001] Controller: ProcessControl: Child exited
    Jan 1 19:30:33:368 [0001] Controller: ProcessControl: Sending SIGTERM signal to processes (2752792)...
    Jan 1 19:30:34:368 [0001] Controller: ProcessControl: Child exited
    Jan 1 19:30:34:369 [0001] Controller: ProcessControl: Sending SIGTERM signal to cdm (17170916)...
    Jan 1 19:30:40:369 [0001] Controller: ProcessControl: Child exited
    Jan 1 19:30:40:369 [0001] Controller: ProcessControl: Sending SIGTERM signal to net_connect (20054460)...
    Jan 1 19:30:41:369 [0001] Controller: ProcessControl: Child exited
    Jan 1 19:30:41:369 [0001] Controller: ProcessControl: Sending SIGTERM signal to logmon (13369732)...
    Jan 1 19:30:51:370 [0001] Controller: ProcessControl: Process logmon (13369732) still running - terminating
    Jan 1 19:30:51:372 [0001] Controller: Controller on serversecnw port 48000 started
    Jan 1 19:30:52:390 [0001] Controller: Hub hubsec373(172.16.1.63) contact established
    Jan 10 17:01:08:530 [0001] Controller: ReadCacheFile - no return pds specified (file /UIM/nimsoft/niscache/D4BD1B90086414BE093B7B4F76C238CBE.dev, max age 360, min age 0), not fetching information
    Jan 10 17:01:08:530 [0001] Controller: ReadCacheFile - no return pds specified (file /UIM/nimsoft/niscache/DD8F276F07A2C4741C4A31DCCE578CBF6.dev, max age 360, min age 0), not fetching information
    Jan 10 17:01:08:530 [0001] Controller: ReadCacheFile - no return pds specified (file /UIM/nimsoft/niscache/C023B681483F07346C398728B613794EA.ci, max age 360, min age 0), not fetching information
    Jan 10 17:01:08:530 [0001] Controller: ReadCacheFile - no return pds specified (file /UIM/nimsoft/niscache/C030D6448985C5C6F3B1D0EB2CEEA69F4.ci, max age 360, min age 0), not fetching information
    Jan 10 17:01:08:530 [0001] Controller: ReadCacheFile - no return pds specified (file /UIM/nimsoft/niscache/C04E0BE77118B74E771E1A563327C8AEE.ci, max age 360, min age 0), not fetching information

    Other ----->

    Jan 10 18:58:46:184 [0001] Controller: nimGetIpList - getifaddrs failed
    Jan 10 18:58:47:209 [0001] Controller: nimGetIpList - getifaddrs failed
    Jan 10 18:58:48:209 [0001] Controller: nimGetIpList - getifaddrs failed
    Jan 10 18:58:48:209 [0001] Controller: hub hubsec373(172.16.1.63) NO CONTACT (out of resources)
    Jan 10 18:58:49:209 [0001] Controller: nimGetIpList - getifaddrs failed
    Jan 10 18:58:50:330 [0001] Controller: nimGetIpList - getifaddrs failed
    Jan 10 18:58:51:330 [0001] Controller: nimGetIpList - getifaddrs failed

    I attached the logs of the robot for more details.

    Anyone have some idea why is this happening? My other servers as OS Windows not to present this issue, only UNIX.
    Note: I checked the communication, firewall, antivirus and this isn't the problem.

    Attachment(s)

    gz
    logsuim.tar.gz   103 KB 1 version


  • 2.  RE: Robot inactive in AIX OS and it activates again without intervention
    Best Answer

    Posted Jan 13, 2020 10:40 AM
    Usually when you see the lines of text:

    Jan 1 19:30:26:368 [0001] Controller: ProcessControl: Sending SIGTERM signal to spooler (4915572)...
    Jan 1 19:30:32:368 [0001] Controller: ProcessControl: Child exited

    that means the controller has terminated unexpectedly - usually the controller shuts probes down first before exiting but on startup it checks to see if any are still running and if so, stops them.

     I will also presume that the restart on Jan 1 19:30:26:360 was unexpected.

    Suggest looking for core files and opening a support case.



  • 3.  RE: Robot inactive in AIX OS and it activates again without intervention

    Broadcom Employee
    Posted Jan 13, 2020 11:53 AM
    Miller, before open the Support Ticket, please try to set up at the controller:
    local_ip_validation = no
    strict_ip_binding = yes

    And be sure to have the log set to loglevel 5 and logsize 95000 to attach to the Support Ticket.

    ------------------------------
    Technical Support Engineer
    Broadcom
    ------------------------------



  • 4.  RE: Robot inactive in AIX OS and it activates again without intervention

    Posted Jan 14, 2020 12:59 PM
    Thanks @Garin Walsh and @Alex Yasuda

    I will configure the recommendation of Alex in 181 UNIX/LINUX servers​ and will monitored the behavior.


  • 5.  RE: Robot inactive in AIX OS and it activates again without intervention

    Posted Jan 14, 2020 03:51 PM
    for what it's worth
    spooler log
    FlushMessages - failed to flush message (communication error)
    not a direct hit but still might be helpful
    https://ca-broadcom.wolkenservicedesk.com/external/article?articleId=35021
    spooler: nimSession - failed to connect session to x.x.x.x:48001, error code 79
    spooler: QueueAdmin - no contact with the hub last 15 minutes
    Alex's recommendation may help with that

    controller also has errors indication connection failures to the hub
    and NO CONTACT (out of resources)

    probably a good idea to update to:
    robot_update-7.97HF7.zip release_notes/robot_update-7.97HF7.txt  

    https://techdocs.broadcom.com/us/product-content/recommended-reading/technical-document-index/ca-unified-infrastructure-management-hotfix-index.html?r=2&r=1

    ------------------------------
    Support Engineer
    Broadcom
    ------------------------------