DX Unified Infrastructure Management

 View Only
  • 1.  Spooler message delays after hub restart

    Posted Dec 04, 2019 08:17 AM

    I've noticed that in all robots since version 7.93 there is a new spooler feature (I know its a feature as support assures me of it intentional design and therefore its definitely not a bug).  Following any loss of connectivity to the hub, from a robot's spooler, like when a hub restarts, it can take 30-40 minutes for the spooler to reconnect and start sending messages again.

    The issue looks like this in the spooler log:
    Feb 21 10:02:39:144 [140244922230528] spooler: nimSession - failed to connect session to 10.xxx.xx.xxx:48001, error code 111
    Feb 21 10:07:34:174 [140244922230528] spooler: QueueAdmin - no contact with the hub last 5 minutes
    Feb 21 10:12:34:202 [140244922230528] spooler: QueueAdmin - no contact with the hub last 10 minutes
    Feb 21 10:17:34:228 [140244922230528] spooler: QueueAdmin - no contact with the hub last 15 minutes
    Feb 21 10:22:34:254 [140244922230528] spooler: QueueAdmin - no contact with the hub last 20 minutes
    Feb 21 10:27:34:280 [140244922230528] spooler: QueueAdmin - no contact with the hub last 25 minutes
    Feb 21 10:32:34:306 [140244922230528] spooler: QueueAdmin - no contact with the hub last 30 minutes
    Feb 21 10:33:25:636 [140245022664448] spooler: flushing is set to ON, hub=10.xxx.xx.xx


    The length of time it will take to reconnect is determined not by some configuration in the spooler.cfg, but instead by the hub_update_interval config in the robot.cfg.  The documentation describes the hub_update_interval as "Interval, in seconds, at which the controller should send alive or probelist information to the hub".  It mentions nothing to do with the spooler.  It seems to me to be a fairly unhelpful reuse of a config setting.  If i want spooler to reconnect quickly to the hub after disruption so I want to set it low, but the lower it get set the more stress it puts on the hubs for functionality I dont want.

    Whilst Broadcom see this as a feature, I really don't want unnecessary message delays in the environment.  Why delay a critical alert for 30 mins for no reason?

    Has anyone else noticed this issue?  If so, do you have anything in place to mitigate?

    I have raised a idea suggesting that this bug (I mean feature) be fixed.  I'd appreciate it if you could lend your support if you think that this is something that should be fixed.
    https://community.broadcom.com/participate/ideation-home/viewidea?IdeationKey=8b6b62cf-086d-4d25-a85f-a939ce47ea80

    Thanks
    Iain

    ------------------------------
    Iain Randall
    ------------------------------


  • 2.  RE: Spooler message delays after hub restart

    Broadcom Employee
    Posted Dec 04, 2019 10:28 AM
    I would go back to support and ask them to engage dev. 
    Having to wait 10-40 minutes for a alarms and message to flow after a hub restart could be very problematic.

    ------------------------------
    Gene Howard
    Principal Support Engineer
    Broadcom
    ------------------------------



  • 3.  RE: Spooler message delays after hub restart

    Broadcom Employee
    Posted Dec 04, 2019 10:55 AM
    I would definitely suggest testing the latest hotfix and see if you still see the same problem before going back to support.
    ftp://UIMuser:CnIa24uJ@ftp.ca.com/UIM_Probe_Hotfixes/robot_update7.93_HF18.zip
    ftp://UIMuser:CnIa24uJ@ftp.ca.com/UIM_Probe_Hotfixes/release_notes/robot_update793_HF18.txt


    ------------------------------
    Gene Howard
    Principal Support Engineer
    Broadcom
    ------------------------------



  • 4.  RE: Spooler message delays after hub restart

    Posted Dec 05, 2019 07:47 AM
    Thanks for the links, unfortunately I cant seem to get to them.  I have however  tested against 7.93, 7.96, 7.97, and 9.20 and they all have the issue.  If we're going to go to the trouble of upgrading 55k robots I want to go to the latest one so it doesnt need to be repeated.


  • 5.  RE: Spooler message delays after hub restart

    Posted Dec 05, 2019 07:32 AM
    Hi Gene

    Thanks for the reply.   I've raised two cases with support for this.  The first back in Feb (1306288) confirmed the issue and that it was a feature, so I would need to raise and enhancement request which i did.  It was mentioned that it would be prioritized by development but this does not seem to be the case.  I followed up with a second case recently and they said i just need to wait for product management to schedule.  

    Without it being fixed it means we're stuck on the downlevel robot 7.70.  However we're looking at RHEL 8 and if its not fixed by the time the robot is released to support then I'll be forced to use the new robot and come up with a custom fix.

    I assume that more customers would be unhappy about this if they realised that it was happening.



  • 6.  RE: Spooler message delays after hub restart
    Best Answer

    Broadcom Employee
    Posted Dec 05, 2019 08:25 AM

    So I reviewed the defect. It looks like the default OOTB value for the  hub_update_interval  is 900 or 15 minutes.
    in your case you have raised this to 1800 or 30 minutes

    From the dev team:
    The design of the spooler is such that if the spooler connection is broken from its hub( as it is happening in this case as a hub is being restarted), the spooler waits to make the connection until the spooler gets the set_hub callback request from the robot controller.

    set_hub callback has been called by the controller after each hub_update_interval. So In the present case, it is 1800 s or 30 min after which spooler is getting the set_hub callback request and then only it is trying to make the connection to its hub for sending messages.
    This is the reason why spooler taking up to 45 min to connect to its hub.
    So to reduce spooler time to connect it's hub, reduce the hub_update_interval value to 300 i.e. 5 minutes in robot.cfg but the controller will send more send_alive message to hub and also call more set_hub callbacks.

    So it is a trade-off between hub_update_interval value and network bandwidth. For testing purposes, hub_update_interval value should be less maybe 2 min or 120 s but in production env 30 minutes or 1800 s is also fine as hub does not crash in normal condition.

    Having a high value is why I had not heard of this before. Unfortunately, this will take a substantial code change to the spooler I would think and the reason it has been delayed.

    You are correct that currently even through 9.x version this will be the same.
    The only option you have is to stay at the 7.70 version or reduce the update interval until this is taken up by the dev team.

    Sorry I could not be of more help.


    ------------------------------
    Gene Howard
    Principal Support Engineer
    Broadcom
    ------------------------------