We have 8 secondary hubs. On two of them, we get connection problems from time to time. It sometimes seems to be that communication has been down for a short while.
The problem is that the hubs does not start up again, even if we have communication with ping.
Sometimes they're connected for hours, but normally it usually takes a few days before they're disconnected.
Forgot to look in the log last time, but will try to capture the it next time.
The resolution is to restart the service "Nimsoft Robot Watcher" on the secondary hub, or restart the hub-probe from the primary hub.
My question is if anyone has had similar problems, and what the reasons could be?
Are there any default settings in the hubs (both primary and skundär) that should be changed?
All hubs is version 7.80. And all hubs have the same default standard settings.
Yes, at any point in time I have about 5% of my 1200 hubs in this state. There is a defect related to this that's "deferred for future release". I can say that hub 7.72 (which is newer than 7.80) is the best so far at not having this problem as much as other versions.
Wow, 1200 hubs!? That's a lot. What is your way to solve and detect this problem every time it happens?
Interesting that 7.72 is newer than 7.80? I don't have that one in my archive though. Only 7.71 from before. Do you know where I can download it?
You've got to have a reason to ask for 7.72. The way that CA does versioning is that generally speaking the larger the version number the more recent the code but occasionally the situation arises where a current version (7.80 in this case) will attempt to address a defect/functionality issue in a prior version (7.71 in this case) and be unsuccessful. So instead of creating 7.81, they'll create a 7.72 (same starting point as 7.80 but with a different approach at a fix). There's nothing wrong with the practice as it is fairly common. What's wrong is inferring release order from the version number string.
As far as detecting this, I watch the queue "failed to connect to" messages. That's a dead on indicator of the problem. Especially if you also filter on count > 20 or something like that.
There is no easy solution. I did have a hack where I had a batch file that restarted Nimsoft on the hub every 24 hours. That would limit my maximum outage to that period. Problem is that with 1200 hubs, that's a hub disappearing from the environment every 72 seconds and then reappearing every 72 seconds. That meant that my network became saturated with hub up and hub down broadcasts because with each of 1200 hubs telling the other 1200 hubs about the appearance or disappearance of a given hub every 36 seconds there was nothing else my environment could do. Had my central firewall cluster glowing red hot from all the traffic.
Then Nimsoft released a controlled and scheduled-in-the-hub restart option which behaved more gracefully with the tunnels but that only successfully shut down maybe 50% of the time. I had to stop using that because this particular tunnel hang issue was much more likely to manifest during shutdown and the scheduled restart just worsened the situation.
So at this point the "solution" is to employ a person who watches these connect failed errors and RDPs into the offending device and restarts the windows service. Job satisfaction is not high on this position.
I was thinking about deploying Whats-Up-Gold to monitor for this and issue the restart..... Only half kidding really.
I can't even imagine what problems this would be if I had as many hubs as you: / I hope CA solve this as soon as possible!
Do you also get problems with the robots, that won't failover to the secondary hub when this HUB-problem occurs?
This really sucks!
"As soon as possible" - Well it has been in the vicinity of a year and waiting.
As far as fail-over, no go because the hub hangs on shutdown - tunnels down but process and connectivity still there. So HA still sees the hub as up.....
Yep, it sucks but it's not the worst badness with UIM today.For instance, alarm_enrichment only runs for about 15 minutes without requiring a restart. Thankfully logmon runs longer than that without a restart so I can restart alarm_enrichment with a log file watcher.
I created a new case on this today to the nimsoft support. We'll see what they respond...
Garin, do you still have this issue?
Absolutely do. I have 2700 hubs now too if you're counting.... 100 Tunnel proxies - 4 "concentrator hubs" because I have too many tunnel hubs to get from.
My temporary solution to this is to have a Windows scheduled task that, once every 24 hours, restarts the Nimsoft service. It's ugly and crude but instead of losing roughly 7 hubs a day to the software hang, I'm down to 1-2 just because of hardware faults.
On the other hand that really plays havoc with the hub address tables - so much chatter.
And all those probes that reset open alarms at startup rather than figuring out if the issue is still happening. So I get a lot of unnecessary churn on those issues.
I will say that CA is working diligently towards a resolution but I am unfortunately experiencing a gap between intent and success.
Wow! I only have 8 remote hubs.
I don't have any random disconnects anymore, only when there is a problem with the VPN like a renegotiation or the link goes down for whatever reason. But the result is the same. When the connection between the remote and primary-hub goes down, it can't reconnect. The only solution is to restart the Nimsoft service.
I have case we have been working on forever it feels like. Sending tons of logs to CA, and I've installed maybe 5 different HUB-versions. Same problem.
The thing is that after the down time in the VPN, for maybe a few seconds. I can connect with whatever other session between the HUB's, like RDP or telnet. It's only the HUB that can't reconnect the session.
I'm now trying to figure out how to send a remote command when the HUB can't reconnect but the VPN is up again, restarting the service on the remote HUB from the primary HUB, when the GET-queue in the primary HUB doesn't find the remote-queue.
It looks like I've found a solution. Not sure if it will help you, but this should be an early step in the troubleshooting at least.
The solution is in our firewalls behind the VPN-tunnel. And we've only tried with Cisco-firewalls.
There is a parameter that is disabled by default, and needs to be activated. This parameter keeps the TCP-session "alive" if the tunnel goes down, and reconnects to the same session when it's up again. And it works perfect so far.
sysopt connection preserve-vpn-flows
You can read about it here:
ASA 8.X: Allow the User Application to Run with the Re-establishment of the L2L VPN Tunnel - Cisco
Important is that this parameter need to be turned on, on both sides of the tunnel, and then the tunnel needs to be restarted twice, before the parameter gets activated.