First of all, below is an example of what I mean by Multi Layer Environment:
Hub Connection Hub Connection Hub
PrimaryHub -> (NameServices) ProxyHub1 -> (Tunnel) CustomerHubA
PrimaryHub -> (NameServices) ProxyHub1 -> (Tunnel) CustomerHubB
PrimaryHub -> (NameServices) ProxyHub2 -> (Tunnel) CustomerHubC
PrimaryHub -> (NameServices) ProxyHub2 -> (Tunnel) CustomerHubD
The intention is to segregate the connections through the Proxy Hubs and avoid overload of the Primary Hub, besides the matter of different networks involved.
The communication between the PrimaryHub and the Proxies are made by NameServices because it is the same network (internal environment) and the CustomerHubs to the Proxies needs a Tunnel since they are different environments.
We actually have this configured already but we ran into a problem related to the discovery server way of working.
On the below article, it states the following:
Technical Details regarding discovery_server communications
- If there is NO tunnel, nametoip will return the actual IP/port of the robot itself on port 48000 and the discovery_server will try to connect directly to the robot - so this may fail if a) there is not route to the robot from the primary or b) there is no tunnel between hubs.
So basically, if there is no direct communication from the CustomerHub to the PrimaryHub we need a Tunnel configured? If so, it makes no sense to build a Multi Layer environment like we did, since it is impossible to communicate with all the customers networks directly without the ProxyHubs.
We bumped into this situation after finding out many monitored devices not displaying any metrics within the USM portlet on the UMP Portal, and the reason is because the Discovery Server probe seems to be trying to reach the robots through a directly communication and not using the tunnel instead (see logs on the article mentioned above).
If anyone could share any thoughts?
Your understanding is perfectly correct.
In your senario, discovery_server probe should use connection to any robot devices (in customer site) via proxy hubs.
Please note - your customer site hubs (and underlying robot devices) does not need direct connection to primary hub.
Your customer site hubs need to talk to proxy hubs via tunnel.
That is what I expected to happen, however, the article says exactly the opposite, that I need direct connection from the Customer Site Hub and the Primary Hub.
Then only after I had configured a Tunnel between the Customer Site Hub and the Primary Hub the discovery server was able to actually find the robots and display correctly on the UMP Portal.
So the Proxy Hubs are only being used for the alert and QoS metric queues.
Note that the documentation is wrong in this case - or at least misleading. The discovery server does get the IP and port of the tunnel hub managing the tunnel to the client hub but that tunnel hub is there to take this traffic directed at the local IP and port and to put it on the tunnel and forward it on to the client hub. That is one of the core functions of the hub probe.
Where you will run into problems with this scenario is with how the network of hubs figures out how to get a message to an arbitrary hub.
You will notice that UIM has no real concept of static routing and DNS. It has some features that address small aspects of this, static name list or NAT, but nothing that addresses it network wide.
The current approach is that each hub, when it learns something new, sends it's current block of knowledge about the hub network to every system it knows about. And every hub is doing this all the time.
Couple that with the extremely unreliable traversal of multiple hubs and you create a possibility for inaccuracies.
Consider what happens if ProxyHub2 is down for a period of time. HubA continues to know about HubB, Proxy1 and Primary but sees HubC and HubD as down because they are not reachable. Proxy2 is also seen as down. Similarly HubC and HubD see everything as down except themselves.
When Proxy2 comes back on line, it immediately learns about HubC when it connects its tunnel and so it sends Proxy2 and HubC out as being up. Since Proxy2 only knows about HubC, that's where it sends it's info. HubC ignores that packet of data because it matches it's version of reality.
Now along with this Proxy2 is broadcasting this information via UDP and so Primary and Proxy1 might learn about Proxy2 being live before Proxy2 can send it's packet of information containing the correct status of HubC.
So, presume that Primary gets the message that Proxy2 is up before the tunnel from HubC starts. Primary now learns that Proxy2 is up and that's new information so Primary starts sending it's package of hub information that includes HubC being down. It sends to Proxy1 and to HubA and to HubB. In the meantime HubC gets it's tunnel going and Proxy2 sends it's packet of information out including HubC being up. Problem is that Primary could now, at this instant of time send to Proxy2 it's packet of information that includes information that HubC is down. Because this is new information, Proxy2 marks HubC as being down and it then sends out this information to every hub it knows about.
But, HUB C is up........
And there's an insidious defect here in the hub because if you time things just right, HubC will remove itself from the list of hubs that are up. Now, HubC can't resolve itself locally either.....
That explains why so many times some Hubs gets unreachable (red on the IM Console).
Thank you for sharing your thoughts on the matter!
I'm sorry if the doc is confusing you.
However, in the doc it does not say anything about necessity of direct communication in between the primary hub and customer site hubs.
I would say in a simple way. There is no tunnel required only if all devices (all hub devices, all robot devices in an UIM domain) are all in the same network. (assume there is no network/port challenges)
Are you creating the discovery get/attach queues on the client hub and their respective tunnel server, then from the tunnel server to your primary hub? You also have to deploy the discovery_agent to the client hub which sends client robot info back into the whole UIM DB. We have multiple levels of hubs in our environment and it works just fine.
Yes, we have the discovery, message and alert queues configured and also the discovery_agent probe with the same version as the Primary Hub:
PrimaryHub (GET queues) -> (NameServices) ProxyHub1 (GET/attach queues) -> (Tunnel) CustomerHubA (attach queues)
Could you please check if your Primary Hub can reach any robot from the farther Hub on the environment through a direct connection? If not, check if these robots are correctly discovered on the UMP Portal.
Thank you for the answer!
If by reach you mean pinging successfully to the remote client boxes from the primary hub then no, not at all. They are on different networks, with FW's in between, etc... But we get all robots created and all their QoS show up in our UMP. Yes they have the correct info in the UMP as the local robot picks up the IP, OS, mac etc and passes that back as part of the discovery process via the queues.
I meant telnet communication on port 48000. Anyway, it seems to properly work on your environment, I might as well open a case at support to check it out.
Thanks for the answer!
You also don't mention what version of code you're on. At this point in time, you really want to be on 8.51 with all the relevant hot fixes from here: CA Unified Infrastructure Management Hotfix Index - CA Technologies
At a minimum, the robot_update, hub, and discovery_server patches.
Also note that IP address and hostname are assumed to be unique. Since you are using tunnels, I would make he assumption that is not going to be an assured assumption.
As a result make sure that you change the robot names to be unique and reset the device id and delete the niscache contents.
Also make sure to select the "Set QoS source to robot name instead of hostname" in the controller.
And finally, restarting discovery_server can make a world of difference in the accuracy of your data. There's an additional defect in discovery_server (in my opinion - not yet acknowledged by CA) where it forgets about systems over time. For me, a handful a day stop getting checked each day and so when I come across a system without newly added QoS, I restart discovery. 9 times out of 10, the new values start showing up in a day or two.
The environment is running on the version 8.51 with service pack 1. All probes were fully up to date until this last release of version 9.
Thank you for the tips, will check all of that and probably open a case at support to investigate this issue further.