DX Unified Infrastructure Management

View Only

How to clear removed robots from the DB

Mike Bowling posted Jun 18, 2021 11:13 AM

So here's my issue:

Running 20.3.3 and have 2372 robots that are managed. However we have 3403 Discovered systems.

These 900+ systems only show up in the OC and are listed as 'device'. These are servers that are long gone but for some reason....

- Only show up after the mgmt server has been restarted
- Are not referenced in ANY of the robot.sds files on the parent hubs (yes...looked at that article about deleting sds files)
- Do NOT show up when a discovery is run (we do not run scheduled discoveries)
- do appear in the DB in the CM_COMPUTER_SYSTEM field but as a device with cs_type of A, da_id of NULL, and a state of 0 (apparently deleting them didn't clean this out properly)

How can I clean just these systems out cleanly and quickly?

TIA

Mike

David Michel posted Jun 18, 2021 11:47 AM

When deleting them was Prevent Rediscovery selected?
If so they should have been added to CM_BLACKLIST_COMPUTER_SYSTEM and so not show up again.

As for quickly removing them not sure. There is the old way using the callback.
In IM select the discovery_server probe > hold down Ctrl + P for the Probe Utility.
Click the drop down arrow for the 'Probe commandset' and select remove_master_devices_by_cs_keys.
At the bottom for csKeys enter the device cs_key.
Separate multiple cs_keys with a comma.
Click the green arrow at the top to run the callback.
Note it will take some time for the removal process to be completed and the difference show up in front end reporting.

The device cs_key is available in the CM_COMPUTER_SYSTEM table.

However downside with that is they are not added to CM_BLACKLIST_COMPUTER_SYSTEM, and there isn't a callback for that either.

Garin Walsh posted Jun 18, 2021 01:29 PM

Also note that you will have two types of entries - host and device. Host will be tied to a robot. Device is something that's detected (like a network adapter, net_connect profile, or database datasource) but isn't a robot.

Typically discovery_server finds these when it pulls the niscache entries.

As such you may need to find where these are coming from and remove the triggering data too.

Mike Bowling posted Jun 18, 2021 02:11 PM

Since we do not run automated or scheduled discoveries I'm not sure the whole blacklist option would mean anything. Finding the triggering data I believe is what I am looking for. Since these were deleted (most likely with the discoverable option not selected) they shouldn't be there and they shouldn't even show up on reboot\restart. Would like to know where they are coming from.

I can delete them again from the OC and run the query again to see if the deletion actually removes them the CM_COMPUTER_SYSTEM table initially then something at a restart places them back in there. Not sure of the whole theory of operation on restarts\discovery\db population but my feeling has been that it is getting these from somewhere and that's what I want to address.

Mike Bowling posted Jun 18, 2021 02:28 PM

ok. So I went to the OC, identified 8 systems that no longer exist. They were in as 'Devices'. I deleted them from the OC and they did indeed disappear from the DB. I then restarted the primary hub and within 10 minutes they were back, both in the OC and the DB.

So where are these systems being discovered from and how do I clear them without having to build a big exception list?

David Michel posted Jun 18, 2021 03:42 PM

as the old story goes there is the long short way and the short long way.
Since it seems the long short way of deleting the devices with Prevent Discovery selected isn't a good fit...
so the short long way
you can query CM_DEVICE for dev_ip or dev_name and probe_name will = the probe it is coming from.
if it is niscache
Navigate to the robot via Infrastructure Manager and select the controller probe and press Ctrl-P. The probe utility window will pop up.
Click the Options Icon - (second icon from the right) Select the 'Expert Mode' checkbox, Click Ok.
Now, in the "Probe Commandset" select _nis_cache_clean and click the Green Arrow to send the command request.
Now, in the "Probe Commandset" select _reset_device_id_and_restart and then click the Green Arrow to send the command request.
The niscache is now clean and the robot is restarting.

if it is a probe > IM > tools > find > probe name > then check the configuration at each instance to see which one is monitoring the device

Garin Walsh posted Jun 18, 2021 03:56 PM

Discovery_server is always doing this regardless of your discovery_agent and discovery scope settings.

There are a handful of tables that record where some of this info came from - I don't have the names off hand so maybe someone else can fill that in - you can also run a test discovery through the probe callback and that will indicate where some of this data is coming from.

But ultimately it's kind of a guessing game where it came from.

Mike Bowling posted Jun 21, 2021 08:47 AM

So I have no discovery profiles configured that have these servers in them. In fact I only have two and both of them were single device discoveries for one server. So if it is indeed the discovery_server that's causing these to come back in...where is it getting its list from?

Sorry if I keep repeating myself....just not quite understanding this yet. I keep thinking to myself that it is finding a list "somewhere" and this list is what needs to be purged.

The niscache info is going to be a robot by robot solution won't it? How would that apply if these are servers\robots that don't even exist anymore...or would this be the niscache on each parent hub?

David Michel posted Jun 21, 2021 08:52 AM

Unless something custom and totally unexpected was setup in your environment, there is no list adding these systems back.
If they were added via remote monitoring then niscache involved is at the robot doing that monitoring.

Garin Walsh posted Jun 21, 2021 09:45 AM

The "list" you are referring to is in the niscache directory on each robot. This directory gets a small file that has some identifying information about the resources being monitored. Constantly discovery server is traversing all robots and checking the contents of niscache to make sure that it knows about everything in the robot's niscache. So if you had set up net_connect to ping one of these since removed robots, there's at least two ways for that robot to have been introduced into the database: when the robot was running and connected or when discovery server polled the other robot running net_connect and pulled the niscache entry that indicated the first robot was being polled.

If you then decommissioned that robot removed that robot, then there's no trigger to add that robot when it connects up but there is still the niscache entry ( even if you removed the robot from the net-connect config) and so every time discovery server pulls that niscache and sees that removed robot is gone, it will add it back in.

The callback _reset_device_id_and_restart mentioned by david will clean these entries but you have to do that to any of the remaining systems that might have, across all time, ever done anything with that removed system - net_connect and the database probes (because they can use the data source name as a source) are prime places for this to be corrected.

Or you can just ignore them still being in your network - these unreachable systems don't cause any harm - generally speaking.

Mike Bowling posted Jun 21, 2021 09:49 AM

The fact that these systems keep showing up and they are on several different hubs....baffling!!

Just not understanding why....when these systems have been completely shut down, removed from UIM, and don't exist in the DB, AND we have no discovery profile configured with them in it....how are they showing back up...???

Garin Walsh posted Jun 21, 2021 10:14 AM

Please go back and read the posts again. This is not baffling - it is expected behavior.

You, in the description of what you have done, don't ever indicate clearing niscache on any elated robots and that's were this is almost assuredly coming from. There is no harm in clearing niscache and I have actually found that scheduling this as a weekly or monthly job on each robot help eliminate issues like this.

Look at CM_COMPUTER_SYSTEM - get the cs_id for the system that keeps coming back, then go to cm_device and look up that cs_id. That should give you an idea where it is coming from.

If not, query cs_device by the device IP and the device name. It can be a chore to find this because discovery server is also using it's rules to either combine or separate this information.

Mike Bowling posted Jun 21, 2021 12:15 PM

Very familiar with the niscache folder...have to clear out some periodically for a bunch of Citrix systems sometimes that re-provision themselves every night...

ok....so I ran this command to find one of the devices that keeps showing up (on one of the hubs):
select * from CM_COMPUTER_SYSTEM where da_id is null and origin = 'dcr-cahub-ms-02_hub' and dedicated = 'device' order by name

Picked the first one and noted its cs_id then ran this query:
select * from CM_DEVICE where cs_id = '17098'

(17098 is the cs_id of server cne-ecl-xa-01)

results show
dev_id cs_id dev_src_id dev_ip dev_name probe_name
D1CE567FF5D581D526CF368D8EE33467E 17098 D4B7675B00A7DE1E73B38DDD859E7D280 xx.x.xx.xxx cne-ecl-xa-01 niscache

So, according to previous posts, niscache seems to be providing this information. Instructions were to "navigate to the robot in IM...."

I can't....these 'devices' do not show up in IM or the Admin Console. These devices only come back into the Operator Console when they are "discovered".

Apologies if I'm simply misunderstanding the instructions provided...maybe I'm just missing something??

Garin Walsh posted Jun 21, 2021 12:24 PM

"navigate to the robot in IM...." where the niscache entry came from.

There's not a lot of information supplied about which robot that is but at least you have confirmation about why these are showing up.

Andrew Cooper posted Jun 22, 2021 03:17 AM

Mike

Following on from Gavin's answer this niscache phantom object is a real problem and a pain in our dev and qa infrastructures as servers are created for a project and then removed afterwards, so we have a process which goes across a hub (and all the associated robots) and forces niscache clean with the controller callback that David mentioned ( select _nis_cache_clean and optionally select _reset_device_id_and_restart) to remove all these phantom images. We use the SDK to run the command but other people us excel to create the list of pu commands to do the same thing, but in the long run do not think if it as a one off exercise, the only difference is the frequency that you need to do it.

We do all the hubs (including production) at different rates based on the number of robot/server changes are going on. This process (along with the standard device/robot removal in AC/OC) has significantly reduce the number of "phantom" devices, such that the operators believe that if it is in UIM then it really exists and not "may be real" which is important.

Regards, Andrew

Mike Bowling posted Jun 22, 2021 07:35 AM

Thanks Andrew, Garin, and David....for helping me to understand this a little better. I have a few steps to try now to see if I can get this cleaned up. Would be interested in the SDK or spreadsheet "process" that Andrew mentioned if possible.

I had also opened a ticket on this and Samer with Support provided some great responses too....this one being probably what I needed the most to understand the process: niscache - what it is, what it does (broadcom.com)

This appears to be pretty much what was being said all along...just couldn't wrap my head around all the info until I saw it laid out like this.

Thanks again!!

Garin Walsh posted Jun 22, 2021 07:50 AM

If you want to automate the cleanup of niscache this is a possibility in dirscan - essentially any file in niscache gets to 30 days old and it's cleared and rebuilt. This fixes up issues if the host name or ip of the server is changed too - especially useful if you have these systems in a dev environment nd they're turned off for extended periods of time and are subject to losing DHCP leases (doesn't fix all issues related to that but some) :

<watchers> overwrite

<NisCache> overwrite
active = yes
name = NisCache
description =
schedules = NisCacheSched
directory = ../../../niscache
pattern = *.[cmr][eio]*
user =
password =
age_of_oldest = yes
age_check_all = no
check_dir = no
exclude_directory_pattern =
dir_age_check = no
recurse_dirs = no
qos_dir_exists = no
qos_number = yes
qos_space = yes
qos_age = yes
qos_response_time = no
number_command = ../../../bin/pu controller _reset_device_id_and_restart "" ""
age_command = ../../../bin/pu controller _reset_device_id_and_restart "" ""
file_size_type = individual
response_time_type = individual
<number_condition> overwrite
limit = 15000
type = le
</number_condition>
<age_condition> overwrite
limit = 30
type = le
unit = days
usercreationtime = No
</age_condition>
<space_condition> overwrite
unit = Mb
</space_condition>
<space_condition_delta> overwrite
unit = Mb
</space_condition_delta>
<file_size_condition> overwrite
unit = Kb
</file_size_condition>
<response_time_condition> overwrite
unit = milliseconds
</response_time_condition>
<message> overwrite
file_number_alarm = FileNumberAlarm
file_age_alarm = FileAgeAlarm
file_size_alarm = FileSizeAlarm
file_space_alarm = FileSpaceAlarm
file_delta_space_alarm = FileDeltaSpaceAlarm
response_time_alarm = ResponseTimeAlarm
directory_check_alarm = DirectoryCheckAlarm
file_error = FileError
dir_age_alarm = DirAge
</message>
</NisCache>
</watchers>

<schedules> overwrite
<NisCacheSched> overwrite
type = Recurring Event
time_type = once
updated = yes
recurrent_event_spec = RRULE:FREQ=DAILY;INTERVAL=1;BYHOUR=04;BYMINUTE=00;BYSECOND=00
</NisCacheSched>
</schedules>

Note that this needs PU installed too. And not that I want to see this change because it's easy but the fact that pu doesn't require credentials to delete files and restart the robot is kind of a security hole......

David Michel posted Jun 22, 2021 08:03 AM

This KB doc provides an example of how to use an external file with a list of targets for a pu commend.
After removing a robot.sds file from a UIM hub machines are not populating in IM
Article Id: 189420
https://knowledge.broadcom.com/external/article?articleId=189420

Mike Bowling posted Jun 22, 2021 08:09 AM

We have a very very simple implementation of UIM. Quite simply robots reporting to a parent hub only...no net_connect, dirscan, or other probes scanning other robots for info. I'm hoping maybe performing the nicache clean process on the hubs will take care of most of this.

Mike Bowling posted Jun 24, 2021 08:30 AM

Wanted to give you an update...

Thanks to the information you provided along with some direction from Support as well as other resources, I have gained a better understanding of the theory of operation of the discovery process.

As a result I have taken the following steps (in the order listed) to rectify the issue:

Executed the _nis_cache_clean call-back only on all the hubs in the environment
Deleted the ‘devices’ in the Operator Console...deselecting the ‘Prevent rediscovery’ option (do not want a blacklist of 1000+ servers)
Restarted the primary hub

All “ghost’ devices are now gone except for two FS mounts that are showing up. Investigating where these are coming from (my guess is a Linux system that is being monitored).

Thanks again for the direction\info.