We are beginning to plan our upgrade from 8.4 to 8.5.1 (Linux hubs and Oracle DB) and are interested to see if there gotchas or things we need to look out for that are not in the documentation.
I'm not aware of any major common issues with the move up to 8.5.1 - seems like it's been pretty stable.
Be sure to check out the Release Notes.
The big thing I stress with any upgrade is to take backups. Take FULL backups of, at least, the following:
> full primary /Nimsoft/ directory
> full UMP /Nimsoft/ directory
> full database
It's probably a good idea to include any other 'important' servers in your environment, like snmpcollector or HA robot/hubs, etc.
I am on Windows central hub and MS SQL as the DB and went 8.31->8.51.
Issues I've run into:
- rules in discovery_server change - hopefully for the better - Expect that if you use USM and haven't yet cleaned up from the 8.31 GA discovery_server, you will have an opportunity to have many previously working web pages show incomplete information as the device id relationships are reworked.
- The upgrade does a lousy job checking the completion of the SQL upgrade scripts. Expect that several of these will fail silently.
- The new alarm_server/nas combo take more RAM. You may need to increase the JVM settings
- The password portlet is gone. Need to replace pages that have that with the account admin portlet. Make sure your ACLs are correct
- You need to manually upgrade the REST interface if you use that
- USM is much much much slower than in 8.31 - by two orders of magnitude in my case. 17 minutes to successfully load worst case of the successful loads - nothing shorter than 30 seconds or so. More often than not your login expires before the load is complete.
- Wasp startup is slow. The official comment from support is that 15 minutes for startup - activate to first successful page load - is expected and normal.
- Make sure that policy_engine stays disabled. The upgrade will inactivate it on the central hub but since it is now replaced, you have to manually disable and remove it wherever it might be installed.
- EMS is in this new release - appears to only support a single central hub. Bad news if you use HA or nas replication
- The Wasp upgrade may fail to install the UMP root. Need to delete the webapps/ROOT directory and the the ROOT.war (could be wrong about the name) and redeploy to get it to create.
- The listviewer portlet didn't deploy correctly on the first attempt - apparently the .war file didn't deploy and so the old version was left in place at the end of the upgrade.
- Expect the UIM and UMP upgrades together to take roughly five hours - longer if you have to call support.... And you have to do both at the same time. There are several database changes that break UMP. If you have customers using the portal pages, they will experience an outage for the whole duration.
So this is my short list - there are many more annoyances
In order to compare the upgrade with what I might experience, what's the size of the environment that you upgraded (monitored devices, hubs, db size, etc.)?
7,000ish robots, 3,050 of which are hubs too.
Nimsoft database is roughly one TB in size at the moment. 5,500 active alerts, 16mil rows in the alarm transaction table
I use no discovery agents or snmp collector features.
Garin just wondering are you doing any QoS roll-up on the data_engine? We used to have the DB that big, over 1TB yet after we adjusted the roll up periods, were now down to < 200GB. Our #'s: 0 - 14 days (raw) 15 - 140 days hourly, 141 - 490 days daily roll up.
I looked into using the rollup but at least in older versions, nothing had access to the data in the hourly and daily tables so it wasn't easily usable. Instead I just threw RAM at the problem and have been reasonably OK with the speed of reporting against it. The testing I did trying to use the views to combine the various historic tables just weren't fast enough to be useful.
I also enabled table partitioning. It didn't speed anything up from what I could tell but it did seem to stop the slowdowns that happen with growth. And the nightly data pruning maintenance was able to finish since it moves from deleting by date to just dropping the oldest partitions.
Today I'm keeping 105 weeks of data for most things where that makes sense (disk usage, database usage, etc - things you would forecast hardware demand off of a steady trend). things like net_connect pings I keep for only a couple weeks. CPU and Memory I keep for 10 weeks. So far that seems to satisfy my customers pretty well.
There's always that one person who demands the metric values between 10:00AM and 2:00PM on 2/14/2016 and won't leave you alone until it happens.
"- The upgrade does a lousy job checking the completion of the SQL upgrade scripts. Expect that several of these will fail silently. "
How did you notice this and how do you fix?
In USM, the group counts were zero. There's a probe callback for the nis seerver to force the migration to happen.
SQL insert errors in nas log. Deploy older version then redeploy newer version. Seems that it needed a couple pops to get it right.
USM had some behavioral issues. Redeploying the USM probes helped that.
"Deploy older version then redeploy newer version. Seems that it needed a couple pops to get it right."
Just to be clear, if I see the SQL insert errors in nas log, redeploy the older version of the nis_server probe (in 8.4 its version 3.5.1) and then try deploying the newer version 8.5.1?
SQL errors in the NAS log: deploy older nas version, then redeploy newer nas version.
count of zero systems in the group display of USM: on the nis_server, there is a call back called migrate_groups - run that.
I have now updated two production UIM environments with lots of hubs and tunnels and both have had issues when hub 7.90 is used. Tunnels pumping up and down all the time. Seems to have issues both in Win and Linux platforms, support confirmed the downgrade to latest 7.80 HF is needed.
And also the UMP ROOT folder corruption was seen in one portal server.
Vaguely reassuring to know that I wasn't alone in some of my experiences.
Based on my usage, I'd argue that the hub 7.72 release (CnIa24uJ@ftp.ca.com/UIM_Probe_Hotfixes/hub772.zip" rel="nofollow" target="_blank">ftp://UIMuser:CnIa24uJ@ftp.ca.com/UIM_Probe_Hotfixes/hub772.zip) is probably a better choice than the 7.80HF21 release. It depends though on which defects cause more pain.
So you'd say stay away from 7.90 hub version and continue using the 7.80HF22 now? We've been using HF21 this whole time.
Funny how the the HF22 says will be all fixed in 7.90.
ttahkapaa was there a specific defect for the 7.90 hub that they said to use 7.80HF22? Were planning on upgrading to 8.5.1 next week and I have a few 7.90 hubs but haven't seen any issues yet.
Well, the defect we saw was that the whole infrastructure became pretty much useless. Both envs had tunnelhubs that connected to all "customer hubs". The one updated today became xmas tree as soon as we updated those tunnel hubs, it was even a bit difficult the get them downgraded because the IM connections did last so short time up and running the we had to be pretty quick to be able to distribute the hub there. We still have the primary hub running 7.90 hub version and at least it looked to function that way pretty well.
The other env had some issues with (customer) win hubs opening so much handles in OS that the whole OS died.
Both envs have lots of hubs and tunnels. No issues were found in test envs with only one tunnels or so.
One additional issue that was found today is was the robot_update v7.90 in two win servers, it failed to run the pre-install command, hmm "rename_library_files.bat" (can not remember the name correctly. When then renaming those three files manually everything works ok. Have not yet opened a case for this, possibly something strange with those servers.
Just got info from support that hub 7.91 should come out, possible already this week.
Okay, hub 7.91 is out, waiting now for info about real tunneled environments how that works before I will start updating those.
Kind of reassuring to know that I wasnt the only one with this issue. 7.9 caused queues to back up and the primary to become unresponsive - The OS was up but robot goes offline causing all secondary queues to back up. Restarting the robot gets things moving again. I have since downgraded to 7.80HF21 which has improved things, however I have been noticing that randomly the bulk size of the secondary queues changes from 100 down to 1 causing some queues to backup.
HUB bulk size suddenly declined from 100 to 1, is a known defect.
The problem is happening for GET queue which has default bulk size assigned (Greyed out - Not specifically declared).
The workaround is to declare 100 value for bulk size in GET queue.
What exactly do we need to declare in the hub.cfg and in what section to fix this?
I'm running hub v7.80HF22 and we've seen very weird alarm flow just stopping issues since I upgraded to 8.4SP2. I see that my nas queue on the primary hub is bulk size = 1 atm. In my Queue tab the nas definition Bulk Size = 60 grayed out.
Do I have to manually edit it and set it to 100 in raw mode?
nas will always be a bulk size of 1 as the nas probe is single threaded.
Currently there is no way to make nas process multiple alarms at the same time.
Thanks Gene. Btw the other queue I see at size 1 is the audit probe. I have the defined with bulk size 50 but is that also single threaded? Thank you.
I should have said nas processes alarms asynchronously rather than single threaded as Garin so kindly pointed out
as to your question about Audit.
The audit probe has not been update in quite some time so currently yes it only has a bulk size of one.
this probe can have an issue in some environments with new hub version
there is a KB article on this here.
Because I'm a fan of semantics I need to offer a correction to the statement "nas probe is single threaded". Nas is definitely not single threaded - on Windows mine shows 23 threads active. It does though process work that is sequential in nature. Consider what would happen if your nas queue was backed up and you had both an alert open and alert close in that queue and if nas was able to read records out of that queue in something other than arrival order? nothing good would happen.
So, nas is required to process records out of its inbound queue one at a time in order to maintain the chronology of the events.
That does not mean that when nas is processing these events it's not also doing other things. If you have a scheduled job for instance it will run simultaneously with reading events from queue.
Also consider what happens when a probe crashes. If it has not in some way cached that block of records so that it can reprocess on startup and also is able to figure out what was partially done or completed or not started then you lose the whole block of data. If you read one at a time then your worst case loss is the content of that single message. And since it caused the crash you probably don't want to keep retrying it over and over and repeating the crash.
Regarding alarm flow stopping, is that happening in the alarm_enrichment part of nas or the nas probe itself? For myself, every upgrade has required me to increase the amount of memory allocated for alarm_enrichment. My current setting starts with 2GB RAM and the limit is set to 4GB.
Garin my issue was that the alarm_enrichment probe would stop processing anything. We then noticed that hey, there were no alerts since 20min ago then have to do a full stop/start on the nas probe to get it back up and running. I have my memory on that probe set to 2GB max and from the support case also threw in the options for the alarm_enrichment probe to automatically restart itself it it gets to 90% of the allocated memory of 2GB. But I've hit situation where it just stops processing and its only using 100MB of memory.. It never restarted so its a full stop/start again to fix.
Right now running the nas v4.94 version and with these options specified on this page
Tech Tip: Alarm Enrichment probe not processing alarms
but only the following set in the nas:nas.cfg,
lower_memory_usage_threshold_percentage = 0.90
upper_memory_usage_threshold_percentage = 0.90
memory_usage_exceeded_threshold = 1
post_max_age = 60
Does sound a lot like what I was facing. The way it was explained to me was that when the outstanding queue of events for alarm_enrichment gets backed up, it moves from processing blocks of data to processing single pieces. The justification for this logic escaped me when it was originally explained and I'm sure that it hasn't improved since. The trick to keeping things working was to make sure that alarm_enrichment never got to the point where it determined it was backing up - doing that was a mix of making sure the block read size was big enough (but not too big), that there was way more than what appeared to be enough RAM, and adjusting the memory usage thresholds.On top of that, I have a logmon probe that watches the size of the alarm_entrichment probe and boots it if it's too big.
The other thing I've noticed is that alarm enrichment gets progressively slower over time. Periodic restarts can be therapeutic for it apparently.
I'm sorry for having you confused.
Your problem with NAS (probably alarm_enrichment) is not related to my post.
The defect in my post is only a HUB to HUB subscriber. (GET queue in a HUB subscribes ATTACH queue in the other HUB)
Due to the defect, I suggest that you specify bulk_size=100 for GET queue definition in hub.cfg
Thanks for all the great info in these replies! To the CA guys on this board, what is CA's plan to fix these issues?
Let me know if anyone faced any issue in device discovery .I have integrated with spectrum and it must not cause any impact after upgrade.
Depending on what version of discovery you are coming from, this version has a bunch of new rules and flexibility in the correlation process. Everyone seems to be learning how to use it and not everything is as easy as it would be hoped. If you are unfamiliar with the new rules format you will probably find the documentation unsatisfying.
The issue I was facing with the old discovery is that it was unable to recognize information from net_connect as belonging to a robot and so you'd get multiple entries in cm_computer_systems, one for the host and one for the device. This then broke anything that assumed robot name would be unique in cm_computer_systems.
The new version allows one to accommodate this to some extent but now what I am seeing is that it seems to oscillate between the several devices ids as being the master.
I've not had much success in figuring out why.
Upgraded from 8.4 SP2 to 8.5.1.
I also had a couple of secondary hubs have issues with discovery agent. I get the following error when starting discovery in USM -
"An error occurred while starting discovery on .Please check the agent status and try again." The discovery still goes on to run however.
If I delete a device using remove_master_devicess_by_cskeys method it doesnt get rediscovered (even though it is a robot connected to the hub). After running the discovery in usm it finds the device but doesnt link its QOS.