I want to see what other companies are doing in regards to Spectrum health check. Few things that we manually check are:
I am wondering what else we should be doing or if there is a better way to make sure everything is fine with Spectrum and running healthy. Please let me know.
Few more daily checks
- CPU/Memory/Disk of Spectroserver and Spectrum/OC processes
- Document the changes you do in Spectrum on a daily basis . Ex: discoveries , configurations etc
- Database maintenance. This is described in Database Management Guide of Spectrum bookshelf
- Sizie of SSDB/DDMDB/SRMDB and proper retention periods and policies
- Model count
- Events on the SS Peroformance model in Oc and alarms on the vnm model in Oneclick for each landscape. Look for alarms on landscape model in OC as well
- In Spectrum control panel there is a tab to generate spectrum health report and schedule it . Check SCP -> control -> Spectrum performance tab
- Check the DDMDB/SRMDB sync at OC -> Administration -> Report Maanger -> Spectroserver status tab
- Check all the log files especially vnm.out/stdout.log/archmgr.out/Notifier.out/locserv.out/rcpd.out
- Status of All Spectrum services on all servers
- Database Backups
- Infoview Reports
Nice list. But for most of the points above, we expect to have a Spectrum alarm if something goes wrong (for example if the Online backup fails, we have an alarm: we don't need to check it every day).
Something that I have implemented because we were hit severely several times, is the content of the Global Collections: I have a cron job that makes the diff between previous and current content, for each GC, twice a day, to make sure that they don't get emptied or that we don't loose half of the entries. It has happened several times ( for example when Gen_If_Port type was changed to LACP_ip_port type for LAG interfaces, or also when a new spectrum version brought a bug that caused "case sensitive yes/no" to fail).
Checks to prevent devices to be forgotten on "maintenance" state.
Hi "VL" .. losing GC members is definitiv a mal-function and you may address this to a CA Support case. Clearly - in case the GC is setup with dynamic search, then the number of "members" may vary depending on the used search (i.e. when honoring maintenance mode - or IF/port status etc ..). I would encourage you to open always a CA Spectrum support case once you see a mal-function - maybe this is already addressed and a patch/corerction is already available. Cheers, Joerg
sure, I have each time open a case, and got a fix. But what I meant is that, when something like that happens, you don't see it easily nor early enough, and it is where the cron job report helps noticing that kind of mal-function.
As Joerg suggested, we need to keep an eye on the ss performnace model / landscape model events to look for server performance. In most of the scenarios we have alarms and few have events to tell you if something goes wrong.
Few might not have as well for which you can raise an ER in the IDEA wall. Incase you see any manl functioning happening open a ticket and we will sort it out. In your case it is the dynamic GC which will update models based on the rules. If models doesn't adhere to these rules, they will be removed for the GC. I guess you see few events on the GC model when this happens.
CA Spectrum currently does not cover an internal "sizing" tool which will automaticaly check for workload conditions. Still there are some internal (VNM) Events and Alarms which will be raised - and may show/indicate an overloaded SpectroSERVER.
Have a look to the SSPerformance application model - and find there the most iportant "SS_Idle_Time" and the Latency for Polling and Notification. These 3 Graphs - i.e. done for 6 hours - will give you a good impression about the workload condition for the SpectroSERVER. Most important is - seeing SS_Idle_Timer always on 100 indicates any new thread could launch immediately - seeing this at 0% means - no more additional thread is possible and will be lauchned. In this case we expect to see increase for the polling and notificaiton latency which then indicate the SpectroSERVER is in serious delay to monitor the devices.
CA Spectrum and SpectroSERVER or OneClick-Server sizing is a technical Service task. Have a look here in the forum to find the discussions about "sizing" / Sizer for CA Spectrum. Also the CA Services Team can help a lot in case of a big distributed Spectrum installation. Overall - a very simple way here is to have a look to the SSPerformance "Performance" tab.
The OC-server acts different - but there are also "significant" criterias here - find in the forum the discussion about OC-server performance:
Hth, cheers, Joerg
In my experience when I was a consultant, a customer's implementation of Spectrum changes for each customer. I think you outline the basic checks, but in some cases we would also look at integration's we set up. We would check failover and force certain errors and see if the system reacted properly (e.g. kill the Spectrum process on the Primary and see if the failover happened correctly.) We also asked the customer if they had any issues and work through this list, as well as a 'things you would like to know' - e.g. if they had questions about monotonous tasks we tried to help them with these. Health-Check for us involved the overall happiness of the customer with Spectrum too. If a person isn't happy to use the tool then that is a big issue.
We also offered a deployment service if customers had many devices to deploy with specific requirements on container structure, etc. We would automate this and help them do this. We would also use the Performance tool to see if any of the landscapes needed upgrading or if devices needed moving around so that the landscapes could better handle the load.
Common things we checked were trap configuration / device templates for e.g. Cisco config requirements. Device Certification. Event Certification, Customisation of OneClick. Custom reporting (since SRM / BOXI is overengineered and they wanted something simpler).
Our cusomers used to purchase a number of days support per year and if they didn't use it by a certain time of the year before renewal time, we would have a health/check and also provide some one-to-one training to cover the time. This way they would learn something too.