Spectrum v10.1.1 stopped working this morning (fortunately on our test instance) with
Aug 04 08:48:17 WARNING at CsWorkSched.cc(173): Low thread resources detected. Work scheduler requesting 2min/40max threads, but not available.
Has anyone seen this before ?
(I have open a case and will put the answer when I get one)
The internal SS threads are low due to alot of processing...the only way to know what it's processing is to review the <SPECROOT>/SS/.moot.trace for clues (when you receive this message the code automatically generates a thread stack dump). I don't believe we have any "general" reason for this at this time. What's the case #? I'll take a look...
Thanks ! case ID is 00470238 .
No problem ☺ Can you upload a copy of the /SS/.moot.trace for me?
Yes. I have already put part of it. I will add the full file.
The trace file is showing a ton of external polling. It looks like the polling is for these 3 threads:
- Do you have a device with a very large IPaddress table? You can run a sniffer to see which device Spectrum is querying this and review the IPadentaddr table
- Did you turn on port polling? If so, what, and how many interfaces?
- Did you create spectrowatches? If so, how extensive and how many models are they applied on?
I think the problem is most like the reading of the address table. I would suggest we work this through the ticket and update this post once we have a solution.
One more question : are you going to take the case in hand ? because no one has replied to it since I opened it this morning, as a level-2 case. I have killed spectrum and restarted it. Seems to be running fine so far but I really need to understand what happened: I was planning to upgrade our production system to v10.1.1 as well and with this experience on our test instance, I cannot continue blindly. Thanks again, Veronique
Sorry but this is not the correct answer as I still have no final answer in the case I have open.
I restarted the spectrum instance from a backup and it crashed again about 14 days later.
Please someone remove the "correct answer" flag.
The number of threads started to grow on the 1st of august as you can see on the snapshot. As this is a test spectrum instance, it is quite stable for what concerns the models it contains. So I suspect that the problem is more on the spectrum side.
I hope I can get help to dig deeper. Thanks, Veronique
My apologies for "dropping off" yesterday as I had meetings to tend to and a S1 case to assist on. I cannot take ownership at this time due to a 10.2 project that I am working on however I will try to continue to assist. I see that you have been working with one of my colleagues on the issue and have provided further data. I'll take a look and circle back with my colleague...
Cool. Thanks for the help ! cheers, Veronique
we have the same problem but in a production. I allready opened a ticket but but it's still without a solution.
Our spectrum instance is during a incident time freezed.
I can do only 'kill -9' for SpectroSERVER process and load a ssdb from last known backup.
I try sometimes wait 10-20 minutes but the SpectroSERVER was still offline.
The stack dumps for your problem are different. They are showing an issue with fault isolation calculations. This could either be due to a modeling problem/configuration or an issue with the Spectrum code. Generally, we can take a look at the db to see if there is a modeling issue. I see from the case you are unable to provide the SSdb so I would suggest looking for modeling issues that are known to cause performance problems:
1. WA_Links connected to more than 2 models
2. Fanouts with more than 50 connections
3. Modeling loops – you can check the tomcat log file and the VNM.OUT to see if there are modeling loops
4. Is there a major outage occurring when this hang happens? If so, is there something different/special about the device/ports/connections (etc)… -- is there a device with thousands of ports going down, or is there a WLC going down that affects thousands of AP? Things like that…
Otherwise we can have our engineering team review the code to see if there is an inefficiency.
Hope that helps
sorry for a late answer. I allready checked options 1 to 3 from your list. We use two landscapes and we will migratie second landscape only to one landscape. I think we can improve fault isolation calculation after we finally migrate a few years back divided infrastructure into one landscape. The interesting thing on this, it's append only when we have a small network outage. But if the monitoring system is feezed during a failure it's hard to say what really happend on the monitored network.
Thank you for a interrest and a help
Your issue is probably the same problem with the SNMP stack. If you haven't already applied one of the patches, please open a case to get the patch for the version of Spectrum you are running:
I have the same problem with Spectrum 10 version. The SepectroServer application consumes 100% of the CPU and the only problem is that events and alarms are no longer displayed.
I asked the following question in the community "How to improve CA Spectrum 10.1 performance" and my case ID is 00668450.
How to improve CA Spectrum 10.1 performance
Could you solve the problem?
Note: I think the problem is in the number of CPUs and threads of the SpectroServer. I saw in the Manual CA Spectrum - 10.1 to 10.1.2_ENU - 20161012 page 562 "Threads and Thread Latency"
Hi, The problem seems to have been solved in v10.2. So you need to upgrade. Cheers, Veronique
More than three weeks ago it was updated to version 10.2.
I noticed the following when the CPU rises in the Performance View.
The Poll Threads In Use are raised to the maximum.
There are only 188 devices.
When I stop the SS it looks like the next image.
Thanks very much Veronique
Most likely you’ll need Spectrum_10.02.00.PTF_10.2.020.
It took a little while to get the tech doc done as I needed to confirm the patch info was finalized:
Please open a case and request this patch and one of us will upload it to the case for you.
Perfect, I test it and comment on the results.
Thanks you very much.