What factors are included in the "Overall Capacity (%)" for the enterprise manager, Manager of Managers (MOM) in 10.5.2.92?
I've searched around the docs and community site on what does the metric show:
We have 10.5.2.92 and we received capacity alerts (above 80%) it holds for 10 to 45 minutes then drops down to a normal level. The only metric I can find under the custom host (MOM) is that the JVM memory after GC jumped to a bit over half (6.99 GB) of the JVM allocated memory (10 GB) and then after about 40 minutes, dropped back down to a normal level of around 1.7 GB.
My best guess of reaction to this would to run a "kill -3" on the JVM a few times during the event and then a few time afterward, send the heapdumps to CA Support and see if they can tell what is going on, but before I did that, I wanted to see if there is anything I can review, change or capture that might shed some light into why the Overall Capacity %, jumps from around 20% to over 80%.
Thank you in advance,
So the GC Capacity was the one that has spiked in our implementation, but why did the memory after GC jump from 1.7 GB to around 7 GB?
Can you share screenshot of the Enterprise Manager Metrics from *SuperDomain*|Custom Metric Host (Virtual)|Custom Metric Process (Virtual)|Custom Metric Agent (Virtual)|Enterprise Manager.
You can also check the logs from when the problem start and see if you can find anything unusual.
The EM:MOM logs looked pretty normal for 10.5.2.
7/06/18 08:48:24.407 AM EDT [INFO] [PO:main Mailman 2] [Manager.SessionBean] Workstation User "bcole" connected successfully from host "Node=Workstation_1822, Address=hq127010.pheaalan.pheaa.org/10.10.55.37:3029, Type=socket"7/06/18 08:48:42.538 AM EDT [INFO] [PO:main Mailman 3] [Manager] Logging out user "bcole" from host "Node=Workstation_1822, Address=hq127010.pheaalan.pheaa.org/10.10.55.37:3029, Type=socket"
7/06/18 08:48:42.555 AM EDT [INFO] [PO Route Down Executor] [Manager.SessionBean] Workstation User "bcole" disconnected successfully from host "Node=Workstation_1822, Address=<redacted>, Type=socket"7/06/18 08:52:25.063 AM EDT [INFO] [Alarm Pooled Worker] [Manager.Action] Action "enterprise_monitoring_admins" successfully sent SMTP mail to "firstname.lastname@example.org"Jul 06, 2018 9:00:01 AM org.hibernate.event.def.AbstractFlushingEventListener performExecutionsSEVERE: Could not synchronize database state with session
org.hibernate.StaleObjectStateException: Row was updated or deleted by another transaction (or unsaved-value mapping was incorrect): [com.timestock.tess.data.objects.Monitor#700000000000000560]"
At 8:48 am, I logged into the MOM with workstation 10.5.2.15 to validate that it was able to log into the EM. Currently the EM is 10.5.2.92 and we are having an issue with one of our other environments (test) where it should also be 10.5.2.92 but unlike our other EMs with 10.5.2.92, we can not connect to it with the 10.5.2.15 workstation.
can you click on enterprise Manager and send the screenshots. I want to see all the other correspondent metrics related to Over All Capacity like GC Duration, SmartStor etc.
Here is also a KB on this topic: What does the Overall Capacity (%) metric consist - CA Knowledge
What does the Overall Capacity (%) metric include?
Overall Capacity is the highest of the four numbers in Enterprise Manager.
This number is generated by a heuristic that looks at other metrics to see where the weak point is in the system. For example, if the CPU is running well but the EM has filled up 95% of memory, then the overall capacity will be 95%. This metric will likely be the "one" health metric to pay attention to as an EM administrator.
The following metrics are examined when computing the overall capacity:
Health: Harvest Capacity (%) - how is the EM doing when it processes data every 15 seconds.
Health: Heap Capacity (%) - how much memory is being used. Takes a running average of the last 4 minutes.
Health: Incoming Data Capacity (%) - how well is the EM processing the incoming flow of data from Agents.
Health: Smartstor Capacity (%) - how well is the EM writing metric data to disk.
The Enterprise Manager Overall Capacity (%) metric estimates the percentage of the Enterprise Manager capacity that is being consumed.
The Overall Capacity (%) metric is computed in part from the following contributing metrics, which are shown in the metric browser tree under Enterprise Manager | Health:
- CPU Capacity (%)
- GC Capacity (%)
- Harvest Capacity (%)
- Heap Capacity (%)
- Incoming Data Capacity (%)
- SmartStor Capacity (%)
The Overall Capacity (%) metric is more valuable over a long period rather than for a specific 15-second time slice. Because the Overall Capacity metric is based on real-time metrics, the Overall Capacity value can spike quite a bit higher than 100 percent. The spike can occur, for example, because the hardware I/O subsystem overloads briefly.
However, the Enterprise Manager tends to recover from these spike situations automatically when they are not long-lasting. In general, a spike (for example, to 200 percent) is not cause for concern if it is transient.
However, over a long period the ideal average Overall Capacity is 75 percent or less.
During time periods that the Overall Capacity (%) metric spikes to high values, at least one of the other contributing
metrics probably also shows a spike. Investigating and understanding the source of the secondary spike can help pinpoint the root cause of the resource issue. For example, you might find the problem by looking at Heap Capacity (%) metric which feeds into Overall Capacity (%) metric.
Viewing the Overall Capacity metric in Historical mode is useful for a general, comparative view of Enterprise Manager capacity status. However, the Enterprise Manager workload is complex, and various aspects of the workload affect the Overall Capacity metric in different, nonlinear ways.
For instance, the duration of SmartStor maintenance tasks (spool to data conversion and reperiodization) can be an important indicator of Enterprise Manager capacity. However, these maintenance tasks do not directly participate in the Overall Capacity calculation. The SmartStor maintenance tasks cause an increase in CPU and heap utilization. The increased utilization results in an increase in capacity percentage, but the magnitude of the increase does not reflect the full impact of SmartStor maintenance issues.
The Overall Capacity metric is focused primarily on how an Enterprise Manager handles the agent metrics workload. This metric does not directly evaluate capacity with respect to the application triage map or CA CEM data. For example, the Overall Capacity metric does not reflect overloaded Enterprise Manager services or APM database I/O issues.
So per Junaid Wahab's comment it would be good to review all the other metrics under the Enterprise Manager | Health node.
Exhaustive, clear , useful. Thank You
The EM Capacity % appears to be related to the number of workstations. Normally the heap after GC stays around 1.8 to 2 GB with the heap in use, staying in a pattern after GC. When a 10.5.2.92 workstation connects to the 10.5.2.92 enterprise manager (MOM) there is a very large increase in the heap use after GC. After the workstation logs out, there appears to be correlation to the heap use after GC, implying that the workstation is driving the memory use on the MOM.
The same behavior is not seen with WebView.
I have opened a support case on this topic since it does not appear to be other factors such as increase in agents, metrics, error or transaction traces during the time the MOM memory jumps.
01135833 - EM Capacity % - Increases when workstation is logged in.
Thanks Billy. I took the case
We applied the hotfix 46 (10.5.2.99) to our enterprise managers and workstations and the capacity issues due to the workstation appear to be fixed. Also the "SEVERE: Could not synchronize database state with session" messages within the EM logs also appear to have been corrected.