Hi Osvaldo,
Indeed, this looks like a capacity/resource issue.
If after increasing the memory the problem persists, open a support case, we would need to do a log analysis.
Below the list of the 5 top common performance issues we see here in support.
In the EM (Mom and collector) logs, search for the below keywords:
1-reported Metric clamp hit
Example:
[INFO] [PO:client_main Mailman 2] [Manager] Collector jhbpsr020000011@25318 reported Metric clamp hit.
[WARN] [Harvest Engine Pooled Worker] [Manager.Agent] [The EM has too many historical metrics reporting from Agents and will stop accepting new metrics from Agents. Current count
Recommendation Increase the clamp in the EM/collectors apm-events-thresholds-config.xml, however, any increase of the default value will have an impact in the overall performacne
Also remember once a Collector has hit its metric clamp, MOM won't redirect Agent to that Collector any more. And if all Collectors hit the clamp at some point, MOM won't find any Collectors for the incoming Agent.
When we are in situation:
- If it's a 91+Agent, MOM will keep the Agent with it in disallowed mode. When this list grows too big, MOM will have trouble keeping up with the connections and taking any new connection.
- if it's a pre-91 agent, MOM will reject the Agent and the Agent will come back again later. This will also add connection load to the MOM.
2- reached
Example:
[WARN] [Dispatcher 1] [Manager] Timed out adding to outgoing message queue. Limit of 3000 reached. Terminating connection: Node=Workstation_29, Address=22.240.96.38/22.240.96.38:46896, Type=socket
Recommendation:
Add /Increase the following properties to the MOM and Collectors properties files. The impact of these changes will be in memory.
transport.outgoingMessageQueueSize=10000
transport.override.isengard.high.concurrency.pool.min.size=10
transport.override.isengard.high.concurrency.pool.max.size=10
3-slowly
Example:
[VERBOSE] [PO:main Mailman 6] [Manager.Cluster] Outgoing message queue is moving slowly: Node=Server, Address=/22.240.96.8:25318, Type=socket
Recommendation:
This could be due to the huge smartstor metadata / historical data.
Have you increased the live/historical metric clamp?
If you are running CLworkstation queries, it could be due to the huge queries, set
introscope.enterprisemanager.query.datapointlimit=5000000
introscope.enterprisemanager.query.returneddatapointlimit=1000000
4- too many
Example:
java.io.IOException: Too many open files
Recommendation: Make sure the max open file handle is at least 4096 on both MOM and Collectors. You can check current open file descriptors by using “ulimit -n” against the user who starts EM processes. You might need to increase the maximum number of open files allowed for that user.
5- outofmemory
Example:
java.lang.OutOfMemoryError: GC overhead limit exceeded
Recommendation: Increase heapsize by 2GB
Or
java.lang.OutOfMemoryError: PermGen space
Recommendation: Increase -XX:MaxPermSize
And of course search for any [ERROR] message
I hope this helps,
Regards,
Sergio