CA Tech Tip: Cascading APM Problems
Introduction
While the present and forthcoming APM troubleshooting covers common problems and resolution, it doesn't always review the complex area of
cascading issues. I am defining this term as the following:
A chain of usually sequential events across multiple servers resulting in various problems and symptoms. These may reoccur over time.
For APM CE (CEM), this is one typical sequence
- A TIM Collector either underpowered or having Introscope Agents connect to it, becomes overloaded
- The Tim Collector stops communicating with TIM and gets a 4xx/5xx error on the Monitors tab
- Defects, btstats (RTTM), stats and other files are backing up on the TIM in /etc/wily/cem/tim/data/out/...
- CEM Reports are not being produced
- A call comes to APM Support
For Introscope, there are similar sequences between MOMs, Collectors, Agents, and other components. (Such as load balancing/Overloading EM issues.)
How to work this issue
- Although there are various approaches that can be used, I like the functional-workflow approach that I have described in earlier tips.
If I know the function of a server and the APM components it corresponds with, then typically I can quickly hone in on an issue.
- By getting the logs across all the impacted server, one can perform an event correlation on determine which events were happening on each server.
- Breaking into multiple issues and prioritizing them. For this case, I would break into two tickets/issues:
* Get the files off the TIM by restarting the TIM Collector or disabling/re-enabling the TIM object
* Clean up the Stats Aggregation issues
If relevant, I would include performance/architecture recommendations that should be addressed by the customer. This could include upgrades/hot fixes
that usually resolve the issue. By addressing these concerns, the issue should less likely happen in the future.
Questions for Discussion:
1) What cascading issues have you/are you encountering
2) Which overall approach did you use to resolve them?
3) What other troubleshooting topics would you like to be covered in tech Tips?