DX Application Performance Management

APM Tech Tip: Cascading Issues

  • 1.  APM Tech Tip: Cascading Issues

    Broadcom Employee
    Posted 12-06-2014 10:17 AM

    CA Tech Tip: Cascading APM Problems


    Introduction
     
    While the present and forthcoming APM troubleshooting covers common problems and resolution, it doesn't always review the complex area of
    cascading issues. I am defining this term as the following:

     

    A chain of usually sequential events across multiple servers resulting in various problems and symptoms. These may reoccur over time.


    For APM CE (CEM), this is one typical sequence

    - A TIM Collector either underpowered or having Introscope Agents connect to it, becomes overloaded
    - The Tim Collector stops communicating with TIM and gets a 4xx/5xx error on the Monitors tab
    -  Defects, btstats (RTTM), stats and other files are backing up on the TIM in /etc/wily/cem/tim/data/out/...
    -  CEM Reports are not being produced
    - A call comes to APM Support

     

      For Introscope, there are similar sequences between MOMs, Collectors, Agents, and other components. (Such as load balancing/Overloading EM issues.)

     


    How to work this issue


    - Although there are various approaches that can be used, I like the functional-workflow approach that I have described in earlier tips.
       If I know the function of a server and the APM components it corresponds with, then typically I can quickly hone in on an issue.


    - By getting the logs across all the impacted server, one can perform an event correlation on determine which events were happening on each server.
    - Breaking into multiple issues and prioritizing them. For this case, I would break into two tickets/issues:

       * Get the files off the TIM by restarting the TIM Collector or disabling/re-enabling the TIM object
       * Clean up the Stats Aggregation issues
      
    If relevant, I would include performance/architecture recommendations that should be addressed by the customer. This could include upgrades/hot fixes
    that usually resolve the issue. By addressing these concerns, the issue should less likely happen in the future.

     

    Questions for Discussion:
    1) What cascading issues have you/are you encountering
    2) Which overall approach did you use to resolve them?
    3) What other troubleshooting topics would you like to be covered in tech Tips?