DX Application Performance Management

Expand all | Collapse all

Uncaugh Exception on MoM in Threadpoo

  • 1.  Uncaugh Exception on MoM in Threadpoo

    Posted Jan 19, 2017 10:27 AM

    Need help with this ERROR, it keeps generated repeatedly. 

     

    1/19/17 10:16:55.182 AM EST [ERROR] [pool-11-thread-286204] [Manager] Uncaught Exception in Enterprise Manager: In thread pool-11-thread-286204 and the message is com.wily.util.exception.UnexpectedExceptionError: Tranport for the registry service at address: {1} is down
    1/19/17 10:17:09.726 AM EST [WARN] [PO:main Mailman 2] [Manager] Unable to send signal for clearing denied agents to collector "10.60.168.40@5001"
    com.wily.isengard.messageprimitives.ConnectionException: Tranport for the registry service at address: {1} is down
    at com.wily.isengard.postofficehub.ClonedRegistry.getEntry(ClonedRegistry.java:141)
    at com.wily.isengard.messageprimitives.service.MessageServiceFactory.internalGetServiceInterface(MessageServiceFactory.java:317)
    at com.wily.isengard.messageprimitives.service.MessageServiceFactory.internalGetServiceInterface(MessageServiceFactory.java:270)
    at com.wily.isengard.messageprimitives.service.MessageServiceFactory.getService(MessageServiceFactory.java:132)
    at com.wily.introscope.server.beans.loadbalancer.ClusteredLoadRebalancer.getLoadBalancerAdmin(ClusteredLoadRebalancer.java:65)
    at com.wily.introscope.server.beans.loadbalancer.ClusteredLoadBalancer.clearDeniedAgentsOnAllCollectors(ClusteredLoadBalancer.java:704)
    at com.wily.introscope.server.beans.loadbalancer.ClusteredLoadRebalancer.rebalance(ClusteredLoadRebalancer.java:275)
    at com.wily.introscope.server.beans.loadbalancer.ClusteredLoadRebalancer$RebalanceTask.run(ClusteredLoadRebalancer.java:1095)
    at com.wily.EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(PooledExecutor.java:728)
    at java.lang.Thread.run(Thread.java:745)
    1

     

     

    Thanks in advance for any/all assistance. 

    Manish



  • 2.  Re: Uncaugh Exception on MoM in Threadpoo

    Posted Jan 19, 2017 10:41 AM

    I can provide logs 1:1 if anyone requires them. Please reach out to me via private message and I can provide them to you.



  • 3.  Re: Uncaugh Exception on MoM in Threadpoo

    Broadcom Employee
    Posted Jan 19, 2017 11:04 AM

    Hi Manish,

     

    Looks like the MOM lost connection to the Collector.  Check the MOM logs to see if this particular collector was disconnected or running slow around or before the time listed above.



  • 4.  Re: Uncaugh Exception on MoM in Threadpoo

    Posted Jan 19, 2017 11:45 AM

    Hi Matt,

    This what I see when I did a search on that IP

     

    Line 478842: 1/19/17 10:07:09.416 AM EST [WARN] [PO:main Mailman 7] [Manager] Unable to send signal for clearing denied agents to collector "10.60.168.40@5001"
    Line 478897: 1/19/17 10:17:09.726 AM EST [WARN] [PO:main Mailman 2] [Manager] Unable to send signal for clearing denied agents to collector "10.60.168.40@5001"
    Line 478950: 1/19/17 10:27:10.900 AM EST [WARN] [PO:main Mailman 6] [Manager] Unable to send signal for clearing denied agents to collector "10.60.168.40@5001"
    Line 479012: 1/19/17 10:28:37.274 AM EST [WARN] [ClusterManager Async Executor] [Manager] Unable to update load balancing for collector "10.60.168.40@5001"
    Line 479116: 1/19/17 10:28:37.931 AM EST [WARN] [ClusterManager Async Executor] [Manager] Unable to update load balancing for collector "10.60.168.40@5001"
    Line 479158: 1/19/17 10:28:46.819 AM EST [WARN] [Collector 10.60.168.40@5001] [Manager.Cluster] Lost contact with the Introscope Enterprise Manager at 10.60.168.40@5001
    Line 479158: 1/19/17 10:28:46.819 AM EST [WARN] [Collector 10.60.168.40@5001] [Manager.Cluster] Lost contact with the Introscope Enterprise Manager at 10.60.168.40@5001



  • 5.  Re: Uncaugh Exception on MoM in Threadpoo

    Posted Jan 19, 2017 11:47 AM

    I've went ahead and restart the MoM when I posted this question. I will continue to monitor to see if these ERRORs come back again.



  • 6.  Re: Uncaugh Exception on MoM in Threadpoo

    Broadcom Employee
    Posted Jan 19, 2017 11:55 AM

    Thanks Manish. If this issue does not reoccur by end of today, may we mark this as answered? (Knowing you may post additional questions/comments as needed.)



  • 7.  Re: Uncaugh Exception on MoM in Threadpoo

    Posted Jan 19, 2017 12:04 PM

    Hal, 

    Sounds good. I am hoping for the best

    Thanks



  • 8.  Re: Uncaugh Exception on MoM in Threadpoo
    Best Answer

    Broadcom Employee
    Posted Jan 19, 2017 12:08 PM

    In doing further research, this warning message is logged by the MOM when it couldn't deliver the notification to the collector for clearing denied agents upon load rebalancing. This is usually an indication of the collector having a connectivity issue with MOM or its message queue already full. MOM will try sending the notification periodically upon each load balancing cycle. What is the value of your introscope.enterprisemanager.loadbalancing.interval property?

     

    This could happen every so often if the collector was restarted and not yet reconnected to MOM at the time of load balancing. However, if these warning messages were occurring repeatedly, then it would likely be a side-effect from some other connectivity issues or or performance issues. 



  • 9.  Re: Uncaugh Exception on MoM in Threadpoo

    Posted Jan 20, 2017 09:18 AM

    introscope.enterprisemanager.loadbalancing.interval=600



  • 10.  Re: Uncaugh Exception on MoM in Threadpoo

    Broadcom Employee
    Posted Jan 20, 2017 09:20 AM

    Ok, not ridiculously low as we had in one issue a while back. 10 minutes should be sufficient.



  • 11.  Re: Uncaugh Exception on MoM in Threadpoo

    Posted Jan 20, 2017 09:57 AM

    In one of the tuning guides that we had for the 9.0.5.6 and also back in 9.1.1.1 it was suggested to set the outgoing message queue size to 6000

     

    transport.outgoingMessageQueueSize=6000

     

    We have added this to everyone of our enterprise managers in our 9.6 and now our 10.0 environments.

     

    Would this help the communication queue buffer issue between the MOM and the collectors?



  • 12.  Re: Uncaugh Exception on MoM in Threadpoo

    Broadcom Employee
    Posted Jan 20, 2017 10:11 AM

    Yes that would help the communication buffer queue.  It should never need to go above 8000.  At that point, 2 things happen.  The higher you increase the number, the more resources it takes up.  The second thing is, there is something more underlying contributing to the cause.  So that is what we would troubleshoot.



  • 13.  Re: Uncaugh Exception on MoM in Threadpoo

    Posted Jan 20, 2017 10:20 AM

    transport.outgoingMessageQueueSize=10000
    transport.override.isengard.high.concurrency.pool.min.size=20
    transport.override.isengard.high.concurrency.pool.max.size=20

     

    I have that set to all of my collectors and MoM. I had a case open about it (separate issue) and these values were suggested.



  • 14.  Re: Uncaugh Exception on MoM in Threadpoo

    Posted Jan 20, 2017 10:42 AM

    My little APM voices are having a field day.  Something looks wrong but not really sure what it is.  I would agree with musma03, anything above 8000 and there is more than likely something more underlying contributing to the cause.

     

    mparikh72, could you provide some APM environment details?

     

    What version of APM are you running?

     

    Do you have the "MOM_Infra_Monitoring_MM.jar" deployed and adjusted for your environment?

    Do you have the "Collector.jar" management modules deployed for each of your collectors?

     

    These dashboards have been very useful to us to understanding what is going on within the APM cluster.

     

    What OS is your enterprise managers running on?

    Are the hosts physical or virtual?

    How is the CPU/Memory/Disk/Network performing on the enterprise managers?

     

    How many collectors are you running?

    Are the collectors pretty well balanced, metric wise?

     

    Do you have any other metrics that are not out of the box, such as an environment performance agent with plugins to gather other OS metrics?

     

    How many agents, and metrics (live/historic) are your collectors dealing with?

    How many workstations, end users are typically logged into APM?

     

    Are your collectors running lots of traces?

    How is your harvest and smartstor durations look like?

    How is your MetaData write duration?

     

    Way back in 9.0.5.6 and also in 9.1.1.1 we had someone from CA Services come in and do a review of the health of the APM cluster, which turned into a professional services engagement to help us tune our hardware (virtual servers) to contend with what and how APM operates.

     

    Sorry for so many questions,

     

    Billy



  • 15.  Re: Uncaugh Exception on MoM in Threadpoo

    Broadcom Employee
    Posted Jan 20, 2017 10:49 AM

    And that is fine.  As long as your server is beefy and can handle 10000 and 20 and 20, then you should be ok.  But if you wanted to raise it to 20000, I would advise against it for the reasons I specified above.

     

    Some other environments cannot handle 10000 20 and 20 as they're not powerful enough to do so.  



  • 16.  Re: Uncaugh Exception on MoM in Threadpoo

    Broadcom Employee
    Posted Jan 20, 2017 08:00 AM

    Dear Manish:

        I was hoping to hear back good news from you. As previously agreed, since there was no response by end of yesterday marking as answered. Matt's last note gives some good leads on what the issue is likely to be. You are more than welcome to post any status updates and further questions as needed. And I will do what I can to get you a response.

     

    Happy Friday

    Hal German