We are facing real wobbly issues with Wily Mom dasbord gtting greyout, all of collectors getting disconncted from mom.
Following is what we have tried till date but nothing seem to be directng us to anywhere:
I have been keeping a close tract of ntp offset and tough have noticed few collector are 1.3ish sec offset from Mom but not for long.
We are using datastore frm SAN (a sngle lun is formed of multiple datastore) but vmware report shows that we arent doing that muc of I/O intensive activity.
Though I asked about vmotion to which they said it is enaled for Mom and collector but the havent been vmotiond for a while.
So following are my queries:
Is there a way to check from vmware guest (collctor & mom) that they have been vmotioned.
Wha is ping time, can it b different from ntp offset status, if I cnfigure simple ping for round trip pkts, will that data also account for ping time. I mean ping time i wily term but we have t make infrastructure team to understand that tere ca b something wrng wit n/w (though mom & collector are in same subnet)
following are the error message which I usually see when wily is wobbly :
[error] [PO Route Down Executor] [Manager] Uncaught Exception in Enterprise Manager: In thread PO Route Down executor and the message is java.lang.NullPointerException
[Manager Cluster] Waited 15000 ms But did no receive respone for the mssage com.wily.isengrd.messageprimitves.servics.MessageServiceCallMessage.
ny help will be useful specially if we can find out whether Mom was vmotoned.
Any help here for Sumit?
Do not use VMotion to move a "live" EM. Shutdown the EM first. Should not cause an issue if your cluster has been scaled properly for load balancing.
Are the EMs in the cluster on the same subnet?
Are your VMs able to keep up with I/O activity for SmartStor? http://goo.gl/RTWRJM
Using a single LUN for multiple SS instances does NOT meet the hardware requirements, so I would suggest you read through all of factors affecting EM performance: http://goo.gl/RTWRJM
Also consider the topics discussed about how to minimize the impact of running multiple EMs on the same ESX server: http://goo.gl/RTWRJM
thanks for you reply.
I have worked on all the factors you mentiond before and here are the updates :
1.) Have checked with VM team and als told them : We do no vmoton EMs. The havent found any vmotion log in last few months but problem of disconnected collectors till xit.
2.) Though we are using datastore in a lun and no dedicated lun but wehave reviewed I/O read wrie stats. The look alrgh for eac collector . Further smartstor and harvest duation never spik to any alarming level and remains norma s ever.
3.) NTP and skew time look ithin control as well.
4.) We hav recently pruned collector smartstor data and now java heap conumption is pretty low as well.Futher we have increased hap mmoy as well in .lax fil so jva heap consumpton has gone more lower.
5.) Yes all EM are in same subnet.
What I have learnt after reviewing logs though that at times mssage q flls up above 6000 (the settng we have for mssag q in collector) and arund the same time we somtime notice high no of historical query.
Can that be the cause ? Heared there is a bug in 9.1.0 that if mssageq fills up , collector takes 2minutes to empty the q and might no respond to Mom. Is that happening to us ?
Need to know if we can put a clamp on workstation generated historic query and also if there is any matrix or formula to calculate wha historic query will fll up mssage q.
Marking as answered.