I have a problem with the MOM connected to 10 collectors.
I have seen the harvest time of MOM is dropping to 200 ms from 1sec average and in the same time the number of metrics drops also causing tons of alarms being spread over
Can you suggest some settings to avoid this drop? Is this nrelated to MOM's heap setup? 18GB on a box of 24 Gb RAM
Thanks in advance
Luca Razzi CA customer from Italy
A drop in harvest time is usually a good thing, you want the harvest time to be as quick as possible.
A lot of things happen during harvest process, one of the things is that the live status of metrics is checked so that alert status can be verified if it should change or stay the same.
There's probably a correlation of fewer metrics to harvest meaning a shorter harvest time.
The thing is that MOM harvest time shouldn't really be that big anyway because metrics do not report directly to the MOM.
If you have metrics generated locally on the MOM, that is what might cause the harvest duration on MOM to rise, one second is not too high, we would be worried above five seconds, but I would not normally expect a MOM to take one second in harvesting.
I would suggest with a MOM and 10 collectors, you are more likely to have performance issues in the cluster itself, maybe one or more collectors are running with a heavy agent load.
Can you see any pattern to this problem, is it something happening at regular intervals ?
It feels like something we might need to investigate here in support to review logs across the cluster, understand how much load is on each collector, how much heap each collector has.
I would look for any regular peaks in Harvest Duration on the collectors for example, depending on which version of APM you are using, there are known issues with spikes in Harvest Duration on collectors due to a SOA deviation metric calculation that, by default, would run hourly - that can cause metrics/alerts to grey out..
18GB heap is normally more than enough for a MOM, as long as you don't intend to run any other Enterprise Managers (collectors) on that machine, you won't be short of physical memory.
Thanks for your kind reply David
Going further with the logs analysis I have found that MOM is suffering of slow connections with other collectors.
At least with one that is catching lots of agents while we go on with our endless installations (1400+ agents now).
I think that should be a problem with messaging between mom and collectors both ways
Can you suggest a strong setup of threads in the messaging area for the 10 collectors and mom?
Even though agents count can cause issues in slowness but 1400+ agents for 10 collectors will not cause any issues if you metrics count is well below threshold. The max agent count per collector is 400 and collector can handle up to 400K metrics as well. Either of these rules should apply, For example: 100 agents can report 400K metrics or 400 agents can report 100K metrics.
What is your current agent count for 10 collectors?
What is your metrics count for all 10 collectors?