we are trying to create APM cluster (v9.7.1) with 10 collectors that will accommodate as many agents as possible. Our plan is to have 800 agents per collector each sending at most 500 KB of live metrics per interval, giving in total 400 000 KB of live metrics per collector and 8000 agents per APM cluster. Is it possible to have this amount of agents (while still sticking with officially recommended 400 000 KB of live metrics per collector)? Any field experience with such a big deployments? I know standard recommendation is to keep at most 400 agents per collector. Please advise.
Disclaimer: I am in Presales AND I like to follow the rules. Yes, it's true. SO my concerns would be:
I wish you well and would like to hear how well the plan works out!
i implemented a EM Cluster with 4000+ agents and there are a few things so far to mention:
- the initial loading of all the agents in to the client can be a problem
- when the agents start loadbalancing it is as well a massive load and could lead to problems
- if you let the mom loadbalance freely you might end up with 10x the number you calculated from a historical count perspective per collector
- the agents really need to be controlled to not suddenly deliver massive amount of metrics
- the historical metric count has an impact as well depending on how long you store the metric data and over time the enviornment could get slower
- as with every sizing give enough power and headroom for the collectors and mom to work with
but in general it works well and depending on the issue you are face you might have to analyze, configure and test things no one else did before not even CA
The big problem is, when you run an env that close to capacity, what happens if one or two collectors need restarting?
The load will be spread to the rest of the collectors and might cause them to fail thus causing more outages and the whole thing falls over like dominos
My rule of thumb is to keep the env below capacity so in an failure event the rest of the collectors can sustain the increased load until the failing collectors can be accessed again.
this is what i wanted to say with "if you let the mom loadbalance freely you might end up with 10x the number you calculated from a historical count perspective per collector"
one way around it, but loosing the capability of an activ/activ solution is to not allow loadbalancing to happen and forcing the agents to stick on to one collector
or you force them on to e.g. a collegtor pair so you have limited active/active capabilities.
in any case you need to calculate the historical metric count per collector based on the way you loadbalance the agents.