Sorry for the vague title, but I was wondering how the community deals with this issue. There are times when a team comes to us and asks us to tweak one of our alerting policies, for example, disk space type alerts. This is technically business as usual and we can make these policy changes on the fly, but the issue comes further downstream with Spectrum and SOI.
For example, there are times when disk space alerts come into Spectrum/SOI, and our command center goes ahead and cuts tickets to the appropriate teams. There are times when those teams come back and say "yes we know that our disk space is X% full, but we cant clean up any further, plus we are replacing this server". Now imagine the same for a cluster of servers, suddenly you get hundreds of servers that start to meet this same criteria of being sunset.
What happens next is that the command center will eventually clear the alerts from their queues/global collections. Because they are disk space alerts, they dont fire again until the next threshold breach. This is where the issue comes into play. When we modify a policy, and re-push the templates to the servers, all of those devices re-fire all of those alerts.
In the grand scheme of things it makes sense why the alerts are firing (you are re-checking the server after a policy update), but is there anyway that we can say "push this policy, but do not re-fire alarms"? Our goal here is to make policy changes without flooding our queues with alarms that have previously been cleared.