Is there a corresponding graph for application pause for the DCs? is there a spike in app pause too?What does "ps -ef | grep java" show for the karaf process on those 2 DCs that dip? It should say when the process was started.So the last graph about "Polling stopped due to prior timeouts" means that DC had 15 requests for 1 Metric Family on a device not return any response from the device. From the spike, that was 170k devices had the same thing happen.
Maybe there was a network issue, or maybe there was a large app pause due to java garbage collection that caused responses to not be processed.