Is there any pattern you can tell, such as always the same agents failing? Perhaps agents with a different config policy?
This will probably require inspecting logs to determine why the jobs are not running then. You should probably wait until the next time it occurs and collect the SD Agent logs from the Agents and the SD Server logs from the Scalability Servers. Make sure you grab the logs before they are over-written, or extend the number and size of the logs (on the SS) so they are not over-written.