Hey Pavel. I ran through an exercise like this not too long ago. I agree with everything you've mentioned and the things others have mentioned too. In our case my prober (prometheus, written in python) also checks...
How far behind is DB - how old is the oldest unprocessed event in ujo_event, how many unprocessed events are there
Skew time for each event - Since the last time scanned how long did each event take in ujo_proc_event to go from init_status_stamp to que_status_stamp, create a histogram and measure SLI skew over time. Good for an SLO.
Discrepancy between what config file thinks about DB status vs. what alamode in each db thinks about its own status
Blackbox user journey - how long does it take to go end-to-end from force starting a /bin/true job to reach an end state
We're also using a dashboard that reports migration ratio of jobs in old instance/jobs in new instance for instances under migration.
I'm sure there's stuff you can monitor for your GUI as well like authentication failures and the usual lot.
Good luck!
Scott
Original Message:
Sent: Jul 18, 2022 01:40 PM
From: Pavel Vaynshtok
Subject: autosys KPI
i'm looking for KPI ideas for autosys.
- EP uptime
- EP latency
- number of jobs
- number of runs
- percentage of failed vs total runs
- number of machines
- number of job owners
what else we can measure?