Rally had taken a schedule maintenance window over Friday night to move to a new database server to process system writes. The configuration had been run in parallel to the existing system for the last two weeks and performance was at parity with the existing production database. We made the cutover successfully and all traffic appeared normal. We continued to monitor traffic over the weekend and the system was healthy. When we hit Monday morning loads, the new system appeared unable to handle inbound connections at the same rate we had previously tested. After some troubleshooting, we made the decision to roll back to the prior configuration to ensure service availability.
The change has been rolled back and we are running in a known stable configuration. We will continue to monitor the system as usual and communicate any issues.
Follow-up information:
On Monday, December 21, 2020, for a duration of 417 minutes, Rally services were unavailable. We’ve identified the recent database configuration changes as the source of the issues experienced. The Rally system is back in a stable configuration and we do not anticipate any further disruption to the service. The engineering team has significant work to understand the failure that occurred and will not be making further database infrastructure changes until the failure is understood and resolved. No additional system maintenance will be planned for the remainder of 2020.
Our operations team monitors the health of our system 24/7. Our current 90 day uptime is 99.65 per our SLA defined measurements. We take any incident that affects the availability and reliability of our customers extremely seriously and we would like to apologize for the scope of impact that this incident had on you, our customers and your business.
#critical-incidents #rootcauseanalysis