Rally Software

View Only

Back to discussions

Expand all | Collapse all

Rally 12/21/2020 System Outage Communication

1. Rally 12/21/2020 System Outage Communication

0 Recommend
Broadcom Employee

David LeDeaux
Posted Dec 21, 2020 05:32 PM
Edited by David LeDeaux Dec 28, 2020 02:42 PM

Reply Reply Privately
Rally had taken a schedule maintenance window over Friday night to move to a new database server to process system writes. The configuration had been run in parallel to the existing system for the last two weeks and performance was at parity with the existing production database. We made the cutover successfully and all traffic appeared normal. We continued to monitor traffic over the weekend and the system was healthy. When we hit Monday morning loads, the new system appeared unable to handle inbound connections at the same rate we had previously tested. After some troubleshooting, we made the decision to roll back to the prior configuration to ensure service availability.

The change has been rolled back and we are running in a known stable configuration. We will continue to monitor the system as usual and communicate any issues.

Follow-up information:

On Monday, December 21, 2020, for a duration of 417 minutes, Rally services were unavailable. We’ve identified the recent database configuration changes as the source of the issues experienced. The Rally system is back in a stable configuration and we do not anticipate any further disruption to the service. The engineering team has significant work to understand the failure that occurred and will not be making further database infrastructure changes until the failure is understood and resolved. No additional system maintenance will be planned for the remainder of 2020.

Our operations team monitors the health of our system 24/7. Our current 90 day uptime is 99.65 per our SLA defined measurements. We take any incident that affects the availability and reliability of our customers extremely seriously and we would like to apologize for the scope of impact that this incident had on you, our customers and your business.

#critical-incidents #rootcauseanalysis
2. RE: Rally 12/21/2020 System Outage Communication

1 Recommend
Broadcom Employee

Jim Tykal
Posted Dec 28, 2020 10:41 PM

Reply Reply Privately
@David LeDeaux Any updates on the root cause of this outage? This had a major impact on our business, and our leadership is insisting on a thorough RCA and mitigation plan.

Original Message
3. RE: Rally 12/21/2020 System Outage Communication

1 Recommend
Broadcom Employee

Jim Tykal
Posted Dec 28, 2020 10:41 PM

Reply Reply Privately
My leadership team is looking for a detailed RCA and mitigation plan for this outage. Rally was unavailable for almost 10 hours -- this has a major impact on our business.
4. RE: Rally 12/21/2020 System Outage Communication

0 Recommend
David Spalding
Posted Dec 29, 2020 10:47 AM

Reply Reply Privately
We are also looking forward to a full RCA report.

------------------------------
SM, Application Support
Cigna
------------------------------

Original Message
5. RE: Rally 12/21/2020 System Outage Communication

0 Recommend
Broadcom Employee

David LeDeaux
Posted Jan 04, 2021 11:30 AM
| view attached

Reply Reply Privately
Please find the RCA document attached

Attachment(s)

RCA - System Outage (1).pdf 122 KB 1 version

Original Message

Rally Software

Rally 12/21/2020 System Outage Communication

David LeDeauxDec 21, 2020 05:32 PM

Jim TykalDec 28, 2020 10:41 PM

Jim TykalDec 28, 2020 10:41 PM

David SpaldingDec 29, 2020 10:47 AM

David LeDeauxJan 04, 2021 11:30 AM

1. Rally 12/21/2020 System Outage Communication

2. RE: Rally 12/21/2020 System Outage Communication

3. RE: Rally 12/21/2020 System Outage Communication

4. RE: Rally 12/21/2020 System Outage Communication

5. RE: Rally 12/21/2020 System Outage Communication