Rally Software

Expand all | Collapse all

Rally 12/21/2020 System Outage Communication

  • 1.  Rally 12/21/2020 System Outage Communication

    Broadcom Employee
    Posted 12-21-2020 05:32 PM
    Edited by Dave LeDeaux 12-28-2020 02:42 PM

    Rally had taken a schedule maintenance window over Friday night to move to a new database server to process system writes. The configuration had been run in parallel to the existing system for the last two weeks and performance was at parity with the existing production database. We made the cutover successfully and all traffic appeared normal. We continued to monitor traffic over the weekend and the system was healthy. When we hit Monday morning loads, the new system appeared unable to handle inbound connections at the same rate we had previously tested. After some troubleshooting, we made the decision to roll back to the prior configuration to ensure service availability.

    The change has been rolled back and we are running in a known stable configuration. We will continue to monitor the system as usual and communicate any issues.

    Follow-up information:

    On Monday, December 21, 2020, for a duration of 417 minutes, Rally services were unavailable. We’ve identified the recent database configuration changes as the source of the issues experienced. The Rally system is back in a stable configuration and we do not anticipate any further disruption to the service. The engineering team has significant work to understand the failure that occurred and will not be making further database infrastructure changes until the failure is understood and resolved. No additional system maintenance will be planned for the remainder of 2020.

    Our operations team monitors the health of our system 24/7. Our current 90 day uptime is 99.65 per our SLA defined measurements. We take any incident that affects the availability and reliability of our customers extremely seriously and we would like to apologize for the scope of impact that this incident had on you, our customers and your business.

     #critical-incidents  #rootcauseanalysis



  • 2.  RE: Rally 12/21/2020 System Outage Communication

    Posted 12-28-2020 10:41 PM

     @Dave LeDeaux  Any updates on the root cause of this outage? This had a major impact on our business, and our leadership is insisting on a thorough RCA and mitigation plan.




  • 3.  RE: Rally 12/21/2020 System Outage Communication

    Posted 12-28-2020 10:41 PM

    My leadership team is looking for a detailed RCA and mitigation plan for this outage. Rally was unavailable for almost 10 hours -- this has a major impact on our business.



    ------------------------------
    Jim Tykal
    Discover Financial Services
    ------------------------------



  • 4.  RE: Rally 12/21/2020 System Outage Communication

    Posted 12-29-2020 10:47 AM

    We are also looking forward to a full RCA report. 



    ------------------------------
    SM, Application Support
    Cigna
    ------------------------------



  • 5.  RE: Rally 12/21/2020 System Outage Communication

    Broadcom Employee
    Posted 01-04-2021 11:30 AM
      |   view attached
    Please find the RCA document attached

    Attachment(s)

    pdf
    RCA - System Outage (1).pdf   122 KB 1 version