CA Client Automation

 View Only

Tales from the Front: How a biological virus (COVID-19) brought down a computer system.

  • 1.  Tales from the Front: How a biological virus (COVID-19) brought down a computer system.

    Broadcom Employee
    Posted Mar 17, 2020 11:24 PM
    Hello Everyone,

    I thought I'd share a cautionary tale with the technical folk and those who look after IT systems here.

    Yesterday, I encountered my first technology issue triggered by the coronovirus (COVID-19).
    That's right, a biological virus had a severe impact on a computer system - and I suspect that it will not be the last time that this happens.

    The cause was a little like the panic buying of toilet paper here in Australia, even though logic suggests that this is not a commodity which will be in short supply - or even needed for this type of outbreak.  Fear itself - or precautionary measures, however you wish to frame it - had a direct impact on the real world.

    A business decision was made by the impacted site to have more people work from home, or remote from the central offices, as a precautionary measure to slow the impact of COVID-19. So far, so good. This is a recommended practice where possible. They were able to use technology to offer this opportunity, thanks to the information revolution. They had also used this technology successfully for years with their employees, with good results. So it was a familiar process.

    The client uses one of our products (CA Process Automation), to enable the VPN authentication of end users. A task that usually took only a small amount of time ballooned into one of several hours to login.

    The first problem encountered was one of scale. There were simply far more people working from home than at any time previously - by a significant factor. Bear in mind that this is a large site, so the numbers are large. They had already taken the correct step of scaling the system with extra server nodes for the software to handle the usual business case. However, the size of load had never been experienced before. 

    The second problem encountered was one of data maintenance. Insufficient archiving was run, so that a key table had ballooned to just under 20 million rows. Around 50% or more of the records were for closed items that were no longer needed. This didn't cause an issue with "normal use." 

    The two issues combined though led to the system performance crashing.

    For this site, the resolution is still under way, but involves direct database removals to get the system back to at least a functioning state. Then the implementation of a long term archive process to meet the business needs. (I'll put the links for this process for CA PAM at the end.)

    One take away from this is that even well run systems can be undone by unusual events. It falls to the system administrator to ask as many "What if?" questions as possible. What if the database was lost? What if load spiked to 200%, 500% or to the whole employee base? What technological impact can we expect from a business decision? There are testing tools available to help with these type of questions (We even make some of them.)

    Another is that system maintenance is not a "one and done" deal.  It is an ongoing activity. Sometimes, the need is obvious, such as when a new division is brought on.  Sometimes it should be part of an "annual check up" to see how the system health is performing.  


    Anyway, I hope this case was of interest. If you have a system which depends on the outside network, VPN or may have a large change in user volume, you may wish to ask yourself a few "What if?" questions at this point and see if you like the answers.


    Thanks, Kyle_R.

    References

    Documentation: CA Process Automation 4.3.04 - Archive Policy
    Knowledge Article: Why is my ITPAM database so large?