Automic Workload Automation

 View Only

Automating Site Reliability Engineering

By Yann Guernion posted Aug 04, 2020 07:07 AM


Being a Site Reliability Engineer is not an easy job. You have to handle application deployments, configuration, monitoring, and more, so that everything works effortlessly in production. Triage, troubleshooting, remediation, and support are, for the most part, undertaken manually. No matter how good you are, these processes are error-prone and require significant effort.

Organizations looking to adopt SRE models are either employing in-house developed open-source tools or loosely connected toolchains. This can quickly result in siloed tool sprawl and, owing to the ensuing heterogeneity, makes it harder for staff to find a single source of truth.

To ensure their complex, hybrid, interrelated, and dynamic environments deliver the best possible user experience, SRE teams have to achieve fundamental breakthroughs in scale and efficiency. It’s no longer sufficient simply to react a little faster when issues arise. Teams must gain the visibility needed to identify potential problems – and address them before they affect service levels.

Introducing Site Reliability Automation

Now an integral part of AIOps from Broadcom, Automic® Automation delivers contextual automation and seamless integration of root cause alarms with remedial workflows. This contextual awareness lets SRE teams easily automate standard operating procedures that can be reused across environments. While contextual automation contributes to reducing operations toil, it also enables teams to deal with a more significant volume of events. 

Our AIOps solution also includes a recommendation engine that leverages cross-domain insight to assist staff in choosing the most effective course of action for issue remediation. ML algorithms are used to rank the most successful remediation workflows in regard to the context. That continuous learning helps to resolve more issues faster and reduces MTTR.

The association of contextual automation and intelligent recommendation enables efficient Site Reliability Automation (SRA). It addresses the two major challenges of SRE teams by efficiently reducing operations toil and improving MTTR. Ultimately, SRA empowers SRE teams with automation that helps decision making and ensures more consistent operations by :

  • Knowing the root cause of a problem by pulling data together from disparate sources and applying AI and ML to find the source of the problem – fast.
  • Doing an action that seeks to solve the problem, using out-of-the-box remediation and diagnostic workflows.
  • Learning what works and what doesn’t, so that the optimal response can be made when similar problems occur in the future.

 Site Reliability Automation

The New Reality for Automation

The reality is that many organizations looking to adopt SRE models are employing loosely connected toolchains. After introducing a multitude of tools, staff typically find it harder to manage the infrastructure efficiently and to expedite problem-solving. SRE teams need modern orchestration and automation to drive agility into their new digital initiatives, integrating cloud, big data, and artificial intelligence platforms.

Broadcom, who is long committed to helping companies implement efficient automation and orchestration, has been referenced as a representative vendor in the latest “Market Guide for Service Orchestration and Automation Platforms (SOAP)” from Gartner. SOAP is a new category of automation solutions that can act as a single point of orchestration and meet the needs of heterogeneous IT environments that include cloud-native infrastructure and big data workloads. SOAPs do not necessarily replace existing automation tools, but rather aim to centralize, route, and delegate automation tasks as needed. Broadcom enables end-to-end visibility and orchestration that simplifies the deployment and the management of digital services across on-prem systems, big data, and multi-cloud infrastructures. The bottom line? You can enhance infrastructure monitoring with intelligent recommendations and auto-remediation that helps SRE engineers create more resilient production environments.

Yesterday’s automation tools are now too complex to use and to scale. Connecting large amounts of heterogeneous tools, systems, and data are all technology hurdles to be addressed. In that event, proper support for massive, fast-changing, and event-driven environments definitely set the new standards for automation in the digital age. If you’re driving your digital transformation and building new innovative services, it is a perfect time to look at tomorrow’s automation solutions.