Latest Developments in Site Reliability Engineering, 2023
Introduction
Gartner recently published its Hype Cycle for Site Reliability Engineering, 2023, (July 2023) report. OnPage was inspired by this report to share its prediction about the future of site reliability engineering. In this blog, OnPage will review evolutionary tools that can improve site reliability engineering practices.
What is Site Reliability Engineering?
Site reliability engineering (SRE) entails ensuring smooth, strong, and reliable infrastructure and operations (I&O). In I&O optimization, site reliability engineers apply software engineering principles in order to maintain scalable high-performing, reliable systems. Investing in SRE can lead to longer periods of consistent network uptime, stronger cybersecurity, higher levels of customer satisfaction, increased observability and monitoring, and quicker software deployment.
Trends in Site Reliability Engineering
While creating its report, Gartner® identified strategic planning assumptions. OnPage agrees with the following predictions made by Gartner®:
- “By 2027, 75% of enterprises will use site reliability engineering practices across their organizations to optimize product design, cost and operations to meet customer expectations, up from 10% in 2022.
- By 2025, 40% of organizations will implement chaos engineering practices as part of site reliability engineering initiatives, improving mean time to repair (MTTR) by an average of 90%.
- By 2026, 70% of organizations that successfully applied observability will achieve shorter latency for decision making, enabling competitive advantage for target business or IT processes.
- By 2025, organizations that invest in building digital immunity will increase customer satisfaction by decreasing downtime by 80%.
- By the end of 2025, 30% of enterprises will establish new roles focused on IT resilience and boost end-to-end reliability, tolerability and recoverability by at least 45%.”
(Gartner® 1, 2023)
Tools & Practices for Site Reliability Engineering
Focus
In this section, we will summarize SRE techniques and automated tools and explain its impact on site reliability engineering. The first practice—monitoring as code (MaC)—is emerging in the market and developers are swiftly innovating the first-generation software. The second is a tool—automated incident response (AIR)—approaching mainstream commercialization and vendors are committed to better understanding the software’s capabilities so they can elevate themselves to mainstream adoption. Thirdly, DevSecOps (development, operations, security) has been fully accepted and organizations bask in its low-risk, easy implementation as mainstream adoption rapidly increases.
Monitoring as Code
OnPage believes MaC is emerging in relevance in the industry. Here is the Gartner® definition of MaC:
“Monitoring as code (MaC) is the process of applying software principles to monitoring,
meaning the configuration of monitoring is designed to enable its management, like
software. With MaC, the configuration of monitoring is codified, version-controlled, tested
and automated. This flexibility offers DevOps teams the option to apply a shift-left
approach for fast and consistent monitoring across systems.” (Gartner® 31, 2023)
Before MaC, traditional monitoring required unrealistic amounts of manual and/or inflexible configuration that increased the risk of human error and extended the mean time to detect failures. Now, MaC improves monitoring practices and can be a proper SRE monitoring solution that supports engineers’ responsibility to upkeep I&O reliability.
OnPage advises organizations to use MaC to better customize necessary monitoring practices to their DevOps and site reliability engineering needs. Additionally, seek operational feedback so they can better tailor MaC to the organization’s I&O network.
Remember that MaC is still in the early stages of innovation. If your organization is unhappy with its current SRE methodologies for monitoring key performance indicators (KPI) and is willing to experiment, consider MaC and investing in tools to support these new monitoring endeavors.
Automated Incident Response
AIR centralizes alert management and incident routing, enabling organizations to streamline IT operations, reduce incident response time, and enhance incident resolution. One example of AIR is OnPage’s incident alert management platform which has on-call scheduling capabilities, escalation policies, and notifications that persistently ping until addressed by an on-call specialist. The OnPage team is pleased to inform that we’ve been included in the Gartner® Hype Cycle™ for Site Reliability Engineering, 2023 report, listing OnPage as a Sample Vendor in the Automated Incident Response category. We believe it is an honor to be mentioned in the Gartner® Hype Cycle™ for Site Reliability Engineering, 2023 and are proud to continue serving the IT community.
OnPage believes organizations can decrease mean time to acknowledge (MTTA) and expedite disaster recovery by providing DevOps and site reliability engineers with a centralized AIR solution. Many AIR tools can integrate with existing software in DevOps toolchains and ChatOps tools. For example, OnPage’s integration with Slack can simplify communication and collaboration across teams with automated workflows during SRE incident response cases.
With AIR expected to be standard in site reliability engineering practices, OnPage recommends investing in a centralized AIR solution. For more information about automated incident response tools, schedule a FREE 30-minute demo with a team member to learn about OnPage’s cutting-edge incident alert management platform.
DevSecOps
OnPage believes the future of SRE lies in part with DevSecOps. Gartner® defines DevSecOps as:
“…the integration and automation of security and compliance testing into
agile IT and DevOps development pipelines, as seamlessly and transparently as possible,
without reducing the agility or speed of developers or requiring them to leave their
development toolchain. Ideally, offerings provide security visibility and protection at
runtime as well.” (Gartner® 86, 2023)
Within the typical software development cycle (SDLC), SRE methodologies for identifying cybersecurity threats can frustrate developers and slow down the SDLC. DevSecOps supports SRE’s goal to ensure software reliability by identifying vulnerabilities in development and can support cybersecurity measures without sabotaging developers already riddled with harsh deadlines and managerial pressure.
OnPage suggests making security testing available early in the SDLC so developers can catch and fix mistakes early without deviating from their CI/CD pipelines. Integrating security testing with the SDLC and the developer’s existing workflows and toolsets can help maintain the expected rapid pace of development.
If your organization does not practice DevSecOps, OnPage recommends reevaluating your I&O to see if DevSecOps can contribute to SDLC and SRE optimization.
Conclusion
OnPage believes that MaC, AIR, and DevSecOps have the ability to support site reliability engineering duties. Organizations can optimize SRE and I&O by observing the evolution of MaC, AIR, and DevSecOps as well as adopting the right tools and technologies at the right time.
________________________________________