Embracing Site Reliability Engineering (SRE) in US Government.

A lot has transpired in the last ten years, and I can't deny the impact SRE has had on my beloved home agency, the Centers for Medicare and Medicaid Services (CMS). Here, I want to share how, in my opinion, the SRE cultural role has been adopted and evolved over the last decade.

Embracing Site Reliability Engineering (SRE) in US Governance.
Where we are now.
Where can we go now?
SRE as a Force for Good
Conclusion
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

It has been more than a decade since I first heard of ‘SRE,’ a term that has since been reinforced in my memory by my involvement with the Healthcare.gov surge team. At the time, some of the Silicon Valley professionals coming out to aid us were describing the general outline of SRE, a concept that, on reflection, was far in advance of what we knew to offer our operations team.  

It was not as if we had no monitoring or operations in place. Rather, the new element was the environmental factors we were working with, particularly the brick-and-mortar data centers that were beginning to change at the time. Add to that the service lifecycle we had to follow from procurement to implementation to operation, including the legacy red tape in place, which followed a very traditional government IT approach. 

The problem was that we were following the software development lifecycle (SDLC) that was unfit for purpose, especially when we needed to recover from the day one issues at Healthcare.gov. We needed to be more resilient to recover quickly; I recall taking a call at 11:00 p.m. from a few tech surge teams asking how they could get an Application Monitoring (APM) tool license we had. We knew we could help the surge team’s SREs by providing some transparency to them, but overall, we were still in the dark about offering a full SRE service.

Where we are now.

A lot has transpired in the last ten years, and I cannot deny the impact SRE has had on my beloved home agency, the Centers for Medicare and Medicaid Services (CMS). Here, I want to share how the SRE cultural role has been adopted and evolved over the last decade.

For starters, SRE had a wide impact, and numerous value-driven programs and endeavors were the natural beneficiaries of the approach. However, it all comes down to being agile, nimble, and open to embracing evolution in a highly bureaucratic environment where everyone must shift to a culture of inspection, adaptation, and repair.

There are two critical components to this change of approach. The first is reflected in how my fellow federal government officials began to embrace a vital change in terms of their people, procedures, and technology. The second has been the industry's support for those changes. 

Let me break it down a bit further:

Government Persona: As the individuals in our government organization began to recognize the value and significance of viewing transparency and collaboration as central to SRE, they began to examine the metrics created. For instance, when product owners prioritized nonfunctional requirements (enabled by new APM technologies being adopted that provided better visibility and root cause analysis), the customer experience improved, resulting in better outcomes. 

The gap between products and operations began to close, fostering collaboration, and silos began to be broken down. Finally, readiness to educate and benefit from the finest in workforce resilience has been a massive development in the last ten years, at least from my unique standpoint.

Process in Government: The evolution of business functions in government was a significant process shift that had a considerable impact. Most significantly, procurement teams have begun to take notice of this. An explicit request for what is currently expected from a vendor conducting work and modern DevOps and SRE duties is now being made of those in the industry. Improved Service Level Indicators (SLI) and Service Level Agreements (SLAs) are being specified at the contract, product, and performance levels, holding diverse organizations accountable. 

Shared ownership has also become a significant factor. No more throwing the problem over the fence; it is now an integrated challenge that must be solved with product and engineering working together. Many product roadmaps have begun to emphasize the importance of SRE functions in their agile processes and planning sessions.

Technology in Government: With the appropriate investment in technology, the government's use of SRE functions will be possible. As datacenter to cloud migration, adoption, and maturation became huge things, we have seen that SREs are better when they have the correct tools to work with and mature. 

Investment in tools that can perform APM, log analysis, and processing is now common. Tools are used to inform customers of the status and performance of their services. This is significant for informing citizens about any service disruptions and improving the customer experience. 

Artificial Intelligence (AI) and Machine Learning (ML) projects have begun, and proactive vs. reactive attitude capability is currently being developed. Many cloud-native and SaaS-based products and services that support the SRE functions have been FedRAMP-approved, and adoption and investment are underway.

Industry Partners: This has been a major task for everyone, yet it was well accepted, welcomed, and delivered. It was a partnership built on trust, empathy, and alignment to generate value for building a better customer experience and ensuring citizens have the greatest experience possible.

From a partnership point of view, assisting government partners in strategizing the value and investment in the SRE function has also been critical to the overall process. This is to bring awareness to the procedures and technologies required and then, most crucially, bring in people and resources who are industry leaders.

Now there are alliances with like-minded groups and tool partners to be a part of the change. Being open about the difficulties that have been uncovered and taking ownership of them rather than working in denial mode was something industry partners have learned, as well as working with government product owners to prioritize features as needed to deliver the value required.

Where can we go now?

As government and industry partners collaborate to make SRE a primary function, the all-important process becomes part of delivering value-focused, resilient, secure, and scalable systems. It also helps us consider how we can develop some basic SRE playbooks, starting with the premise that SRE never stops evolving and is constantly changing. 

We could continue to define the role of SRE in platform engineering, product management, and new modern security practices, all of which are in desperate need of assistance. Research on AI generative SRE functions shows how quickly AI is evolving and how it may now be applied in compliance management.

We continue to hear that cloud spending is increasing, partly due to the mismanagement of faulty engineering methods and a lack of cloud control. For example, look at how FinOps can be used as an SRE function, giving SRE a seat at the table, just like many other important jobs. The process can also be used to continue to foster a blameless culture rather than blaming the contractor or the Government Task Lead (GTL).

SRE as a Force for Good

Traditional operational teams' cultures are changing. Operations teams are often concerned with keeping systems operational, whereas SRE teams are concerned with making systems resilient. Through all iterations, however, it’s the citizens who deserve high-quality services and are the whole point of any management system. SRE can assist organizations in ensuring that their systems are available and responsive, thereby improving the overall customer experience. 

Agencies can also innovate more swiftly if they can release modifications and new features quickly and safely. SRE teams can design and implement processes that allow for rapid experimentation, while metrics can be used to assess success. Agencies should use metrics like uptime, mean time to recovery, and user satisfaction to evaluate their SRE programs' effectiveness and identify improvement areas, as well as collaborate with government and community organizations.

Finally, it is important to raise awareness, educate about the importance of SRE and develop SRE’s responsibilities, not just in the partner vendor community but also in federal positions.

In retrospect, SRE saved our lives at CMS. I am not here to debate who saved Healthcare.gov but how the government has generally embraced SRE. Healthcare.gov was the starting point for me and many of my fellow government colleagues. Some may consider it a failure or a bad dream. Still, in retrospect, I see it as a good wakeup, a good dream that I and many of my colleagues in government who worked hard, sacrificed family time, and were mission-oriented. The battle was about how we could give the SRE team some time to have some observability in place so that the SREs could aid the surge team and the government in dealing with the issues that needed to be dealt with.

Conclusion

Despite these obstacles, SRE can benefit government agencies. Government enterprises can improve the stability of their software systems, cut costs, improve security, and increase agility by implementing SRE. If your government agency is considering SRE, I recommend that you conduct a study and ensure that you are prepared to face the hurdles. If you are a Product Owner, please visit our SRE section under Shipping Value to learn more about SRE.

Share this post: