The Other Side of DORA Metrics - The Virtuous Cycle

build it

Jul 10, 2023

DORA

DORA stands for DevOps Research and Assessment. And, according to its website, it "is the largest and longest running research program of its kind, that seeks to understand capabilities that drive software delivery and operations performance." DORA conducts an annual survey and has indicated a strong correlation between high survey scores (or metrics) and better operational performance.

The DORA metrics are used by engineering teams to measure their performance and rate themselves on a continuum from "low performers" to "elite performers". The four metrics are as follows:

Deployment Frequency (DF): the frequency of successful software releases to production.
Lead Time for Changes (LT): the time between a code change commit and its deployable state.
Mean Time to Recovery (MTTR): the time between an interruption due to deployment or system failure and full recovery
Change Failure Rate (CFR): how often a team's changes or hotfixes lead to failures after the code has been deployed.

Of these four metrics, two make direct references to business value: CFR and MTTR. However, both of these metrics refer only to problems (a "failure" or an "interruption"). Take MTTR, for example. By measuring and improving the time it takes to recover from an interruption, your company can get back to a stable state quicker. This is good, but it is only "arguing to zero." That is, we're going from a bad state to a normal state. I see nothing wrong with this measurement, but I think the real value in the DORA metrics, the positive benefits, can get overlooked.

DORA Context

Let's reflect on all that has to go into achieving exceptional DORA metrics. To hit "elite performers" status, you need many pre-requisite milestones such as:
- Automated Testing
- Continuous Integration / Continuous Deployment (CI/CD)
- Infrastructure as Code (IaC)
- Agile (the real Agile) methodology
- The elimination of vertical silos

Why all this? Good Deployment Frequency (DF) numbers are unachievable without strong automated testing. This gives assurance that your new change didn't break existing functionality. Lead Time for Changes (LT) numbers almost always require Continuous Integration. And both DF and LT are key components to MTTR, without either, your recovery time suffers.

More importantly, all these metrics are negatively affected if you have to "silo hop." What is a silo hop? It's when you pass responsibility for different parts of the software development lifecycle to different teams. E.g., The Engineering team writes code, the QA team tests it, the Deployment team deploys it, and the Operations/SRE team monitors and runs it. This misaligns feedback loops and puts the focus on sub-optimal, localized, siloed measurements like "It was 'Dev-Complete' in just one day! (I don't know why it wasn't deployed sooner)" or "We deployed the app in 20 minutes! (I don't know why it got stuck in QA for a week)" or "After an all-nighter, we now have the app stable in Production! (Don't you developers dare make any changes)"

Instead, I think we should highlight some of the positives that go along with good DORA metrics (and all the pre-requisite milestones necessary for good scores). At an "elite performer" level we can deliver features faster. With all testing automated, including tests on the new functionality, you know immediately that you successfully added functionality and you did no harm. This allows you to spend time cleaning/tidying/refactoring your code. Clean code makes the next feature easier to deliver. Which, in turn, gives you more time to clean/tidy/refactor.

This last bit is the "virtuous cycle" I mentioned in the title. It is also referred to as the "flywheel effect" in that doing a small good thing makes the next good thing a bit easier. Rinse and repeat until your team is at the "elite performer" level.

Case Study

Let's look at a (slightly anonymized) case study that I've seen play out at multiple companies. In each case, I was asked to take a team (or a subset of teams) out of an Engineering department and help them excel. For this article, let's call this example team "Delta".

I obtained cover for team Delta and helped them solve problems as they arose. "Cover" in this sense means I got permission from upstream leadership to let this team own everything: the code, the build pipelines, the testing, the deployment, the monitoring; the works. The other teams had a separate QA team to test, a DevOps team to build and maintain pipelines, and an Operations team to monitor and run their app.

Within three months, Delta had turned around a rotting set of automated, post-deploy tests into something they trusted. As a consequence of this, Delta adopted continuous integration via trunk-based development and a "ship/show/ask" approach to merging. The other teams continued to use QA for testing and quality sign-off. In addition, the other teams kept using their open-source-style pull-request model for code integration.

Within six months, Delta was deploying all of its applications to all lower (non-production) environments on every commit. They constantly tuned their build pipelines for speed, accuracy, and simplicity. They knew how their pipelines contributed to their deployment speed and had the knowledge to fix problems quickly if they arose. The other teams, whose pipelines were created and managed by DevOps, often reported "The pipeline is broken; we're blocked" and had issues getting their apps deployed someplace where QA could begin to test.

Within nine months, Delta was deploying to production on every commit. Their Change Failure Rate (CFR) was hovering near zero. The other teams batched up all their changes from a two-week sprint into a bi-weekly release. Those deployments often had problems and their apps almost always had bugs.

At this point, I started noticing the "virtuous cycle" or "flywheel effect" in action. Delta team started completing their sprint work within the first week of a two-week cycle. They would then dedicate time to cleaning/refactoring/tidying their code, even taking on large architectural improvements. The other teams began to miss sprint commitments and started begging Product for "refactor stories" or a "clean-up sprint" which, although considered, never materialized.

In short, the Delta team produced:
- Faster, safer deliveries
- Cleaner code
- Pride in their work
The team was happy, performant, and engaged. They asked questions like "What else can we do to make it even better?" as opposed to "Why are things so bad?"

When I talk to tech leaders about DevOps and the DORA metrics, I tell them these kinds of stories. I want them to know the positives that come along with elite-performing teams, not just the negatives they can avoid. Which team would you rather lead?

Have fun.

References

Jim Collins - The Flywheel Effect
(real) Agile - Manifesto for Agile Software Development
LeanIX - The Definitive Guide to DORA Metrics
Forbes - Mean Time To Recovery - A DORA Metric Explained
DORA - DevOps Research and Assessment
Martin Fowler - Ship/Show/Ask

Ownership Matters