Rethinking Metrics
How to turn “good vibes” into meaningful signals.
Right now, a lot of testing runs on VDD. Vibe-Driven Development.
And no, I don’t mean AI! I mean, Testers are doing plenty of work that looks great on the surface. Everyone feels confident. But dig a little deeper?
Vibes.
People1 have a hard time understanding:
The state of the product
What testing we did, to arrive at that conclusion
And just as importantly, what testing we did not do
How the testing we did was impacted by outside factors (environments, absence, testability, etc)
At best, that’s not very compelling. At worst, it’s risky. It leaves gaps, confusion, and plenty of room for bad surprises.
I’ve been thinking about what to do instead.
Let’s dive in.
The Question That Always Comes Up
Vibes Aren’t Enough
Vibes can get you pretty far. But sooner or later, every team faces that question:
How effective is the work we’re doing?
Testing is no exception, especially when testing’s treated as a separate team (but that’s a whole other topic). Rich and I have been talking about this on the podcast, and I’ve been thinking about it since then. And I’ll be honest, one of the big reasons is self-preservation.
In 2025, it feels risky not to have evidence of our effectiveness.
And that’s where DORA comes in.
DORA Metrics to the rescue
DORA metrics are a great starting point.
There are four of them, and they come in two flavours - Throughput & Stability. From the DORA article “DORA’s software delivery metrics: the four keys“:
Throughput
Throughput measures the velocity of changes that are being made. DORA assesses throughput using the following metrics:
Change lead time - This metric measures the time it takes for a code commit or change to be successfully deployed to production. It reflects the efficiency of your software delivery process.
Deployment frequency - This metric measures how often application changes are deployed to production. Higher deployment frequency indicates a more efficient and responsive delivery process.
Stability
Stability measures the quality of the changes delivered and the team’s ability to repair failures. DORA assesses stability using the following metrics:
Change fail percentage - This metric measures the percentage of deployments that cause failures in production, requiring hotfixes or rollbacks. A lower change failure rate indicates a more reliable delivery process.
Failed deployment recovery time - This metric measures the time it takes to recover from a failed deployment. A lower recovery time indicates a more resilient and responsive system.
Thanks to the work of Dr Nicole Forsgren et al, we have ways to help us assess delivery performance.
But “delivery performance” is only part of the story.
The Gaps We Don’t Talk About
We Need To Talk About DORA
Delivery performance tells us how good we are at doing engineering “stuff”2.
But what about the impact of doing all that “stuff”? Is it possible that we’re really good at building things that the market doesn’t want? Or that the market wants, but we can’t deliver profitably?
Sadly, it is. I’ve been on a couple of those teams :(
The DORA metrics have another challenge too.
Consider change fail percentage. It tells us the percentage of deployments that cause a failure in prod. And when do we find out about that?
AFTER we’ve had a failure in prod 😬
Yikes!
Don’t Look So Smug Testing Metrics!
Testing metrics have the same issues too.
Number of test cases executed. The classic! Sure we measure it before release, but impact? It says nothing. Yikes.
Defect leakage (defects found after release). Measured after the event. Plus, were these defects things people care about? No idea. More yikes!
Average test suite execution time. On its own, I have no idea what makes this important. Moar yikes!
It seems like there’s a pattern here.
Also, that’s a lot of yikes!
We need a way to see what’s coming before the event, instead of assessing things after it.
Measure The System, Not The Stuff
When I started my testing career, I had one thing, and one thing only on my mind.
Find bugs. As many as possible. As quickly as possible. That was the measure of success back then: A long list of bugs sent to the dev team. These days, I’m a manager, and success is measured much differently.
It’s less about doing “engineering stuff” and more about managing the system of work it happens within. And understanding the impact of that work. That shift was hard to make at first. Success stopped being about what I produced and started being about what the system produced.
To borrow from Dan Sullivan, it’s less “how do I fix this?” and more “who or what can make this better?”.3
We need to create an environment where the most likely outcome is the thing you want. If we over-focus on testing activity, we lose sight of the outcomes that matter. So maybe it’s time to look at what signals can tell us how the system is working, before the results arrive?
That means my next job is to connect the dots between the system I manage, the signals it gives off, and the goals the business actually cares about.
Talk About Impact, Not Activity
In order to pull this off, I’ll need to do three things:
Gather intel on what the company cares about and what its goals are
…and the goals of the people charged with achieving those goals4
…and how those things are measured
Find some leading indicators to complement those things (which are probably lagging indicators)
Start measuring and mapping how testing impacts those metrics, so the story we tell about quality connects directly to what the business values.
Where can I find intel?
Companies throw off TONS of information!
the markets (if they’re publicly traded)
the news (if they have launches about to drop)
shareholders (probably the same thing I wrote earlier)
investors (where they talk about the cool wins the company has had, the kinds of clients they have, the impact they’ve had for those customers, the TAM, ARR, what features are driving those things, etc)
potential buyers (same as above)
So put on your favourite Sherlock Holmes Deer Stalker and start digging!
What might some leading indicators be?
The easiest thing to do is ask!
This one is simple.
If you’re lucky enough to line into a C-Level exec, just ask them! Depending on your relationship, they’ll almost certainly tell you. Especially if you frame it in words like:
What problem do you wish you could stop worring about?
Or something like that. That could open the door to an insightful conversation about what they’re dealing with. Another way you could ask is:
What would you like to look back on at the end of the quarter/half-year/year and say “Yep, we crushed that!”?
Please adjust based on your personality, communication style, and relationship though, ok!
And what about you folks that don’t line into them? You’ll have to do a bit more digging, but it’s essentially the same question posed to your manager, and their manager, and so on.
Find the Growth Metric
Your Exec team are being measured against certain numbers and targets.
My first thought was “Clearly we mean money right?”, but I think we can do better! It’s probably based on some kind of Growth Metric. According to the book Leading Quality, there are three flavours.
Attention-based metrics show whether users are engaged, e.g. things like daily active users, time spent in-app, or retention.
Transaction-based metrics capture value exchange, e.g. purchases, bookings, or upgrades.
Productivity-based metrics measure efficiency, e.g. how quickly or effectively customers can get their jobs done.
If we’ve done a good job investigating, we have a handful of things to anchor our testing and quality initiatives around.
But how?
How do we use that information to improve our quality narrative?
I suspect this will boil down to mapping our initiatives to the metrics, goals, or problems we just identified. Then whenever we talk about our work, we always frame it in terms of something they actually care about.
We did X, which helps Y
I think it will look like this:
Describe how our work alleviates the problem your C-Level is dealing with
Show where our work helps the goal they shared with you
Connect any initiatives we’re working on with the growth metric
Another important step is discovering or creating complementary leading indicators for the lagging indicators.
A useful prompt to get the brain working might look like this:
What is it that causes [INSERT LAGGING INDICATOR] to change?
What else is going on when [INSERT LAGGING INDICATOR] changes?
And how could we measure that?
Even if I manage to do only a little bit of this, we can position ourselves as people who contribute to the growth of the company rather than a cost centre.
A quick example
Let’s say we have a team who’s experiencing high turnover or has new members who are still getting up to speed, what impact would we predict this has on the risk of deployment failures?
Or if a release includes a large portion of expedited work that skipped standard processes, would it be more or less prone to issues?
Or if a team improves its testability and release process, how would that impact deployment recovery times if/when failures do happen?
Wrapping up
This is all about becoming a meaning-maker.
Starting with good industry practices like DORA is a wise thing to do. There are general trends we should be striving for with these delivery performance style metrics. But they won’t tell us the whole story.
Metrics such as change failure percentage and customer-reported issues only make us wise after the event. We also need to measure the impact of these activity-based measurements.
We may need to do some digging to understand the goals of the business and stakeholders first. But once we have, we can start tying activities to business impact and start providing meaning to potentially uninspiring measurements that hide the real value of the work we’re doing.
What do you think?
This seems like a solid plan to me.
Is this something you’re already doing? If so, I’d love to hear about it.
If you’re solving this another way, I’d love to hear about that too.
And I definitely want to hear from you if you think I’m way off it and should try a different approach!
And by “People”, I primarily mean stakeholders, people who rely on the findings of the testing. I also mean teammates, peers, direct reports, and futureYou who would like to know what happened when pastYou last looked at this feature!
Engineering “stuff” includes (but is not limited to)… Refining, designing, building, testing, releasing, monitoring, observing, supporting, investigating, triaging, etc, your product or service
“Who Not How” a book written by Dan Sullivan
Aka the exec team


