An algorithm for when things go wrong

Over a decade of building AI products has taught me this: when things go wrong in production, the same patterns tend to resurface. The blame game begins, lessons go unlearned, and the cost of mistakes compounds.

Here’s my guide to staying effective- and constructive- when everything seems to be going off the rails:

Stage 1: Investigate with an open mind

The first instinct during a crisis is often to find someone to blame. The polite version is “This isn’t my fault.”

I’ve lost count of the times I was convinced- beyond all doubt- that the issue lay elsewhere, only to discover it was mine.

If you’re still investigating, allow for the uncomfortable possibility that the responsibility might lie with you.

Treat others as you’d hope to be treated if the tables were turned.

Stage 2: Own the problem- don’t justify it

No system is perfect. No team is flawless. Sometimes, things slip through the cracks- and those cracks are ours.

Across every organization I’ve worked with, once a root cause is found, a lot of energy gets wasted in defending, explaining, or passing the buck. Leaders squabble, juniors get scapegoated, and everyone retreats into self-preservation mode.

This is the worst possible outcome. Even if I’m responsible for just 1%, I raise my hand. Because these conversations are not about blame- they’re about ownership.

And the moment someone says, “I’ve got this, I’ll fix it,” the tension breaks.

Never blame juniors in the heat of the moment. There’s time for private debriefs later. But in the moment, public blame only erodes trust- and their willingness to ever take a risk again.

Stage 3: Fix with others, not alone

Pressure distorts judgment. A fix that seems quick can easily introduce a bigger mess.

Bring a few teammates into the solutioning process. Even if you’re leading the fix, second and third opinions can save you from tunnel vision.

Sometimes the best fix is the simplest: disabling an endpoint, rolling back a version. A collaborative approach helps you anticipate edge cases and side effects.

Write a checklist. Follow it methodically. Clarity is your ally under pressure.

Stage 4: Reflect, document, and share

Once the fire’s out, it’s time to truly understand what went wrong. With the crisis behind you, people are far more open to honest reflection.

Write a thorough postmortem. Capture what failed, why it failed, and how you’ll prevent it in the future. Share it widely.

If we’re paying the price of a mistake, the entire team should benefit from the lesson- not just the person who stumbled.

Stage 5: Make it impossible to repeat

Finally, put safeguards in place. Add automated tests, update deployment checklists, or tweak workflows- whatever it takes to eliminate the chance of a repeat.

Standards aren’t bureaucracy. They’re the insurance we pay against failure.

Mistakes are opportunities for growth. Approach them with responsibility, honesty, and a shared commitment to learning as a team.