There is a particular kind of pitch I get more of every month. A vendor at a conference, a LinkedIn DM, a system-wide email about a transformative new platform. The product is always elegant. The dashboard is always beautiful. The pilot data is always promising. And almost none of it has been validated against the population of patients I actually take care of in Duluth.
I don’t say this as a Luddite. I say it as someone who has watched well-meaning hospital systems deploy clinical-decision tools that quietly worsened care for the patients they were supposed to help. The Epic Sepsis Model, deployed at hundreds of hospitals before its external validation in 2021 showed sensitivity around 33%. Risk calculators built on cohorts that don’t look like rural Minnesota. Sleep-apnea screening tools that flagged demographic groups they were never trained on.
The pattern is consistent: a product looks impressive in its training environment, gets purchased on the strength of a slide deck, and meets a real population whose data it was not designed for. The patient at the bedside does not know the model was trained at an academic medical center in California. They only know that something happened — or didn’t — because of a number on a screen.
So my rule is simple. Before I rely on any clinical tool — AI or otherwise — I ask three questions:
- Was it validated on a population that resembles mine?
- Has the validation been published, peer-reviewed, and replicated?
- Who owns the workflow when the model is wrong?
If a vendor cannot answer all three plainly, the answer is not yet. Not no — not necessarily — but not yet, and not in the room with my patient.
There is nothing courageous about waiting for evidence. It is the baseline of how medicine has always advanced. We do not adopt a new chemotherapy because the molecule is interesting. We adopt it because a randomized trial showed it works, and then we watch the post-market data closely. The bar for software in the clinical encounter should be no lower.
The argument I sometimes hear is that AI moves too fast for traditional validation — that by the time a study is published, the model has been retrained twice. I have sympathy for the engineering reality there. But the answer is not to lower the bar; it is to design new validation methods that match the cadence of the technology. Continuous evaluation, transparent versioning, real-world performance dashboards, model cards in the chart. None of this is exotic. It just requires that the people building the tools take the responsibility seriously.
What I refuse to do is accept the trade where my patient becomes part of someone’s silent A/B test.
There is a quieter cost to early adoption that vendors do not mention: trust. Patients are not stupid. They notice when the screen knows more than the doctor. They notice when the doctor stops looking at them. They notice when the medication recommendation has a logo on it. Every time a clinical tool fails them — even invisibly — it costs us a little bit of the relationship that the entire practice rests on.
Validated tools, used carefully, can return that trust with interest. An imaging algorithm that catches a small bleed I might have missed is a good thing. A risk score that sharpens an already-considered decision is a good thing. A documentation assistant that lets me look up at the patient instead of down at the keyboard could be a very good thing — when it works, when it’s been validated, when its limits are documented.
That last one, ambient AI scribes, is where my line is currently drawn. I’ll write more about that next.
— Jeremy Tabernero, MD · More notes · RSS · Get in touch