Javlon Baxtiyorov
← Writing

When the Government Runs ChatGPT on Medicaid Audits

HHS reportedly used ChatGPT to scan all 50 states' Medicaid audits hunting an estimated $100B to $200B in fraud, and as an engineer that headline raises more questions than it answers.

When the Government Runs ChatGPT on Medicaid Audits
Photo by National Cancer Institute on Unsplash

According to reporting, the Department of Health and Human Services used ChatGPT to scan all 50 states' Medicaid audits, looking for an estimated $100 billion to $200 billion in fraud. As a headline it is striking: a general-purpose chatbot turned loose on one of the largest, most sensitive datasets the federal government holds. As an engineer who has built async pipelines over regulated data, my reaction is not awe. It is a list of questions I would need answered before I trusted a single number that came out the other end.

The interesting part is the data plumbing, not the model

The phrase "used ChatGPT to scan all 50 state audits" hides every hard decision. The model is the easy, visible piece. The hard, invisible pieces are how the data got to the model and what happened to the results. Medicaid audit data is exactly the category of regulated, identifiable information you do not move casually.

The questions I would insist on answering:

  • Where did the data run? Scanning 50 states' worth of Medicaid records means touching protected health information at scale; the deployment boundary and data-handling terms matter more than the model's cleverness.
  • Detection or decision? There is a world of difference between using AI to flag candidates for human review and using it to conclude that fraud occurred. The first is a triage tool; the second is a liability.
  • Where does the dollar figure come from? An estimate of $100B to $200B is an enormous range, and a range that wide usually means the methodology is doing a lot of quiet work.
  • What is the false-positive cost? In fraud detection, flagging an honest provider has real consequences. The model's precision is not an academic metric here; it is people's livelihoods.

AI as a triage layer, not an oracle

If I were building this, I would frame the model as exactly one thing: a way to prioritize human attention across a haystack too big to read by hand. That is a legitimate and genuinely useful application. Large language models are good at surfacing patterns and anomalies in unstructured text, and audit documents are full of unstructured text. Used as a first-pass filter that routes suspicious cases to trained investigators, it can make a slow process faster.

What it must not become is the thing that decides. The moment a model's output is treated as a finding rather than a lead, you have outsourced judgment to a system that cannot explain itself, over data where being wrong is expensive in both dollars and trust. The headline-grabbing fraud estimate is only as good as the human verification behind it, and the reporting does not tell me how much of that there was.

The take

I am not reflexively against governments using AI on hard problems. Medicaid fraud is real, the dataset is genuinely too large for manual review, and triage is a reasonable job for a model. What makes me cautious is the framing. "We ran ChatGPT over everything and found $100B to $200B" is a sentence engineered for a press release, not for a courtroom or an appeals process. The responsible version of this story has a human in the loop on every consequential decision, a defensible data-handling posture, and a fraud estimate with its methodology attached. Until I see those, I read the number as a hypothesis, not a result. The technology is plausible. It is the rigor around it that I would want to audit.


Sources: AI News Today, June 6, 2026.


← All writing Get in touch →