Regulated Industries
Why most 'AI for compliance' demos fall apart on contact with reality
Slick demos pass review meetings and lose money. The four predictable failure modes of compliance AI tools — and what the rare good ones do differently.
The demo always works. The salesperson types a question into the chat interface — "summarize the customer due diligence obligations under the Bank Secrecy Act" — and the model produces a fluent, well-organized, three-paragraph response that cites the statute. Heads nod. The chief compliance officer looks at the CFO. The CFO looks at procurement. There's a sense in the room that the future has arrived.
Then a pilot starts. Two weeks in, the actual compliance team is filing complaints with the project manager. The tool has cited statutes that don't exist. It has confidently answered a state-law question with federal-law guidance. It has summarized a SAR filing requirement and missed the timing rules. The procurement team is now in an uncomfortable position with the vendor. The chief compliance officer has stopped using the tool entirely and gone back to the binder.
I have watched some version of this happen at three different firms. The pattern is so consistent it deserves a name. I'll call it the demo-reality gap, and the goal of this piece is to explain where it comes from, why it persists, and what the rare good compliance AI tools do differently.
Where the gap comes from
The fundamental problem is that compliance work, despite looking like an information-retrieval problem, is not one.
Surface-level compliance questions have surface-level answers. "What is the federal definition of a beneficial owner under the Corporate Transparency Act?" is a question a model can answer, because the answer is approximately stable, approximately well-documented, and approximately the same in any law-firm associate's outline.
But that is not what compliance officers do all day. What they do all day is take a specific fact pattern in front of them and ask: given everything I know about this client, this account, this transaction history, this regulator's recent guidance, and this specific firm's policy, does this trigger a reporting obligation, and if so, which one, and on what timeline, and what's my paper trail?
The model can recite the rule. It cannot apply the rule. Application requires four things that are almost never in the prompt window.
Failure mode 1: Citing law that doesn't exist or has been superseded
This is the most common failure and the easiest to detect, which makes it the source of most early-pilot complaints. The model will confidently cite "31 CFR 1020.220" and describe what it requires. The citation is real but the description has drifted from the actual regulation — sometimes meaningfully, sometimes catastrophically.
The newer the rule, the worse this gets. SECURE Act 2.0 — passed in late 2022, with rules that have been clarified and re-clarified through 2024 and 2025 — is a particularly painful example. Models trained on documents from before the relevant guidance dropped will give you confident answers about distribution rules that no longer apply. The fluency of the wrong answer is what makes it dangerous.
The same failure happens with regulator-specific interpretations. The OCC, the FRB, and the state banking regulators do not interpret federal rules identically. A model that has read all three sets of guidance will blend them into a single response that resembles none of them.
Failure mode 2: Not understanding jurisdictional variance
Most compliance work is federal and state. A model that gives you a clean federal answer to a question with state-law dimensions has just produced something worse than nothing. It has produced confident misdirection.
I worked on a project last year involving an entity registered in Georgia, with operations in three states, with an account at a federally-regulated institution. The first AI tool we tested confidently told us the entity's CTA reporting obligations. The answer was federally correct, state-blind, and entirely missed the relevant Georgia disclosure requirements layered on top.
A junior associate at a law firm would have known to ask. The tool didn't.
Failure mode 3: Failing on the documents that compliance actually has
Compliance teams do not work primarily with clean, digital, well-indexed source material. They work with:
- Scanned PDFs of older paper records
- Handwritten margin notes on policy documents
- Excel files that someone built in 2014 and that no one understands anymore
- Audit findings that exist only as printouts of emails
- Older client agreements with provisions that don't translate to current systems
A demo using a freshly-typed text prompt against a model with a clean knowledge base looks impressive. The same model handed an actual compliance team's actual document inventory often produces transcription-quality errors, OCR-induced confusion, and high-confidence summaries of documents it has half-read.
The vendor will tell you "we support OCR and document ingestion." This is true. It is also misleading. There is a difference between a tool that can read scanned documents and a tool that produces reliable compliance work product from scanned documents. The gap is usually a 30-point accuracy difference, and the 30 points are where the legal risk lives.
Failure mode 4: Confident answers where a human would refuse
This is the deepest failure mode and the one that scares me most. Ask a senior compliance officer an ambiguous question and you get back a careful list of considerations, a probability-weighted recommendation, and three references to escalate further. Ask an AI tool the same question and you get a clean, confident, three-paragraph answer.
The cleanness is the bug. Real compliance practice is full of "it depends," "we'd want to ask the regulator," "let me see how the auditor read this last year," "the policy is silent on this case." A tool that flattens this nuance into a confident response has produced output that looks like compliance work and is actually a liability.
The technical name for this is calibration — does the model's confidence in its answer track the actual probability that the answer is right? Almost no compliance AI tool I've evaluated has well-calibrated outputs. Almost all of them present "I'm pretty sure" and "this is settled law" with identical visual weight.
What the rare good tools do differently
I have seen a small number of compliance AI tools that handle reality reasonably well. They share four properties.
They show their work. Every claim cites a specific source — not "the BSA," but "31 CFR 1020.220(a)(2)(ii), updated by 2023 NPRM XYZ." Every cited source is a click away. If you click and the source doesn't say what the model said it said, you find out immediately. This single feature catches more errors than any other.
They refuse on ambiguity. Asked a question the model doesn't have a confident answer for, the good tools say so. "I don't have enough information about this fact pattern's jurisdictional posture to give you a reliable answer. Here's what I would need to refine this: [list]." This is the answer a senior compliance officer gives. It is the answer an AI tool should give. Most don't.
They have a model of risk, not just law. Compliance is not what's legal; it's what won't get you sued, fined, or examined. The good tools are aware that some technically-correct answers are bad answers because of how they'd play in front of a regulator or in a litigation discovery. The bad tools optimize for legal correctness and miss that practical correctness is what compliance is actually trying to achieve.
They put the human at the right step. "Human-in-the-loop" is now a marketing phrase, which means the demo has a human approval button somewhere. The good tools put the human at the step where the human's judgment is highest-value — usually the step of applying the rule to the specific facts, not the step of generating the summary. The bad tools have the human approving model output after the fact, which is not review; it is rubber-stamping.
What to ask a vendor
If you are evaluating a compliance AI tool for your firm, three questions cut through most demo theater:
- "Show me the tool's behavior when it doesn't know the answer." If the demo doesn't include this, you're being sold confidence, not capability.
- "What happens when I hand it our own actual scanned document inventory, not your demo set?" Get a real pilot on real documents before signing.
- "Walk me through a case where the tool's first answer was wrong and how the user found out." If the vendor can't tell this story, they haven't seen the tool fail enough to know its failure modes.
The right tool for your firm exists. The chance it's the first vendor in your meeting room is low. The chance you'll find it by trusting a demo over a pilot is zero.
Related