A retrieval-augmented generation pipeline will happily tell you that "Article 47 of NIS2 requires a 48-hour breach notification." It will format the citation cleanly. It will sound certain. And it can be wrong in three independent ways at once: the article number may not exist, the obligation may live in a different instrument, and the 48-hour figure may be a number the model recalled rather than retrieved. None of those errors leave a trace in the output. The citation looks identical whether it is grounded or fabricated.
For internal Q&A that a human double-checks, that is an acceptable failure mode. For a gap analysis you hand to an auditor, a DPIA that goes in the file, or a tender response that a public buyer scores, it is disqualifying. The problem is not that the model is bad at retrieval. The problem is that RAG over a document dump has no mechanism to tell a grounded citation from a decorative one — and regulated work is exactly the case where that distinction is the whole job.
This post is about why that gap is structural, not a tuning problem, and what we built instead.
What RAG actually guarantees, and what it doesn't#
Strip a RAG pipeline to its mechanism. You embed a corpus into vectors, embed the query, return the k nearest chunks by cosine similarity, and paste them into the model's context with an instruction to answer using the provided passages. The model writes prose. Somewhere in that prose it produces citation strings.
Three properties follow from that mechanism, and all three matter for compliance.
Retrieval is probabilistic, not exhaustive. Nearest-neighbour search returns the chunks that are semantically closest to your query phrasing — not the provisions that govern the question. If the controlling article uses different vocabulary than your query, it can rank below a chunk that merely sounds relevant. There is no completeness guarantee: a RAG retriever cannot tell you "these are all the provisions that apply," only "these were the closest vectors." For a question like "which articles impose breach-notification timelines on us," missing one is the failure you most need to avoid, and the architecture cannot detect the miss.
The citation is text, not a reference. When the generator writes "Article 30 GDPR," that string is generated the same way every other token is generated — by predicting what comes next. Nothing in a standard RAG pipeline checks that the cited article corresponds to a chunk that was actually retrieved, or that the article exists, or that it says what the surrounding sentence claims. The citation is decorative: it decorates the answer with the appearance of grounding without the fact of it. This is the mechanism behind the hallucinated-citation problem that has put more than one lawyer in front of a judge explaining a brief full of invented cases.
There is no provenance contract. A provenance contract would let you answer, for any claim in the output: did this come from the corpus or from the model's training weights, and can I re-validate it against the live source today? RAG cannot answer either question. The retrieved chunks and the model's parametric memory are blended in the same context window, and the output is a single stream of tokens with no per-claim lineage. You cannot diff it, you cannot re-validate it, and you cannot prove to a third party that a given sentence is grounded.
Where RAG is genuinely fine#
We are not arguing that RAG is bad technology. It is the right tool for a large class of problems, and pretending otherwise would be dishonest.
If you are searching your own contracts to find the three that mention a specific indemnity clause, RAG is excellent — a human reads the three hits and judges them. If you are building internal Q&A over your policy library, where the cost of a wrong answer is an employee asking a follow-up, RAG is fine. If you are drafting a first-pass memo and the model surfaces candidate passages for a lawyer to verify and rewrite, RAG earns its keep. In all three, a human stands between the retrieval and the consequence, and the citation is a starting point, not a fact on the record.
The dividing line is simple: does anyone downstream treat the model's citation as true without checking it? If the answer is no — if a human verifies every reference before it carries weight — RAG's probabilistic retrieval is a productivity tool and its decorative citations are harmless scaffolding. If the answer is yes — if the output is the deliverable — you have moved into territory where the architecture has to change.
Regulated compliance work lives almost entirely on the wrong side of that line. The point of a cited gap analysis is that the reader doesn't re-derive every article. The citation has to be load-bearing.
The alternative: typed corpus tools instead of an embedded dump#
We took a different architecture. There is no document dump and no embedding index standing in for the law. Instead, the customer's AI agent — Claude, Copilot in VS Code or Studio, Cursor, any MCP client — calls typed tools through the Ansvar gateway, and those tools return structured provisions with stable identifiers.
The two that matter most are search and get_provision. search takes a scoped query — a jurisdiction, a framework, a sector — and returns provisions, not chunks. get_provision takes a law and an article number and returns the exact text of that provision with its citation metadata. Behind them sit the audited law corpora for 27 jurisdictions (119 built), an EU regulations corpus of 98 instruments and 5,008 provisions, and 262 security frameworks — see /coverage for the live inventory.
The difference from RAG is not "better retrieval." It is a different contract:
- A provision comes back as a structured object with a stable identifier, not a free-text chunk. The agent knows it is holding "GDPR Article 30," not "a passage that scored 0.83 on cosine similarity."
- The corpus is the law as published, segmented at the provision level, not your documents re-chunked by a splitter that knows nothing about legal structure. A sub-article does not get cut in half because it crossed a 512-token boundary.
- When the agent asks for an article that does not exist, the tool says so. It does not return the nearest neighbour and let the model narrate around it.
The model still does the reasoning and the writing. What it no longer does is invent the source material. The law arrives as data, with an identifier, from a tool call that either succeeded or failed.
Citation validation: the check RAG never runs#
Returning structured provisions removes one class of error. It does not, by itself, stop a model from writing a citation that drifts from what it retrieved. So we run a second check that RAG pipelines have no equivalent for: every citation is validated against the live corpus before it ships.
The gateway exposes this as validate_citation. Give it a jurisdiction, a law, and an article, and it confirms the provision exists and returns the current text — including whether the article has been amended since it was last cited. A citation that fails validation does not get downgraded to a warning footnote. It gets caught.
For the workflows that produce citation-heavy deliverables — gap analysis, threat modelling, DPIA, tender review — this runs as discipline, not decoration. Every regulatory claim is grounded against a real provision, and the citation that lands in the report is one that re-validates. You, or your auditor, can run the same lookup a year later and get the same provision back, or a clear signal that it changed. That reproducibility is the property RAG cannot offer, because there is nothing stable to re-look-up: the next run embeds, retrieves, and generates afresh, and the citation string is regenerated from scratch each time.
This is also why the foundational obligations in EU regulation map cleanly onto the model. The GDPR's accountability principle — the controller must be able to demonstrate compliance — and its records-of-processing obligation under Article 30 both presuppose that you can produce the provision behind a claim on demand. DORA's ICT third-party risk regime, including the register of information required under Article 28 and the key contractual provisions that govern ICT outsourcing, presupposes the same. A pipeline that cannot tell you whether its own citation is real is structurally unable to support "demonstrate."
Refusal discipline: a wrong answer is worse than no answer#
The last piece is the one most product teams resist, because it makes the demo look weaker. When a claim cannot be grounded, the right output is not a confident guess. It is a refusal.
This is a hard rule on our platform: no silent fallbacks. If a corpus tool is unavailable, the answer is a data-source-unavailable error — never a fluent paragraph reconstructed from the model's training memory. If a requirement produces no grounded citation after the workflow has exhausted its enrichment passes, the requirement is marked regulatory_basis_unresolved rather than fitted with a plausible-looking article number. The model is not permitted to paper over a gap with prose.
RAG's instinct is the opposite. Faced with weak retrieval, the generator's job is to produce something — and it will, because that is what a language model does with an underspecified context. The fluency that makes RAG demo well is exactly what makes it dangerous for regulated work: it never tells you when it is guessing. A compliance officer reading a clean paragraph cannot see that the retriever returned nothing useful and the model improvised.
We chose the inverse default. A wrong answer in a compliance deliverable does not just waste time — it creates a false record, and false records are the thing audit regimes exist to prevent. So we would rather the tool say "I could not ground this" and force a human to look, than hand over a confident sentence that nobody knows to question. Availability is not the goal; correctness is.
Concretely, what changes for your team#
If you run regulatory work through a RAG-over-documents pipeline today, the honest move is not to rip it out. It is to split the work along the line that matters.
Keep RAG for what it does well: searching your own corpus — contracts, policies, prior memos, internal guidance — where a human reviews the hits. That is the discovery and drafting layer, and a vector index is a fine engine for it.
Route the regulatory grounding through the gateway. When the question is "what does the law actually require," the answer comes back as a validated, cited provision from search and get_provision, not a retrieved guess. Your agent assembles the gap analysis, the threat model, or the AI Act readiness assessment; the gateway supplies the law with a citation that survives scrutiny. For the EU AI Act specifically, where the obligation that attaches to a system depends on which risk tier it falls in, getting the classification grounded against the actual text — rather than a model's recollection of the risk categories — is the difference between a defensible assessment and a confident one.
The architecture is deliberately boring at the seam. The gateway speaks MCP over OAuth 2.1, and the whole thing is EU-hosted with no server-side model holding your data. You bring your own agent; we supply grounded law. Quickstart is at /docs/quickstart, and the tier matrix — free at 100 searches a day, Premium at €249 a month — is on /pricing.
The summary is one sentence. RAG retrieves the nearest chunk and writes a citation as text; we return the governing provision and validate the citation before it ships — and when we cannot, we say so instead of guessing. For regulated work, that last clause is the product.