Skip to main content

Command Palette

Search for a command to run...

A Law Firm Just Asked Whether Your AI Legal Research Tool Hallucinates: How to Answer the Reliability and Accuracy Section

Updated
6 min read

A Law Firm Just Asked Whether Your AI Legal Research Tool Halluccinates: How to Answer the Reliability and Accuracy Section

The questionnaire arrived from a conflicts-and-technology partner at a 600-attorney firm in Frankfurt. Your legaltech company had been in procurement review for three weeks. Section 4 appeared on page 7:

"Please describe your AI system's hallucination rate, citation accuracy, and the mechanism by which your AI signals uncertainty to users. Provide supporting documentation. Response required within 5 business days."

Eight questions. All in Section 4. And your sales rep had no idea what "hallucination rate" meant in a procurement context.

Here is exactly what Section 4 is asking and how to answer it.


Why Law Firms Ask About Hallucinations

Since 2023, multiple attorneys have been sanctioned or embarrassed after submitting AI-generated briefs that cited cases that did not exist. Firms now treat AI reliability as a due-diligence requirement — not a nice-to-have.

Under the EU AI Act, if your legal research tool assists with legal interpretation, case law analysis, or drafting, it likely qualifies as a high-risk system under Annex III, point 5(b): "AI systems intended to be used for administration of justice and democratic processes." That classification triggers Article 13 (transparency), Article 14 (human oversight), and Article 15 (accuracy and robustness) requirements.

But law firms are not asking these questions because of the EU AI Act. They are asking because one of their partners was on the wrong end of a hallucinated citation. Section 4 exists because the reputational and disciplinary risk is concrete and recent.


What "Hallucination Rate" Means in This Context

Law firms are not asking for a number from a research paper. They are asking: "If an attorney uses your tool and submits what it gives them, what is the probability they cite a nonexistent case?"

This breaks into three distinct sub-questions:

1. Citation existence: Does the case cited actually exist in the relevant jurisdiction's law reports?

2. Citation accuracy: If the case exists, does the AI correctly represent what the case held? Hallucination in legal research often looks like a real case cited for a proposition the case does not support.

3. Uncertainty signaling: When your system is not confident — a novel legal area, sparse case law, jurisdiction it has less training data for — does it tell the attorney, or does it generate the same confident-sounding output either way?


How to Answer Each Sub-Question

"Describe your hallucination rate"

Do not give a percentage without context. A number like "0.3% hallucination rate" is almost meaningless — it depends entirely on the evaluation set, the task type, and how "hallucination" was defined in the test.

What law firms actually want to know:

  • What specific evaluation methodology did you use?
  • What task was tested (case retrieval vs. argument drafting vs. statute lookup)?
  • Who ran the evaluation (internal team, independent auditor, academic partner)?
  • How frequently do you re-run evaluations as your model changes?

A strong answer structure:

"We evaluate citation existence accuracy by running [X task type] queries against a benchmark set of [N] queries drawn from [jurisdiction] case law, verified against [law database, e.g., Juris, EUR-Lex, jurisdiction-specific legal databases]. As of [date], citation existence accuracy on this benchmark is [X]%. We re-run this evaluation on each model update. Independent evaluation was last conducted by [partner or internal team] in [month/year]. Full methodology is in our Technical Documentation, available upon NDA request."

If you have not run a formal evaluation, say so — and say what you use instead (e.g., user feedback loops, attorney review panels, citation-checking post-processing). Law firms can accept imperfect systems. They cannot accept systems where you do not know what you do not know.

"Describe citation accuracy"

This is distinct from citation existence. The case is real but your AI said it held X when it held Y.

Answer this by describing your retrieval architecture (RAG vs. fine-tuned vs. hybrid) and your post-generation citation verification layer, if you have one. If you fetch the actual case text and constrain generation to that text, say so explicitly — this is the strongest architecture from a citation accuracy standpoint.

If you use a general-purpose LLM without retrieval grounding, acknowledge it and explain what mitigation you have in place. Hiding this in procurement will not help you. The firm's technology counsel will understand the architecture question. Being clear builds trust; being evasive ends deals.

"How does your AI signal uncertainty?"

This is Article 15 and Article 13 in practice. The EU AI Act requires high-risk AI systems to be designed so that their outputs are "sufficiently transparent" and that users can "appropriately interpret the system's output."

Your answer should describe the specific UI/UX mechanism:

  • Does your system include a confidence indicator on each retrieved case?
  • Does it flag jurisdictions or time periods where its training data is sparser?
  • Does it recommend attorney review before citing any output in a submission?
  • Does it differentiate between "I found this case" (retrieval) and "I believe this applies here" (reasoning)?

Law firms want to know that your system is designed to make attorneys more careful, not less careful.


The Documentation They Will Ask to See

Section 4 will almost always be followed by a documentation request. Expect:

  • Technical overview of your AI architecture (retrieval method, model version, update cadence)
  • Accuracy evaluation report (the most recent one, with methodology)
  • A sample of the uncertainty language your system outputs to users
  • Your process for handling user-reported errors

Complizo generates an Evidence Pack from your AI Feature Registry that includes your architecture summary, accuracy claims with methodology references, and human oversight description. Paste the Section 4 questions in; get answers you can send.


What Not to Say

A few answers that kill legaltech deals in procurement review:

  • "Our AI is trained on the full corpus of legal literature" — this is not an answer to the hallucination question and firms know it.
  • "All outputs should be reviewed by an attorney" — true, but insufficient on its own. The question is whether your system helps attorneys review effectively.
  • "We have a 99.9% accuracy rate" — without a methodology, this is not credible. Firms' technology reviewers have seen enough AI vendor pitches to discount unsupported accuracy claims.

The Underlying Concern

Section 4 exists because the firm is trying to answer one question: "If one of our attorneys uses this tool for client work, what is the probability of a professional conduct issue?"

The attorneys signing off on your vendor approval have personal liability exposure. They are not being paranoid. They are doing their job.

Your job in Section 4 is to help them answer that question clearly — with specifics, not assurances.


Try Complizo free at complizo.com — paste your first AI vendor questionnaire and get draft answers you can actually send.

More from this blog

Complizo

68 posts