When you use large language models to analyze legal documents, one question inevitably comes up: How do you know the output is actually good?
In most domains, an imperfect answer is an inconvenience. In law, it's a risk. A deposition summary that misses a key admission. A contract review that overlooks a liability carve-out. An evidence analysis that misstates the timeline. These aren't cosmetic issues — they're the kinds of errors that can materially affect strategy, outcomes, and professional responsibility.
At DecoverAI, we needed a rigorous, repeatable, and defensible way to measure how well our LLMs perform on legal document analysis. So we built one.
The Datasets
We used discovery and court documents from the case Depp v. Heard that were available through CourtListener and are publicly accessible. These consist of deposition transcripts, trial testimony, motions, exhibits, evidentiary filings, and judicial rulings. The dataset includes a mix of narrative testimony, adversarial questioning, procedural filings, and factual exhibits — exactly the kinds of heterogeneous, high-stakes documents that legal teams routinely analyze under time pressure.
The materials contain known facts, disputed narratives, and documented inconsistencies, making them well-suited for evaluating deeper legal reasoning tasks such as timeline reconstruction, issue spotting, contradiction detection, and attribution.
The Core Idea: LLM-as-a-Judge
Use LLMs to judge LLMs. One model generates findings from a source document. A separate LLM, acting as a judge, evaluates those findings across multiple dimensions of legal quality. Think of it like a senior attorney reviewing a junior associate's work — except the senior attorney is also an LLM, operating against a precise rubric.
This is not a single thumbs-up score. We evaluate across multiple dimensions, each with its own criteria, scoring scale, and weight.
Eight Dimensions of Legal Quality
- Legal Accuracy — most important, weighted double. If a model misstates the law, nothing else matters.
- Factual Completeness — dates, parties, amounts, contradictions, key admissions.
- Evidence Identification — distinguishing direct evidence, circumstantial evidence, and inference.
- Legal Relevance — findings relevant to legal issues, not merely accurate but beside the point.
- Procedural Accuracy — jurisdiction, statutes of limitation, filing requirements.
- Argumentation Structure — reasoning coherence, conclusions supported by premises.
- Citation Quality — specific source document references vs vague references.
- Critical Issue Identification — surfacing most important issues vs burying them.
Each dimension is scored 1–5 with explicit criteria. The final score is a weighted aggregate with legal accuracy at double weight.
Consensus Through Multiple Judges
Multiple LLM judges independently score the same output. We aggregate scores and measure inter-judge agreement. We use four judge personas: a senior attorney (15+ years), a litigation specialist, a legal researcher, and an e-discovery expert.
When agreement drops below 0.8, that disagreement becomes a signal for human review. High inter-judge agreement gives us confidence that the scores reflect genuine quality differences rather than noise. Low agreement tells us something interesting is happening — the output may be in a gray area that warrants closer inspection.
The End-to-End Evaluation Pipeline
- Load source documents — depositions, contracts, expert reports, discovery responses.
- Generate findings — one or more LLMs generate structured findings from the source documents.
- Evaluate — judge models score each set of findings across all eight dimensions.
- Analyze — means, standard deviations, correlations, pairwise statistical significance.
- Report — markdown reports and visualizations (box plots, heatmaps, correlation matrices).
The pipeline runs asynchronously for efficiency, allowing us to evaluate multiple models across multiple documents in parallel.
What the Results Actually Showed
Performance by dimension:
- Evidence Detection: 0.96 avg, 100% pass
- Completeness Check: 1.00 avg, 100% pass
- Strategy & Planning: 0.91 avg, 100% pass
- Tool Efficiency: 0.85 avg, 100% pass
- Reasoning Transparency: 0.87 avg, 100% pass
- Legal Reasoning: 0.57 avg, 33.3% pass
Legal reasoning was the weak point. The model produced overgeneralized conclusions — correct facts paired with imprecise analysis. It could identify the right evidence but struggled to draw the kind of tightly reasoned legal conclusions that a senior attorney would expect. Without dimension-level scoring, these weaknesses would have been masked by strong performance in other areas. The overall average would have looked fine. The dimension-level breakdown revealed that the model was not yet ready for tasks requiring deep legal reasoning without human oversight.
What We Learned
- Weighted dimensions matter. Equal weighting produced misleading rankings. A model that scored well on citation quality but poorly on legal accuracy would rank higher than it should under equal weighting. Doubling the weight on legal accuracy corrected this distortion.
- Multiple judges are essential. Single-judge evaluation showed approximately 15% variance in scores. With four judges, the variance dropped significantly, and the scores became stable and reproducible.
- Document type is a confounding variable. One model appeared to perform better than another — but only because the test set was skewed toward document types that favored it. Controlling for document type eliminated the apparent advantage.
- Statistical significance prevents false confidence. We require p < 0.05 before declaring one model better than another. Without this threshold, random variation in small samples could lead to incorrect conclusions about model quality.
Why This Matters
You cannot responsibly deploy AI to analyze depositions, contracts, and evidence without knowing quantitatively how good it is. Not "it seems to work well." Not "the attorneys like the output." Quantitatively — with scores, confidence intervals, and statistical tests.
The framework spans approximately 2,600 lines of Python across six modules. But the underlying patterns — weighted multi-dimension scoring, multi-judge consensus, statistical testing — are broadly applicable to any high-stakes domain where the cost of errors is high and the tolerance for ambiguity is low.
If you are building AI systems for legal work, medical analysis, financial compliance, or any domain where precision matters, you need an evaluation framework that goes beyond "accuracy on a benchmark." You need one that measures the dimensions that actually matter to the professionals who will rely on the output.