DecoverAI Blog - Beyond Keyword Search: How Generative AI Is Transforming Document Classification in Investigations

The Three Generations of "Search" in eDiscovery

Search in eDiscovery has gone through three distinct technology generations in less than twenty years, and the legal doctrine has only caught up to the second of them. Generation one was keyword search — the Boolean query, the wildcard, the proximity operator — carried over from Westlaw and the early hosted-review platforms. By 2008, courts were already warning that keyword search alone was not a defensible methodology. In Victor Stanley, Inc. v. Creative Pipe, Inc., 250 F.R.D. 251, 261–62 (D. Md. 2008), Magistrate Judge Paul Grimm held that the producing party bears the burden of proving the reasonableness of its search methodology, and that an undocumented keyword protocol with no quality assurance and no sampling cannot meet that burden.

Generation two was technology-assisted review (TAR 1.0): passive learning over a static seed set. A senior reviewer would code a sample of a few thousand documents as responsive or not, the classifier would train on that seed, and the resulting model would be applied to the full corpus. This is the technology blessed in Da Silva Moore v. Publicis Groupe, 287 F.R.D. 182, 191 (S.D.N.Y. 2012), where Magistrate Judge Andrew Peck wrote that “computer-assisted review now can be considered judicially-approved for use in appropriate cases.” TAR 2.0, also called continuous active learning or CAL, replaced the static seed set with an iterative ranking loop in which reviewers continuously code the highest-scoring uncoded documents and the model continuously updates.

Generation three is the LLM-based classifier. Instead of training a logistic-regression or SVM model on attorney-coded examples, a large language model is given a natural-language description of what counts as responsive — the issue, the custodians, the date range, the privilege test — and asked to read each document and return a classification with a written justification. A ten-line prompt now does what a six-week TAR project did in 2015, and it does it without the cold-start problem of having to code a seed set first. The change in cost and speed is significant. The change in defensibility doctrine has not yet caught up.

What Da Silva Moore and Rio Tinto Actually Settled

The case-law foundation for any conversation about AI-assisted review begins with two opinions, both from Magistrate Judge Peck, three years apart. Da Silva Moore in 2012 was the first US judicial endorsement of TAR. The court approved the parties’ predictive-coding protocol over the plaintiffs’ objections that the technology lacked “generally accepted reliability standards,” reasoning that no review method — manual, keyword, or machine — can guarantee perfection, and that the well-documented inconsistency of contract-attorney review made TAR a comparatively defensible alternative. Critically, the opinion approved a protocol, not a tool: a 2,399-document random baseline, seven iterative training rounds, disclosure of the seed set, and a final random sample of discarded documents to measure recall. Da Silva Moore, 287 F.R.D. at 187–91.

Three years later, in Rio Tinto PLC v. Vale S.A., 306 F.R.D. 125, 127 (S.D.N.Y. 2015), the same judge declared TAR effectively black-letter law, writing that “it is inappropriate to hold TAR to a higher standard than keywords or manual review.” The English High Court reached the same conclusion the following year in Pyrrho Investments Ltd v. MWB Property Ltd [2016] EWHC 256 (Ch) at [33], where Master Matthews enumerated ten reasons for approving predictive coding on a 3.1-million-document disclosure exercise, including that “no evidence” suggests predictive coding is less accurate than manual review and that consistency across a single classifier’s judgments is itself a defensibility advantage.

Two years after that, the same court drew the limit. In Hyles v. City of New York, 2016 WL 4077114, at *4 (S.D.N.Y. Aug. 1, 2016), Judge Peck refused the plaintiff’s request to compel the defendant to use TAR, holding that “Hyles’ application to force the City to use TAR is DENIED” on the strength of Sedona Principle 6: a responding party is best situated to choose its own search methodology, and the standard is reasonableness, not perfection. Together, these three cases settled the proposition that AI-assisted review is permissible, that it cannot be held to a higher standard than the alternatives, and that no party can be forced into it. They did not settle anything about LLMs.

The Generation Gap: Why LLMs Don't Slot Cleanly Into Existing TAR Doctrine

The Da Silva Moore / Rio Tinto framework was built around three structural assumptions that no longer hold for a generative model. The first assumption is the existence of a stable training set. A TAR 1.0 protocol depends on a coded seed set that the parties can examine, sample, and dispute — the 2,399-document baseline in Da Silva Moore is the canonical example. An LLM-based classifier given a natural-language prompt has no seed set in that sense. The “training” the model received happened years earlier, on a corpus the producing party never saw, and the prompt is closer to a jury instruction than to a labeled example. Existing doctrine has no vocabulary for auditing a prompt the way it audits a seed set.

The second assumption is reproducibility. A logistic-regression classifier trained on a fixed seed set will return the same probability for the same document every time it is run. Most production LLM endpoints, by contrast, are stochastic by default. Run the same prompt against the same document twice and you may get two slightly different classifications. The Pyrrho ten-factor analysis explicitly cited consistency — “greater consistency by applying a single senior lawyer’s approach across the entire data set” (Pyrrho [2016] EWHC 256 (Ch) at [33]) — as one of the principal reasons predictive coding should be approved. A non-deterministic classifier inverts that argument unless the validation framework controls for it.

The third assumption is opacity tolerance. Courts have been willing to approve TAR despite its “black box” character because the validation metrics — precision, recall, F1, elusion rate — are mathematically well-defined and can be sampled directly from the production set. An LLM that generates a natural-language justification alongside its classification is, paradoxically, both more transparent (you can read its reasoning) and harder to validate (the reasoning may be a confabulation that has no causal relationship to the classification decision). This is the core defensibility challenge for the next decade of AI-assisted review, and it is why DecoverAI’s technical work in the LLM Evaluation Framework has focused on isolating the sources of disagreement between models on the same document.

Multi-Model Consensus and the New Validation Vocabulary

The most significant engineering shift in AI-assisted review since Rio Tinto is the move from single-model classification to multi-model consensus. Instead of routing each document through one LLM and accepting the result, modern platforms route the same document through multiple independently trained models — typically three to five — and treat the agreement among them as a confidence signal. Where all models concur, the classification is treated as high-confidence and routed to a sampling-based QC pass. Where the models disagree, the document is escalated to human review. The disagreement rate becomes a directly observable validation metric of a kind that single-model TAR could not produce.

This is not a marketing reframing of voting ensembles. It maps directly onto the validation expectations the courts have already articulated. The Da Silva Moore protocol approved a final random sample of discarded documents to measure recall on the “not responsive” pile (Da Silva Moore, 287 F.R.D. at 187). Multi-model consensus generates the equivalent of that sample continuously, on every document, by treating any inter-model disagreement as a candidate for human review. The Pyrrho consistency rationale — that a single senior lawyer’s approach should be applied across the entire data set — is preserved by using each LLM as an independent “reviewer” and measuring agreement across them, the same way a real review team would measure inter-coder reliability.

The new validation vocabulary that comes out of this is worth learning, because it is the language a Rule 26(g)-certifying attorney will need to use on the record. The four metrics that matter are per-class recall (what fraction of truly responsive documents the consensus catches), per-class precision (what fraction of documents the consensus calls responsive actually are), inter-model agreement rate (the fraction of documents where all classifiers reached the same answer), and elusion rate (the fraction of the “not responsive” pile that, on random sampling, turns out to be responsive). DecoverAI’s Meridian Lab evaluation framework measures all four on every matter, and the white paper accompanying this article reproduces the per-document-type targets DecoverAI uses internally.

The Rule 26(g) Problem: Who Signs for an LLM Classification?

Federal Rule of Civil Procedure 26(g) requires that every discovery response be signed by an attorney of record, and that the signature certify that the response is “complete and correct as of the time it is made” after a “reasonable inquiry.” The leading application of that requirement to ESI is Magistrate Judge Paul Grimm’s opinion in Mancia v. Mayflower Textile Servs. Co., 253 F.R.D. 354, 357–58 (D. Md. 2008), which held that boilerplate objections are prima facie evidence of a Rule 26(g) violation because they suggest the lawyer did not actually pause to investigate the burden being claimed. The duty to inquire is affirmative, not formal.

The same logic, applied to an LLM-based responsiveness pass, gets uncomfortable quickly. If the certifying attorney has not personally reviewed the documents, what exactly has she “reasonably inquired” into? The answer the doctrine already gives, in the keyword-search context, is that the attorney must inquire into the methodology, not into each individual document — the failure in Victor Stanley I was a failure to document and test the keyword protocol, not a failure to review every hit. Victor Stanley, 250 F.R.D. at 261–62 (the producing party “failed to demonstrate that the keyword search they performed on the text-searchable ESI was reasonable”). The same standard applied to an LLM means the attorney must be able to describe, on the record, what the model is, what the prompt is, what the validation metrics are, and what the sampling protocol is.

This is why the platform you choose is the attorney’s reasonable inquiry. If the platform cannot produce a written description of its prompt, its model versions, its consensus protocol, and its validation metrics for the matter at hand, the attorney cannot certify under Rule 26(g) that she has conducted a reasonable inquiry — she has simply outsourced the question. A platform that publishes its validation framework, exposes its agreement metrics on a per-matter basis, and gives the attorney the documentation she needs to defend the methodology in a meet-and-confer is the only kind of platform that lets a Rule 26(g) signature actually mean something. Compare this to the Hyles “reasonableness” standard: courts will not force you to use AI, but if you do use it, the inquiry into how you used it has to be on the record.

How DecoverAI Validates Every Classification

DecoverAI’s Relevance Detection product is built on a multi-model consensus architecture in which every document is classified independently by multiple frontier LLMs, and the disagreement rate is exposed to the reviewing attorney as a first-class signal. Documents on which all models agree are routed to sampling-based QC; documents on which they disagree are escalated to human review with the model-by-model justifications surfaced side by side. The validation metrics — per-class recall, per-class precision, inter-model agreement, elusion rate — are recomputed on every matter and made available to the certifying attorney in a form that can be attached to a Rule 26(f) statement or read into the record at a meet-and-confer.

The evaluation framework underneath this is documented in the Meridian Lab post, which is the technical companion to this policy-facing piece. The white paper that accompanies this article goes further: it sets out the GenAI Validation Framework DecoverAI uses internally — per-document-type recall and precision targets, multi-judge consensus thresholds, the sampling methodology, and the documentation packet a Rule 26(g)-certifying attorney needs to take into a defensibility hearing. The framework is reproduced as a one-page structured table in the white paper and is the artifact most readers ask for after the demo.

The pricing model that supports this is the same one that supports the rest of the platform: $60 per gigabyte per month, all-in, no contracts, SOC 2 Type II and HIPAA compliant. There is no surcharge for multi-model classification, no per-document fee, no separate “AI premium.” In the Tax Credit Investigation case study, DecoverAI processed 30,000 documents in three days — including a complete privilege log — with the multi-model consensus protocol described in this article and the validation metrics published to the client. See the pricing page for the full breakdown, or book a 30-minute demo and we will run your own corpus through the consensus pipeline live on the call.

Get the GenAI Validation Framework

A 14-page white paper with the per-document-type recall and precision targets, the multi-judge consensus protocol, and the Rule 26(g) documentation checklist.

Beyond Keyword Search: How Generative AI Is Transforming Document Classification in Investigations