By: Jeff Johnson, Chief Innovation Officer
Last month, the Sedona Conference Working Group 13 Annual Meeting and the ASU Arkfeld Conference on eDiscovery, Law, and Technology each offered a thoughtful look at AI’s evolution in the legal profession. Where it stands today. Where it’s going. How we should be governing use. While those discussions and challenges extend well beyond generative AI in eDiscovery document review, the themes apply directly.
The night before the WG 13 annual meetings, over dinner, someone asked me: “Is it even possible to know that GenAI is correctly reviewing documents?”
It’s a good question and one we hear often.
Recently, I contributed to a multi-author industry article on this exact topic — Ground Truth: The Realities of Generative AI in eDiscovery — it is a great companion piece to this post. I encourage anyone thinking through these questions to read both. Each piece complements, and covers some ground not addressed by, the other.
Throughout both conferences and the article, a theme surfaces repeatedly, perhaps best captured in a single idea: Do the right thing and prove it. I heard a panelist say it almost exactly that way. It stuck with me because it describes the work our team does every day.
Prompted by those conversations, this post focuses on a straightforward position:
- Technology-Assisted Review (TAR) is a process, not a specific technology. We may not all place generative AI–enabled workflows under the TAR umbrella today, but we will.
- We already know how to prove defensibility in AI-enabled review. Courts and practitioners have accepted validation of machine learning–driven workflows for years.
- GenAI changes technology, not the validation standard. These workflows can be tested, measured, and validated using the same statistical frameworks.
- Concerns about missing relevant material while still achieving “acceptable” validation are not new. That risk exists in traditional TAR. GenAI may be better positioned to alleviate it.
TAR Is a Process, Not Technology
This distinction matters more than it might seem. TAR has never been defined by a single technology or specific algorithm. There are many technologies, with their own “secret sauce” using various algorithms (linear regression, support vector machines, and k-nearest neighbor). No one required product or algorithm-level benchmarking before deploying them in production. What the profession required was a defensible documented process, defined criteria, and measured outcomes.
GenAI classification works the same way. A large language model, guided by criteria drafted by qualified subject matter experts, produces a classification output: responsive or non-responsive, privileged or not. The process involves iterating and refining instructions until the model, armed with calibrated input, performs reliably against the defined criteria. The results are measurable. GenAI classifications can be compared against subject matter expert decisions, exactly as we do in validating traditional TAR classifications.
At a workflow level, this is indistinguishable from a TAR workflow.
A common analogy fits here: TAR is the vehicle. The algorithm is the engine. The engines have changed over the years. The vehicle (the workflow, the validation, the oversight component) has not. See Emory, Pickens & Louis, TAR 1 Reference Model, 25 Sedona Conf. J. 109 (March 2024) (analogizing TAR workflows to a vehicle and predictive algorithms to interchangeable engines).
We Know How to Prove Effective TAR-Enabled Review
Starting with Da Silva Moore v. Publicis Groupe in 2012, courts have accepted recall as the primary measure of review quality. The core questions have always been the same: Did the workflow find most of the relevant material? Can you prove it? Can you document process and proof?
GenAI classification does not change those questions, or the processes we use to answer them. It merely changes the mechanism producing the classifications. The validation workflow remains: establish the criteria, measure recall, precision, and elusion in the discard set by testing with control or validation samples, document the process, and remediate if measured results do not meet established criteria.
I’ve yet to see any compelling basis for requiring a separate or heightened validation standard simply because our engine (the AI model) is newer. The novelty of the tool does not change the key question. What matters is whether the producing party can demonstrate, through accepted methods and documented quality controls, that the workflow achieved reliable results.
For example, in a recent review for production, involving 51,000 documents, a PurposeXi workflow, using Relativity aiR for Review achieved a validated 98% recall with 70% precision. Importantly, this review process saved over 900 hours of attorney time and $24,ooo in review expense (compared to realistic estimates for a traditional TAR – Continuous Active Learning workflow).
We’ve guided clients through GenAI workflows on many live matters, involving millions of documents. The results are consistently strong.
With workflow and oversight, we see validated recall averaging above 90%, often exceeding 95%, while maintaining precision averaging 84%. These results meaningfully outperform traditional TAR benchmarks.
In another 60,000 documents example, our team designed a workflow combining Purpose CaseOpticsTM for initial relevance triage with a traditional TAR – Continuous Active Learning workflow. End-to-end process validation resulted in a 93% recall estimate with precision over 95%.
Our clients appreciate these workflows for their improved validated outcomes combined with savings. Some have already established GenAI–enabled review as their default approach for any substantial review and others are following.
The “What Is It missing?” Concern Is Not New & GenAI May Help Solve It
A concern we hear raised about GenAI review is the possibility of missing relevant material, achieving statistically acceptable validation metrics while still leaving significant gaps.
The concern generally applies to rare relevant documents or low-prevalence subcategories. Specifically, the concern is that a seemingly well-validated review may have failed to find some category(ies) of relevant documents because they represent such a small percentage of the corpus.
This is a legitimate concern. But it is not new or unique to GenAI workflows. In fact, a well-executed GenAI-enabled process has the advantage here.
GenAI classification consistently achieves significantly higher recall, making these shortcomings less likely.
Additionally, traditional TAR trains on what it sees. If a relevant document subtype is rare, the model may never develop adequate training to find it consistently. On the other hand, GenAI classification works from criteria, not training examples. You describe what matters — the model finds it, whether it appears ten times or ten thousand. That is a meaningful structural advantage in any matter where you know certain document types are critical but rare.
Of course, this advantage depends on how well the criteria are written. A vague or incomplete prompt won’t reliably surface rare subtypes any more than a poorly seeded TAR model will. The quality of your instructions is the controlling variable — which is exactly why expert oversight in prompt development, targeted searching, and validation by category are core workflow disciplines, not optional steps.
Existing Rules Are Sufficient – We Do Need to Apply Them
Many calls for new rules for GenAI in eDiscovery miss the point. For classification workflows, the rules we already have are adequate. Professional responsibility, the Federal Rules, and more than a decade of TAR case law give practitioners everything they need to deploy these tools responsibly.
That said, we have noted a tendency, in our clients and partners, to assume GenAI output is correct beyond any similar inclinations with earlier TAR solutions. The attorney readable classification rationale is a potential explanation for this unearned confidence.
This is not a good rationale for new rules. It is a judgment problem. Training is the answer. We need to understand our obligations for oversight apply to AI, just as they do to any member of our team whose work we rely on.
We build expert oversight into every workflow as a process requirement. Defined quality controls, documented decisions, and validation sampling are not optional steps. They are how we sign off on results.
Where This Leaves Us
The question asked over dinner: “Is it even possible to know that GenAI is correctly reviewing documents?”—has a clear answer. Yes. We know how to test it, measure it, and prove it. The tools are the same that the profession has relied on for years.
Do the right thing and prove it. That is not a new standard. It is the only standard that has ever mattered in defensible document review. GenAI does not change that. It just gives us a more capable engine to work with.
The firms and teams that will lead in this space are the ones that combine strong workflow design with rigorous validation, and can explain every step to a client, opposing counsel, or a judge. That is the standard we hold ourselves to on every matter.
If you want to see what PurposeXi’s validated GenAI workflows look like in practice, we’re happy to walk you through it. Reach out and we’ll show you how we do the right thing and prove it! Connect with our experts here.