DecoverAI Blog - An Introduction to Ediscovery

What Ediscovery Is — and Why It Matters

Electronic discovery — ediscovery — is the legally governed process of identifying, preserving, collecting, reviewing, and producing electronically stored information in litigation and investigations. Platforms like DecoverAI apply AI-first workflows to every stage of that process, dramatically reducing the cost and time of review while maintaining the defensibility courts require. This primer covers the EDRM lifecycle, the governing federal rules, the data sources in play in a modern matter, and the practical steps every practitioner needs to get ediscovery right.

For legal practitioners entering the field, it is worth appreciating how dramatically the discovery landscape has changed in the past twenty-five years. When the Federal Rules of Civil Procedure were amended in 2006 to formally recognize ESI as a distinct category of discoverable material, the typical matter involved gigabytes of email and a few network drives. Today, the same matter might involve terabytes of data spread across Microsoft 365, Google Workspace, Slack, Zoom, Salesforce, Jira, GitHub, mobile devices, ephemeral messaging applications, and a dozen SaaS platforms nobody thought to map until a litigation hold issued. Ediscovery is the discipline that manages this complexity without losing defensibility.

Why does it matter? Because the cost, timeline, and outcome of most civil matters are now driven by how well the parties execute discovery. A study by the RAND Institute for Civil Justice found that review costs alone can consume 70 percent or more of total ediscovery spend, and ediscovery itself frequently accounts for the majority of total litigation budgets in complex cases. When discovery goes poorly — when preservation fails, when productions are incomplete, when privileged material is inadvertently disclosed — the consequences range from adverse inferences and monetary sanctions to case-dispositive rulings and waiver of privilege. Getting ediscovery right is not a nice-to-have. It is a core competency of modern litigation practice.

The good news is that ediscovery, despite its technical surface, is governed by a well-developed framework of standards, rules, and best practices. The Electronic Discovery Reference Model (EDRM), maintained today by the EDRM organization at Duke Law School, provides the canonical process map for the discipline. The Federal Rules of Civil Procedure and their state analogs provide the legal scaffolding. And The Sedona Conference provides authoritative guidance on the principles that should inform any defensible ediscovery workflow. This primer walks through each of these in turn.

The EDRM Framework: Nine Stages of the Discovery Lifecycle

The Electronic Discovery Reference Model was first published in 2005 by a consortium of ediscovery practitioners seeking to create a common vocabulary and process map for the field. It is now the most widely referenced framework in the profession. The EDRM depicts ediscovery as a lifecycle moving left to right, from the upstream information-governance activities that precede any specific matter, through the downstream activities of review and production, to the final presentation of evidence in court. Crucially, the model is iterative rather than strictly linear: earlier stages narrow the volume of data that must be handled in later stages, and findings from later stages can send you back upstream to collect additional custodians or refine your scope.

Stage 1: Information Governance. Information governance precedes litigation. It encompasses the policies, procedures, and technical controls an organization implements to manage its data throughout its lifecycle — retention schedules, records-management policies, data maps, disposal protocols, and the identification of data stewards. Strong information governance reduces ediscovery cost and risk dramatically. If an organization has a defensible retention policy and actually follows it, there is less data to preserve, less data to collect, and less data to review when litigation strikes. The Sedona Conference's Commentary on Information Governance treats IG as the foundation on which every other ediscovery activity depends.

Stage 2: Identification. Once a matter is reasonably anticipated, the identification stage maps the universe of potentially relevant ESI. Who are the custodians? What systems did they use? Where is the data stored? Identification typically involves custodian interviews, IT interviews, a review of the organization's data map, and the issuance of legal hold notices. The output of identification is a defensible inventory of the data sources that must be preserved and, ultimately, collected. Mistakes at this stage — missing a custodian, overlooking a collaboration platform, failing to identify a departing employee's personal device — propagate through every downstream stage.

Stage 3: Preservation. Preservation is the legal and technical act of ensuring that potentially relevant ESI is not altered, deleted, or otherwise lost. It is triggered by the duty to preserve, which arises when litigation is reasonably anticipated — often well before a complaint is filed. Preservation is typically implemented through a litigation hold (a written notice to custodians and data stewards instructing them to preserve relevant materials) combined with technical measures such as suspending auto-deletion policies, placing mailboxes on legal hold in Microsoft 365, and preserving Slack workspaces through vendor-provided hold features. Failure to preserve triggers the sanctions provisions of Rule 37(e), discussed below.

Stage 4: Collection. Collection is the acquisition of preserved ESI in a forensically sound manner. The guiding principle is defensibility: the collection process must preserve the authenticity, integrity, and metadata of the source data and must be repeatable and documented. Modern collection tools interface directly with enterprise platforms (Microsoft Graph API, Google Vault, Slack eDiscovery API, Box Governance) to pull data with full metadata intact. For custodian-held data — laptops, phones, thumb drives — forensic imaging or targeted collection is used. Chain-of-custody documentation begins here and continues throughout the matter.

Stage 5: Processing. Processing transforms raw collected data into a form suitable for review. It typically includes extraction of text and metadata, file-type identification, deduplication, de-NISTing (removing known system files), language detection, and indexing. Processing normalizes disparate data sources into a common schema so that email, Word documents, spreadsheets, Slack messages, and PDFs can all be searched and reviewed in the same interface. Processing is also where early case assessment culling happens: date filters, custodian filters, keyword filters, and family-threading decisions that dramatically reduce the population that must be reviewed.

Stage 6: Review. Review is the most expensive stage of ediscovery and, not coincidentally, the stage where AI has had the most transformative impact. Reviewers assess each document for responsiveness to the requests for production, for privilege, for confidentiality, and for issue tagging. Traditional linear review — where each document is examined by a human attorney — has been supplemented and in many cases replaced by technology-assisted review (TAR), also known as predictive coding, and increasingly by generative AI classification. DecoverAI's Relevance Detection is an example of an AI-first approach to this stage.

Stage 7: Analysis. Analysis runs in parallel with review and involves the substantive examination of the data to identify key facts, build chronologies, test theories of the case, and prepare for deposition and trial. Where review asks "is this document responsive?", analysis asks "what does this document tell us about what happened?" Modern ediscovery platforms provide concept clustering, email-thread analysis, communication network diagrams, and timeline visualization to support this work.

Stage 8: Production. Production is the formal transfer of responsive, non-privileged documents to the requesting party. The format of production — TIFF, native, or hybrid — is typically governed by the parties' ESI protocol. Production packages include image files, extracted text, metadata load files (Concordance DAT, Relativity-compatible formats), and any Bates numbering or confidentiality endorsements required by the protective order. Production is the most visible output of the ediscovery process and the stage where errors are most costly: a defective production must often be remediated at significant expense.

Stage 9: Presentation. Presentation is the use of the produced material in deposition, hearing, mediation, and trial. Documents are loaded into trial-presentation tools, hot documents are organized into exhibit binders, and key facts are woven into the case narrative. Presentation closes the loop: the discovery process exists ultimately to surface the evidence that will be used to tell the client's story.

The Governing Rules: FRCP 26, 34, 37(e), and Proportionality

Ediscovery in federal practice is governed primarily by the Federal Rules of Civil Procedure, as amended in 2006 and again in 2015. Every practitioner handling discoverable ESI should have a working command of the key provisions, because the rules define not only the scope of discovery but also the sanctions available when things go wrong.

Rule 26(b)(1) defines the scope of discovery. As amended in 2015, the rule limits discovery to "any nonprivileged matter that is relevant to any party's claim or defense and proportional to the needs of the case." Proportionality is assessed against six factors: the importance of the issues at stake, the amount in controversy, the parties' relative access to relevant information, the parties' resources, the importance of the discovery in resolving the issues, and whether the burden or expense outweighs the likely benefit. Proportionality was elevated to a core principle in the 2015 amendments precisely because courts were concerned about ediscovery costs spiraling out of control.

Rule 26(f) requires parties to confer early in the case to discuss ESI issues, including the preservation of discoverable information, the form of production, and any issues relating to claims of privilege. This is the meet-and-confer obligation that gives rise to ESI protocols and protective orders. Negotiating a thoughtful ESI protocol at this stage is one of the highest-leverage activities in any matter. Rule 26(g) imposes a certification obligation on counsel: every discovery response must be signed to certify that it is complete and correct to the best of counsel's knowledge after a reasonable inquiry. This is the rule that undergirds the defensibility requirement throughout the ediscovery workflow.

Rule 34 governs requests for production of documents and ESI. It gives the requesting party the right to specify the form in which ESI is produced and requires the producing party to produce documents as they are kept in the usual course of business or to organize and label them to correspond to the categories in the request. Rule 34 is the source of the traditional requirement that productions maintain document families (parent emails with their attachments) and preserve metadata. Practitioners should pay particular attention to the form-of-production requirements, which are often the flashpoint in ESI protocol negotiations.

Rule 37(e), as amended in 2015, addresses the failure to preserve ESI. It provides that when ESI "that should have been preserved in the anticipation or conduct of litigation is lost because a party failed to take reasonable steps to preserve it, and it cannot be restored or replaced through additional discovery," the court may order measures proportional to the prejudice. If the court finds the party acted "with the intent to deprive another party of the information's use in the litigation," it may impose severe sanctions, including adverse-inference instructions, dismissal, or default judgment. The 2015 amendment was specifically intended to curb the wide variation in sanctions practice that had developed in the federal courts and to provide a uniform standard. The practical takeaway: intentional destruction or bad-faith failure to preserve can be case-dispositive; negligent failures trigger more modest curative measures.

Alongside the FRCP, The Sedona Conference has issued a series of influential commentaries that have become the de facto standards for the profession. The Sedona Principles, Third Edition sets forth fourteen principles addressing the scope, form, and process of ediscovery. Principle 6, for instance, establishes that "responding parties are best situated to evaluate the procedures, methodologies, and technologies appropriate for preserving and producing their own electronically stored information." Principle 11 directs that a producing party should be entitled to rely on search techniques reasonably designed to produce relevant information. Judges frequently cite the Sedona Principles in discovery rulings, and practitioners should treat them as persuasive authority on any contested ediscovery issue.

The Data Sources of Modern Ediscovery

One of the hardest adjustments for lawyers new to ediscovery is coming to grips with the sheer variety of data sources in play in a modern matter. Twenty years ago, "ediscovery" meant email and documents. Today, a single matter might implicate a dozen or more distinct platforms, each with its own collection mechanics, metadata conventions, and production challenges.

Email remains the workhorse of ediscovery. The dominant systems — Microsoft Exchange and Microsoft 365, Google Workspace, and legacy Lotus Notes environments — support direct API-based collection with full metadata preservation. Email families (parent message plus attachments) are treated as a single logical unit, and modern processing tools thread conversations so that related messages can be reviewed together. Despite the rise of collaboration tools, email still represents the largest single category of ESI in most matters because it has been the primary business communication channel for two decades.

Collaboration platforms like Slack and Microsoft Teams have transformed internal communication at most organizations and created entirely new categories of discovery challenges. Unlike email, which has clear document boundaries, Slack and Teams data is a continuous stream of threaded conversations, channels, direct messages, reactions, file uploads, and application integrations. Collecting this data requires specialized connectors — Slack's Discovery API and Microsoft's Purview eDiscovery are the primary enterprise tools — and producing it in a form that preserves context without overwhelming reviewers is a non-trivial problem. Courts increasingly expect that collaboration-platform data will be addressed explicitly in ESI protocols.

Mobile devices — smartphones and tablets — are another critical and often-overlooked source. Text messages, iMessage, WhatsApp, Signal, Telegram, and platform-specific messaging apps can contain some of the most probative communications in a matter, particularly in employment, trade-secret, and white-collar investigations. Collecting mobile data raises unique challenges around device ownership (BYOD versus corporate-issued), encryption, ephemeral messaging, and cross-border privacy laws. Forensic imaging via tools like Cellebrite or targeted cloud-backup collection via iCloud or Google Backup are the typical methods.

Cloud and SaaS applications round out the picture. Salesforce, Workday, Jira, Confluence, Box, Dropbox, Google Drive, SharePoint, GitHub, Notion, Asana, Zoom recordings — any application that stores business records is potentially in scope. Each platform has its own export mechanism, metadata model, and volume profile. A critical early-case-assessment task is to build a comprehensive data map of every SaaS application used by relevant custodians. Missing a platform is not just an oversight — it is a potential preservation failure and a Rule 26(g) problem.

Finally, practitioners must contend with ephemeral and modern data types: voice messages, video recordings, audio transcripts, structured database records, IoT telemetry, and increasingly the outputs of enterprise AI assistants. Each raises its own preservation and production issues. The Sedona Conference's Commentary on Ephemeral Messaging offers guidance on disappearing-message platforms and is required reading for any matter involving Signal, Snapchat, or similar tools.

Common Challenges: Volume, Cost, Privilege, and Defensibility

The challenges of modern ediscovery can be grouped into four recurring themes, each of which has shaped the evolution of the discipline. Understanding these challenges is essential to managing them.

Volume is the most visible challenge. The amount of ESI generated by a typical enterprise doubles roughly every two years, and the number of potentially relevant documents in a mid-sized commercial matter routinely runs into the millions. Volume drives cost, extends timelines, and creates risk: the more documents in the review pool, the greater the probability of inadvertent disclosure, missed responsive material, or missed privilege. Volume is addressed through early culling (date filtering, custodian filtering, file-type filtering, deduplication, email threading) and, more recently, through AI-assisted classification that allows reviewers to focus on the documents most likely to be relevant.

Cost is the constant complaint of general counsel and a principal driver of proportionality analysis under Rule 26(b)(1). Traditional ediscovery pricing models — per-gigabyte processing plus per-gigabyte-per-month hosting plus per-user review-seat fees — have produced matter budgets that shock even sophisticated litigation clients. The economics of document review are particularly punishing: at typical contract-reviewer rates, reviewing one million documents can cost $1 million or more. Modern platforms, including DecoverAI's $60/GB pricing, have moved toward flatter pricing structures that make budgets predictable and reduce the friction of scaling up to meet large matters.

Privilege is the single most consequential substantive issue in review. An inadvertent production of privileged material can waive the privilege not just for that document but for the entire subject matter, and the consequences can be catastrophic. Privilege review requires identifying communications involving attorneys, classifying them as privileged or not, determining whether privilege has been waived through disclosure to third parties, and logging each withheld document in a privilege log that is detailed enough to support the claim but not so detailed that it reveals the privileged content itself. Federal Rule of Evidence 502(d) provides critical protection: a court order entered under this rule establishes that inadvertent disclosure does not constitute waiver in any federal or state proceeding, and every ESI protocol should include or reference a 502(d) order.

Defensibility is the thread that runs through every stage of the workflow. The ediscovery process must be one that counsel can explain and justify to a court. Every decision — what was collected, what was filtered, what search terms were used, what review workflow was applied, what quality-control measures were employed — must be documented and supportable. Defensibility is not about achieving perfection; the Sedona Principles and the federal rules both recognize that perfect recall is impossible at scale. Defensibility is about reasonableness: a process designed in good faith, executed with care, and documented in a way that can be defended in court.

How AI Is Transforming Ediscovery

The most consequential change in ediscovery over the past two decades has been the rise of machine learning and, more recently, large language models. AI has reshaped the economics, the workflow, and the defensibility analysis of the discipline. A practitioner entering the field today will encounter AI at nearly every stage of the workflow, and understanding what AI can and cannot do is now a core competency.

The first wave of ediscovery AI was technology-assisted review (TAR), also known as predictive coding. TAR uses supervised machine learning: human reviewers code a seed set of documents, a classifier learns from their decisions, and the classifier is then used to rank or classify the remaining population. The seminal 2012 decision in Da Silva Moore v. Publicis Groupe established that TAR is an acceptable review methodology under the federal rules, and subsequent case law has reinforced that parties are not required to use the most exhaustive method possible but must use a method reasonably calculated to identify responsive material. TAR significantly reduced review costs in large matters and made predictive coding a standard option in every major ediscovery platform.

The second wave — now underway — is generative AI and large language models. Unlike TAR, which requires per-matter training on a seed set, modern LLMs can perform relevance classification, privilege identification, issue coding, and summarization based on natural-language instructions. A reviewer can literally describe what they are looking for — "find communications discussing the Q3 pricing change that went to customers outside the US" — and the model will classify and rank documents accordingly. The implications for cost and speed are substantial: tasks that previously required weeks of linear review can be completed in hours, and the cost per decision is a fraction of traditional human review.

Generative AI is not limited to classification. It is transforming chronology construction, where models can extract key events from thousands of documents and assemble a timeline in minutes; deposition preparation, where models can surface every document a witness touched and summarize their substantive content; and privilege logging, where models can draft privilege-log descriptions that preserve the privilege claim without revealing the underlying content. DecoverAI's work on commercial-litigation matters illustrates how these capabilities compound: when AI handles the mechanical work of tagging, summarization, and logging, attorneys can focus on the substantive analysis that drives case strategy.

Of course, AI introduces its own defensibility questions. How do you validate that a model's classifications are accurate? How do you demonstrate that the process is reasonable under Sedona Principle 6? How do you handle the risk of hallucination in AI-generated summaries? The emerging best practice is human-in-the-loop validation: AI performs the initial classification, statistical sampling validates accuracy, and human reviewers focus their attention on the close calls and the documents most likely to drive case outcomes. This is not a replacement of human judgment but an amplification of it, and it is consistent with the long-standing principle that ediscovery processes must be reasonable rather than perfect.

Getting Started: Practical First Steps

If you are a legal practitioner stepping into an ediscovery matter for the first time, the scope of the undertaking can feel intimidating. The following practical steps will get you oriented and positioned for a defensible workflow. None of them require deep technical expertise; all of them require discipline and attention to documentation.

First, issue a litigation hold promptly and document it. The duty to preserve attaches when litigation is reasonably anticipated, not when it is filed. As soon as you have a credible signal — a demand letter, a regulatory inquiry, internal reports of a significant incident — issue a written hold notice to every custodian and data steward who may have relevant information. The notice should describe the subject matter in plain English, identify the categories of data to preserve, and explicitly instruct recipients to suspend any auto-deletion or disposal activities. Document who received the notice and when, and follow up periodically to reinforce it. A well-executed litigation hold is the single most important defensive measure against Rule 37(e) sanctions.

Second, build a data map. Before you can collect, you have to know what to collect. Sit down with IT and with each custodian and inventory every place their work lives: email, OneDrive, Google Drive, Slack, Teams, SharePoint, Dropbox, Box, Salesforce, Jira, GitHub, personal devices, legacy archives, voicemail systems, and any application-specific stores. For each source, note the responsible administrator, the available collection method, and the approximate data volume. This data map is the foundation for your Rule 26(f) conference, your ESI protocol negotiations, and your collection workflow.

Third, engage your meet-and-confer obligation seriously. Rule 26(f) is not a box-checking exercise. Use it to surface ESI issues early, to align with opposing counsel on the scope and format of production, and to negotiate a protective order and 502(d) order. The effort you invest in an ESI protocol at this stage is repaid many times over in reduced disputes later. Review the ESI protocol guide for a detailed walkthrough of the key provisions you should negotiate.

Fourth, collect defensibly. Use tools that preserve metadata and maintain chain of custody. Document every collection: what was collected, from which custodian, from which source, on which date, by which person, using which tool. If you use forensic imaging, retain the images. If you use API-based collection, retain the collection logs. If there is ever a question about the authenticity or completeness of your production, this documentation is what saves you.

Fifth, think about proportionality at every decision point. The 2015 amendments to Rule 26(b)(1) were a deliberate signal that courts expect counsel to scope discovery reasonably. Before you collect everything from every custodian, ask whether narrower collection would satisfy the proportionality factors. Before you propose a comprehensive manual review, ask whether AI-assisted workflows would be equally defensible at a fraction of the cost. Before you resist a production request, ask whether a negotiated compromise would be more efficient than a motion to compel.

Sixth, invest in tooling that matches the scale of modern data. The era when a firm could manage ediscovery with shared drives and spreadsheets is over. Modern matters require processing capacity, search infrastructure, metadata handling, and AI-assisted review. Choose a platform that provides these capabilities at predictable cost. DecoverAI's flat $60-per-gigabyte pricing is designed precisely for practitioners who need enterprise capabilities without enterprise pricing. Whether you are handling your first matter or your fiftieth, the tools should make defensibility easier, not harder.

Ediscovery rewards preparation, documentation, and rigor. None of its individual steps are intellectually difficult, but all of them must be performed consistently and defensibly across growing volumes of increasingly varied data. Practitioners who internalize the EDRM framework, keep the governing rules in mind at every decision point, and pair disciplined process with modern tooling will find that ediscovery, while never trivial, becomes a manageable and even rewarding part of litigation practice. The cases of the next decade will be won and lost in discovery. It is worth learning to do it well.

An Introduction to Ediscovery