DecoverAI - The Complete Guide to eDiscovery in 2026

What Is eDiscovery

Electronic discovery (eDiscovery) is the process of identifying, collecting, reviewing, and producing electronically stored information (ESI) in connection with litigation, regulatory investigations, or internal inquiries. It is a core obligation in virtually every civil lawsuit and many criminal and regulatory matters in the United States.

The scope of ESI is broad: emails, documents, spreadsheets, presentations, text messages, Slack and Teams conversations, voicemails, social media posts, database records, and any other information stored in digital form. The average Fortune 500 company generates 2.5 billion emails per year. When litigation or an investigation arises, the legal team must determine which of these electronic records are relevant, review them for responsiveness and privilege, and produce them to the opposing party or regulator — all within court-imposed deadlines and under the threat of sanctions for non-compliance.

The stakes are significant. Federal Rule of Civil Procedure 37(e) authorizes courts to impose sanctions — including adverse inference instructions and default judgments — for the failure to preserve or produce ESI. In practice, eDiscovery costs typically represent 60-80% of total litigation costs, and production errors can lead to motions to compel, fee-shifting, and waiver of privilege.

Why This Guide Exists

Most eDiscovery guides are written by vendors selling platforms. This guide is written for the practitioner who needs to understand the full workflow, make informed decisions about technology and process, and avoid the specific pitfalls that lead to sanctions, cost overruns, and lost cases. We include cost benchmarks and practical checklists at every stage.

The EDRM Framework

The Electronic Discovery Reference Model (EDRM) is the standard framework for understanding the eDiscovery workflow. It describes nine stages, from the trigger event through presentation at trial. In practice, most litigation teams focus on seven core stages: legal hold, identification, collection, processing, review, analysis, and production.

These stages are not strictly sequential. Review findings often trigger additional collection. Analysis may reveal the need for broader searches. Production may need to be re-run after QC failures. The EDRM is best understood as a reference architecture, not a waterfall process. Each stage feeds back into the others, and effective eDiscovery requires the ability to iterate quickly as the case develops.

The sections that follow walk through each stage in detail, with specific guidance on what to do, what to avoid, and how to control costs at every step.

1. Legal Hold

A legal hold (also called a litigation hold or preservation notice) is a directive issued to custodians and IT departments requiring them to preserve all potentially relevant ESI. The duty to preserve is triggered when litigation is reasonably anticipated — not when the complaint is filed. This means the hold obligation can arise weeks or months before any lawsuit is actually commenced.

The consequences of failing to issue or enforce a legal hold are severe. In Zubulake v. UBS Warburg, the court imposed an adverse inference instruction after the defendant failed to preserve emails despite a clear litigation hold obligation. More recently, courts have imposed monetary sanctions in the six- and seven-figure range for preservation failures, even where the failure was negligent rather than intentional.

What a proper legal hold requires:

Written notice to all custodians who may possess relevant ESI, describing the matter, the categories of information to be preserved, and the custodian's obligations
Suspension of automatic deletion policies (email retention limits, Slack message purges, backup tape recycling) for all custodians and systems covered by the hold
Periodic reminders to custodians, typically quarterly, confirming their ongoing obligation and asking whether they have identified any additional sources of potentially relevant information
Documentation of every step: when the hold was issued, to whom, what was preserved, and how compliance was verified. This documentation becomes critical if the opposing party later challenges your preservation efforts
IT coordination to ensure that system-level deletions (automatic email purges, departing employee account deprovisioning) do not override the preservation directive

Common Mistake

Issuing a legal hold notice but failing to follow up. Courts have held that a legal hold is ineffective if the issuing party does not take reasonable steps to verify compliance. A hold notice sitting unread in an employee's inbox provides no preservation.

2. Identification

Identification is the process of determining which custodians, systems, and data sources contain potentially relevant ESI. This stage sets the scope for everything that follows: if you identify too narrowly, you risk missing relevant documents and facing sanctions for inadequate search. If you identify too broadly, you incur unnecessary costs in collection, processing, and review.

Start with custodians. Work with the case team to identify every individual who may have created, received, or stored relevant documents. This typically includes the named parties, their direct reports, key decision-makers, and anyone involved in the events at issue. For each custodian, map their data sources: email (which platform?), local files, shared drives, cloud storage (OneDrive, Google Drive, Dropbox), messaging platforms (Slack, Teams, Signal), mobile devices, and any enterprise applications (CRM, ERP, project management tools) they use.

Don't forget non-custodial sources. Shared mailboxes, distribution lists, SharePoint sites, shared drives, and database systems often contain critical documents that are not attributable to any single custodian. Identify these sources early, because they often require different collection methods and may contain unique documents not found in individual custodian collections.

Document your identification decisions. The meet-and-confer process under FRCP Rule 26(f) requires parties to discuss preservation and discovery issues, including data sources. Being able to articulate why you included or excluded specific custodians and data sources is essential if your search scope is later challenged.

3. Collection

Collection is the process of extracting ESI from identified sources in a forensically defensible manner. The key principle is that collection must preserve the integrity and metadata of the original documents. Metadata — dates, authors, recipients, file properties — is often as important as the document content itself, and collection methods that alter or strip metadata can compromise the entire production.

Defensible collection requires:

Chain of custody documentation for every data source: who collected it, when, from what system, using what method, and how it was transferred to the review platform
Hash values (MD5 or SHA-256) calculated at the point of collection and verified after transfer to confirm that files have not been altered in transit
Preservation of metadata including file creation dates, modification dates, author information, and email headers. Collection methods that modify timestamps (such as copying files via drag-and-drop instead of forensic imaging) can create defensibility issues
Collection logs recording any errors encountered, files that could not be collected (password-protected, corrupted, or inaccessible), and the steps taken to address those issues

For modern messaging platforms like Slack and Microsoft Teams, collection presents unique challenges. Messages are stored in cloud environments controlled by the platform provider, conversation threads may span months or years, and attachments may be stored separately from the messages that reference them. See our detailed guide on collecting Slack and Teams data for specific strategies.

For mobile devices, collection typically requires either a mobile forensics tool (Cellebrite, GrayKey) for a full device image, or targeted collection of specific applications. The choice depends on the scope of the discovery obligation and the sensitivity of the device contents. Our mobile data guide covers the decision framework.

4. Processing

Processing transforms raw collected data into a format suitable for review. This includes extracting text and metadata from files, expanding container formats (ZIP, PST, OST, NSF), converting files to reviewable formats (TIFF, PDF), running OCR on image-only documents, and de-duplicating across custodians and data sources.

De-duplication is one of the most impactful processing steps. In a typical multi-custodian collection, 30-60% of documents are duplicates — the same email received by multiple custodians, the same document stored in multiple locations. De-duplication reduces the review population proportionally, directly reducing the cost and time required for document review. The standard approach is global de-duplication by MD5 hash, which removes exact duplicates across all custodians while preserving unique instances.

Date and keyword filtering during processing can further reduce the review population. Applying date ranges that correspond to the relevant time period and excluding file types that are categorically non-responsive (system files, executables, font files) can eliminate 20-40% of the collection before review begins. However, filtering decisions should be documented and defensible — overly aggressive filtering can lead to allegations of inadequate search.

Processing Benchmark

A well-processed dataset typically reduces the review population by 40-70% compared to the raw collection through a combination of de-duplication, date filtering, and file-type exclusions. For a 100GB collection, this can mean the difference between reviewing 500,000 documents and reviewing 175,000 documents.

5. Document Review

Document review is where the legal team examines each document for responsiveness (is it relevant to the discovery request?), privilege (is it protected by attorney-client privilege or work product doctrine?), and confidentiality (does it contain trade secrets, PII, or other sensitive information requiring protection?). Review is traditionally the most expensive phase of eDiscovery, typically accounting for 70-80% of total eDiscovery costs.

Traditional managed review involves teams of contract attorneys reviewing documents one at a time, coding each document for responsiveness, privilege, and other categories. Review rates vary, but a typical contract reviewer processes 50-75 documents per hour at a cost of $25-45 per hour, resulting in an all-in cost of $0.50-$1.50 per document including supervision and quality control.

Technology-assisted review (TAR), also called predictive coding, uses machine learning to prioritize and classify documents based on a set of training documents coded by senior attorneys. TAR 2.0 (continuous active learning) has been widely accepted by courts since Rio Tinto v. Vale (2015) and can reduce the number of documents requiring human review by 60-80%. However, TAR requires careful protocol development, seed set selection, and validation — and the training process itself can take days or weeks.

AI-powered review represents the next generation. Unlike TAR, which requires extensive training on each new matter, modern AI platforms can classify documents using natural language understanding without matter-specific training data. This dramatically reduces the time from data ingestion to review-ready classification. The cost model also shifts: rather than paying per-reviewer-hour, teams pay per-document or per-GB, typically at $0.05-$0.15 per document — a 90%+ reduction compared to managed review.

Metric	Managed Review	TAR 2.0	AI-Powered Review
Cost per Document	$0.50–$1.50	$0.25–$0.75	$0.05–$0.15
Setup Time	1–2 weeks	1–3 weeks (training)	Hours
Review Speed (100K docs)	4–8 weeks	2–4 weeks	Days
Consistency	Variable (reviewer fatigue)	Good (model-based)	High (deterministic)
Court Acceptance	Established	Widely accepted	Growing (defensible with validation)

Regardless of the review method, quality control is non-negotiable. Sample-based QC (reviewing a random sample of coded documents to measure error rates), inter-reviewer agreement analysis, and senior attorney spot-checks should be built into every review workflow. Courts expect that producing parties can demonstrate the reliability and consistency of their review process. For detailed QC procedures, see our Production QC Checklist.

6. Analysis

Analysis goes beyond individual document review to identify patterns, relationships, and strategic insights across the document population. While review asks "is this document responsive?", analysis asks "what does this collection of documents tell us about the case?"

Key analysis workflows include:

Timeline construction — building a chronological narrative of events from document dates, email threads, and metadata to understand the sequence of decisions and communications
Communication network analysis — mapping who communicated with whom, how frequently, and during which time periods, to identify key players and unusual communication patterns
Concept clustering — grouping documents by topic to identify themes, hot spots, and gaps in the document collection that may indicate missing data or additional custodians to collect
Key document identification — surfacing the most important documents in the collection based on content analysis, communication patterns, and relevance signals
Privilege analysis — identifying documents that involve attorneys, reference legal advice, or contain work product, and flagging them for attorney review before any production decision

Effective analysis can transform the strategic position of a case. In one commercial litigation matter, document analysis surfaced contractual clauses and deposition inconsistencies that directly contributed to a $15.4M jury verdict. In a construction defect case, automated cross-referencing of engineering reports and contractor communications identified systemic defect patterns across multiple buildings that would have taken months to uncover manually.

7. Production

Production is the final stage: assembling the reviewed documents into a package that meets the format and content requirements agreed upon in the ESI protocol or ordered by the court. A production typically includes the documents themselves (in native, image, or both formats), a load file containing metadata, Bates numbering, and a privilege log listing all documents withheld on privilege grounds.

The ESI protocol governs the production format. It should be negotiated early in the case and should specify: file formats (native, TIFF, PDF), metadata fields to be produced, Bates numbering conventions, redaction requirements, confidentiality designations, and delivery method. Getting the ESI protocol right upfront prevents costly disputes and re-productions later. See our ESI protocol guide for negotiation strategies.

Bates numbering provides a unique identifier for every page in the production. Numbers must be sequential with no gaps or duplicates, and family groups (parent emails and their attachments) should be numbered consecutively. See our QC checklist for detailed verification steps.

Privilege logs must list every document withheld on privilege grounds, with enough detail to support the claimed privilege without revealing the privileged content. Courts have little patience for generic descriptions ("email re: legal matter") and will order production of documents where the privilege log fails to establish the elements of the privilege. For a comprehensive treatment, see our privilege log guide.

Redactions must be applied at the data layer (not visual-only overlays) and flattened so they cannot be removed. Visual-only redactions — where a black box is placed over text but the underlying content remains selectable — are among the most common and most damaging production errors. See our redaction guide for the proper approach.

Production Benchmark

A well-run production of 30,000 documents with Bates numbering, privilege log, and redactions should be deliverable in 3-5 days using modern tooling, versus 3-4 weeks with traditional methods. See the Tax Credit Investigation case study for real-world benchmarks.

See how DecoverAI handles production end-to-end

From document upload to court-ready output — Bates numbered, redacted, privilege-logged — in under an hour.

Book a Demo →

Cost Benchmarks for eDiscovery

Understanding eDiscovery costs is essential for budgeting, vendor negotiations, and making informed decisions about technology investments. Costs vary significantly based on data volume, complexity, and the approach used. The benchmarks below reflect 2026 market rates across the three primary cost models.

Phase	Traditional (Law Firm)	ALSP / Managed Service	AI-Powered Platform
Collection & Processing	$500–$2,000 / GB	$150–$500 / GB	$50–$150 / GB
Document Review	$0.50–$1.50 / doc	$0.25–$0.75 / doc	$0.05–$0.15 / doc
Production	$100–$300 / GB	$50–$150 / GB	Included
Hosting	$25–$75 / GB / month	$15–$40 / GB / month	$60 / GB / month (all-in)
Total (10GB matter, 50K docs)	$50K–$100K	$20K–$50K	$3K–$8K

The economics of eDiscovery are changing rapidly. AI-powered platforms have compressed costs by 10-20x compared to traditional approaches, while simultaneously improving speed and consistency. For small and mid-size matters (under 50GB), the cost difference is particularly stark: what used to require a $50,000 budget can now be accomplished for under $5,000.

The most important cost lever is reducing the volume that requires human review. Every document that can be accurately classified by AI is a document that does not require a contract reviewer at $25-45/hour. Processing-stage culling (de-duplication, date filtering) and AI-powered first-pass classification are the two highest-ROI investments in any eDiscovery workflow.

AI in eDiscovery: What Works and What Doesn't

AI has transformed eDiscovery, but the technology landscape is still maturing. Understanding what AI can and cannot do reliably is critical for both efficiency and defensibility.

What AI does well today:

Relevance classification — determining whether a document is responsive to a discovery request based on its content, with accuracy rates comparable to or exceeding senior attorney review
Privilege detection — identifying documents likely protected by attorney-client privilege or work product doctrine based on participants, content, and contextual signals
PII and confidential information detection — flagging Social Security numbers, financial account numbers, health information, and other sensitive data for redaction
Document summarization — generating concise summaries of long documents, email threads, and document families to accelerate attorney review
Entity extraction — identifying and normalizing names, organizations, dates, monetary amounts, and other key entities across the document population

Where human judgment is still essential:

Final privilege calls — AI can flag potential privilege, but the determination of whether a specific communication is privileged requires attorney judgment and understanding of the client relationship
Strategic significance — AI can identify relevant documents, but determining which documents are "hot" or case-dispositive requires understanding the legal theories and factual disputes at issue
Redaction scope decisions — AI can detect categories of sensitive information, but deciding what must be redacted (versus what may be produced) often requires interpretation of protective orders and privacy regulations

The defensibility of AI-assisted review is well-established. Courts have recognized that technology-assisted review can be more accurate than exhaustive manual review, and no court has required a party to use manual review where technology-assisted review was available and properly validated. The key to defensibility is transparency and validation: document your methodology, measure your accuracy, and be prepared to explain your process. For a deeper treatment, see our guide on AI review defensibility.

Choosing an eDiscovery Platform

The eDiscovery platform market ranges from legacy enterprise tools to modern AI-powered platforms. The right choice depends on your matter volume, technical capabilities, budget, and workflow requirements. Here are the factors that matter most:

Processing capability — Can the platform handle your data volumes? Test with realistic datasets, not demo data. A platform that performs well with 1GB may struggle at 100GB or 1TB
AI classification quality — Request accuracy metrics on representative datasets. Ask about precision, recall, and F1 scores for relevance, privilege, and confidentiality classification. Be skeptical of claims without supporting data
Production automation — Can the platform generate court-ready productions (Bates numbering, redactions, privilege logs) end-to-end? Manual production assembly is a major source of errors and cost
Security and compliance — SOC 2 Type II certification is the minimum standard. For matters involving health information, HIPAA compliance is required. Ask about encryption (in transit and at rest), access controls, and data residency
Pricing model — Per-GB, per-document, per-seat, or flat rate? Understand the total cost of ownership, including hosting, processing, and production fees. Some platforms advertise low per-GB rates but add significant fees for processing and production
Support model — Do you get self-service, email support, or a dedicated specialist? For firms without in-house litigation support staff, having a platform with embedded LegalOps support can be the difference between success and failure

For teams evaluating a platform switch, our platform migration guide covers data portability, format compatibility, and the specific steps to migrate without losing data or work product.

Master eDiscovery Checklist

Use this checklist as a starting framework for every new matter. Not every item will apply to every case, but reviewing the full list ensures nothing critical is missed.

Pre-Litigation / Legal Hold

Identify trigger event and date preservation duty arose
Issue written legal hold notices to all relevant custodians
Coordinate with IT to suspend auto-deletion policies
Document all preservation steps and custodian acknowledgments
Schedule periodic hold reminders (quarterly)

Identification & Scoping

Identify all custodians with potentially relevant ESI
Map data sources for each custodian (email, files, cloud, mobile, messaging)
Identify non-custodial data sources (shared drives, databases, enterprise apps)
Document identification decisions and rationale
Prepare for Rule 26(f) meet-and-confer on ESI issues

Collection

Use forensically defensible collection methods
Calculate and record hash values at point of collection
Maintain chain of custody documentation
Preserve all metadata (dates, authors, recipients)
Log any collection errors or inaccessible files

Processing

Expand container files (PST, ZIP, OST, NSF)
Run global de-duplication by hash value
Apply defensible date range and file-type filters
Run OCR on image-only documents
Verify processing completion rates and error logs

Review

Define coding categories (responsive, non-responsive, privilege, confidential)
Establish review protocol and coding manual
Implement QC sampling and inter-reviewer agreement checks
Conduct privilege review with senior attorney oversight
Document review methodology for defensibility

Production

Verify ESI protocol compliance (format, fields, naming conventions)
Confirm Bates numbering is sequential with no gaps or duplicates
Verify all redactions are data-layer and flattened
Cross-reference privilege log against withheld documents
Validate load file metadata fields and file path references
Spot-check image quality and native file integrity
Final senior attorney sign-off before release

Ready to see this workflow in action?

DecoverAI handles every stage from upload to court-ready production. See it with your own data.

Book a Demo →