DecoverAI Blog - What Is Ediscovery Software?

Defining Ediscovery Software Through the EDRM

Ediscovery software is the category of legal technology that automates the identification, preservation, collection, processing, review, analysis, and production of electronically stored information (ESI) in litigation, investigations, and regulatory matters. The clearest way to understand what the software does is to map its capabilities against the Electronic Discovery Reference Model (EDRM), the industry-standard framework that breaks discovery into discrete, repeatable stages. Every platform on the market — from enterprise giants to point solutions — ultimately justifies its existence by how well it handles one or more of those stages.

The EDRM begins with information governance and flows through identification, preservation, collection, processing, review, analysis, production, and presentation. Traditional ediscovery software concentrates on the middle of that workflow — processing through production — because that is where the volume, cost, and legal risk concentrate. A modern platform is expected to ingest data in its native format, normalize it into a reviewable state, surface the documents that matter to a legal theory, let attorneys tag and redact at scale, and then assemble a defensible production that satisfies the governing ESI protocol and Federal Rule of Civil Procedure 34.

What distinguishes ediscovery software from general-purpose document management or enterprise search is defensibility. Every transformation the software applies — hash calculation, text extraction, de-duplication, email threading, predictive coding — must be reproducible and auditable, because opposing counsel and courts can and do challenge the process. The software is not just a productivity tool; it is the system of record for a regulated workflow. That single constraint explains most of the architectural choices, pricing quirks, and vendor behaviors that newcomers find baffling about the market.

Practitioners sometimes use "ediscovery software" and "review platform" interchangeably, but they are not the same thing. A review platform is a subset — the user interface and coding tools attorneys touch when they are looking at documents. A complete ediscovery platform also includes the processing engine that ingests data before it reaches review, the analytics layer that clusters and prioritizes it, and the production engine that outputs load files compliant with Concordance, Relativity, or Opticon specifications. When you evaluate vendors, be precise about which of these layers they actually own versus stitch together from third-party components.

Core Capabilities Every Platform Should Deliver

A modern ediscovery platform is expected to handle a long list of technical functions, and a buyer should be suspicious of any vendor that cannot demonstrate each of them end to end. Processing is the foundation: the ingestion of raw custodial data (PSTs, NSFs, OST files, loose files, mobile extractions, Slack and Teams exports, Google Workspace Takeouts) and its conversion into a normalized, indexed form. Processing must extract text, capture metadata, expand container files, decrypt password-protected documents where lawful, handle embedded objects, and flag exceptions so nothing silently disappears.

De-duplication and email threading are the first cost-reduction levers the software applies. Global (cross-custodian) and custodian-level dedup using MD5 or SHA-1 hashes can eliminate 30 to 60 percent of a collection before review begins. Email threading identifies the most-inclusive message in a conversation so reviewers read each unique exchange once rather than wading through every forward and reply. Near-duplicate detection extends the same idea to non-identical but textually similar documents, grouping them so batched coding decisions can flow across the set.

Optical character recognition (OCR) is essential for scanned PDFs, image-only emails, and the mountain of TIFFs that show up in most productions. OCR quality has historically been one of the most variable features between platforms, and it meaningfully affects search recall. Alongside OCR, a credible platform offers full-text and structured search with Boolean, proximity, fuzzy, wildcard, and field-level operators; saved searches; and the ability to run complex queries across tens of millions of documents without timing out.

Review workflows require technology-assisted review (TAR) in two flavors: the older TAR 1.0 (simple active learning against a training set) and the now-standard TAR 2.0 / Continuous Active Learning (CAL), where the model retrains continuously as reviewers code. On top of that sits the reviewer UI itself, redaction (including automated pattern-based redaction for SSNs, account numbers, and other PII), privilege logging, and the production engine that stamps Bates numbers, burns redactions, renders TIFFs or PDFs, and writes load files that will actually load on the receiving side without remediation. If any of these is missing, weak, or outsourced, you do not have a complete platform.

The Four Categories of Ediscovery Platforms

Despite the crowded vendor landscape, every ediscovery tool on the market falls into one of four broad categories, and understanding the category tells you most of what you need to know about the product's strengths, weaknesses, and pricing posture. The first category is legacy enterprise platforms — Relativity Server, Nuix, OpenText Axcelerate, IPRO. These were built in the 2000s for on-premise deployment by large firms and service providers. They are powerful, infinitely customizable, and deeply entrenched in the workflows of Am Law 100 firms. They are also complex, require specialist administrators, and carry licensing structures that assume you have a dedicated litigation support team.

The second category is modern cloud platforms — RelativityOne, Everlaw, DISCO, Reveal, Nextpoint, Logikcull. These were architected for multi-tenant SaaS delivery, prioritize usability, and target the "self-service" or "lightly supported" buyer. They reduce the administrative overhead that legacy tools demand and update continuously rather than on annual release cycles. The trade-off is less customization, heavier reliance on the vendor for any non-standard workflow, and pricing models that often punish you as data volume grows.

The third category is AI-native platforms, a wave that accelerated sharply after the emergence of capable large language models in 2023 and 2024. These platforms — DecoverAI among them — treat generative AI as the primary engine of review rather than a bolt-on feature. Instead of training a classifier on 2,000 manually coded seed documents, they use LLMs to evaluate documents against a natural-language description of the legal theory, producing first-pass relevance, issue tagging, privilege scoring, and chronology building directly. The economic argument is straightforward: if the model can do the work of a contract reviewer at a fraction of the cost and in a fraction of the time, the fully loaded cost of discovery collapses.

The fourth category is point solutions — specialized tools that handle one narrow slice of the workflow exceptionally well. Examples include collection tools (X1, Exterro FTK, Cellebrite), processing-only engines (LAW PreDiscovery, Venio), analytics layers (Brainspace, which was absorbed by Reveal), and transcript or deposition tools. Point solutions are indispensable when your end-to-end platform is weak in a specific area, but they add integration work and create hand-off points where data and metadata can be corrupted. Knowing which category each vendor in your shortlist belongs to is the single most useful filter you can apply early in an evaluation.

The Pricing Models Problem

Ediscovery pricing is notoriously opaque, and the opacity is not accidental. Most legacy and modern cloud platforms combine at least four price components: per-gigabyte processing (charged on ingested volume), per-gigabyte hosting (charged monthly on stored volume), per-user review seats (often billed monthly or annually whether used or not), and professional services for project management, custom workflows, and anything the platform cannot self-serve. Layered on top are surcharges for premium features: analytics, TAR, advanced production formats, foreign-language OCR, translation, and audio or video processing.

The problem with this structure is that it penalizes exactly the workflow patterns you want to encourage. Over-collecting to be safe — a defensible and common practice — gets taxed at processing time. Keeping data available through trial in case you need to re-search it incurs hosting fees that compound month after month. Bringing more reviewers in on a surge requires negotiating seat adds with the vendor's sales team. Every decision that protects the matter costs more, which is why the hidden cost of document review so often exceeds the original quote by a multiple.

Processing and hosting rates at the enterprise tier have historically ranged from $75 to $300 per gigabyte per month when all components are totaled, with wide variance based on negotiating leverage, commit volume, and contract length. Per-user seat fees at review-platform pricing can run $100 to $1,200 per month per reviewer. In a matter with 500 GB of data and a 30-person review team running for three months, the platform bill alone can exceed $300,000 before anyone has coded a document. This is the math that has driven in-house legal departments and mid-market firms to demand flatter, simpler pricing models.

A flat per-gigabyte model that bundles processing, hosting, and unlimited users — the approach DecoverAI takes at $60 per GB — is the market's response to this problem. It trades the vendor's ability to meter every feature for predictability and alignment: the buyer knows the cost before the matter begins, and the vendor has no incentive to slow-walk deletions or discourage additional reviewers. When you evaluate platforms, insist on a total cost of ownership comparison that includes every surcharge and every seat, not just the headline processing rate.

Key Evaluation Criteria for Buyers

Once you have narrowed the field to a few vendors, the real evaluation turns on five criteria that separate a platform that will hold up in a complex matter from one that will become the problem. The first is defensibility. Can the vendor produce a written, auditable record of every processing decision, every TAR training round, every production specification, and every redaction? Has the processing engine been tested against the standard corpora the ediscovery community uses to benchmark exception handling? Will the vendor stand behind the platform if a production is challenged in court? These are not theoretical questions — they become existential the first time an opponent moves to compel or a court appoints a special master.

The second criterion is scale. How does the platform behave at 1 TB, 10 TB, 100 TB? Processing throughput, search latency, and production-build times degrade nonlinearly on most platforms, and the breaking point often shows up only after you are committed. Ask for customer references at the volume you expect to handle, not at average volume. Review case studies that explicitly cite terabyte-scale handling with document counts, not just storage footprints.

The third criterion is security and compliance. At a minimum, any credible vendor should carry SOC 2 Type II certification with no significant exceptions; HIPAA compliance if you touch protected health information; and clear documentation of encryption at rest and in transit, tenant isolation, key management, data residency options, and incident response. For government and defense-adjacent work, add FedRAMP or CJIS requirements. Review the vendor's security posture before you review the feature list — a platform with great features and weak security is disqualifying in 2026.

The fourth criterion is integration. How does data get into and out of the platform? Does it natively ingest from Microsoft 365, Google Workspace, Slack, Teams, Box, Dropbox, and the cloud collection tools your firm already uses? Does it export cleanly to the file formats downstream tools expect? Does it expose an API for automation? The fifth criterion is AI capability, and it has become the single fastest-moving dimension of the evaluation. Beyond TAR 2.0, does the platform offer LLM-powered relevance classification, issue tagging, chronology construction, privilege detection, summarization, and hallucination-resistant Q&A against the case record? These are no longer aspirational features; they are the difference between a one-week review and a one-day review.

Common Pitfalls When Choosing a Platform

The same mistakes recur in ediscovery software evaluations, and they are expensive. The first is buying based on the demo rather than the data. Every vendor looks impressive when running a 10,000-document sandbox with cherry-picked sample content. The platform's real behavior emerges only under the messy conditions of actual production data — corrupted PSTs, multilingual content, encrypted files, 2 GB PowerPoints, Slack exports with missing permissions, mobile extractions with deleted-but-recoverable fragments. Insist on a paid pilot with your own data before signing a multi-year agreement.

The second pitfall is underestimating migration cost and risk. Ediscovery data is sticky. Moving an active matter between platforms means re-processing (which can alter hash values and break chain of custody arguments), re-tagging (which requires reconciling tag schemas that rarely map one-to-one), and re-validating productions. The platform migration guide walks through the specific failure modes, but the short version is: assume migration will cost at least as much as a year of the incumbent platform, and plan matters on natural transition boundaries whenever possible.

The third pitfall is ignoring the total cost of ownership. The headline price almost never reflects the full bill. Buyers routinely overlook analytics add-ons, TAR licensing, premium production formats, foreign-language support, image redaction tools, user training, certification fees, and the professional services hours the vendor will insist are required to "get the most out of the platform." The hidden cost of document review breaks down where these line items accumulate. A rigorous TCO model is the only way to compare apples to apples across vendors.

The fourth pitfall is over-indexing on market share. Buying the market leader feels defensible because "nobody ever got fired for buying Relativity," but that logic collapses in matters where the market leader is demonstrably slower, more expensive, or less capable on the specific workload at hand. The fifth pitfall is treating AI features as marketing rather than infrastructure. A platform whose "AI" is a bolt-on workflow added in 2024 to check a box will not deliver the economics or accuracy of a platform architected around LLMs from the start. Ask how the AI is trained, how it is evaluated, how hallucinations are detected, and how the model decisions are logged. If the vendor cannot answer in technical detail, the feature is theater.

The Shift to AI-Native Ediscovery

The most consequential change in ediscovery software in twenty years is happening right now, and it is the transition from classifier-based machine learning to LLM-native workflows. TAR 2.0, which dominated the 2015–2023 era, required a human to code seed documents, train a logistic regression or support vector machine classifier, and then iteratively re-train as new documents were coded. It worked, it was defensible, and it was slow. Setting up a TAR project meant days of protocol negotiation and weeks of reviewer seeding before you could trust the cutoffs.

An AI-native platform collapses that workflow. Instead of seed sets and classifiers, the system accepts a natural-language description of the legal theory — "documents discussing the decision to delay the product recall between March and June 2023" — and scores every document in the collection against it using an LLM that has already read the entire corpus. The result is first-pass relevance that is competitive with, and often better than, TAR 2.0 on the same data, delivered in hours instead of weeks. Issue tagging, privilege scoring, and chronology construction follow the same pattern: describe what you want, and the model produces it with citations back to the source documents. DecoverAI's relevance detection is built this way from the ground up.

Defensibility concerns about LLMs are real and worth taking seriously, but they are tractable. The Sedona Conference and EDRM have published guidance on evaluating generative AI in discovery, and the emerging best practice is familiar to anyone who has defended a TAR 2.0 protocol: document the workflow, measure recall and precision against a statistically valid validation sample, disclose the methodology in the ESI protocol, and produce the audit trail on request. Courts have so far been receptive to AI-assisted review where the process is transparent and measurable, which mirrors how they came to accept predictive coding a decade ago.

The economic implication of the AI-native shift is enormous. Contract review has historically been the single largest line item in a litigation budget — frequently 70 percent or more of total discovery spend. When an LLM can perform first-pass relevance review at a marginal cost that is two orders of magnitude lower than a contract attorney, the entire cost structure of discovery changes. Matters that were economically infeasible to litigate become viable. Review teams shrink and reorient around quality control and privilege calls rather than first-pass triage. Expect the platforms that fail to make this transition to lose ground rapidly over the next two to three years.

How DecoverAI Fits In

DecoverAI is an AI-native ediscovery platform built for legal teams that have outgrown the economics and workflows of legacy and first-generation cloud tools. The platform handles the full EDRM mid-section — processing, analytics, review, production — with LLM-powered relevance detection, confidentiality analysis, privilege log generation, and automated redaction built in rather than bolted on. It is designed to be run by the attorneys handling the matter, without a dedicated litigation support team, and to scale from single-custodian investigations to multi-terabyte bet-the-company litigation.

Pricing is flat: $60 per gigabyte, all-in, with no per-user seat fees, no analytics surcharges, no TAR licensing, and no premium production format upsells. That single change to the pricing model typically reduces total platform spend by 50 to 80 percent compared to the enterprise incumbents, which is why buyers comparing DecoverAI against Relativity, Everlaw, DISCO, and Logikcull generally focus on feature parity and defensibility rather than price.

On the security and compliance front, DecoverAI is SOC 2 Type II certified and HIPAA compliant, with tenant isolation, encryption at rest and in transit, and a public trust center. The platform has been used on federal productions, commercial litigation, regulatory investigations, and matters requiring protected health information handling. Buyers evaluating the platform against the five criteria above — defensibility, scale, security, integration, AI capability — should start with a paid pilot on their own data, exactly as this post recommends for every vendor evaluation. The right ediscovery software is the one that holds up under your data, your deadlines, and your budget, not the one with the best demo.

What Is Ediscovery Software?