DecoverAI - Every File Format DecoverAI Now Ingests

PST, MBOX, PDF, DOCX, TIFF, MP3, WhatsApp exports, Discord JSON, iMessage databases, nested ZIP archives — those are the file types a legal case actually arrives as, and all of them are supported. DecoverAI processes each format through a single ingestion pipeline: archives are recursively unpacked, images are OCR'd, audio is transcribed, emails inherit their custodian automatically, and every resulting document emerges Bates-stampable and ready for review — with no format needing a separate vendor or workflow.

The honest answer is almost certainly yes. But the more interesting answer is what happens between dropping a file and seeing it show up in your workspace — deduplicated, tagged with the right custodian, OCR'd if it needs to be, and ready for review.

This post is the full reference as of April 2026.

DecoverAI ingestion pipeline: mixed file formats flow through archive extraction, format detection, OCR and transcription, deduplication, and custodian assignment, emerging as review-ready documents.

The full ingestion pipeline — every format flows through the same stages.

Archives and containers

Most cases don't start with a loose document. They start with a PST per custodian, a forensic vendor's ZIP of ZIPs, or a TAR archive from an IT department that didn't label anything.

DecoverAI's ingestion pipeline handles all of it — and it's recursive. A PST inside a TAR.GZ inside a ZIP unwinds down to individual messages without anyone stitching things together manually.

Supported formats: .pst, .mbox, .zip, .tar, .tar.gz, .tgz, .tar.bz2, .tbz2, .tar.xz, .txz

A few things happen automatically during extraction. Junk files like __MACOSX/ folders and .DS_Store are skipped. Symlinks are dropped. Any path that tries to escape the extraction root is rejected. Zip-bomb protection is built in — you don't configure it, and you can't accidentally turn it off.

Email and messages

Supported formats: .eml, .msg

Standard .eml files are handled natively — MIME multipart, inline content, attachments, all of it. Outlook .msg files are converted to PDF for Bates-stampable output.

Messages extracted from a PST or MBOX automatically inherit the custodian of their parent container. No manual mapping required.

Office documents

Supported formats: .pdf, .docx, .xls, .xlsx, .xlsm, .odt, .rtf

PDFs are Bates-stamped directly — no conversion step. Word documents get real page-break detection, not just a rough word-count split. Excel files (including macro-enabled workbooks) treat each sheet as its own logical page. OpenDocument and RTF files are fully supported as well.

If a Word document has embedded images, those images are automatically checked for text content and OCR'd when needed.

Web content and plain text

Supported formats: .html, .htm, .txt

HTML pages are rendered into a stampable PDF so they can carry Bates numbers like every other production document. Plain text gets the same treatment — laid out cleanly so it participates in the same review workflow as everything else.

Images

Supported formats: .png, .jpg, .jpeg, .gif, .bmp, .webp, .tiff, .tif

Every image is wrapped into a single-page PDF so it can be Bates-stamped and reviewed alongside your documents. Multi-page TIFFs are flattened into a single PDF automatically.

When OCR is enabled for a case, these same files also pass through vision-based text extraction — so handwritten notes, scanned letters, and photographed whiteboards all become searchable in your review workspace.

Audio

Supported formats: .wav, .mp3, .ogg

Audio files are increasingly common in white-collar investigations, harassment matters, and any case involving voicemails or recorded calls. When transcription is enabled, DecoverAI processes them through Whisper. Long recordings — even a four-hour deposition — are automatically chunked and stitched back together, so there's no file-size ceiling to worry about.

Chat exports

This is the format category where most platforms quietly fall over. There's no universal export shape, the per-message metadata doesn't match email, and the media attachments often live in a sibling folder nobody told you about.

DecoverAI normalizes each supported export into individual documents so the rest of the review pipeline — Bates stamping, custodian tracking, privilege review, redaction — treats chat messages exactly like everything else.

Supported sources:

WhatsApp (.txt export) — each message becomes a reviewable document; sibling images and audio are inlined as attachments
Discord (.json export) — per-message documents with custodian preserved
iMessage (chat.db) — participant metadata is preserved as the custodian set
Google Chat (.json export) — channel and DM messages reconstructed as individual documents
Android SMS/MMS (.xml from SMS Backup & Restore) — MMS binary parts like images are attached to the rendered document

Custodian tracking — before and after dedup

Every file that lands in a case carries two related fields, and the distinction matters.

The custodian field is strictly 1:1 with the source the document came from — the person whose hard drive, mailbox, or chat export it was extracted from.

The all_custodians field is the post-dedup union. When the engine collapses two byte-identical files into a single surviving record, the custodians of the collapsed copies are merged here.

The pipeline enforces a simple rule: the original custodian is always included in the combined set. The surviving record never loses track of where it came from, even after it absorbs duplicates from other sources.

Custodian deduplication: three custodians each hold a copy of the same contract before dedup; after dedup, one surviving record carries all three custodians.

Three custodians, one surviving record — with all three custodians preserved on it.

At export time, all_custodians is flattened to a semicolon-delimited string — the format opposing counsel and downstream tools expect. The result: when a document is produced, the metadata accurately reflects every custodian who held a copy, not just the first one ingested.

Why this matters

The point of supporting this many formats isn't to win a feature-comparison checklist. It's that real cases arrive as a mix — a PST from the CFO, a mobile extraction with iMessage, a cloud export full of Word and Excel files, a folder of TIFFs from a process server, and a few audio recordings from a security desk.

If any of those need a separate vendor, a separate workflow, or a separate line item, the case fragments before review even starts.

DecoverAI's ingestion path is one pipeline, with one custodian model, one dedup pass, one OCR and transcription layer, and one set of safeguards. Drop the data in. Get reviewable, searchable, Bates-ready documents back.

If your case involves a format we haven't listed here — particularly something proprietary like a forensic image container or a niche chat platform — reach out. The pipeline is built to grow.

Every File Format DecoverAI Now Ingests