One of the questions we get most often on demos is also one of the simplest: "Can DecoverAI handle my data?"
The honest answer is almost certainly yes. But the more interesting answer is what happens between dropping a file and seeing it show up in your workspace — deduplicated, tagged with the right custodian, OCR'd if it needs to be, and ready for review.
This post is the full reference as of April 2026.
Archives and containers
Most cases don't start with a loose document. They start with a PST per custodian, a forensic vendor's ZIP of ZIPs, or a TAR archive from an IT department that didn't label anything.
DecoverAI's ingestion pipeline handles all of it — and it's recursive. A PST inside a TAR.GZ inside a ZIP unwinds down to individual messages without anyone stitching things together manually.
Supported formats: .pst, .mbox, .zip, .tar, .tar.gz, .tgz, .tar.bz2, .tbz2, .tar.xz, .txz
A few things happen automatically during extraction. Junk files like __MACOSX/ folders and .DS_Store are skipped. Symlinks are dropped. Any path that tries to escape the extraction root is rejected. Zip-bomb protection is built in — you don't configure it, and you can't accidentally turn it off.
Email and messages
Supported formats: .eml, .msg
Standard .eml files are handled natively — MIME multipart, inline content, attachments, all of it. Outlook .msg files are converted to PDF for Bates-stampable output.
Messages extracted from a PST or MBOX automatically inherit the custodian of their parent container. No manual mapping required.
Office documents
Supported formats: .pdf, .docx, .xls, .xlsx, .xlsm, .odt, .rtf
PDFs are Bates-stamped directly — no conversion step. Word documents get real page-break detection, not just a rough word-count split. Excel files (including macro-enabled workbooks) treat each sheet as its own logical page. OpenDocument and RTF files are fully supported as well.
If a Word document has embedded images, those images are automatically checked for text content and OCR'd when needed.
Web content and plain text
Supported formats: .html, .htm, .txt
HTML pages are rendered into a stampable PDF so they can carry Bates numbers like every other production document. Plain text gets the same treatment — laid out cleanly so it participates in the same review workflow as everything else.
Images
Supported formats: .png, .jpg, .jpeg, .gif, .bmp, .webp, .tiff, .tif
Every image is wrapped into a single-page PDF so it can be Bates-stamped and reviewed alongside your documents. Multi-page TIFFs are flattened into a single PDF automatically.
When OCR is enabled for a case, these same files also pass through vision-based text extraction — so handwritten notes, scanned letters, and photographed whiteboards all become searchable in your review workspace.
Audio
Supported formats: .wav, .mp3, .ogg
Audio files are increasingly common in white-collar investigations, harassment matters, and any case involving voicemails or recorded calls. When transcription is enabled, DecoverAI processes them through Whisper. Long recordings — even a four-hour deposition — are automatically chunked and stitched back together, so there's no file-size ceiling to worry about.
Chat exports
This is the format category where most platforms quietly fall over. There's no universal export shape, the per-message metadata doesn't match email, and the media attachments often live in a sibling folder nobody told you about.
DecoverAI normalizes each supported export into individual documents so the rest of the review pipeline — Bates stamping, custodian tracking, privilege review, redaction — treats chat messages exactly like everything else.
Supported sources:
- WhatsApp (.txt export) — each message becomes a reviewable document; sibling images and audio are inlined as attachments
- Discord (.json export) — per-message documents with custodian preserved
- iMessage (chat.db) — participant metadata is preserved as the custodian set
- Google Chat (.json export) — channel and DM messages reconstructed as individual documents
- Android SMS/MMS (.xml from SMS Backup & Restore) — MMS binary parts like images are attached to the rendered document
Custodian tracking — before and after dedup
Every file that lands in a case carries two related fields, and the distinction matters.
The custodian field is strictly 1:1 with the source the document came from — the person whose hard drive, mailbox, or chat export it was extracted from.
The all_custodians field is the post-dedup union. When the engine collapses two byte-identical files into a single surviving record, the custodians of the collapsed copies are merged here.
The pipeline enforces a simple rule: the original custodian is always included in the combined set. The surviving record never loses track of where it came from, even after it absorbs duplicates from other sources.
At export time, all_custodians is flattened to a semicolon-delimited string — the format opposing counsel and downstream tools expect. The result: when a document is produced, the metadata accurately reflects every custodian who held a copy, not just the first one ingested.
Why this matters
The point of supporting this many formats isn't to win a feature-comparison checklist. It's that real cases arrive as a mix — a PST from the CFO, a mobile extraction with iMessage, a cloud export full of Word and Excel files, a folder of TIFFs from a process server, and a few audio recordings from a security desk.
If any of those need a separate vendor, a separate workflow, or a separate line item, the case fragments before review even starts.
DecoverAI's ingestion path is one pipeline, with one custodian model, one dedup pass, one OCR and transcription layer, and one set of safeguards. Drop the data in. Get reviewable, searchable, Bates-ready documents back.
If your case involves a format we haven't listed here — particularly something proprietary like a forensic image container or a niche chat platform — reach out. The pipeline is built to grow.