File formats

Read this if you want to know exactly what omem ingest can read, which engine handles each format, and how images, OCR, and the curator behave per format. The parser chain is a first-order design concern — not an afterthought — so this page is specific about what each format gets.

The format matrix

Three independent settings shape what happens to a file: the parser that turns it into Markdown, the image policy for pictures inside it, and the curator mode that writes the final wiki page. Here’s every format and its defaults:

Format	Parser library	Image default	Curator default
PDF	pymupdf4llm (layout) + RapidOCR	`ocr`	`llm-full`
DOCX	python-docx	`vlm`	`llm-full`
PPTX	python-pptx	`vlm`	`llm-full`
XLSX / .xls / .xlsm / .ods	calamine (tables) + openpyxl (images)	`vlm`	`llm-full`
Markdown	pass-through (verbatim)	—	`frontmatter-only`
Plain text / CSV / TSV / JSON	pass-through	—	`frontmatter-only`
HTML	BeautifulSoup + markdownify	—	`llm-full`
Email (.eml / .msg)	stdlib `email` / extract-msg	`vlm` (inline)	— (mail kind)
Calendar (.ics)	icalendar	—	— (calendar kind)
Images (png/jpeg/gif/webp/tiff/emf)	Pillow (+ LibreOffice for EMF/WMF)	`vlm`	—

Two controls in that table are configurable per format — see parser.images.* and curator.mode.*, where each value is explained in full. A third, ingest.formats.*, is a master on/off switch for each format across every source. This page is the format-by-format view; the config page is the value-by-value view.

What each format gets

Office documents — keeping the images

The thing that separates OMem from “convert to text and lose everything visual” tools is that it keeps images and describes them.

PDF uses pymupdf4llm with layout preservation. Embedded images run through OCR by default (PDFs are mostly text, so per-page vision would burn tokens for little gain); a page with almost no extractable text is treated as scanned and flagged for vision instead.
DOCX / PPTX / XLSX parse with the deterministic Office libraries, and embedded images default to vlm — a vision model writes a description in place. Slide pictures, chart images, and embedded screenshots become searchable text rather than vanishing.

Markdown, text, and data files — left alone

Markdown, plain text, CSV, TSV, and JSON are passed through verbatim — no LLM rewrites the body. They’re already structured, so the curator runs in frontmatter-only mode: it writes an abstract and tags, and copies the body byte-for-byte. (A counter-intuitive payoff: the verbatim body keeps exact numbers and structure that a full rewrite might smooth over.)

Email and calendar — rendered as one page

Email and calendar aren’t “files” in the file sense; they’re the mail and calendar kinds. An email thread is rendered into a single page (headers, body, inline images), and attachments are parsed recursively through this same matrix — a PDF attached to an email gets the PDF treatment. Calendar events render with time, location, organizer, and attendees.

Images — described, and EMF handled

A standalone image file is described by the vision model. EMF/WMF (the vector format Office loves to embed) is converted to PNG first — Pillow attempts it, and LibreOffice’s headless soffice is the reliable fallback — so those images don’t become dead ends.

How images get described: OCR vs. VLM

When parser.images.<fmt> is vlm, an image is sent to the vision model only if it’s worth it. OMem filters out decoration first, so you don’t pay to describe icons and spacers:

Context	Sent to the vision model when…
Standalone image file	always (unless either side < 100 px)
PowerPoint slide image	displayed size ≥ 80 px per edge and not a sub-200×200 stamp
Word paragraph image	always (a picture on its own line is content)
Word inline image	both sides ≥ 200 px (smaller = icon)
Excel cell image	always (Excel images are rarely decorative)
Email/calendar inline image	≥ 5 KB or ≥ 100×100 px (skips tracking pixels & signature art)

OCR (the default for PDFs) uses RapidOCR, which handles mixed Chinese/English text well. To keep memory bounded on image-heavy documents, OCR runs in a subprocess that restarts every parser.ocr_subprocess_batch images, and a per-document cap (parser.max_images_per_doc, default 200) skips the long tail rather than ballooning memory — anything skipped is logged, never silently dropped.

Turning a format off

To stop ingesting a format everywhere — say you never want spreadsheets indexed:

omem config set ingest.formats.xlsx false

To keep ingesting a format but stop describing its images (e.g. skip email signature art):

omem config set parser.images.mail off

Both take effect on the next run. Existing pages aren’t deleted — they’re just not refreshed for that format.

What’s next

Configuration schema — the full parser.images.*, ingest.formats.*, and curator.mode.* fields.
The ingest lifecycle — where parsing and image description sit in the pipeline.