File formats
Read this if you want to know exactly what omem ingest can read, which engine handles each format, and how images, OCR, and the curator behave per format. The parser chain is a first-order design concern — not an afterthought — so this page is specific about what each format gets.
The format matrix
Section titled “The format matrix”Three independent settings shape what happens to a file: the parser that turns it into Markdown, the image policy for pictures inside it, and the curator mode that writes the final wiki page. Here’s every format and its defaults:
| Format | Parser library | Image default | Curator default |
|---|---|---|---|
| pymupdf4llm (layout) + RapidOCR | ocr | llm-full | |
| DOCX | python-docx | vlm | llm-full |
| PPTX | python-pptx | vlm | llm-full |
| XLSX / .xls / .xlsm / .ods | calamine (tables) + openpyxl (images) | vlm | llm-full |
| Markdown | pass-through (verbatim) | — | frontmatter-only |
| Plain text / CSV / TSV / JSON | pass-through | — | frontmatter-only |
| HTML | BeautifulSoup + markdownify | — | llm-full |
| Email (.eml / .msg) | stdlib email / extract-msg | vlm (inline) | — (mail kind) |
| Calendar (.ics) | icalendar | — | — (calendar kind) |
| Images (png/jpeg/gif/webp/tiff/emf) | Pillow (+ LibreOffice for EMF/WMF) | vlm | — |
Two controls in that table are configurable per format — see parser.images.* and curator.mode.*, where each value is explained in full. A third, ingest.formats.*, is a master on/off switch for each format across every source. This page is the format-by-format view; the config page is the value-by-value view.
What each format gets
Section titled “What each format gets”Office documents — keeping the images
Section titled “Office documents — keeping the images”The thing that separates OMem from “convert to text and lose everything visual” tools is that it keeps images and describes them.
- PDF uses pymupdf4llm with layout preservation. Embedded images run through OCR by default (PDFs are mostly text, so per-page vision would burn tokens for little gain); a page with almost no extractable text is treated as scanned and flagged for vision instead.
- DOCX / PPTX / XLSX parse with the deterministic Office libraries, and embedded images default to
vlm— a vision model writes a description in place. Slide pictures, chart images, and embedded screenshots become searchable text rather than vanishing.
Markdown, text, and data files — left alone
Section titled “Markdown, text, and data files — left alone”Markdown, plain text, CSV, TSV, and JSON are passed through verbatim — no LLM rewrites the body. They’re already structured, so the curator runs in frontmatter-only mode: it writes an abstract and tags, and copies the body byte-for-byte. (A counter-intuitive payoff: the verbatim body keeps exact numbers and structure that a full rewrite might smooth over.)
Email and calendar — rendered as one page
Section titled “Email and calendar — rendered as one page”Email and calendar aren’t “files” in the file sense; they’re the mail and calendar kinds. An email thread is rendered into a single page (headers, body, inline images), and attachments are parsed recursively through this same matrix — a PDF attached to an email gets the PDF treatment. Calendar events render with time, location, organizer, and attendees.
Images — described, and EMF handled
Section titled “Images — described, and EMF handled”A standalone image file is described by the vision model. EMF/WMF (the vector format Office loves to embed) is converted to PNG first — Pillow attempts it, and LibreOffice’s headless soffice is the reliable fallback — so those images don’t become dead ends.
How images get described: OCR vs. VLM
Section titled “How images get described: OCR vs. VLM”When parser.images.<fmt> is vlm, an image is sent to the vision model only if it’s worth it. OMem filters out decoration first, so you don’t pay to describe icons and spacers:
| Context | Sent to the vision model when… |
|---|---|
| Standalone image file | always (unless either side < 100 px) |
| PowerPoint slide image | displayed size ≥ 80 px per edge and not a sub-200×200 stamp |
| Word paragraph image | always (a picture on its own line is content) |
| Word inline image | both sides ≥ 200 px (smaller = icon) |
| Excel cell image | always (Excel images are rarely decorative) |
| Email/calendar inline image | ≥ 5 KB or ≥ 100×100 px (skips tracking pixels & signature art) |
OCR (the default for PDFs) uses RapidOCR, which handles mixed Chinese/English text well. To keep memory bounded on image-heavy documents, OCR runs in a subprocess that restarts every parser.ocr_subprocess_batch images, and a per-document cap (parser.max_images_per_doc, default 200) skips the long tail rather than ballooning memory — anything skipped is logged, never silently dropped.
Turning a format off
Section titled “Turning a format off”To stop ingesting a format everywhere — say you never want spreadsheets indexed:
omem config set ingest.formats.xlsx falseTo keep ingesting a format but stop describing its images (e.g. skip email signature art):
omem config set parser.images.mail offBoth take effect on the next run. Existing pages aren’t deleted — they’re just not refreshed for that format.
What’s next
Section titled “What’s next”- Configuration schema — the full
parser.images.*,ingest.formats.*, andcurator.mode.*fields. - The ingest lifecycle — where parsing and image description sit in the pipeline.