Skip to content

File formats

Read this if you want to know exactly what omem ingest can read, which engine handles each format, and how images, OCR, and the curator behave per format. The parser chain is a first-order design concern — not an afterthought — so this page is specific about what each format gets.

Three independent settings shape what happens to a file: the parser that turns it into Markdown, the image policy for pictures inside it, and the curator mode that writes the final wiki page. Here’s every format and its defaults:

FormatParser libraryImage defaultCurator default
PDFpymupdf4llm (layout) + RapidOCRocrllm-full
DOCXpython-docxvlmllm-full
PPTXpython-pptxvlmllm-full
XLSX / .xls / .xlsm / .odscalamine (tables) + openpyxl (images)vlmllm-full
Markdownpass-through (verbatim)frontmatter-only
Plain text / CSV / TSV / JSONpass-throughfrontmatter-only
HTMLBeautifulSoup + markdownifyllm-full
Email (.eml / .msg)stdlib email / extract-msgvlm (inline)(mail kind)
Calendar (.ics)icalendar(calendar kind)
Images (png/jpeg/gif/webp/tiff/emf)Pillow (+ LibreOffice for EMF/WMF)vlm

Two controls in that table are configurable per format — see parser.images.* and curator.mode.*, where each value is explained in full. A third, ingest.formats.*, is a master on/off switch for each format across every source. This page is the format-by-format view; the config page is the value-by-value view.

The thing that separates OMem from “convert to text and lose everything visual” tools is that it keeps images and describes them.

  • PDF uses pymupdf4llm with layout preservation. Embedded images run through OCR by default (PDFs are mostly text, so per-page vision would burn tokens for little gain); a page with almost no extractable text is treated as scanned and flagged for vision instead.
  • DOCX / PPTX / XLSX parse with the deterministic Office libraries, and embedded images default to vlm — a vision model writes a description in place. Slide pictures, chart images, and embedded screenshots become searchable text rather than vanishing.

Markdown, text, and data files — left alone

Section titled “Markdown, text, and data files — left alone”

Markdown, plain text, CSV, TSV, and JSON are passed through verbatim — no LLM rewrites the body. They’re already structured, so the curator runs in frontmatter-only mode: it writes an abstract and tags, and copies the body byte-for-byte. (A counter-intuitive payoff: the verbatim body keeps exact numbers and structure that a full rewrite might smooth over.)

Email and calendar — rendered as one page

Section titled “Email and calendar — rendered as one page”

Email and calendar aren’t “files” in the file sense; they’re the mail and calendar kinds. An email thread is rendered into a single page (headers, body, inline images), and attachments are parsed recursively through this same matrix — a PDF attached to an email gets the PDF treatment. Calendar events render with time, location, organizer, and attendees.

A standalone image file is described by the vision model. EMF/WMF (the vector format Office loves to embed) is converted to PNG first — Pillow attempts it, and LibreOffice’s headless soffice is the reliable fallback — so those images don’t become dead ends.

When parser.images.<fmt> is vlm, an image is sent to the vision model only if it’s worth it. OMem filters out decoration first, so you don’t pay to describe icons and spacers:

ContextSent to the vision model when…
Standalone image filealways (unless either side < 100 px)
PowerPoint slide imagedisplayed size ≥ 80 px per edge and not a sub-200×200 stamp
Word paragraph imagealways (a picture on its own line is content)
Word inline imageboth sides ≥ 200 px (smaller = icon)
Excel cell imagealways (Excel images are rarely decorative)
Email/calendar inline image≥ 5 KB or ≥ 100×100 px (skips tracking pixels & signature art)

OCR (the default for PDFs) uses RapidOCR, which handles mixed Chinese/English text well. To keep memory bounded on image-heavy documents, OCR runs in a subprocess that restarts every parser.ocr_subprocess_batch images, and a per-document cap (parser.max_images_per_doc, default 200) skips the long tail rather than ballooning memory — anything skipped is logged, never silently dropped.

To stop ingesting a format everywhere — say you never want spreadsheets indexed:

Terminal window
omem config set ingest.formats.xlsx false

To keep ingesting a format but stop describing its images (e.g. skip email signature art):

Terminal window
omem config set parser.images.mail off

Both take effect on the next run. Existing pages aren’t deleted — they’re just not refreshed for that format.