Skip to content

The ingest lifecycle

Read this if you want to understand what OMem is actually doing every few minutes — how a file becomes a wiki page, why a re-run is usually instant, why one flag can get expensive, and what happens when you delete something.

When OMem runs, each new or changed item flows through a pipeline. Scroll to follow one file from a binary blob into a page your AI can find — each step, told to its own point:

parse

It reads what other tools quietly give up on.

Your work hides inside PowerPoint, Excel, scanned PDFs — formats that don’t turn into clean text on their own. Most converters silently drop the hardest parts: the chart embedded in a spreadsheet, the merged cells, the scanned page. OMem gives each format its own parser, precisely to keep what others lose — and it’s deterministic, so the same file reads the same three years from now.

vlm + ocr

A picture can’t live in a text wiki. So OMem writes down what it sees.

The wiki is plain text — that’s what lets you read it, grep it, version it. But a chart or a scanned page is pixels; it can’t sit inside plain text. So OMem reads every picture and turns it into words: OCR transcribes text-bearing images like scans and screenshots; a vision model describes charts and diagrams. What the picture held now lives in the wiki, and is searchable like everything else.

curate

It tidies — but never rewrites your numbers.

Raw parsed text is still messy. An LLM shapes it into one clean page: a one-line summary, a readable body, tags — while keeping every number and proper noun exactly as written. It will never quietly “round” an 11.3% into 11%. The page is saved as Markdown you can open, edit, and version — and the work is cached, so unchanged items never pay for it twice.

index

The moment it becomes findable.

Until now the page just sat on disk — written, but unreachable. This step adds it to the search index, and in that instant your agent’s query can find it. The index is only an opinion laid over the wiki: delete it and it rebuilds from the Markdown. The wiki is the truth.

A few steps behave differently per kind (mail aggregates a thread into one page; calendar expands recurring events) — see kinds and sources.

Two things in this pipeline are worth understanding properly, because they explain OMem’s day-to-day behavior: why a re-run is cheap, and why one flag isn’t.

Why re-runs are cheap: content addressing + caching

Section titled “Why re-runs are cheap: content addressing + caching”

OMem runs every few minutes, but it almost never repeats work. Two mechanisms make that true:

  • Content addressing (the raw stage): each item is fingerprinted by its content (SHA256). If the bytes haven’t changed, the fingerprint matches what’s already on disk, and the rest of the pipeline is skipped entirely.
  • Curation cache: the expensive step — the LLM call that writes the wiki page — is cached by an input hash. Add a hundred new documents and only those hundred touch the LLM. The other ten thousand are untouched.

For an inbox of tens of thousands of items, that’s the difference between $10 and $5,000 in LLM spend. The cache is what makes a continuous, every-few-minutes loop economically sane.

Bootstrap vs. incremental, and what a “cursor” is

The first run is a bootstrap: there’s no cursor yet, so the source discovers everything in scope and ingests it. Every run after that is incremental: a per-source cursor records where ingestion left off (a modification-time watermark for files; a sequence position for mail/calendar). The next run only looks at items past that watermark, so a 10,000-file folder with nothing changed exits in well under a second.

The cursor also tracks a failed set — items that errored last time (an LLM rate-limit, a transient read failure) are retried on the next run even if they’re “old”, so a blip doesn’t permanently orphan an item.

(There is no --bootstrap flag — that concept was retired. Bootstrap just is what the first run does when no cursor exists; scope is always read from your config.)

omem ingest --now forgets the cursor and re-scans everything. It sounds harmless — and if nothing changed, content addressing and the curation cache keep it cheap. But here’s the trap: re-scanning recomputes each item’s content hash, and for mail/calendar that hash includes the item’s metadata (subject line, dates, attendees). Change even one of those and the hash changes → the curation cache misses → the LLM is called again.

On a corpus of thousands of items where many have shifting metadata, a single --now can mean hundreds to thousands of LLM calls — several dollars to tens of dollars. You almost never need it; the normal incremental loop already catches changes.

When a file disappears from your disk (or an email is deleted), OMem does not delete the wiki page. It tombstones it — and if the source comes back, the page revives. Press Play to watch the lifecycle:

source: Q3-budget-review.pptx ✓ on diskwiki: wiki page · live
The file is on disk; its wiki page is live and queryable.

This is deliberate safety. If a sync hiccup or an accidental rm made a hundred files vanish, you don’t want a hundred wiki pages instantly destroyed. Tombstoning keeps them, drops them out of normal query results, and waits. Only omem lint --orphans --purge ever truly removes them — and that’s your explicit call.

The three storage layers underneath all this

Everything above sits on three layers, in order of authority:

  • raw/ — the immutable, content-addressed archive of parser output (parsed.md + extracted assets). Never deleted. The deterministic parser means the same file produces the same parsed.md years from now — which is why this archive is trustworthy and why the curation cache can hit reliably.
  • wiki/ — the curated Markdown pages, generated from raw/. This is the truth: delete the indexes and they rebuild from here; delete the wiki and it rebuilds from raw/.
  • the index — FTS5 (or qmd) sitting on top of the wiki, purely to make queries fast. An opinion, not the source.

The history of every page is kept too: re-ingest a changed file and the old parsed version is retained (omem raw get … --version N), not overwritten.

You’ve now seen the wiki get built. Next: the plugin architecture — the extension points (source, parser, index) that each stage of this pipeline plugs into, and why none of them lock you in.