Skip to content

Ingest Pipeline

Summary

The ingest operation is the most important operation in the LLM Wiki pattern — it's where knowledge compounds. When a new source is added, a five-step pipeline reads it, routes to relevant pages, synthesizes updates, embeds, and updates the index and log.

Five-Step Pipeline

Step 0: Resolve Source

Determine the source type and extract text: - Local files — read directly (PDFs, markdown, text) - YouTube URLs — extract transcript - HTTP URLs — fetch and strip to clean text

The system accepts the same ingest command regardless of source type.

Step 1: Route

The LLM reads a compact summary of the schema (one line per page: slug: title — description) alongside the source text, and returns a JSON array of the slugs that are genuinely relevant.

Critical for cost control: without routing, every ingest triggers synthesis for every page. With routing, only relevant pages are touched.

Step 2: Synthesize

For each relevant slug, the LLM receives the existing page body alongside the new source text and rewrites the complete page.

Key invariant: "Preserve and extend existing content — never discard information already on the page." This is what makes knowledge compound rather than overwrite. Each subsequent source makes the page richer, not just different.

Step 3: Embed

The updated page is re-embedded using an embedding model (e.g., OpenAI text-embedding-3-small, 1536-dimensional vectors). The embedding index is upserted in place, keeping the vector store in sync with the wiki.

Step 4: Update Index and Log

  • Regenerate index.md table to reflect newly updated pages
  • Append a timestamped entry to log.md recording which slugs were touched and from what source

Cost Considerations

  • Each source triggers at least 2 Claude API calls per relevant page (routing + synthesis)
  • For a 50-paper corpus with 10 wiki pages → ~100-200 API calls
  • Prompt caching on system prompts reduces costs by ~90% on repeated operations
  • Optimized for deliberate, curated knowledge — not bulk document ingestion

Manual Compilation Prompt (No Code)

For Claude.ai users without a code pipeline:

"Read these papers and create entity pages in wiki/. For each key concept create a markdown file with a summary, explanation, related links using brackets, and note any contradictions between papers."

For subsequent ingests:

"A new source has been added. Read it alongside the existing wiki pages. Update any existing entity pages affected by this new source. Create new entity pages for any new concepts introduced. Flag any contradictions with previously compiled knowledge."

See Also