PROJECT.md 5.8 KB

PROJECT.md — book-ingestor

Vision

Feed structured knowledge into a mem0 memory server so an AI agent can recall it naturally in conversation — no explicit RAG retrieval, no "search the knowledge base" prompts. The agent simply knows what it has read.


Architecture

Pipeline Overview

File detected (watchdog)
    │
    ▼
[detector.py]
    Pythonic structure analysis via PyMuPDF:
    - Font size variance → heading detection
    - Bold flags + positioning → chapter boundaries
    - Flat if no structural signals found
    │
    ├─── STRUCTURED PATH ──────────────────────────┐
    │    Extract: book title, chapters, paragraphs  │
    │    Summarize: book (1 Groq call)              │
    │    Summarize: each chapter (N Groq calls)     │
    │    Chunk: paragraphs → content memories       │
    │                                               │
    └─── FLAT PATH ─────────────────────────────────┤
         Semantic/sliding window chunking           │
         Summarize: whole doc (1-3 Groq calls)      │
         Chunk: paragraphs → content memories       │
                                                    │
                                                    ▼
                                        [mem0_writer.py]
                                        POST /memories (layered)
                                                    │
                                                    ▼
                                        [manifest.py]
                                        Save manifest JSON

Memory Schema

Every memory POSTed to mem0 carries structured metadata:

{
  "messages": [{"role": "user", "content": "<memory text>"}],
  "agent_id": "knowledge_base",
  "metadata": {
    "source_file": "sapiens.pdf",
    "source_type": "book",
    "memory_type": "chapter_summary",
    "chapter": 4,
    "chapter_title": "The Storytelling Animal",
    "page_start": 67,
    "page_end": 71,
    "ingested_at": "2026-03-11T10:00:00Z"
  }
}

Memory Types

memory_type Count per doc Purpose
book_summary 1 High-level overview, broad questions
chapter_summary N (structured docs) Mid-level recall by topic
content M Specific facts, quotes, details

Module Responsibilities

Module Role LLM?
watchdog_runner.py Watches inbox/, triggers pipeline No
pipeline.py Orchestrates the full flow No
detector.py Detects document structure via PyMuPDF No
chunker.py Splits text into token-sized chunks No
summarizer.py Generates summaries via Groq/Llama 4 ✅ Yes
mem0_writer.py POSTs memories to mem0 REST API No
manifest.py Tracks ingested files and memory IDs No
config.py Loads .env, exposes typed settings No

Rule: Only summarizer.py calls an LLM. Everything else is pure Python.


Token Budget

Estimated cost per ~300-page book using Groq/Llama 4:

Operation Calls Input tokens Output tokens
Book summary 1 ~2,000 ~500
Chapter summaries (20 ch) 20 ~20,000 ~6,000
Flat doc summary 1–3 ~6,000 ~1,500
Total (structured) ~21 ~22,000 ~6,500

At Groq free tier rates: effectively $0.00 for most books.


Manifest Format

books/manifests/sapiens_2026-03-11.json

{
  "source_file": "sapiens.pdf",
  "ingested_at": "2026-03-11T10:23:00Z",
  "document_type": "structured",
  "chapters_detected": 20,
  "memories_created": {
    "book_summary": 1,
    "chapter_summary": 20,
    "content": 187
  },
  "mem0_memory_ids": ["abc123", "def456", "..."],
  "status": "complete"
}

The manifest enables clean deletion: purge all mem0_memory_ids to fully remove a book from memory.


Development Phases

Phase 1 — Core Pipeline (current)

  • Project structure & config
  • detector.py — structure detection
  • chunker.py — token-aware chunking
  • summarizer.py — Groq/Llama 4 summarization
  • mem0_writer.py — mem0 REST integration
  • manifest.py — ingestion tracking
  • pipeline.py — full orchestration
  • watchdog_runner.py — folder watcher + Rich terminal UI

Phase 2 — Extended Formats

  • Markdown and plain text ingestion
  • EPUB support
  • Scanned PDF OCR (via Tesseract or Llama 4 vision)

Phase 3 — Docker

  • Dockerfile
  • docker-compose.yml with books/ volume mount
  • Health check endpoint

Phase 4 — Management

  • CLI tool: book-ingestor delete sapiens.pdf
  • CLI tool: book-ingestor list — show all ingested books
  • Re-ingest on file change (hash-based deduplication)

Dependencies (planned)

pymupdf          # PDF text + structure extraction
watchdog         # Folder monitoring
groq             # Groq Python SDK
tiktoken         # Token counting (no LLM)
requests         # mem0 REST calls
python-dotenv    # .env loading
rich             # Terminal UI / progress display

Design Principles

  1. Python does the heavy lifting — structure detection, chunking, and file management are pure Python. LLMs only summarize.
  2. Token frugality — we never send more to Groq than necessary. Chunk boundaries are computed locally.
  3. Idempotent ingestion — manifests make it safe to re-run. Duplicate detection via file hash.
  4. Network-only coupling — the only external dependency at runtime is the mem0 server URL. No shared filesystem required.
  5. Docker-ready by design — folder paths are configurable, stateless between runs.