PROJECT.md — book-ingestor

Vision

Feed structured knowledge into a mem0 memory server so an AI agent can recall it naturally in conversation — no explicit RAG retrieval, no "search the knowledge base" prompts. The agent simply knows what it has read.

Architecture

Pipeline Overview

File detected (watchdog)
    │
    ▼
[detector.py]
    Pythonic structure analysis via PyMuPDF:
    - Font size variance → heading detection
    - Bold flags + positioning → chapter boundaries
    - Flat if no structural signals found
    │
    ├─── STRUCTURED PATH ──────────────────────────┐
    │    Extract: book title, chapters, paragraphs  │
    │    Summarize: book (1 Groq call)              │
    │    Summarize: each chapter (N Groq calls)     │
    │    Chunk: paragraphs → content memories       │
    │                                               │
    └─── FLAT PATH ─────────────────────────────────┤
         Semantic/sliding window chunking           │
         Summarize: whole doc (1-3 Groq calls)      │
         Chunk: paragraphs → content memories       │
                                                    │
                                                    ▼
                                        [mem0_writer.py]
                                        POST /memories (layered)
                                                    │
                                                    ▼
                                        [manifest.py]
                                        Save manifest JSON

Memory Schema

Every memory POSTed to mem0 carries structured metadata:

{
  "messages": [{"role": "user", "content": "<memory text>"}],
  "agent_id": "knowledge_base",
  "metadata": {
    "source_file": "sapiens.pdf",
    "source_type": "book",
    "memory_type": "chapter_summary",
    "chapter": 4,
    "chapter_title": "The Storytelling Animal",
    "page_start": 67,
    "page_end": 71,
    "ingested_at": "2026-03-11T10:00:00Z"
  }
}

Memory Types

`memory_type`	Count per doc	Purpose
`book_summary`	1	High-level overview, broad questions
`chapter_summary`	N (structured docs)	Mid-level recall by topic
`content`	M	Specific facts, quotes, details

Module Responsibilities

Module	Role	LLM?
`watchdog_runner.py`	Watches `inbox/`, triggers pipeline	No
`pipeline.py`	Orchestrates the full flow	No
`detector.py`	Detects document structure via PyMuPDF	No
`chunker.py`	Splits text into token-sized chunks	No
`summarizer.py`	Generates summaries via Groq/Llama 4	✅ Yes
`mem0_writer.py`	POSTs memories to mem0 REST API	No
`manifest.py`	Tracks ingested files and memory IDs	No
`config.py`	Loads `.env`, exposes typed settings	No

Rule: Only summarizer.py calls an LLM. Everything else is pure Python.

Token Budget

Estimated cost per ~300-page book using Groq/Llama 4:

Operation	Calls	Input tokens	Output tokens
Book summary	1	~2,000	~500
Chapter summaries (20 ch)	20	~20,000	~6,000
Flat doc summary	1–3	~6,000	~1,500
Total (structured)	~21	~22,000	~6,500

At Groq free tier rates: effectively $0.00 for most books.

Manifest Format

books/manifests/sapiens_2026-03-11.json

{
  "source_file": "sapiens.pdf",
  "ingested_at": "2026-03-11T10:23:00Z",
  "document_type": "structured",
  "chapters_detected": 20,
  "memories_created": {
    "book_summary": 1,
    "chapter_summary": 20,
    "content": 187
  },
  "mem0_memory_ids": ["abc123", "def456", "..."],
  "status": "complete"
}

The manifest enables clean deletion: purge all mem0_memory_ids to fully remove a book from memory.

Development Phases

Phase 1 — Core Pipeline (current)

Project structure & config
detector.py — structure detection
chunker.py — token-aware chunking
summarizer.py — Groq/Llama 4 summarization
mem0_writer.py — mem0 REST integration
manifest.py — ingestion tracking
pipeline.py — full orchestration
watchdog_runner.py — folder watcher + Rich terminal UI

Phase 2 — Extended Formats

Markdown and plain text ingestion
EPUB support
Scanned PDF OCR (via Tesseract or Llama 4 vision)

Phase 3 — Docker

Dockerfile
docker-compose.yml with books/ volume mount
Health check endpoint

Phase 4 — Management

CLI tool: book-ingestor delete sapiens.pdf
CLI tool: book-ingestor list — show all ingested books
Re-ingest on file change (hash-based deduplication)

Dependencies (planned)

pymupdf          # PDF text + structure extraction
watchdog         # Folder monitoring
groq             # Groq Python SDK
tiktoken         # Token counting (no LLM)
requests         # mem0 REST calls
python-dotenv    # .env loading
rich             # Terminal UI / progress display

Design Principles

Python does the heavy lifting — structure detection, chunking, and file management are pure Python. LLMs only summarize.
Token frugality — we never send more to Groq than necessary. Chunk boundaries are computed locally.
Idempotent ingestion — manifests make it safe to re-run. Duplicate detection via file hash.
Network-only coupling — the only external dependency at runtime is the mem0 server URL. No shared filesystem required.
Docker-ready by design — folder paths are configurable, stateless between runs.

PROJECT.md 5.8 KB Riwayat Mentahan