# PROJECT.md — book-ingestor ## Vision Feed structured knowledge into a mem0 memory server so an AI agent can recall it naturally in conversation — no explicit RAG retrieval, no "search the knowledge base" prompts. The agent simply *knows* what it has read. --- ## Architecture ### Pipeline Overview ``` File detected (watchdog) │ ▼ [detector.py] Pythonic structure analysis via PyMuPDF: - Font size variance → heading detection - Bold flags + positioning → chapter boundaries - Flat if no structural signals found │ ├─── STRUCTURED PATH ──────────────────────────┐ │ Extract: book title, chapters, paragraphs │ │ Summarize: book (1 Groq call) │ │ Summarize: each chapter (N Groq calls) │ │ Chunk: paragraphs → content memories │ │ │ └─── FLAT PATH ─────────────────────────────────┤ Semantic/sliding window chunking │ Summarize: whole doc (1-3 Groq calls) │ Chunk: paragraphs → content memories │ │ ▼ [mem0_writer.py] POST /memories (layered) │ ▼ [manifest.py] Save manifest JSON ``` --- ## Memory Schema Every memory POSTed to mem0 carries structured metadata: ```json { "messages": [{"role": "user", "content": ""}], "agent_id": "knowledge_base", "metadata": { "source_file": "sapiens.pdf", "source_type": "book", "memory_type": "chapter_summary", "chapter": 4, "chapter_title": "The Storytelling Animal", "page_start": 67, "page_end": 71, "ingested_at": "2026-03-11T10:00:00Z" } } ``` ### Memory Types | `memory_type` | Count per doc | Purpose | |--------------|---------------|---------| | `book_summary` | 1 | High-level overview, broad questions | | `chapter_summary` | N (structured docs) | Mid-level recall by topic | | `content` | M | Specific facts, quotes, details | --- ## Module Responsibilities | Module | Role | LLM? | |--------|------|-------| | `watchdog_runner.py` | Watches `inbox/`, triggers pipeline | No | | `pipeline.py` | Orchestrates the full flow | No | | `detector.py` | Detects document structure via PyMuPDF | No | | `chunker.py` | Splits text into token-sized chunks | No | | `summarizer.py` | Generates summaries via Groq/Llama 4 | ✅ Yes | | `mem0_writer.py` | POSTs memories to mem0 REST API | No | | `manifest.py` | Tracks ingested files and memory IDs | No | | `config.py` | Loads `.env`, exposes typed settings | No | **Rule:** Only `summarizer.py` calls an LLM. Everything else is pure Python. --- ## Token Budget Estimated cost per ~300-page book using Groq/Llama 4: | Operation | Calls | Input tokens | Output tokens | |-----------|-------|-------------|---------------| | Book summary | 1 | ~2,000 | ~500 | | Chapter summaries (20 ch) | 20 | ~20,000 | ~6,000 | | Flat doc summary | 1–3 | ~6,000 | ~1,500 | | **Total (structured)** | ~21 | ~22,000 | ~6,500 | At Groq free tier rates: effectively **$0.00** for most books. --- ## Manifest Format `books/manifests/sapiens_2026-03-11.json` ```json { "source_file": "sapiens.pdf", "ingested_at": "2026-03-11T10:23:00Z", "document_type": "structured", "chapters_detected": 20, "memories_created": { "book_summary": 1, "chapter_summary": 20, "content": 187 }, "mem0_memory_ids": ["abc123", "def456", "..."], "status": "complete" } ``` The manifest enables clean **deletion**: purge all `mem0_memory_ids` to fully remove a book from memory. --- ## Development Phases ### Phase 1 — Core Pipeline (current) - [x] Project structure & config - [ ] `detector.py` — structure detection - [ ] `chunker.py` — token-aware chunking - [ ] `summarizer.py` — Groq/Llama 4 summarization - [ ] `mem0_writer.py` — mem0 REST integration - [ ] `manifest.py` — ingestion tracking - [ ] `pipeline.py` — full orchestration - [ ] `watchdog_runner.py` — folder watcher + Rich terminal UI ### Phase 2 — Extended Formats - [ ] Markdown and plain text ingestion - [ ] EPUB support - [ ] Scanned PDF OCR (via Tesseract or Llama 4 vision) ### Phase 3 — Docker - [ ] `Dockerfile` - [ ] `docker-compose.yml` with `books/` volume mount - [ ] Health check endpoint ### Phase 4 — Management - [ ] CLI tool: `book-ingestor delete sapiens.pdf` - [ ] CLI tool: `book-ingestor list` — show all ingested books - [ ] Re-ingest on file change (hash-based deduplication) --- ## Dependencies (planned) ``` pymupdf # PDF text + structure extraction watchdog # Folder monitoring groq # Groq Python SDK tiktoken # Token counting (no LLM) requests # mem0 REST calls python-dotenv # .env loading rich # Terminal UI / progress display ``` --- ## Design Principles 1. **Python does the heavy lifting** — structure detection, chunking, and file management are pure Python. LLMs only summarize. 2. **Token frugality** — we never send more to Groq than necessary. Chunk boundaries are computed locally. 3. **Idempotent ingestion** — manifests make it safe to re-run. Duplicate detection via file hash. 4. **Network-only coupling** — the only external dependency at runtime is the mem0 server URL. No shared filesystem required. 5. **Docker-ready by design** — folder paths are configurable, stateless between runs.