|
|
@@ -0,0 +1,181 @@
|
|
|
+# PROJECT.md — book-ingestor
|
|
|
+
|
|
|
+## Vision
|
|
|
+
|
|
|
+Feed structured knowledge into a mem0 memory server so an AI agent can recall it naturally in conversation — no explicit RAG retrieval, no "search the knowledge base" prompts. The agent simply *knows* what it has read.
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## Architecture
|
|
|
+
|
|
|
+### Pipeline Overview
|
|
|
+
|
|
|
+```
|
|
|
+File detected (watchdog)
|
|
|
+ │
|
|
|
+ ▼
|
|
|
+[detector.py]
|
|
|
+ Pythonic structure analysis via PyMuPDF:
|
|
|
+ - Font size variance → heading detection
|
|
|
+ - Bold flags + positioning → chapter boundaries
|
|
|
+ - Flat if no structural signals found
|
|
|
+ │
|
|
|
+ ├─── STRUCTURED PATH ──────────────────────────┐
|
|
|
+ │ Extract: book title, chapters, paragraphs │
|
|
|
+ │ Summarize: book (1 Groq call) │
|
|
|
+ │ Summarize: each chapter (N Groq calls) │
|
|
|
+ │ Chunk: paragraphs → content memories │
|
|
|
+ │ │
|
|
|
+ └─── FLAT PATH ─────────────────────────────────┤
|
|
|
+ Semantic/sliding window chunking │
|
|
|
+ Summarize: whole doc (1-3 Groq calls) │
|
|
|
+ Chunk: paragraphs → content memories │
|
|
|
+ │
|
|
|
+ ▼
|
|
|
+ [mem0_writer.py]
|
|
|
+ POST /memories (layered)
|
|
|
+ │
|
|
|
+ ▼
|
|
|
+ [manifest.py]
|
|
|
+ Save manifest JSON
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## Memory Schema
|
|
|
+
|
|
|
+Every memory POSTed to mem0 carries structured metadata:
|
|
|
+
|
|
|
+```json
|
|
|
+{
|
|
|
+ "messages": [{"role": "user", "content": "<memory text>"}],
|
|
|
+ "agent_id": "knowledge_base",
|
|
|
+ "metadata": {
|
|
|
+ "source_file": "sapiens.pdf",
|
|
|
+ "source_type": "book",
|
|
|
+ "memory_type": "chapter_summary",
|
|
|
+ "chapter": 4,
|
|
|
+ "chapter_title": "The Storytelling Animal",
|
|
|
+ "page_start": 67,
|
|
|
+ "page_end": 71,
|
|
|
+ "ingested_at": "2026-03-11T10:00:00Z"
|
|
|
+ }
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+### Memory Types
|
|
|
+
|
|
|
+| `memory_type` | Count per doc | Purpose |
|
|
|
+|--------------|---------------|---------|
|
|
|
+| `book_summary` | 1 | High-level overview, broad questions |
|
|
|
+| `chapter_summary` | N (structured docs) | Mid-level recall by topic |
|
|
|
+| `content` | M | Specific facts, quotes, details |
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## Module Responsibilities
|
|
|
+
|
|
|
+| Module | Role | LLM? |
|
|
|
+|--------|------|-------|
|
|
|
+| `watchdog_runner.py` | Watches `inbox/`, triggers pipeline | No |
|
|
|
+| `pipeline.py` | Orchestrates the full flow | No |
|
|
|
+| `detector.py` | Detects document structure via PyMuPDF | No |
|
|
|
+| `chunker.py` | Splits text into token-sized chunks | No |
|
|
|
+| `summarizer.py` | Generates summaries via Groq/Llama 4 | ✅ Yes |
|
|
|
+| `mem0_writer.py` | POSTs memories to mem0 REST API | No |
|
|
|
+| `manifest.py` | Tracks ingested files and memory IDs | No |
|
|
|
+| `config.py` | Loads `.env`, exposes typed settings | No |
|
|
|
+
|
|
|
+**Rule:** Only `summarizer.py` calls an LLM. Everything else is pure Python.
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## Token Budget
|
|
|
+
|
|
|
+Estimated cost per ~300-page book using Groq/Llama 4:
|
|
|
+
|
|
|
+| Operation | Calls | Input tokens | Output tokens |
|
|
|
+|-----------|-------|-------------|---------------|
|
|
|
+| Book summary | 1 | ~2,000 | ~500 |
|
|
|
+| Chapter summaries (20 ch) | 20 | ~20,000 | ~6,000 |
|
|
|
+| Flat doc summary | 1–3 | ~6,000 | ~1,500 |
|
|
|
+| **Total (structured)** | ~21 | ~22,000 | ~6,500 |
|
|
|
+
|
|
|
+At Groq free tier rates: effectively **$0.00** for most books.
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## Manifest Format
|
|
|
+
|
|
|
+`books/manifests/sapiens_2026-03-11.json`
|
|
|
+
|
|
|
+```json
|
|
|
+{
|
|
|
+ "source_file": "sapiens.pdf",
|
|
|
+ "ingested_at": "2026-03-11T10:23:00Z",
|
|
|
+ "document_type": "structured",
|
|
|
+ "chapters_detected": 20,
|
|
|
+ "memories_created": {
|
|
|
+ "book_summary": 1,
|
|
|
+ "chapter_summary": 20,
|
|
|
+ "content": 187
|
|
|
+ },
|
|
|
+ "mem0_memory_ids": ["abc123", "def456", "..."],
|
|
|
+ "status": "complete"
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+The manifest enables clean **deletion**: purge all `mem0_memory_ids` to fully remove a book from memory.
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## Development Phases
|
|
|
+
|
|
|
+### Phase 1 — Core Pipeline (current)
|
|
|
+- [x] Project structure & config
|
|
|
+- [ ] `detector.py` — structure detection
|
|
|
+- [ ] `chunker.py` — token-aware chunking
|
|
|
+- [ ] `summarizer.py` — Groq/Llama 4 summarization
|
|
|
+- [ ] `mem0_writer.py` — mem0 REST integration
|
|
|
+- [ ] `manifest.py` — ingestion tracking
|
|
|
+- [ ] `pipeline.py` — full orchestration
|
|
|
+- [ ] `watchdog_runner.py` — folder watcher + Rich terminal UI
|
|
|
+
|
|
|
+### Phase 2 — Extended Formats
|
|
|
+- [ ] Markdown and plain text ingestion
|
|
|
+- [ ] EPUB support
|
|
|
+- [ ] Scanned PDF OCR (via Tesseract or Llama 4 vision)
|
|
|
+
|
|
|
+### Phase 3 — Docker
|
|
|
+- [ ] `Dockerfile`
|
|
|
+- [ ] `docker-compose.yml` with `books/` volume mount
|
|
|
+- [ ] Health check endpoint
|
|
|
+
|
|
|
+### Phase 4 — Management
|
|
|
+- [ ] CLI tool: `book-ingestor delete sapiens.pdf`
|
|
|
+- [ ] CLI tool: `book-ingestor list` — show all ingested books
|
|
|
+- [ ] Re-ingest on file change (hash-based deduplication)
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## Dependencies (planned)
|
|
|
+
|
|
|
+```
|
|
|
+pymupdf # PDF text + structure extraction
|
|
|
+watchdog # Folder monitoring
|
|
|
+groq # Groq Python SDK
|
|
|
+tiktoken # Token counting (no LLM)
|
|
|
+requests # mem0 REST calls
|
|
|
+python-dotenv # .env loading
|
|
|
+rich # Terminal UI / progress display
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## Design Principles
|
|
|
+
|
|
|
+1. **Python does the heavy lifting** — structure detection, chunking, and file management are pure Python. LLMs only summarize.
|
|
|
+2. **Token frugality** — we never send more to Groq than necessary. Chunk boundaries are computed locally.
|
|
|
+3. **Idempotent ingestion** — manifests make it safe to re-run. Duplicate detection via file hash.
|
|
|
+4. **Network-only coupling** — the only external dependency at runtime is the mem0 server URL. No shared filesystem required.
|
|
|
+5. **Docker-ready by design** — folder paths are configurable, stateless between runs.
|