# PROJECT.md — book-ingestor

## Vision

Feed structured knowledge into a mem0 memory server so an AI agent can recall it naturally in conversation — no explicit RAG retrieval, no "search the knowledge base" prompts. The agent simply *knows* what it has read.

---

## Architecture

### Pipeline Overview

```
File detected (watchdog)
    │
    ▼
[detector.py]
    Pythonic structure analysis via PyMuPDF:
    - Font size variance → heading detection
    - Bold flags + positioning → chapter boundaries
    - Flat if no structural signals found
    │
    ├─── STRUCTURED PATH ──────────────────────────┐
    │    Extract: book title, chapters, paragraphs  │
    │    Summarize: book (1 Groq call)              │
    │    Summarize: each chapter (N Groq calls)     │
    │    Chunk: paragraphs → content memories       │
    │                                               │
    └─── FLAT PATH ─────────────────────────────────┤
         Semantic/sliding window chunking           │
         Summarize: whole doc (1-3 Groq calls)      │
         Chunk: paragraphs → content memories       │
                                                    │
                                                    ▼
                                        [mem0_writer.py]
                                        POST /memories (layered)
                                                    │
                                                    ▼
                                        [manifest.py]
                                        Save manifest JSON
```

---

## Memory Schema

Every memory POSTed to mem0 carries structured metadata:

```json
{
  "messages": [{"role": "user", "content": "<memory text>"}],
  "agent_id": "knowledge_base",
  "metadata": {
    "source_file": "sapiens.pdf",
    "source_type": "book",
    "memory_type": "chapter_summary",
    "chapter": 4,
    "chapter_title": "The Storytelling Animal",
    "page_start": 67,
    "page_end": 71,
    "ingested_at": "2026-03-11T10:00:00Z"
  }
}
```

### Memory Types

| `memory_type` | Count per doc | Purpose |
|--------------|---------------|---------|
| `book_summary` | 1 | High-level overview, broad questions |
| `chapter_summary` | N (structured docs) | Mid-level recall by topic |
| `content` | M | Specific facts, quotes, details |

---

## Module Responsibilities

| Module | Role | LLM? |
|--------|------|-------|
| `watchdog_runner.py` | Watches `inbox/`, triggers pipeline | No |
| `pipeline.py` | Orchestrates the full flow | No |
| `detector.py` | Detects document structure via PyMuPDF | No |
| `chunker.py` | Splits text into token-sized chunks | No |
| `summarizer.py` | Generates summaries via Groq/Llama 4 | ✅ Yes |
| `mem0_writer.py` | POSTs memories to mem0 REST API | No |
| `manifest.py` | Tracks ingested files and memory IDs | No |
| `config.py` | Loads `.env`, exposes typed settings | No |

**Rule:** Only `summarizer.py` calls an LLM. Everything else is pure Python.

---

## Token Budget

Estimated cost per ~300-page book using Groq/Llama 4:

| Operation | Calls | Input tokens | Output tokens |
|-----------|-------|-------------|---------------|
| Book summary | 1 | ~2,000 | ~500 |
| Chapter summaries (20 ch) | 20 | ~20,000 | ~6,000 |
| Flat doc summary | 1–3 | ~6,000 | ~1,500 |
| **Total (structured)** | ~21 | ~22,000 | ~6,500 |

At Groq free tier rates: effectively **$0.00** for most books.

---

## Manifest Format

`books/manifests/sapiens_2026-03-11.json`

```json
{
  "source_file": "sapiens.pdf",
  "ingested_at": "2026-03-11T10:23:00Z",
  "document_type": "structured",
  "chapters_detected": 20,
  "memories_created": {
    "book_summary": 1,
    "chapter_summary": 20,
    "content": 187
  },
  "mem0_memory_ids": ["abc123", "def456", "..."],
  "status": "complete"
}
```

The manifest enables clean **deletion**: purge all `mem0_memory_ids` to fully remove a book from memory.

---

## Development Phases

### Phase 1 — Core Pipeline (current)
- [x] Project structure & config
- [ ] `detector.py` — structure detection
- [ ] `chunker.py` — token-aware chunking
- [ ] `summarizer.py` — Groq/Llama 4 summarization
- [ ] `mem0_writer.py` — mem0 REST integration
- [ ] `manifest.py` — ingestion tracking
- [ ] `pipeline.py` — full orchestration
- [ ] `watchdog_runner.py` — folder watcher + Rich terminal UI

### Phase 2 — Extended Formats
- [ ] Markdown and plain text ingestion
- [ ] EPUB support
- [ ] Scanned PDF OCR (via Tesseract or Llama 4 vision)

### Phase 3 — Docker
- [ ] `Dockerfile`
- [ ] `docker-compose.yml` with `books/` volume mount
- [ ] Health check endpoint

### Phase 4 — Management
- [ ] CLI tool: `book-ingestor delete sapiens.pdf`
- [ ] CLI tool: `book-ingestor list` — show all ingested books
- [ ] Re-ingest on file change (hash-based deduplication)

---

## Dependencies (planned)

```
pymupdf          # PDF text + structure extraction
watchdog         # Folder monitoring
groq             # Groq Python SDK
tiktoken         # Token counting (no LLM)
requests         # mem0 REST calls
python-dotenv    # .env loading
rich             # Terminal UI / progress display
```

---

## Design Principles

1. **Python does the heavy lifting** — structure detection, chunking, and file management are pure Python. LLMs only summarize.
2. **Token frugality** — we never send more to Groq than necessary. Chunk boundaries are computed locally.
3. **Idempotent ingestion** — manifests make it safe to re-run. Duplicate detection via file hash.
4. **Network-only coupling** — the only external dependency at runtime is the mem0 server URL. No shared filesystem required.
5. **Docker-ready by design** — folder paths are configurable, stateless between runs.