# 📚 book-ingestor

> *"The agent reads the book so you don't have to explain it."*

A standalone Python service that watches a folder for PDFs (and other text documents), intelligently processes them into layered memories, and feeds them into a [mem0](https://github.com/mem-ai/mem0) server via its REST API.

The result: your AI agent doesn't *search* for knowledge — it simply *knows* it.

---

## How it works

```
📂 books/inbox/          ← drop a PDF here
        ↓  (watchdog detects new file)
🔍 Structure Detection   ← is this a book with chapters, or a flat doc?
        ↓
✂️  Chunking             ← smart paragraph/semantic chunking (no LLM used)
        ↓
🧠 Summarization         ← Groq/Llama generates book + chapter summaries
        ↓
💾 mem0 /memories        ← layered memories POSTed to your mem0 server
        ↓
📂 books/done/           ← file archived, manifest saved
```

Memories are stored in layers:
- **Book summary** — one high-level memory for the whole document
- **Chapter summaries** — one memory per chapter/section (structured docs)
- **Content chunks** — paragraph-level memories for fine-grained recall

---

## Requirements

- Python 3.11+
- A running [mem0 server](https://github.com/mem-ai/mem0) accessible on your LAN
- A [Groq API key](https://console.groq.com/) (free tier is plenty)

---

## Quick Start

```bash
git clone https://github.com/yourname/book-ingestor.git
cd book-ingestor
cp .env.example .env        # fill in your values
pip install -r requirements.txt
python -m book_ingestor.watchdog_runner
```

Drop a PDF into `books/inbox/` and watch it get ingested.

---

## Configuration

All config lives in `.env`:

```env
MEM0_BASE_URL=http://192.168.0.200:8420
MEM0_AGENT_ID=knowledge_base
GROQ_API_KEY=your_groq_key_here
GROQ_MODEL=meta-llama/llama-4-scout-17b-16e-instruct
BOOKS_INBOX=./books/inbox
BOOKS_PROCESSING=./books/processing
BOOKS_DONE=./books/done
BOOKS_MANIFESTS=./books/manifests
CHUNK_SIZE_TOKENS=350
LOG_LEVEL=INFO
```

---

## Folder Structure

```
book-ingestor/
├── books/
│   ├── inbox/          ← drop zone (watched)
│   ├── processing/     ← in-flight (do not touch)
│   ├── done/           ← archived originals
│   └── manifests/      ← JSON record per ingested book
├── book_ingestor/
│   ├── watchdog_runner.py
│   ├── pipeline.py
│   ├── detector.py
│   ├── chunker.py
│   ├── summarizer.py
│   ├── mem0_writer.py
│   ├── manifest.py
│   └── config.py
├── .env.example
├── requirements.txt
├── PROJECT.md
└── README.md
```

---

## Supported File Types

| Format | Status |
|--------|--------|
| PDF (text-based) | ✅ |
| PDF (scanned/image) | 🔜 (OCR planned) |
| Markdown (.md) | 🔜 |
| Plain text (.txt) | 🔜 |
| EPUB | 🔜 |

---

## Notes

- This project is **completely independent** of OpenClaw or any specific AI agent — it only talks to mem0.
- Any machine on the LAN with network access to your mem0 server can run this.
- Docker support is planned for a future release.

---

## License

MIT