README.md 3.1 KB

📚 book-ingestor

"The agent reads the book so you don't have to explain it."

A standalone Python service that watches a folder for PDFs (and other text documents), intelligently processes them into layered memories, and feeds them into a mem0 server via its REST API.

The result: your AI agent doesn't search for knowledge — it simply knows it.


How it works

📂 books/inbox/          ← drop a PDF here
        ↓  (watchdog detects new file)
🔍 Structure Detection   ← is this a book with chapters, or a flat doc?
        ↓
✂️  Chunking             ← smart paragraph/semantic chunking (no LLM used)
        ↓
🧠 Summarization         ← Groq/Llama generates book + chapter summaries
        ↓
💾 mem0 /memories        ← layered memories POSTed to your mem0 server
        ↓
📂 books/done/           ← file archived, manifest saved

Memories are stored in layers:

  • Book summary — one high-level memory for the whole document
  • Chapter summaries — one memory per chapter/section (structured docs)
  • Content chunks — paragraph-level memories for fine-grained recall

Requirements


Quick Start

git clone https://github.com/yourname/book-ingestor.git
cd book-ingestor
cp .env.example .env        # fill in your values
pip install -r requirements.txt
python -m book_ingestor.watchdog_runner

Drop a PDF into books/inbox/ and watch it get ingested.


Configuration

All config lives in .env:

MEM0_BASE_URL=http://192.168.0.200:8420
MEM0_AGENT_ID=knowledge_base
GROQ_API_KEY=your_groq_key_here
GROQ_MODEL=meta-llama/llama-4-scout-17b-16e-instruct
BOOKS_INBOX=./books/inbox
BOOKS_PROCESSING=./books/processing
BOOKS_DONE=./books/done
BOOKS_MANIFESTS=./books/manifests
CHUNK_SIZE_TOKENS=350
LOG_LEVEL=INFO

Folder Structure

book-ingestor/
├── books/
│   ├── inbox/          ← drop zone (watched)
│   ├── processing/     ← in-flight (do not touch)
│   ├── done/           ← archived originals
│   └── manifests/      ← JSON record per ingested book
├── book_ingestor/
│   ├── watchdog_runner.py
│   ├── pipeline.py
│   ├── detector.py
│   ├── chunker.py
│   ├── summarizer.py
│   ├── mem0_writer.py
│   ├── manifest.py
│   └── config.py
├── .env.example
├── requirements.txt
├── PROJECT.md
└── README.md

Supported File Types

Format Status
PDF (text-based)
PDF (scanned/image) 🔜 (OCR planned)
Markdown (.md) 🔜
Plain text (.txt) 🔜
EPUB 🔜

Notes

  • This project is completely independent of OpenClaw or any specific AI agent — it only talks to mem0.
  • Any machine on the LAN with network access to your mem0 server can run this.
  • Docker support is planned for a future release.

License

MIT