Bez popisu

Lukas Goldschmidt 8220079693 request throttle před 2 dny
book_ingestor 8220079693 request throttle před 2 dny
books 589d19ae78 dockerized před 2 dny
.env.example 8220079693 request throttle před 2 dny
.gitignore f9373b6058 Initial commit před 3 dny
Dockerfile 589d19ae78 dockerized před 2 dny
PROJECT.md f9373b6058 Initial commit před 3 dny
README.md 589d19ae78 dockerized před 2 dny
docker-compose.yml 589d19ae78 dockerized před 2 dny
requirements.txt 589d19ae78 dockerized před 2 dny

README.md

📚 book-ingestor

"The agent reads the book so you don't have to explain it."

A standalone Python service that watches a folder for PDFs (and other text documents), intelligently processes them into layered memories, and feeds them into a mem0 server via its REST API.

The result: your AI agent doesn't search for knowledge — it simply knows it.


How it works

📂 books/inbox/          ← drop a PDF here
        ↓  (watchdog detects new file)
🔍 Structure Detection   ← is this a book with chapters, or a flat doc?
        ↓
✂️  Chunking             ← smart paragraph/semantic chunking (no LLM used)
        ↓
🧠 Summarization         ← Groq/Llama generates book + chapter summaries
        ↓
💾 mem0 /memories        ← layered memories POSTed to your mem0 server
        ↓
📂 books/done/           ← file archived, manifest saved

Memories are stored in layers:

  • Book summary — one high-level memory for the whole document
  • Chapter summaries — one memory per chapter/section (structured docs)
  • Content chunks — paragraph-level memories for fine-grained recall

Requirements


Quick Start

With Docker (recommended)

git clone https://github.com/yourname/book-ingestor.git
cd book-ingestor
cp .env.example .env        # fill in your values
docker compose up -d --build

Watch logs:

docker compose logs -f

Stop / restart:

docker compose down
docker compose up -d

If a PDF gets stuck in books/processing/ after an interrupted run:

mv books/processing/*.pdf books/inbox/
docker compose restart

Without Docker

git clone https://github.com/yourname/book-ingestor.git
cd book-ingestor
cp .env.example .env        # fill in your values
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python -m book_ingestor.watchdog_runner

Drop a PDF into books/inbox/ and watch it get ingested.


Configuration

All config lives in .env:

MEM0_BASE_URL=http://192.168.0.200:8420
MEM0_AGENT_ID=knowledge_base
GROQ_API_KEY=your_groq_key_here
GROQ_MODEL=meta-llama/llama-4-scout-17b-16e-instruct
BOOKS_INBOX=./books/inbox
BOOKS_PROCESSING=./books/processing
BOOKS_DONE=./books/done
BOOKS_MANIFESTS=./books/manifests
CHUNK_SIZE_TOKENS=350
LOG_LEVEL=INFO

Folder Structure

book-ingestor/
├── books/
│   ├── inbox/          ← drop zone (watched)
│   ├── processing/     ← in-flight (do not touch)
│   ├── done/           ← archived originals
│   └── manifests/      ← JSON record per ingested book
├── book_ingestor/
│   ├── watchdog_runner.py
│   ├── pipeline.py
│   ├── detector.py
│   ├── chunker.py
│   ├── summarizer.py
│   ├── mem0_writer.py
│   ├── manifest.py
│   └── config.py
├── Dockerfile
├── docker-compose.yml
├── .env.example
├── requirements.txt
├── PROJECT.md
└── README.md

Supported File Types

Format Status
PDF (text-based)
PDF (scanned/image) 🔜 (OCR planned)
Markdown (.md) 🔜
Plain text (.txt) 🔜
EPUB 🔜

Notes

  • This project is completely independent of OpenClaw or any specific AI agent — it only talks to mem0.
  • Any machine on the LAN with network access to your mem0 server can run this.
  • The books/ folder is mounted into the container — PDFs, manifests and archives survive restarts and rebuilds.
  • network_mode: host is used so the container can reach your LAN mem0 server without extra networking config.

License

MIT