Feed structured knowledge into a mem0 memory server so an AI agent can recall it naturally in conversation — no explicit RAG retrieval, no "search the knowledge base" prompts. The agent simply knows what it has read.
File detected (watchdog)
│
▼
[detector.py]
Pythonic structure analysis via PyMuPDF:
- Font size variance → heading detection
- Bold flags + positioning → chapter boundaries
- Flat if no structural signals found
│
├─── STRUCTURED PATH ──────────────────────────┐
│ Extract: book title, chapters, paragraphs │
│ Summarize: book (1 Groq call) │
│ Summarize: each chapter (N Groq calls) │
│ Chunk: paragraphs → content memories │
│ │
└─── FLAT PATH ─────────────────────────────────┤
Semantic/sliding window chunking │
Summarize: whole doc (1-3 Groq calls) │
Chunk: paragraphs → content memories │
│
▼
[mem0_writer.py]
POST /memories (layered)
│
▼
[manifest.py]
Save manifest JSON
Every memory POSTed to mem0 carries structured metadata:
{
"messages": [{"role": "user", "content": "<memory text>"}],
"agent_id": "knowledge_base",
"metadata": {
"source_file": "sapiens.pdf",
"source_type": "book",
"memory_type": "chapter_summary",
"chapter": 4,
"chapter_title": "The Storytelling Animal",
"page_start": 67,
"page_end": 71,
"ingested_at": "2026-03-11T10:00:00Z"
}
}
memory_type |
Count per doc | Purpose |
|---|---|---|
book_summary |
1 | High-level overview, broad questions |
chapter_summary |
N (structured docs) | Mid-level recall by topic |
content |
M | Specific facts, quotes, details |
| Module | Role | LLM? |
|---|---|---|
watchdog_runner.py |
Watches inbox/, triggers pipeline |
No |
pipeline.py |
Orchestrates the full flow | No |
detector.py |
Detects document structure via PyMuPDF | No |
chunker.py |
Splits text into token-sized chunks | No |
summarizer.py |
Generates summaries via Groq/Llama 4 | ✅ Yes |
mem0_writer.py |
POSTs memories to mem0 REST API | No |
manifest.py |
Tracks ingested files and memory IDs | No |
config.py |
Loads .env, exposes typed settings |
No |
Rule: Only summarizer.py calls an LLM. Everything else is pure Python.
Estimated cost per ~300-page book using Groq/Llama 4:
| Operation | Calls | Input tokens | Output tokens |
|---|---|---|---|
| Book summary | 1 | ~2,000 | ~500 |
| Chapter summaries (20 ch) | 20 | ~20,000 | ~6,000 |
| Flat doc summary | 1–3 | ~6,000 | ~1,500 |
| Total (structured) | ~21 | ~22,000 | ~6,500 |
At Groq free tier rates: effectively $0.00 for most books.
books/manifests/sapiens_2026-03-11.json
{
"source_file": "sapiens.pdf",
"ingested_at": "2026-03-11T10:23:00Z",
"document_type": "structured",
"chapters_detected": 20,
"memories_created": {
"book_summary": 1,
"chapter_summary": 20,
"content": 187
},
"mem0_memory_ids": ["abc123", "def456", "..."],
"status": "complete"
}
The manifest enables clean deletion: purge all mem0_memory_ids to fully remove a book from memory.
detector.py — structure detectionchunker.py — token-aware chunkingsummarizer.py — Groq/Llama 4 summarizationmem0_writer.py — mem0 REST integrationmanifest.py — ingestion trackingpipeline.py — full orchestrationwatchdog_runner.py — folder watcher + Rich terminal UIDockerfiledocker-compose.yml with books/ volume mountbook-ingestor delete sapiens.pdfbook-ingestor list — show all ingested bookspymupdf # PDF text + structure extraction
watchdog # Folder monitoring
groq # Groq Python SDK
tiktoken # Token counting (no LLM)
requests # mem0 REST calls
python-dotenv # .env loading
rich # Terminal UI / progress display