1 viikko sitten · e3d27d9fd1
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -29,14 +29,33 @@ This project spans two machines. **Always check which machine you're operating o
 
				 - The local dev copy has its own separate DB — treat it as empty/stale unless explicitly working with it.
			
 
				 
			
 
				 ## Repo Map
			
 
				-- `news_mcp/mcp_server_fastmcp.py`: MCP tool surface, startup refresh, pruning, and HTTP health endpoints.
			
 
				+- `news_mcp/mcp_server_fastmcp.py`: MCP tool surface, startup refresh, pruning, HTTP health endpoints, REST API.
			
 
				 - `news_mcp/jobs/poller.py`: feed refresh loop, clustering, enrichment, and cache writes.
			
 
				-- `news_mcp/storage/sqlite_store.py`: SQLite schema, cluster/entity metadata, feed hashes, and prune state.
			
 
				-- `news_mcp/dedup/cluster.py`: topic bucketing and the current fuzzy/embedding clustering path.
			
 
				+- `news_mcp/storage/sqlite_store.py`: SQLite schema (payload_ts, junction tables), upsert with junction population, SQL-level read methods. **Single data access layer for MCP tools.**
			
 
				+- `news_mcp/dashboard/dashboard_store.py`: Read-only query layer for dashboard REST API. Wraps `SQLiteClusterStore`. Added junction-table entity/keyword search. **NOTE: this store duplicates methods from sqlite_store — see Design Flaw in PROJECT.md.**
			
 
				+- `news_mcp/dedup/cluster.py`: topic bucketing and fuzzy/embedding clustering.
			
 
				 - `news_mcp/enrichment/llm_enrich.py`: LLM extraction/summarization and blacklist filtering.
			
 
				-- `news_mcp/trends_resolution.py` and `news_mcp/related_entities.py`: local Google Trends-based entity resolution and neighborhood lookup.
			
 
				+- `news_mcp/trends_resolution.py` and `news_mcp/related_entities.py`: entity resolution and neighborhood lookup.
			
 
				 - `news_mcp/config.py`: env-driven defaults and file paths.
			
 
				 
			
 
				+## Query Architecture (READ THIS BEFORE ADDING NEW QUERIES)
			
 
				+
			
 
				+**Time filtering:** Always use `payload_ts >= ?` SQL filter. Never parse JSON timestamps in Python for time ranges.
			
 
				+
			
 
				+**Entity/keyword search:** Use junction tables:
			
 
				+- `cluster_entities` for entity search: `JOIN cluster_entities ce ON c.cluster_id = ce.cluster_id WHERE ce.entity = ?`
			
 
				+- `cluster_keywords` for keyword search: `JOIN cluster_keywords ck ON c.cluster_id = ck.cluster_id WHERE ck.keyword = ?`
			
 
				+- Do NOT fetch all clusters and filter entities in Python.
			
 
				+
			
 
				+**Backfill:** After schema changes, run `scripts/backfill_junction_tables.py` in the Docker container:
			
 
				+```
			
 
				+docker exec -it news-mcp python3 scripts/backfill_junction_tables.py
			
 
				+```
			
 
				+
			
 
				+## Design Flaw: Two Stores
			
 
				+
			
 
				+`SQLiteClusterStore` and `DashboardStore` are parallel copies. Only `DashboardStore` was updated with junction-table entity search. MCP tools (`get_events_for_entity`, `get_news_sentiment`) still use `SQLiteClusterStore` Python-side entity matching with a row limit (top 200), missing entities in older clusters. See PROJECT.md for full analysis and proposed fix.
			
 
				+
			
 
				 ## Docker / Live Server Details
			
 
				 - `docker-compose.yml` mounts `./:/app` with `working_dir: /app`
			
 
				 - Data dir and DB path both hardcoded in docker-compose env: `NEWS_MCP_DB_PATH: ./data/news.sqlite`
			
@@ -52,12 +71,11 @@ This project spans two machines. **Always check which machine you're operating o
 
				 - `include_articles=true` should keep responses compact and only return minimal article fields.
			
 
				 - Timestamps in cluster payloads are normalized to ISO 8601 UTC (`YYYY-MM-DDTHH:MM:SS+00:00`) at write time in `sanitize_cluster_payload()`.
			
 
				 
			
 
				-## Timestamp Contract (READ THIS BEFORE TOUCHING ANY TIMESTAMP CODE)
			
 
				-- `payload.timestamp`, `payload.first_seen`, `payload.last_updated` are **guaranteed** `YYYY-MM-DDTHH:MM:SS+00:00` for every row written after the normalization migration (backfill script was run on the live server).
			
 
				-- **Read paths**: use `_read_ts()` from `news_mcp.storage.sqlite_store`, or `datetime.fromisoformat()` directly. That is all that is needed.
			
 
				-- **Never** add `parsedate_to_datetime` / RFC 2822 fallbacks to a read path. If `_read_ts` returns None on a stored timestamp, the bug is in the write path — fix `sanitize_cluster_payload()`, don't paper over it.
			
 
				-- `parsedate_to_datetime` is intentionally retained **only** in `sqlite_store._normalize_ts()` (write path) and `dedup/cluster.py` (raw ingest before normalization). Nowhere else.
			
 
				-- **Never query the dev DB** (`news_mcp/data/news.sqlite` on latitude) to check live data. It is empty/stale. The live DB is on thinkcenter-2 in Docker at `/app/data/news.sqlite`.
			
 
				+## Timestamp Contract
			
 
				+- `payload_ts` SQL column (VIRTUAL GENERATED) is the ONLY way to filter by event time. Use `WHERE payload_ts >= ?` in SQL. Never parse JSON timestamps in Python for time ranges.
			
 
				+- `payload.timestamp` in JSON is guaranteed `YYYY-MM-DDTHH:MM:SS+00:00` at write time (enforced by `sanitize_cluster_payload()`).
			
 
				+- `updated_at` in the DB = row modification time, NOT event time. Never use for time-range queries.
			
 
				+- This repo is a **dev machine** copy. The live server is on thinkcenter-2 (192.168.0.200). Never query the dev DB to verify live data.
			
 
				 
			
 
				 ## Editing Rules
			
 
				 - Keep changes aligned with the docs in `README.md`, `PROJECT.md`, and `OUTLOOK.md`.
			
--- a/OUTLOOK.md
+++ b/OUTLOOK.md
@@ -1,544 +1,61 @@
 
				+# News MCP Server — Project Vision & Status
			
 
				 
			
 
				-# 📰 News MCP Server — Requirements Spec
			
 
				+> **Current version: v0.4.0** — see PROJECT.md for architecture details.
			
 
				 
			
 
				-> **Current version: v0.3.1** — see [RELEASE_NOTES.md](RELEASE_NOTES.md) for changelog.
			
 
				+## Core Design Principle
			
 
				 
			
 
				-## 🎯 Goal
			
 
				+Raw news is useless to agents. **Processed news is powerful.**
			
 
				 
			
 
				-Provide **structured, deduplicated, topic-aware news signals**
			
 
				-that an agent can use for reasoning about:
			
 
				+- ✅ Clusters are the unit of truth, not raw articles
			
 
				+- ✅ 100 articles → 5–10 clusters, with entities, sentiment, importance
			
 
				+- ✅ SQL-level filtering by time, entity, keyword — no full-table JSON parsing
			
 
				 
			
 
				-* events
			
 
				-* narratives
			
 
				-* sentiment shifts
			
 
				+## Architecture (v0.4.0)
			
 
				 
			
 
				-👉 Not a feed reader
			
 
				-👉 Not a headline dump
			
 
				-👉 A **signal extraction layer**
			
 
				+See PROJECT.md for full schema and architecture. Key points:
			
 
				+- `payload_ts` generated column for indexed time-range queries
			
 
				+- `cluster_entities` and `cluster_keywords` junction tables for O(log n) entity/keyword search
			
 
				+- MCP tools and Dashboard REST API both query the same SQLite DB
			
 
				+- Docker deployment on thinkcenter-2 (192.168.0.200:8506)
			
 
				 
			
 
				----
			
 
				+## Tool Surface
			
 
				 
			
 
				-# 🧠 Core Design Principle
			
 
				+| Tool | Status | Notes |
			
 
				+|---|---|---|
			
 
				+| `get_latest_events` | ✅ | Time-filtered via `payload_ts` SQL index |
			
 
				+| `get_events_for_entity` | ⚠️ | MCP tool still uses Python-side entity matching (top-N limit). Dashboard uses SQL junction table. Known design flaw. |
			
 
				+| `get_event_summary` | ✅ | LLM-written narrative |
			
 
				+| `detect_emerging_topics` | ✅ | entity/keyword/phrase signal types, velocity scoring |
			
 
				+| `get_news_sentiment` | ⚠️ | Same Python-side entity matching limitation as `get_events_for_entity` |
			
 
				+| `get_related_recent_entities` | ✅ | Co-occurrence + Google Trends blend |
			
 
				+| `get_feeds` / `toggle_feed` | ✅ | Feed management |
			
 
				+| `detect_emerging_topics(around=...)` | ✅ | Scope to entity neighborhood |
			
 
				 
			
 
				-> Raw news is useless to agents.
			
 
				-> **Processed news is powerful.**
			
 
				+## Known Design Issues
			
 
				 
			
 
				----
			
 
				+### Two Stores (see PROJECT.md § "Design Flaw")
			
 
				+`SQLiteClusterStore` and `DashboardStore` are parallel copies. Only `DashboardStore` was updated with junction-table entity search. MCP tools still use Python-side entity matching with a row limit. Proposed fix: collapse into single data access layer.
			
 
				 
			
 
				-# 🏗️ 1. Internal Architecture
			
 
				+### MCP Tool Entity Search
			
 
				+`get_events_for_entity` and `get_news_sentiment` fetch top-N clusters by time then filter entities in Python. Entities in clusters beyond the limit are missed. Fix: use junction table `get_clusters_by_entity()`.
			
 
				 
			
 
				-## 🧩 Data Sources Layer (`sources/`)
			
 
				+## Backfill Scripts
			
 
				 
			
 
				-Mix of:
			
 
				-
			
 
				-* RSS feeds (primary)
			
 
				-* optional APIs later
			
 
				-
			
 
				-Examples:
			
 
				-
			
 
				-* Reuters
			
 
				-* Bloomberg
			
 
				-* CoinDesk
			
 
				-
			
 
				-### Responsibilities:
			
 
				-
			
 
				-* fetch articles
			
 
				-* normalize format
			
 
				-
			
 
				----
			
 
				-
			
 
				-## 🔄 Ingestion Pipeline
			
 
				-
			
 
				-Runs periodically (e.g. every few minutes)
			
 
				-
			
 
				-Steps:
			
 
				-
			
 
				-1. fetch articles
			
 
				-2. normalize fields:
			
 
				-
			
 
				-   * title
			
 
				-   * url
			
 
				-   * source
			
 
				-   * timestamp
			
 
				-   * summary (if available)
			
 
				-
			
 
				----
			
 
				-
			
 
				-## 🧹 Deduplication Layer
			
 
				-
			
 
				-### Problem:
			
 
				-
			
 
				-Same story appears across many sources.
			
 
				-
			
 
				-### Solution:
			
 
				-
			
 
				-Cluster articles by similarity:
			
 
				-
			
 
				-Methods:
			
 
				-
			
 
				-* title similarity (fuzzy match / embeddings)
			
 
				-* URL canonicalization
			
 
				-* content similarity (optional later)
			
 
				-
			
 
				-### Output:
			
 
				-
			
 
				-```json id="cluster"
			
 
				-{
			
 
				-  "cluster_id": "...",
			
 
				-  "headline": "Canonical headline",
			
 
				-  "articles": [...],
			
 
				-  "sources": ["Reuters", "Bloomberg"],
			
 
				-  "first_seen": "...",
			
 
				-  "last_updated": "..."
			
 
				-}
			
 
				-```
			
 
				-
			
 
				-👉 This is your **core unit of truth**, not individual articles
			
 
				-
			
 
				----
			
 
				-
			
 
				-## 🧠 Enrichment Layer
			
 
				-
			
 
				-Adds meaning to clusters.
			
 
				-
			
 
				-### 1. Entity extraction
			
 
				-
			
 
				-* assets (BTC, ETH)
			
 
				-* companies
			
 
				-* macro topics (inflation, rates)
			
 
				-
			
 
				----
			
 
				-
			
 
				-### 2. Topic classification
			
 
				-
			
 
				-Examples:
			
 
				-
			
 
				-* crypto
			
 
				-* macro
			
 
				-* regulation
			
 
				-* AI
			
 
				-
			
 
				----
			
 
				-
			
 
				-### 3. Sentiment (lightweight)
			
 
				-
			
 
				-* positive / negative / neutral
			
 
				-* or simple score
			
 
				-
			
 
				-👉 Keep this simple in v1 (don’t over-engineer NLP)
			
 
				-
			
 
				----
			
 
				-
			
 
				-### 4. Importance scoring (VERY useful)
			
 
				-
			
 
				-Heuristic:
			
 
				-
			
 
				-* number of sources covering it
			
 
				-* recency
			
 
				-* source credibility
			
 
				-* keyword weighting
			
 
				-
			
 
				----
			
 
				-
			
 
				-## 🗃️ Storage Layer
			
 
				-
			
 
				-You need short-term memory:
			
 
				-
			
 
				-* clusters (not raw articles)
			
 
				-* TTL: e.g. 24–72h
			
 
				-
			
 
				-Optional:
			
 
				-
			
 
				-* in-memory store (start)
			
 
				-* later: DB
			
 
				-
			
 
				-we have a choice of storage possibilites including qdrant, postgresql, couchdb
			
 
				-
			
 
				----
			
 
				-
			
 
				-# 🧰 2. Agent-Facing Tools (IMPORTANT)
			
 
				-
			
 
				-Keep tools **high-level and semantic**
			
 
				-
			
 
				----
			
 
				-
			
 
				-## 1. `get_latest_events`
			
 
				-
			
 
				-> “What is happening right now?”
			
 
				-
			
 
				-Input:
			
 
				-
			
 
				-```json id="n1"
			
 
				-{
			
 
				-  "topic": "crypto",
			
 
				-  "limit": 5,
			
 
				-  "include_articles": false
			
 
				-}
			
 
				-```
			
 
				-
			
 
				-Output:
			
 
				-
			
 
				-```json id="n2"
			
 
				-[
			
 
				-  {
			
 
				-    "headline": "...",
			
 
				-    "summary": "...",
			
 
				-    "entities": ["BTC"],
			
 
				-    "sentiment": "positive",
			
 
				-    "importance": 0.82,
			
 
				-    "sources": ["Reuters", "CoinDesk"],
			
 
				-    "timestamp": "...",
			
 
				-    "articles": [
			
 
				-      {
			
 
				-        "title": "...",
			
 
				-        "url": "...",
			
 
				-        "source": "Reuters",
			
 
				-        "timestamp": "..."
			
 
				-      }
			
 
				-    ]
			
 
				-  }
			
 
				-]
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## 2. `get_events_for_entity`
			
 
				-
			
 
				-> “What’s happening with X?”
			
 
				-
			
 
				-```json id="n3"
			
 
				-{
			
 
				-  "entity": "BTC",
			
 
				-  "include_articles": false
			
 
				-}
			
 
				-```
			
 
				-
			
 
				-👉 filters clusters by entity
			
 
				-
			
 
				-Optional:
			
 
				-
			
 
				-* `include_articles` to include article title/url/source/timestamp in the payload
			
 
				-
			
 
				----
			
 
				-
			
 
				-## 3. `get_event_summary`
			
 
				-
			
 
				-> “Explain this event clearly”
			
 
				-
			
 
				-```json id="n4"
			
 
				-{
			
 
				-  "event_id": "cluster_id",
			
 
				-  "include_articles": false
			
 
				-}
			
 
				-```
			
 
				-
			
 
				-Output:
			
 
				-
			
 
				-* merged summary
			
 
				-* key facts
			
 
				-* sources
			
 
				-* optional articles (title/url/source/timestamp)
			
 
				-
			
 
				-👉 This is where you compress multiple articles into one clean narrative
			
 
				-
			
 
				----
			
 
				-
			
 
				-## 4. `get_news_sentiment`
			
 
				-
			
 
				-> “What’s the tone around X?”
			
 
				-
			
 
				-```json id="n5"
			
 
				-{
			
 
				-  "entity": "BTC",
			
 
				-  "timeframe": "24h"
			
 
				-}
			
 
				+After deploying junction table schema changes:
			
 
				+```bash
			
 
				+docker exec -it news-mcp python3 scripts/backfill_junction_tables.py
			
 
				 ```
			
 
				 
			
 
				-Output:
			
 
				-
			
 
				-```json id="n6"
			
 
				-{
			
 
				-  "sentiment": "positive",
			
 
				-  "score": 0.64,
			
 
				-  "article_count": 42
			
 
				-}
			
 
				+For timestamp normalization (already run on live server):
			
 
				+```bash
			
 
				+docker exec -it news-mcp python3 scripts/normalize_cluster_timestamps.py
			
 
				 ```
			
 
				 
			
 
				----
			
 
				-
			
 
				-## 5. `detect_emerging_topics` (very valuable)
			
 
				-
			
 
				-> “What is gaining attention?”
			
 
				-
			
 
				-Output:
			
 
				-
			
 
				-```json id="n7"
			
 
				-[
			
 
				-  {
			
 
				-    "topic": "Ethereum ETF",
			
 
				-    "trend_score": 0.91,
			
 
				-    "related_entities": ["ETH", "BlackRock", "SEC"],
			
 
				-    "count": 8,
			
 
				-    "avg_importance": 0.17
			
 
				-  }
			
 
				-]
			
 
				-```
			
 
				-
			
 
				----
			
 
				-
			
 
				-## 6. `get_related_entities`
			
 
				-
			
 
				-> “What entities tend to appear with X?”
			
 
				-
			
 
				-```json id="n8"
			
 
				-{
			
 
				-  "subject": "Iran",
			
 
				-  "timeframe": "24h",
			
 
				-  "limit": 10
			
 
				-}
			
 
				-```
			
 
				-
			
 
				-Output:
			
 
				-
			
 
				-```json id="n9"
			
 
				-[
			
 
				-  {
			
 
				-    "entity": "United States",
			
 
				-    "count": 5,
			
 
				-    "avg_importance": 0.11,
			
 
				-    "sentiment": "negative",
			
 
				-    "score": -0.2
			
 
				-  }
			
 
				-]
			
 
				-```
			
 
				-
			
 
				-👉 entity-only co-occurrence neighborhood for real-time sense-making
			
 
				-
			
 
				----
			
 
				-
			
 
				-# ⚠️ 3. What NOT to expose
			
 
				-
			
 
				-Avoid:
			
 
				-
			
 
				-* raw RSS feeds
			
 
				-* individual article endpoints
			
 
				-* unprocessed headlines
			
 
				-
			
 
				-❌ Bad:
			
 
				-
			
 
				-```id="bad-news"
			
 
				-get_raw_articles()
			
 
				-```
			
 
				-
			
 
				-👉 This destroys signal quality for agents
			
 
				-
			
 
				----
			
 
				-
			
 
				-# 🔁 4. Caching & Freshness Strategy
			
 
				-
			
 
				-## Key difference from crypto:
			
 
				-
			
 
				-* News is **append-only + evolving**
			
 
				-* Not real-time tick data
			
 
				-
			
 
				----
			
 
				-
			
 
				-## Strategy:
			
 
				-
			
 
				-### Fetch layer:
			
 
				-
			
 
				-* poll every few minutes
			
 
				-
			
 
				-### Cluster layer:
			
 
				-
			
 
				-* update clusters incrementally
			
 
				-
			
 
				-### Tool responses:
			
 
				-
			
 
				-* no heavy recomputation
			
 
				-* serve from processed store
			
 
				-
			
 
				----
			
 
				-
			
 
				-# 🧠 5. Deduplication Strategy (critical)
			
 
				-
			
 
				-Clustering is the unit of truth, not individual articles.
			
 
				-
			
 
				-**Signal cascade** (cheapest first, short-circuit on match):
			
 
				-1. Cosine similarity (if embeddings enabled) against cluster centroid
			
 
				-2. Fuzzy title similarity (SequenceMatcher, configurable threshold, default 0.87)
			
 
				-3. Token Jaccard over headline+summary (default threshold 0.55)
			
 
				-4. Consensus: cosine ≥ 0.80 AND (jaccard ≥ 0.30 OR title ≥ 0.55)
			
 
				-
			
 
				-Each new article is compared against **all** articles in a candidate cluster; the best signal across all members is used.
			
 
				-
			
 
				-**Stable cluster IDs**: `sha1(topic | min_article_key)` — the same set of articles always maps to the same ID regardless of which article arrived first or which polling cycle created the cluster.
			
 
				-
			
 
				-**Cross-cycle merge**: the poller loads recent clusters from the DB (controlled by `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h) and seeds them as merge targets before clustering. New articles can merge into clusters from previous polling cycles.
			
 
				-
			
 
				-**Orphan merge**: a post-clustering Union-Find pass merges clusters that share article keys, catching cases where articles about the same event didn't match during the main loop.
			
 
				-
			
 
				-Planned runtime order:
			
 
				-* when `NEWS_EMBEDDINGS_ENABLED=true`, try Ollama embeddings first
			
 
				-* if Ollama fails, fall back to the existing heuristic cluster path
			
 
				-* keep candidate pre-filtering cheap before any vector compare
			
 
				-
			
 
				----
			
 
				-
			
 
				-# ⚡ 6. Signal Quality Rules
			
 
				-
			
 
				-Your MCP should:
			
 
				-
			
 
				-### ✅ Do:
			
 
				-
			
 
				-* reduce 100 articles → 5–10 clusters
			
 
				-* highlight consensus
			
 
				-* surface importance
			
 
				-
			
 
				-### ❌ Don’t:
			
 
				-
			
 
				-* overwhelm agent with volume
			
 
				-* pass conflicting duplicates
			
 
				-* expose noise
			
 
				-
			
 
				----
			
 
				-
			
 
				-# 🧩 7. Relationship to Other MCPs
			
 
				-
			
 
				-This MCP becomes powerful when combined with:
			
 
				-
			
 
				-* crypto MCP → price
			
 
				-* trends MCP → attention
			
 
				-
			
 
				-👉 News MCP provides:
			
 
				-
			
 
				-> **causal narratives**
			
 
				-
			
 
				----
			
 
				-
			
 
				-# 🧭 8. Design Philosophy
			
 
				-
			
 
				-Each tool should answer:
			
 
				-
			
 
				-> “What is happening, and why should I care?”
			
 
				-
			
 
				----
			
 
				-
			
 
				-# 🚀 9. Suggested Build Order
			
 
				-
			
 
				-1. RSS ingestion
			
 
				-2. normalization
			
 
				-3. basic deduplication
			
 
				-4. clustering
			
 
				-5. simple summarization
			
 
				-6. entity tagging
			
 
				-
			
 
				-👉 Only then expose tools
			
 
				-
			
 
				----
			
 
				-
			
 
				-# 🧠 Final takeaway
			
 
				-
			
 
				-> Crypto MCP gives you **facts**
			
 
				-> News MCP gives you **meaning**
			
 
				-
			
 
				-But only if you:
			
 
				-
			
 
				-* aggressively deduplicate
			
 
				-* cluster events
			
 
				-* compress information
			
 
				-
			
 
				----
			
 
				-
			
 
				-# ✅ Completed since this outlook was written
			
 
				-
			
 
				-* v0.1.0 released and tagged
			
 
				-* provider-agnostic LLM extraction/summarization layer added
			
 
				-* prompts moved into separate files for easier updates
			
 
				-* entity blacklist implemented and made case-insensitive
			
 
				-* wildcard blacklist support added for entities/topics/keywords
			
 
				-* live extraction smoke test added
			
 
				-* JSON-backed alias map added for query normalization
			
 
				-* query normalization added so shorthand like `btc` and `trump` still works
			
 
				-* docs updated with the new env vars and workflow
			
 
				-* optional article payloads added to event tools
			
 
				-* blacklist enforcement maintenance script added
			
 
				-* related-entities tool added for co-occurrence neighborhoods
			
 
				-* emerging-topic scoring improved with importance-weighting and co-occurrence
			
 
				-* concurrent RSS/OLLAMA/LLM pipelines added (v0.3.0)
			
 
				-* stable cluster IDs, cross-cycle merge, orphan dedup, multi-article signal comparison added (v0.3.1)
			
 
				-
			
 
				----
			
 
				-
			
 
				-# 🔭 Next high-level steps
			
 
				-
			
 
				-## What is left of v0.1.0
			
 
				-
			
 
				-The first version is now effectively a usable baseline. The remaining work for v0.1.x is mostly polish:
			
 
				-
			
 
				-* stabilize extraction quality across a few more real-world samples
			
 
				-* expand the alias map only where usage demands it
			
 
				-* tune emerging-topic noise so repeated source names do not dominate
			
 
				-* keep sentiment labels aligned with scores as the model improves
			
 
				-
			
 
				-## Where v0.2.0 should lead
			
 
				-
			
 
				-### Future plan (worth building slowly): “Emerging entity graph over time”
			
 
				-Right now `detect_emerging_topics()` returns a flat list of emerging *topics/entities*.
			
 
				-Next-level idea: turn it into an **entity graph** that an agent can reason over.
			
 
				-
			
 
				-**Core concept**
			
 
				-- Collapse/group results into canonical entity nodes (e.g. `iran`, `israel`, `donald_trump`, `strait_of_hormuz`, etc.)
			
 
				-- Build weighted edges from co-occurrence in recent clusters:
			
 
				-  - edge weight ~ frequency/co-occurrence strength
			
 
				-  - node weight ~ trend_score + count (+ optional avg_importance)
			
 
				-- Infer communities (graph grouping) so related nodes form stable “story neighborhoods”
			
 
				-
			
 
				-**Over time (the important part)**
			
 
				-- Each refresh window produces a snapshot of the graph
			
 
				-- Store snapshots / deltas to observe:
			
 
				-  - rising/falling node weights (“momentum”)
			
 
				-  - strengthening/weaker relations
			
 
				-  - emerging communities and topic shifts
			
 
				-
			
 
				-**Suggested output for an eventual agent tool**
			
 
				-- `get_emerging_entity_graph(timeframe, limit)` returning:
			
 
				-  - grouped communities
			
 
				-  - top nodes + weights
			
 
				-  - top relations + direction (optional)
			
 
				-  - summary of “what changed since last snapshot”
			
 
				-
			
 
				-This needs extra time to become a real usable MCP tool, so it’s intentionally captured here for later execution.
			
 
				-
			
 
				-
			
 
				-1. **Normalization layer**
			
 
				-
			
 
				-   * canonicalize acronyms and entity variants before storage / querying
			
 
				-   * keep the blacklist as a separate post-processing rule
			
 
				-
			
 
				-2. **Wildcard blacklist support**
			
 
				-
			
 
				-   * allow patterns for entities / topics / keywords
			
 
				-   * keep matching case-insensitive
			
 
				-
			
 
				-3. **Emerging signal quality**
			
 
				-
			
 
				-   * tune what counts as an emerging topic/entity
			
 
				-   * reduce noise from repeated source names and generic terms
			
 
				-
			
 
				-4. **Entity/time tracking and replay (future capability)**
			
 
				-
			
 
				-   * track how important entities evolve over time
			
 
				-   * allow replay of when entities first appeared, how topics shifted, and how sentiment changed
			
 
				-   * useful later for narrative reconstruction and trend timelines
			
 
				-
			
 
				-## Longer-term direction
			
 
				-
			
 
				-The endgame is not just “news search”, but a light narrative memory system:
			
 
				-
			
 
				-* entity histories over time
			
 
				-* topic shifts and turning points
			
 
				-* sentiment arcs
			
 
				-* replayable timelines for a person, company, or event
			
 
				+## Future Directions (v0.5.0+)
			
 
				 
			
 
				-That should stay in mind while keeping the current implementation simple.
			
 
				+### "Emerging entity graph over time"
			
 
				+- Collapse `detect_emerging_topics()` results into canonical entity nodes
			
 
				+- Build weighted edges from co-occurrence in recent clusters
			
 
				+- Infer communities (story neighborhoods)
			
 
				+- Track graph evolution across refresh windows (node momentum, edge strength changes)
			
 
				+- Agent tool: `get_emerging_entity_graph(timeframe, limit)`
			
--- a/POLLER_UPGRADE_PLAN.md
+++ b/POLLER_UPGRADE_PLAN.md
@@ -1,25 +1,22 @@
 
				 # Poller Upgrade Plan
			
 
				 
			
 
				+**Status:** Not yet implemented.
			
 
				+
			
 
				 ## Goal
			
 
				-Remove the poller's direct dependency on `SQLiteClusterStore._conn()` and replace it with a public store method so tests and future store implementations do not need to model a private connection helper.
			
 
				+Remove the poller's direct dependency on `SQLiteClusterStore._conn()` and replace it with a public store method.
			
 
				 
			
 
				 ## Current Problem
			
 
				-- `news_mcp/jobs/poller.py` clears legacy `feed_state` rows with `with store._conn() as conn:`.
			
 
				+- `news_mcp/jobs/poller.py:173` clears legacy `feed_state` rows with `with store._conn() as conn:`.
			
 
				 - That couples the refresh loop to a private SQLite implementation detail.
			
 
				-- Test doubles now need to expose `_conn()`, which is a sign the contract is too low-level.
			
 
				+- Test doubles need to expose `_conn()`.
			
 
				 
			
 
				 ## Proposed Refactor
			
 
				-1. Add a public method to `SQLiteClusterStore` for the legacy cleanup step.
			
 
				+1. Add a public `clear_legacy_feed_state()` method to `SQLiteClusterStore`.
			
 
				 2. Move the `DELETE FROM feed_state WHERE feed_key LIKE 'newsfeeds:%'` logic into that method.
			
 
				-3. Update `news_mcp/jobs/poller.py` to call the public method instead of `_conn()`.
			
 
				-4. Adjust tests to mock the public method, not a private connection handle.
			
 
				-
			
 
				-## Verification
			
 
				-- Re-run the poller-focused tests.
			
 
				-- Run the repo test script if the change stays small enough to keep coverage cheap.
			
 
				-- Confirm no other code paths still depend on `store._conn()` outside the store implementation itself.
			
 
				+3. Update `poller.py` to call the public method.
			
 
				+4. Adjust tests accordingly.
			
 
				 
			
 
				-## Notes
			
 
				+## Rules
			
 
				 - Keep the change narrow.
			
 
				-- Do not alter the feed-hash or clustering behavior in the same patch.
			
 
				-- Preserve the current legacy-row cleanup behavior exactly; only the access path should change.
			
 
				+- Do not alter feed-hash or clustering behavior in the same patch.
			
 
				+- Preserve current legacy-row cleanup behavior exactly.
			
--- a/PROJECT.md
+++ b/PROJECT.md
@@ -6,271 +6,96 @@ Provide a signal-extraction MCP server that converts RSS into **deduplicated, en
 
				 ## Current architecture (v0.4.0)
			
 
				 - FastMCP SSE server mounted at `/mcp`
			
 
				 - SQLite cache for clusters + entity metadata + feed state + LLM summary caches
			
 
				-- **payload_ts** — indexed generated column for SQL-level event-time filtering (no JSON parsing at read time)
			
 
				-- **cluster_entities** and **cluster_keywords** junction tables with indexes for O(log n) entity/keyword search
			
 
				-- All read paths use SQL-level filtering (no full-table JSON parsing)
			
 
				-- **Stable cluster IDs**: `sha1(min_article_key)` — topic-independent, order-independent, consistent across polling cycles. The topic is excluded from the hash so that the same article always maps to the same cluster_id regardless of heuristic vs LLM-enriched topic classification.
			
 
				-- **Cross-cycle merge**: poller seeds clustering with recent DB clusters (configurable `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h). Existing clusters are re-bucketed by the same heuristic topic function (`normalize_topic_from_title`) that new articles use, ensuring matching works even when the enriched topic drifted.
			
 
				+- **payload_ts** — indexed VIRTUAL GENERATED column: `json_extract(payload, '$.timestamp')`. Auto-maintained by SQLite on write. Indexed for O(log n) time-range queries. No write-path code needed.
			
 
				+- **cluster_entities** junction table — `(cluster_id, entity)` with index on `entity`. Populated in `upsert_clusters()`. SQL-level entity search.
			
 
				+- **cluster_keywords** junction table — `(cluster_id, keyword)` with index on `keyword`. Same pattern.
			
 
				+- All time-range filters and entity/keyword searches use SQL indexes. No full-table JSON parsing at query time.
			
 
				+- **Stable cluster IDs**: `sha1(min_article_key)` — topic-independent, order-independent, consistent across polling cycles.
			
 
				+- **Cross-cycle merge**: poller seeds clustering with recent DB clusters (configurable `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h).
			
 
				 - **Orphan merge**: post-clustering Union-Find pass merges clusters sharing article keys
			
 
				-- Concurrent Ollama embeddings (pre-computed before clustering loop)
			
 
				-- Concurrent LLM enrichment (entity extraction, topic classification, sentiment) with per-provider semaphore
			
 
				-- Per-cluster retry with exponential backoff (3 retries, 2s/4s/8s) + cross-cycle failure recovery
			
 
				-- All concurrency limits configurable via env vars (`NEWS_RSS_MAX_CONCURRENCY`, `NEWS_OLLAMA_MAX_CONCURRENCY`, `NEWS_LLM_CONCURRENCY_<PROVIDER>`)
			
 
				-- Dashboard REST API (`/api/v1/*`) for clusters, sentiment series, entity frequencies
			
 
				-- `get_latest_events()` defaults to all topics (omit `topic` for unfiltered)
			
 
				+- Concurrent RSS fetch, Ollama embeddings, LLM enrichment with per-provider semaphore
			
 
				+- Dashboard REST API (`/api/v1/*`) + Keywords panel + entity/keyword drill-down via junction tables
			
 
				 
			
 
				-## Previous: v0.2.x architecture
			
 
				-- FastMCP SSE server mounted at `/mcp`
			
 
				-- SQLite cache for clusters + Groq summary caches
			
 
				-- RSS fetch (breakingthenews.net)
			
 
				-- v1 dedup via fuzzy title similarity only, seed-article-only comparison
			
 
				-- optional Ollama embeddings path for clustering (when `NEWS_EMBEDDINGS_ENABLED=true`)
			
 
				-- configurable embedding similarity threshold (`NEWS_EMBEDDING_SIMILARITY_THRESHOLD`)
			
 
				-- optional embeddings backfill script for precomputing cluster vectors in SQLite
			
 
				-- optional merge-analysis script for threshold experiments before any DB rewrite
			
 
				-- optional merge pass for destructive consolidation after threshold review
			
 
				-- optional article-dedup cleanup for repeated article variants inside a cluster
			
 
				-- Groq enrichment (topic/entities/sentiment/keywords)
			
 
				-- Tools expose semantic queries over cached clusters
			
 
				-
			
 
				-## MCP tools (current)
			
 
				+## MCP tools
			
 
				 - `get_latest_events(topic, limit, include_articles)`
			
 
				 - `get_events_for_entity(entity, limit, timeframe, include_articles)`
			
 
				 - `get_event_summary(event_id, include_articles)`
			
 
				-- `detect_emerging_topics(limit)`
			
 
				+- `detect_emerging_topics(limit, timeframe, topic, around)` — returns signal_type (entity/keyword/phrase)
			
 
				 - `get_news_sentiment(entity, timeframe)`
			
 
				 - `get_related_recent_entities(subject, timeframe, limit, include_trends)`
			
 
				+- `get_feeds()` / `toggle_feed(feed_url, enabled)`
			
 
				 - `get_capabilities()`
			
 
				 
			
 
				-## Refresh & caching
			
 
				-
			
 
				-## Future work (planned): entity graph over time
			
 
				-Instead of treating `detect_emerging_topics()` as a flat list, we want a higher-level representation:
			
 
				-
			
 
				-- Convert emerging topic/entity co-occurrence signals into a **weighted entity graph**
			
 
				-- Group the graph into **communities** (story neighborhoods)
			
 
				-- Track **time evolution** across refresh windows:
			
 
				-  - node “momentum” (trend_score/count changes)
			
 
				-  - edge strength changes (relation tightening/weakening)
			
 
				-  - community emergence/disappearance
			
 
				-
			
 
				-Eventual agent tool shape (later): `get_emerging_entity_graph(timeframe, limit)`.
			
 
				+## REST API
			
 
				+- `GET /` — server info, tools list
			
 
				+- `GET /health` — uptime, version hash
			
 
				+- `GET /api/v1/clusters` — paginated, filtered by `payload_ts` SQL index
			
 
				+- `GET /api/v1/entities` — top entities via junction table GROUP BY
			
 
				+- `GET /api/v1/keywords` — top keywords via junction table GROUP BY
			
 
				+- `GET /api/v1/clusters/by-entity?entity=X&hours=Y` — SQL entity search (NEW)
			
 
				+- `GET /api/v1/clusters/by-keyword?keyword=X&hours=Y` — SQL keyword search (NEW)
			
 
				+- `GET /api/v1/sentiment-series` — filtered by `payload_ts` SQL index
			
 
				+- `GET /api/v1/cluster/{cluster_id}` — full detail
			
 
				+- `GET /api/v1/feeds` / `POST /api/v1/feeds/toggle` — feed management
			
 
				 
			
 
				-- Background refresh every `NEWS_REFRESH_INTERVAL_SECONDS` (default 900s)
			
 
				-- Feed-hash skipping to avoid redundant RSS+Groq work
			
 
				-- Cluster TTL (`NEWS_CLUSTERS_TTL_HOURS` via `CLUSTERS_TTL_HOURS`)
			
 
				+## Refresh & caching
			
 
				+- Background refresh every `NEWS_REFRESH_INTERVAL_SECONDS` (default 300s)
			
 
				+- Feed-hash skipping to avoid redundant RSS+LLM work
			
 
				 - Summary caching for `get_event_summary`
			
 
				+- Pruning via `NEWS_RETENTION_DAYS`, `NEWS_PRUNE_INTERVAL_HOURS`
			
 
				 
			
 
				-## Definition of “committable”
			
 
				-- Tests pass offline (dedup/storage unit tests)
			
 
				-- Server exposes tool surface with valid schemas
			
 
				-- Caching prevents repeated Groq calls for unchanged clusters
			
 
				-- Embeddings remain optional: Ollama is tried first when enabled, otherwise the heuristic path stays active
			
 
				-- Embeddings backfill script exists for older cluster rows before the server restart
			
 
				-- Merge-analysis script exists to inspect candidate cluster pairs at multiple thresholds
			
 
				-- Merge pass exists for destructive consolidation once thresholds look sane
			
 
				-- Article-dedup cleanup exists for fixing duplicated article records already in SQLite
			
 
				-- Entity lookup now respects timeframe as the scan window, with limit acting as a cap
			
 
				-
			
 
				-## Dashboard & REST API (added May 2026)
			
 
				-
			
 
				-### What was added
			
 
				-- **5 REST endpoints** (`/api/v1/*`) for programmatic access to cluster data, sentiment series, entity frequencies, and health stats
			
 
				-- **Dashboard SPA** at `/dashboard` — HTMX-based shell with Chart.js visualizations (5 views: health, clusters, sentiment, entities, detail)
			
 
				-- **Non-blocking startup** — moved from synchronous `@app.on_event("startup")` pruning to `lifespan`-based fire-and-forget background loop; server responds within ~0.3s regardless of feed/LLM latency
			
 
				-
			
 
				-### Architecture
			
 
				-```
			
 
				-news-mcp/
			
 
				-├── news_mcp/mcp_server_fastmcp.py   ← MCP tools + REST API + dashboard mount
			
 
				-├── news_mcp/dashboard/
			
 
				-│   ├── dashboard_store.py           ← Read-only query layer (no side effects)
			
 
				-│   ├── index.html                   ← SPA shell with 5 views
			
 
				-│   ├── style.css                    ← Dark theme, responsive
			
 
				-│   └── dashboard.js                 ← Client-side rendering + Chart.js
			
 
				-```
			
 
				-
			
 
				-### Key design decisions
			
 
				-- Dashboard store wraps `SQLiteClusterStore` with thin read-only methods — no enrichment, no writes
			
 
				-- Single shared store instance (`_shared_store`) avoids repeated DB connections
			
 
				-- Static SPA files are served by FastAPI's `StaticFiles` mount — no Jinja2/templating dependency
			
 
				-- Client-side `fetch()` + Chart.js avoids HTMX raw-JSON-in-DOM issues
			
 
				-- Default lookback matches `NEWS_DEFAULT_LOOKBACK_HOURS` (144h), not a hardcoded 24h
			
 
				-
			
 
				-### Known gaps
			
 
				-- No auth (LAN-only, no login)
			
 
				-- Entity detail view in dashboard is minimal (click-to-expand from entity list is stub)
			
 
				-- No alerting/threshold notifications yet (Phase 2: velocity spikes, sentiment divergence)
			
 
				-
			
 
				-## Dashboard & REST API (added May 2026)
			
 
				-
			
 
				-### What was added
			
 
				-- **5 REST endpoints** (`/api/v1/*`) — JSON-only, for programmatic access and the dashboard
			
 
				-- **Dashboard SPA** at `/dashboard` — 5 views (health, clusters, sentiment, entities, detail), Chart.js visualizations, instant client-side rendering
			
 
				-- **Non-blocking startup** — replaced synchronous `@app.on_event("startup")` with `lifespan`-based fire-and-forget background loop; server responds in <0.3s
			
 
				-- **Async ingestion lock** — `asyncio.Lock` prevents overlapping refresh cycles
			
 
				-- **Hardened LLM calls** — OpenRouter retry logic with exponential backoff on 429/5xx, response shape validation
			
 
				-
			
 
				-### Architecture additions
			
 
				-```
			
 
				-news-mcp/
			
 
				-├── news_mcp/mcp_server_fastmcp.py   ← MCP + REST API + /dashboard static mount
			
 
				-├── news_mcp/dashboard/
			
 
				-│   ├── __init__.py
			
 
				-│   ├── dashboard_store.py           ← Read-only query layer (no side effects)
			
 
				-│   ├── index.html                   ← SPA shell, 5 views
			
 
				-│   ├── style.css                    ← Dark theme, responsive grid
			
 
				-│   └── dashboard.js                 ← Client render, Chart.js, null-safe DOM access
			
 
				-```
			
 
				-
			
 
				-### Design decisions
			
 
				-- **Dashboard store** wraps `SQLiteClusterStore` with read-only methods (stats, pagination, series, frequencies, detail). No writes, no enrichment.
			
 
				-- **Single shared store** (`_shared_store`) — one DB connection pool for the entire process.
			
 
				-- **Static SPA** served via FastAPI `StaticFiles` — no Jinja2/templating dependency.
			
 
				-- **Client-side rendering** with `fetch()` + Chart.js — avoids HTMX raw-JSON-in-DOM issues.
			
 
				-- **Default lookback** follows `NEWS_DEFAULT_LOOKBACK_HOURS` (144h), not hardcoded.
			
 
				-- **Cluster ordering** — always date-descending (SQL `ORDER BY updated_at DESC` + client-side sort as safety net).
			
 
				-
			
 
				-### Known gaps (for future work)
			
 
				-- No auth (LAN-only assumption)
			
 
				-- Entity detail view is functional but minimal
			
 
				-- No alerting/threshold notifications (Phase 2)
			
 
				-- No server-sent events for real-time dashboard updates
			
 
				-
			
 
				-## Keyword Utilization Upgrade (May 2026)
			
 
				-
			
 
				-### Problem
			
 
				-Keywords are extracted by the LLM (`extract_entities.prompt` — "provide short keywords that justify the classification"), stored in the cluster payload, and displayed in the dashboard detail view — but they are not used by any search, scoring, or retrieval path. Thematic signals like "ETF", "rate-cut", "contagion" are invisible to entity search, emerging-topics detection, and related-entity expansion.
			
 
				-
			
 
				-### Plan
			
 
				-
			
 
				-#### Phase 1 — Search & Retrieval (done)
			
 
				-- **1a**: Add keywords to `_cluster_entity_haystack()` in `mcp_server_fastmcp.py` so `get_events_for_entity()` and `get_news_sentiment()` match clusters by thematic keywords, not just named entities.
			
 
				-- **1b**: Add `keywords` field to cluster output dicts in `get_latest_events()` and `get_events_for_entity()` so downstream LLM agents see the full semantic picture.
			
 
				-
			
 
				-#### Phase 2 — Emerging Topics (pending)
			
 
				-- **2a**: Count keywords in `detect_emerging_topics()` with parallel `keyword_counts_recent` / `keyword_counts_prior` accumulators, scored with the same velocity/recency/source-diversity formula as entities.
			
 
				-- **2b**: Optionally promote high-velocity keywords to "suggested entities" on the dashboard.
			
 
				-
			
 
				-#### Phase 3 — Relatedness & Dashboard (pending)
			
 
				-- **3a**: Add keyword co-occurrence counting in `_collect_local_related()` in `related_entities.py`.
			
 
				-- **3b**: Add `get_keyword_frequencies()` to `DashboardStore` and a "Keywords" panel on the dashboard.
			
 
				-
			
 
				-#### Phase 4 — Prompt Refinement (optional)
			
 
				-- Split keyword extraction into "theme keywords" (subject matter) and "signal keywords" (what's new/notable) for differential weighting downstream.
			
 
				-
			
 
				-## Timestamp Normalization (May 2026)
			
 
				-
			
 
				-### Problem
			
 
				-Cluster payloads stored timestamps as raw RSS strings (RFC 2822 HTTP-date like `"Sat, 30 May 2026 02:00:12 +00:00"`). Every read path needed fragile format-guessing, and SQL time-range queries on `updated_at` (row modification time, not event time) returned wrong data.
			
 
				-
			
 
				-### Fix
			
 
				-- `_normalize_ts()` helper in `sqlite_store.py`: parses ISO 8601, RFC 2822/HTTP-date, epoch seconds → uniform `YYYY-MM-DDTHH:MM:SS+00:00`
			
 
				-- `sanitize_cluster_payload()` now normalizes `timestamp`, `first_seen`, `last_updated`, and all `article[].timestamp` before writing to DB
			
 
				-- `merge_cluster_embeddings.py`: same normalization on merged payloads
			
 
				-- `scripts/normalize_cluster_timestamps.py`: backfill script for existing rows (run on live server with correct `--db` path)
			
 
				-- `get_sentiment_series()` and `get_entity_frequencies()`: filter by `payload.timestamp` in Python, not `updated_at` in SQL
			
 
				-
			
 
				-### Key invariant
			
 
				-`updated_at` in the DB = row modification time (set to `datetime.now()` on every upsert). For time-range queries, always use `payload.timestamp` parsed from the JSON.
			
 
				-
			
 
				-## Timestamp Read-Path Cleanup (May 2026)
			
 
				-
			
 
				-### Problem
			
 
				-After normalization, all read paths still contained defensive RFC 2822 / `parsedate_to_datetime` fallback parsers. This was dead code on the live server (all stored timestamps are ISO 8601 UTC) and risked being re-introduced by future contributors who misread the defensive pattern as necessary.
			
 
				-
			
 
				-### Fix
			
 
				-- Added `_read_ts(ts) -> float | None` to `sqlite_store.py` (module-level, exported). Uses only `datetime.fromisoformat()`. No RFC 2822 fallback. If it fails, the normalization pipeline has a bug — fix that instead.
			
 
				-- All read-path timestamp parsing in `sqlite_store.py`, `dashboard_store.py`, and `mcp_server_fastmcp.py` now uses `_read_ts` or plain `fromisoformat`.
			
 
				-- `parsedate_to_datetime` removed from `dashboard_store.py` and `mcp_server_fastmcp.py` imports entirely.
			
 
				-- `parsedate_to_datetime` is **only** retained in `sqlite_store._normalize_ts()` (the write path) and `dedup/cluster.py` (raw ingest before normalization).
			
 
				-- Test fixtures updated to use ISO 8601 UTC timestamps.
			
 
				-
			
 
				-### Contract (ENFORCE THIS)
			
 
				-- `payload.timestamp`, `payload.first_seen`, `payload.last_updated` are **always** `YYYY-MM-DDTHH:MM:SS+00:00` for any row written after the normalization migration.
			
 
				-- Read paths: use `_read_ts()` from `sqlite_store` or `datetime.fromisoformat()` directly. **Never** add `parsedate_to_datetime` to a read path.
			
 
				-- Write paths: `sanitize_cluster_payload()` in `sqlite_store.py` is the single normalization point. All writes go through `upsert_clusters()` which calls it.
			
 
				-- This repo is a **dev machine** copy. The live server is on thinkcenter-2 (192.168.0.200). Never query the dev DB to verify live data — the dev DB is stale/empty.
			
 
				-
			
 
				-## Junction Tables + Indexed Timestamp (May 2026)
			
 
				-
			
 
				-### Problem
			
 
				-All read paths deserialize every JSON payload to filter by entity/keyword/time. With 6000+ clusters, `get_clusters_page` returns only the 100 newest — clicking an entity that appears 34x shows only 2 clusters because the other 32 are outside the LIMIT. `get_entity_frequencies` counts correctly but the detail view can't find them. Every query does a full table scan with JSON parsing.
			
 
				-
			
 
				-### Solution: junction tables + generated timestamp column
			
 
				-
			
 
				-**Schema (migrated in `_init_db`, incremental-safe):**
			
 
				-
			
 
				+## Schema (clusters table)
			
 
				 ```sql
			
 
				--- Indexed event timestamp (SQLite generated column — zero write-path cost)
			
 
				-ALTER TABLE clusters ADD COLUMN payload_ts
			
 
				-    GENERATED ALWAYS AS (json_extract(payload, '$.timestamp')) STORED;
			
 
				-CREATE INDEX IF NOT EXISTS idx_clusters_payload_ts ON clusters(payload_ts);
			
 
				+CREATE TABLE clusters (
			
 
				+    cluster_id TEXT PRIMARY KEY,
			
 
				+    topic TEXT NOT NULL,
			
 
				+    payload TEXT NOT NULL,
			
 
				+    updated_at TEXT NOT NULL,          -- row modification time (set on every upsert)
			
 
				+    summary_payload TEXT,
			
 
				+    summary_updated_at TEXT,
			
 
				+    payload_ts GENERATED ALWAYS AS     -- indexed event time (auto-maintained)
			
 
				+        (json_extract(payload, '$.timestamp')) VIRTUAL
			
 
				+);
			
 
				+CREATE INDEX idx_clusters_payload_ts ON clusters(payload_ts);
			
 
				 
			
 
				--- Entity junction table for SQL-level entity search
			
 
				-CREATE TABLE IF NOT EXISTS cluster_entities (
			
 
				+CREATE TABLE cluster_entities (
			
 
				     cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
			
 
				-    entity     TEXT NOT NULL,
			
 
				+    entity     TEXT NOT NULL,          -- lowercased
			
 
				     PRIMARY KEY (cluster_id, entity)
			
 
				 );
			
 
				-CREATE INDEX IF NOT EXISTS idx_cluster_entities_entity ON cluster_entities(entity);
			
 
				+CREATE INDEX idx_cluster_entities_entity ON cluster_entities(entity);
			
 
				 
			
 
				--- Keyword junction table for SQL-level keyword search
			
 
				-CREATE TABLE IF NOT EXISTS cluster_keywords (
			
 
				+CREATE TABLE cluster_keywords (
			
 
				     cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
			
 
				-    keyword    TEXT NOT NULL,
			
 
				+    keyword    TEXT NOT NULL,          -- lowercased
			
 
				     PRIMARY KEY (cluster_id, keyword)
			
 
				 );
			
 
				-CREATE INDEX IF NOT EXISTS idx_cluster_keywords_keyword ON cluster_keywords(keyword);
			
 
				+CREATE INDEX idx_cluster_keywords_keyword ON cluster_keywords(keyword);
			
 
				 ```
			
 
				 
			
 
				-**Write path (`upsert_clusters`):** Within the existing transaction, after sanitizing the payload and before INSERT/UPDATE:
			
 
				-1. `DELETE FROM cluster_entities WHERE cluster_id = ?`  (handles re-enrichment)
			
 
				-2. `DELETE FROM cluster_keywords WHERE cluster_id = ?`
			
 
				-3. `INSERT OR IGNORE INTO cluster_entities VALUES (?, ?)` for each entity
			
 
				-4. `INSERT OR IGNORE INTO cluster_keywords VALUES (?, ?)` for each keyword
			
 
				-5. `payload_ts` is auto-maintained by SQLite's generated column — no code needed
			
 
				-
			
 
				-**Read paths — all SQL-level, no JSON parsing at query time:**
			
 
				-
			
 
				-- `get_clusters_page`: `WHERE payload_ts >= ? ORDER BY payload_ts DESC LIMIT ? OFFSET ?`
			
 
				-- `get_entity_frequencies`: `JOIN cluster_entities ... WHERE payload_ts >= ? GROUP BY entity ORDER BY cnt DESC`
			
 
				-- `get_keyword_frequencies`: `JOIN cluster_keywords ... WHERE payload_ts >= ? GROUP BY keyword ORDER BY cnt DESC`
			
 
				-- New `get_clusters_by_entity`: `JOIN cluster_entities WHERE payload_ts >= ? AND entity = ?`
			
 
				-- New `get_clusters_by_keyword`: `JOIN cluster_keywords WHERE payload_ts >= ? AND keyword = ?`
			
 
				+## Keyword Utilization (done, May 2026)
			
 
				+Keywords extracted by the LLM are now first-class search signals:
			
 
				+- `_cluster_entity_haystack()` includes keywords → `get_events_for_entity()` matches themes
			
 
				+- Cluster output includes `keywords[]` field
			
 
				+- `detect_emerging_topics()` scores keywords with velocity/recency/source-diversity formula (`signal_type: "keyword"`)
			
 
				+- `_collect_local_related()` counts keyword co-occurrence
			
 
				+- Dashboard Keywords panel with SQL frequency counts via junction table
			
 
				+- Topic labels (crypto/macro/regulation/ai/other) filtered from keywords at extraction time
			
 
				 
			
 
				-**Backfill script (`scripts/backfill_junction_tables.py`):**
			
 
				-- Same pattern as `normalize_cluster_timestamps.py`
			
 
				-- Accepts `--db` arg, defaults to config DB_PATH
			
 
				-- Reads all cluster payloads, populates `cluster_entities` and `cluster_keywords`
			
 
				-- `payload_ts` is auto-populated by SQLite's generated column
			
 
				-- Idempotent (`INSERT OR IGNORE` + transaction)
			
 
				-- Reports entity/keyword counts after completion
			
 
				-- Run once on live server: `docker exec -it <container> python3 scripts/backfill_junction_tables.py`
			
 
				+## Timestamp Pipeline (May 2026)
			
 
				+1. **Write**: `sanitize_cluster_payload()` normalizes `timestamp`/`first_seen`/`last_updated` to `YYYY-MM-DDTHH:MM:SS+00:00`. If all three missing, falls back to `datetime.now()`.
			
 
				+2. **Generated column**: `payload_ts` auto-extracts from JSON on write. Indexed.
			
 
				+3. **Read**: All queries filter by `payload_ts >= ?` in SQL. No JSON parsing for time filtering.
			
 
				+4. **Backfill**: One-time `scripts/backfill_junction_tables.py` populated junction tables from existing payloads. `payload_ts` was auto-populated by SQLite.
			
 
				 
			
 
				-**REST API changes:**
			
 
				-- `GET /api/v1/clusters` — now uses SQL `payload_ts` filter, consistent total
			
 
				-- `GET /api/v1/entities` — SQL `COUNT(*) ... GROUP BY` via junction table
			
 
				-- `GET /api/v1/keywords` — SQL `COUNT(*) ... GROUP BY` via junction table
			
 
				-- **New `GET /api/v1/clusters/by-entity?entity=X&hours=Y&limit=Z`** — SQL entity search
			
 
				-- **New `GET /api/v1/clusters/by-keyword?keyword=X&hours=Y&limit=Z`** — SQL keyword search
			
 
				+## Design Flaw: Two Stores (KNOWN, fix planned)
			
 
				 
			
 
				-**Dashboard JS changes:**
			
 
				-- `showEntityDetail(label)` — calls `/api/v1/clusters/by-entity` instead of fetching all clusters
			
 
				-- `showKeywordDetail(label)` — calls `/api/v1/clusters/by-keyword` instead of fetching all clusters
			
 
				+**Problem:** `SQLiteClusterStore` and `DashboardStore` are parallel copies of the same data access layer. Methods were duplicated when DashboardStore was added, with the same JSON-parsing approach. When junction tables were implemented, only `DashboardStore` was updated. `SQLiteClusterStore` (used by MCP tools) still does full-table JSON parsing for entity/keyword search.
			
 
				 
			
 
				-**Files changed:**
			
 
				-| File | Change |
			
 
				-|---|---|
			
 
				-| `news_mcp/storage/sqlite_store.py` | Schema migration (generated column + junction tables), write-path junction population, new SQL-level read methods |
			
 
				-| `news_mcp/mcp_server_fastmcp.py` | New REST endpoints for entity/keyword cluster search |
			
 
				-| `news_mcp/dashboard/dashboard_store.py` | `get_entity_frequencies`, `get_keyword_frequencies` use SQL junction table counts |
			
 
				-| `dashboard/dashboard.js` | `showEntityDetail`, `showKeywordDetail` call new endpoints |
			
 
				-| `scripts/backfill_junction_tables.py` | New backfill script (same pattern as normalize_cluster_timestamps.py) |
			
 
				+**Current state:**
			
 
				+- `DashboardStore` — uses SQL `payload_ts` filter + junction tables ✓
			
 
				+- `SQLiteClusterStore` — uses SQL `payload_ts` filter for time ✓, but MCP tool entity search (`get_events_for_entity`, `get_news_sentiment`) still fetches top-N clusters by time then filters entities in Python
			
 
				 
			
 
				-**Migration safety:**
			
 
				-- All DDL uses `IF NOT EXISTS` / `ADD COLUMN IF NOT EXISTS` — safe to re-run
			
 
				-- Backfill script is idempotent (`INSERT OR IGNORE` in transactions)
			
 
				-- Generated column requires no write-path code changes
			
 
				-- Old query methods can coexist during transition (removed after verification)
			
 
				+**Consequence:** `get_events_for_entity("Pete Hegseth", timeframe="72h")` fetches the 200 most recent clusters (via `payload_ts`), then loops in Python checking entities. If the entity appears in 34 clusters but only 15 are in the top 200, 19 are missed.
			
 
				 
			
 
				+**Proposed fix:** Collapse both stores into one. `SQLiteClusterStore` should be the single data access layer with proper junction-table methods for entity/keyword search. `DashboardStore` should be a thin wrapper or removed entirely. MCP tools should call `SQLiteClusterStore.get_clusters_by_entity()` using junction tables instead of Python-side filtering.