# Project: news-mcp ## Goal Provide a signal-extraction MCP server that converts RSS into **deduplicated, enriched news clusters** that are easy for agents to use. ## Current architecture (v0.3.2) - FastMCP SSE server mounted at `/mcp` - SQLite cache for clusters + entity metadata + feed state + LLM summary caches - Concurrent RSS fetch (async `asyncio.gather` + `httpx`, bounded semaphore) - **Multi-signal clustering**: cosine embedding + fuzzy title + token Jaccard + consensus cascade; compares against ALL cluster articles (not just seed) - **Stable cluster IDs**: `sha1(min_article_key)` — topic-independent, order-independent, consistent across polling cycles. The topic is excluded from the hash so that the same article always maps to the same cluster_id regardless of heuristic vs LLM-enriched topic classification. - **Cross-cycle merge**: poller seeds clustering with recent DB clusters (configurable `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h). Existing clusters are re-bucketed by the same heuristic topic function (`normalize_topic_from_title`) that new articles use, ensuring matching works even when the enriched topic drifted. - **Orphan merge**: post-clustering Union-Find pass merges clusters sharing article keys - Concurrent Ollama embeddings (pre-computed before clustering loop) - Concurrent LLM enrichment (entity extraction, topic classification, sentiment) with per-provider semaphore - Per-cluster retry with exponential backoff (3 retries, 2s/4s/8s) + cross-cycle failure recovery - All concurrency limits configurable via env vars (`NEWS_RSS_MAX_CONCURRENCY`, `NEWS_OLLAMA_MAX_CONCURRENCY`, `NEWS_LLM_CONCURRENCY_`) - Dashboard REST API (`/api/v1/*`) for clusters, sentiment series, entity frequencies - `get_latest_events()` defaults to all topics (omit `topic` for unfiltered) ## Previous: v0.2.x architecture - FastMCP SSE server mounted at `/mcp` - SQLite cache for clusters + Groq summary caches - RSS fetch (breakingthenews.net) - v1 dedup via fuzzy title similarity only, seed-article-only comparison - optional Ollama embeddings path for clustering (when `NEWS_EMBEDDINGS_ENABLED=true`) - configurable embedding similarity threshold (`NEWS_EMBEDDING_SIMILARITY_THRESHOLD`) - optional embeddings backfill script for precomputing cluster vectors in SQLite - optional merge-analysis script for threshold experiments before any DB rewrite - optional merge pass for destructive consolidation after threshold review - optional article-dedup cleanup for repeated article variants inside a cluster - Groq enrichment (topic/entities/sentiment/keywords) - Tools expose semantic queries over cached clusters ## MCP tools (current) - `get_latest_events(topic, limit, include_articles)` - `get_events_for_entity(entity, limit, timeframe, include_articles)` - `get_event_summary(event_id, include_articles)` - `detect_emerging_topics(limit)` - `get_news_sentiment(entity, timeframe)` - `get_related_recent_entities(subject, timeframe, limit, include_trends)` - `get_capabilities()` ## Refresh & caching ## Future work (planned): entity graph over time Instead of treating `detect_emerging_topics()` as a flat list, we want a higher-level representation: - Convert emerging topic/entity co-occurrence signals into a **weighted entity graph** - Group the graph into **communities** (story neighborhoods) - Track **time evolution** across refresh windows: - node “momentum” (trend_score/count changes) - edge strength changes (relation tightening/weakening) - community emergence/disappearance Eventual agent tool shape (later): `get_emerging_entity_graph(timeframe, limit)`. - Background refresh every `NEWS_REFRESH_INTERVAL_SECONDS` (default 900s) - Feed-hash skipping to avoid redundant RSS+Groq work - Cluster TTL (`NEWS_CLUSTERS_TTL_HOURS` via `CLUSTERS_TTL_HOURS`) - Summary caching for `get_event_summary` ## Definition of “committable” - Tests pass offline (dedup/storage unit tests) - Server exposes tool surface with valid schemas - Caching prevents repeated Groq calls for unchanged clusters - Embeddings remain optional: Ollama is tried first when enabled, otherwise the heuristic path stays active - Embeddings backfill script exists for older cluster rows before the server restart - Merge-analysis script exists to inspect candidate cluster pairs at multiple thresholds - Merge pass exists for destructive consolidation once thresholds look sane - Article-dedup cleanup exists for fixing duplicated article records already in SQLite - Entity lookup now respects timeframe as the scan window, with limit acting as a cap ## Dashboard & REST API (added May 2026) ### What was added - **5 REST endpoints** (`/api/v1/*`) for programmatic access to cluster data, sentiment series, entity frequencies, and health stats - **Dashboard SPA** at `/dashboard` — HTMX-based shell with Chart.js visualizations (5 views: health, clusters, sentiment, entities, detail) - **Non-blocking startup** — moved from synchronous `@app.on_event("startup")` pruning to `lifespan`-based fire-and-forget background loop; server responds within ~0.3s regardless of feed/LLM latency ### Architecture ``` news-mcp/ ├── news_mcp/mcp_server_fastmcp.py ← MCP tools + REST API + dashboard mount ├── news_mcp/dashboard/ │ ├── dashboard_store.py ← Read-only query layer (no side effects) │ ├── index.html ← SPA shell with 5 views │ ├── style.css ← Dark theme, responsive │ └── dashboard.js ← Client-side rendering + Chart.js ``` ### Key design decisions - Dashboard store wraps `SQLiteClusterStore` with thin read-only methods — no enrichment, no writes - Single shared store instance (`_shared_store`) avoids repeated DB connections - Static SPA files are served by FastAPI's `StaticFiles` mount — no Jinja2/templating dependency - Client-side `fetch()` + Chart.js avoids HTMX raw-JSON-in-DOM issues - Default lookback matches `NEWS_DEFAULT_LOOKBACK_HOURS` (144h), not a hardcoded 24h ### Known gaps - No auth (LAN-only, no login) - Entity detail view in dashboard is minimal (click-to-expand from entity list is stub) - No alerting/threshold notifications yet (Phase 2: velocity spikes, sentiment divergence) ## Dashboard & REST API (added May 2026) ### What was added - **5 REST endpoints** (`/api/v1/*`) — JSON-only, for programmatic access and the dashboard - **Dashboard SPA** at `/dashboard` — 5 views (health, clusters, sentiment, entities, detail), Chart.js visualizations, instant client-side rendering - **Non-blocking startup** — replaced synchronous `@app.on_event("startup")` with `lifespan`-based fire-and-forget background loop; server responds in <0.3s - **Async ingestion lock** — `asyncio.Lock` prevents overlapping refresh cycles - **Hardened LLM calls** — OpenRouter retry logic with exponential backoff on 429/5xx, response shape validation ### Architecture additions ``` news-mcp/ ├── news_mcp/mcp_server_fastmcp.py ← MCP + REST API + /dashboard static mount ├── news_mcp/dashboard/ │ ├── __init__.py │ ├── dashboard_store.py ← Read-only query layer (no side effects) │ ├── index.html ← SPA shell, 5 views │ ├── style.css ← Dark theme, responsive grid │ └── dashboard.js ← Client render, Chart.js, null-safe DOM access ``` ### Design decisions - **Dashboard store** wraps `SQLiteClusterStore` with read-only methods (stats, pagination, series, frequencies, detail). No writes, no enrichment. - **Single shared store** (`_shared_store`) — one DB connection pool for the entire process. - **Static SPA** served via FastAPI `StaticFiles` — no Jinja2/templating dependency. - **Client-side rendering** with `fetch()` + Chart.js — avoids HTMX raw-JSON-in-DOM issues. - **Default lookback** follows `NEWS_DEFAULT_LOOKBACK_HOURS` (144h), not hardcoded. - **Cluster ordering** — always date-descending (SQL `ORDER BY updated_at DESC` + client-side sort as safety net). ### Known gaps (for future work) - No auth (LAN-only assumption) - Entity detail view is functional but minimal - No alerting/threshold notifications (Phase 2) - No server-sent events for real-time dashboard updates ## Keyword Utilization Upgrade (May 2026) ### Problem Keywords are extracted by the LLM (`extract_entities.prompt` — "provide short keywords that justify the classification"), stored in the cluster payload, and displayed in the dashboard detail view — but they are not used by any search, scoring, or retrieval path. Thematic signals like "ETF", "rate-cut", "contagion" are invisible to entity search, emerging-topics detection, and related-entity expansion. ### Plan #### Phase 1 — Search & Retrieval (done) - **1a**: Add keywords to `_cluster_entity_haystack()` in `mcp_server_fastmcp.py` so `get_events_for_entity()` and `get_news_sentiment()` match clusters by thematic keywords, not just named entities. - **1b**: Add `keywords` field to cluster output dicts in `get_latest_events()` and `get_events_for_entity()` so downstream LLM agents see the full semantic picture. #### Phase 2 — Emerging Topics (pending) - **2a**: Count keywords in `detect_emerging_topics()` with parallel `keyword_counts_recent` / `keyword_counts_prior` accumulators, scored with the same velocity/recency/source-diversity formula as entities. - **2b**: Optionally promote high-velocity keywords to "suggested entities" on the dashboard. #### Phase 3 — Relatedness & Dashboard (pending) - **3a**: Add keyword co-occurrence counting in `_collect_local_related()` in `related_entities.py`. - **3b**: Add `get_keyword_frequencies()` to `DashboardStore` and a "Keywords" panel on the dashboard. #### Phase 4 — Prompt Refinement (optional) - Split keyword extraction into "theme keywords" (subject matter) and "signal keywords" (what's new/notable) for differential weighting downstream. ## Timestamp Normalization (May 2026) ### Problem Cluster payloads stored timestamps as raw RSS strings (RFC 2822 HTTP-date like `"Sat, 30 May 2026 02:00:12 +00:00"`). Every read path needed fragile format-guessing, and SQL time-range queries on `updated_at` (row modification time, not event time) returned wrong data. ### Fix - `_normalize_ts()` helper in `sqlite_store.py`: parses ISO 8601, RFC 2822/HTTP-date, epoch seconds → uniform `YYYY-MM-DDTHH:MM:SS+00:00` - `sanitize_cluster_payload()` now normalizes `timestamp`, `first_seen`, `last_updated`, and all `article[].timestamp` before writing to DB - `merge_cluster_embeddings.py`: same normalization on merged payloads - `scripts/normalize_cluster_timestamps.py`: backfill script for existing rows (run on live server with correct `--db` path) - `get_sentiment_series()` and `get_entity_frequencies()`: filter by `payload.timestamp` in Python, not `updated_at` in SQL ### Key invariant `updated_at` in the DB = row modification time (set to `datetime.now()` on every upsert). For time-range queries, always use `payload.timestamp` parsed from the JSON. ## Timestamp Read-Path Cleanup (May 2026) ### Problem After normalization, all read paths still contained defensive RFC 2822 / `parsedate_to_datetime` fallback parsers. This was dead code on the live server (all stored timestamps are ISO 8601 UTC) and risked being re-introduced by future contributors who misread the defensive pattern as necessary. ### Fix - Added `_read_ts(ts) -> float | None` to `sqlite_store.py` (module-level, exported). Uses only `datetime.fromisoformat()`. No RFC 2822 fallback. If it fails, the normalization pipeline has a bug — fix that instead. - All read-path timestamp parsing in `sqlite_store.py`, `dashboard_store.py`, and `mcp_server_fastmcp.py` now uses `_read_ts` or plain `fromisoformat`. - `parsedate_to_datetime` removed from `dashboard_store.py` and `mcp_server_fastmcp.py` imports entirely. - `parsedate_to_datetime` is **only** retained in `sqlite_store._normalize_ts()` (the write path) and `dedup/cluster.py` (raw ingest before normalization). - Test fixtures updated to use ISO 8601 UTC timestamps. ### Contract (ENFORCE THIS) - `payload.timestamp`, `payload.first_seen`, `payload.last_updated` are **always** `YYYY-MM-DDTHH:MM:SS+00:00` for any row written after the normalization migration. - Read paths: use `_read_ts()` from `sqlite_store` or `datetime.fromisoformat()` directly. **Never** add `parsedate_to_datetime` to a read path. - Write paths: `sanitize_cluster_payload()` in `sqlite_store.py` is the single normalization point. All writes go through `upsert_clusters()` which calls it. - This repo is a **dev machine** copy. The live server is on thinkcenter-2 (192.168.0.200). Never query the dev DB to verify live data — the dev DB is stale/empty. ## Junction Tables + Indexed Timestamp (May 2026) ### Problem All read paths deserialize every JSON payload to filter by entity/keyword/time. With 6000+ clusters, `get_clusters_page` returns only the 100 newest — clicking an entity that appears 34x shows only 2 clusters because the other 32 are outside the LIMIT. `get_entity_frequencies` counts correctly but the detail view can't find them. Every query does a full table scan with JSON parsing. ### Solution: junction tables + generated timestamp column **Schema (migrated in `_init_db`, incremental-safe):** ```sql -- Indexed event timestamp (SQLite generated column — zero write-path cost) ALTER TABLE clusters ADD COLUMN payload_ts GENERATED ALWAYS AS (json_extract(payload, '$.timestamp')) STORED; CREATE INDEX IF NOT EXISTS idx_clusters_payload_ts ON clusters(payload_ts); -- Entity junction table for SQL-level entity search CREATE TABLE IF NOT EXISTS cluster_entities ( cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE, entity TEXT NOT NULL, PRIMARY KEY (cluster_id, entity) ); CREATE INDEX IF NOT EXISTS idx_cluster_entities_entity ON cluster_entities(entity); -- Keyword junction table for SQL-level keyword search CREATE TABLE IF NOT EXISTS cluster_keywords ( cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE, keyword TEXT NOT NULL, PRIMARY KEY (cluster_id, keyword) ); CREATE INDEX IF NOT EXISTS idx_cluster_keywords_keyword ON cluster_keywords(keyword); ``` **Write path (`upsert_clusters`):** Within the existing transaction, after sanitizing the payload and before INSERT/UPDATE: 1. `DELETE FROM cluster_entities WHERE cluster_id = ?` (handles re-enrichment) 2. `DELETE FROM cluster_keywords WHERE cluster_id = ?` 3. `INSERT OR IGNORE INTO cluster_entities VALUES (?, ?)` for each entity 4. `INSERT OR IGNORE INTO cluster_keywords VALUES (?, ?)` for each keyword 5. `payload_ts` is auto-maintained by SQLite's generated column — no code needed **Read paths — all SQL-level, no JSON parsing at query time:** - `get_clusters_page`: `WHERE payload_ts >= ? ORDER BY payload_ts DESC LIMIT ? OFFSET ?` - `get_entity_frequencies`: `JOIN cluster_entities ... WHERE payload_ts >= ? GROUP BY entity ORDER BY cnt DESC` - `get_keyword_frequencies`: `JOIN cluster_keywords ... WHERE payload_ts >= ? GROUP BY keyword ORDER BY cnt DESC` - New `get_clusters_by_entity`: `JOIN cluster_entities WHERE payload_ts >= ? AND entity = ?` - New `get_clusters_by_keyword`: `JOIN cluster_keywords WHERE payload_ts >= ? AND keyword = ?` **Backfill script (`scripts/backfill_junction_tables.py`):** - Same pattern as `normalize_cluster_timestamps.py` - Accepts `--db` arg, defaults to config DB_PATH - Reads all cluster payloads, populates `cluster_entities` and `cluster_keywords` - `payload_ts` is auto-populated by SQLite's generated column - Idempotent (`INSERT OR IGNORE` + transaction) - Reports entity/keyword counts after completion - Run once on live server: `docker exec -it python3 scripts/backfill_junction_tables.py` **REST API changes:** - `GET /api/v1/clusters` — now uses SQL `payload_ts` filter, consistent total - `GET /api/v1/entities` — SQL `COUNT(*) ... GROUP BY` via junction table - `GET /api/v1/keywords` — SQL `COUNT(*) ... GROUP BY` via junction table - **New `GET /api/v1/clusters/by-entity?entity=X&hours=Y&limit=Z`** — SQL entity search - **New `GET /api/v1/clusters/by-keyword?keyword=X&hours=Y&limit=Z`** — SQL keyword search **Dashboard JS changes:** - `showEntityDetail(label)` — calls `/api/v1/clusters/by-entity` instead of fetching all clusters - `showKeywordDetail(label)` — calls `/api/v1/clusters/by-keyword` instead of fetching all clusters **Files changed:** | File | Change | |---|---| | `news_mcp/storage/sqlite_store.py` | Schema migration (generated column + junction tables), write-path junction population, new SQL-level read methods | | `news_mcp/mcp_server_fastmcp.py` | New REST endpoints for entity/keyword cluster search | | `news_mcp/dashboard/dashboard_store.py` | `get_entity_frequencies`, `get_keyword_frequencies` use SQL junction table counts | | `dashboard/dashboard.js` | `showEntityDetail`, `showKeywordDetail` call new endpoints | | `scripts/backfill_junction_tables.py` | New backfill script (same pattern as normalize_cluster_timestamps.py) | **Migration safety:** - All DDL uses `IF NOT EXISTS` / `ADD COLUMN IF NOT EXISTS` — safe to re-run - Backfill script is idempotent (`INSERT OR IGNORE` in transactions) - Generated column requires no write-path code changes - Old query methods can coexist during transition (removed after verification)