# Project: news-mcp ## Goal Provide a signal-extraction MCP server that converts RSS into **deduplicated, enriched news clusters** that are easy for agents to use. ## Current architecture (v1) - FastMCP SSE server mounted at `/mcp` - SQLite cache for clusters + Groq summary caches - RSS fetch (breakingthenews.net) - v1 dedup via fuzzy title similarity - optional Ollama embeddings path for clustering (when `NEWS_EMBEDDINGS_ENABLED=true`) - configurable embedding similarity threshold (`NEWS_EMBEDDING_SIMILARITY_THRESHOLD`) - optional embeddings backfill script for precomputing cluster vectors in SQLite - optional merge-analysis script for threshold experiments before any DB rewrite - optional merge pass for destructive consolidation after threshold review - optional article-dedup cleanup for repeated article variants inside a cluster - Groq enrichment (topic/entities/sentiment/keywords) - Tools expose semantic queries over cached clusters ## MCP tools (current) - `get_latest_events(topic, limit, include_articles)` - `get_events_for_entity(entity, limit, timeframe, include_articles)` - `get_event_summary(event_id, include_articles)` - `detect_emerging_topics(limit)` - `get_news_sentiment(entity, timeframe)` - `get_related_recent_entities(subject, timeframe, limit, include_trends)` - `get_capabilities()` ## Refresh & caching ## Future work (planned): entity graph over time Instead of treating `detect_emerging_topics()` as a flat list, we want a higher-level representation: - Convert emerging topic/entity co-occurrence signals into a **weighted entity graph** - Group the graph into **communities** (story neighborhoods) - Track **time evolution** across refresh windows: - node “momentum” (trend_score/count changes) - edge strength changes (relation tightening/weakening) - community emergence/disappearance Eventual agent tool shape (later): `get_emerging_entity_graph(timeframe, limit)`. - Background refresh every `NEWS_REFRESH_INTERVAL_SECONDS` (default 900s) - Feed-hash skipping to avoid redundant RSS+Groq work - Cluster TTL (`NEWS_CLUSTERS_TTL_HOURS` via `CLUSTERS_TTL_HOURS`) - Summary caching for `get_event_summary` ## Definition of “committable” - Tests pass offline (dedup/storage unit tests) - Server exposes tool surface with valid schemas - Caching prevents repeated Groq calls for unchanged clusters - Embeddings remain optional: Ollama is tried first when enabled, otherwise the heuristic path stays active - Embeddings backfill script exists for older cluster rows before the server restart - Merge-analysis script exists to inspect candidate cluster pairs at multiple thresholds - Merge pass exists for destructive consolidation once thresholds look sane - Article-dedup cleanup exists for fixing duplicated article records already in SQLite - Entity lookup now respects timeframe as the scan window, with limit acting as a cap ## Dashboard & REST API (added May 2026) ### What was added - **5 REST endpoints** (`/api/v1/*`) for programmatic access to cluster data, sentiment series, entity frequencies, and health stats - **Dashboard SPA** at `/dashboard` — HTMX-based shell with Chart.js visualizations (5 views: health, clusters, sentiment, entities, detail) - **Non-blocking startup** — moved from synchronous `@app.on_event("startup")` pruning to `lifespan`-based fire-and-forget background loop; server responds within ~0.3s regardless of feed/LLM latency ### Architecture ``` news-mcp/ ├── news_mcp/mcp_server_fastmcp.py ← MCP tools + REST API + dashboard mount ├── news_mcp/dashboard/ │ ├── dashboard_store.py ← Read-only query layer (no side effects) │ ├── index.html ← SPA shell with 5 views │ ├── style.css ← Dark theme, responsive │ └── dashboard.js ← Client-side rendering + Chart.js ``` ### Key design decisions - Dashboard store wraps `SQLiteClusterStore` with thin read-only methods — no enrichment, no writes - Single shared store instance (`_shared_store`) avoids repeated DB connections - Static SPA files are served by FastAPI's `StaticFiles` mount — no Jinja2/templating dependency - Client-side `fetch()` + Chart.js avoids HTMX raw-JSON-in-DOM issues - Default lookback matches `NEWS_DEFAULT_LOOKBACK_HOURS` (144h), not a hardcoded 24h ### Known gaps - No auth (LAN-only, no login) - Entity detail view in dashboard is minimal (click-to-expand from entity list is stub) - No alerting/threshold notifications yet (Phase 2: velocity spikes, sentiment divergence) ## Dashboard & REST API (added May 2026) ### What was added - **5 REST endpoints** (`/api/v1/*`) — JSON-only, for programmatic access and the dashboard - **Dashboard SPA** at `/dashboard` — 5 views (health, clusters, sentiment, entities, detail), Chart.js visualizations, instant client-side rendering - **Non-blocking startup** — replaced synchronous `@app.on_event("startup")` with `lifespan`-based fire-and-forget background loop; server responds in <0.3s - **Async ingestion lock** — `asyncio.Lock` prevents overlapping refresh cycles - **Hardened LLM calls** — OpenRouter retry logic with exponential backoff on 429/5xx, response shape validation ### Architecture additions ``` news-mcp/ ├── news_mcp/mcp_server_fastmcp.py ← MCP + REST API + /dashboard static mount ├── news_mcp/dashboard/ │ ├── __init__.py │ ├── dashboard_store.py ← Read-only query layer (no side effects) │ ├── index.html ← SPA shell, 5 views │ ├── style.css ← Dark theme, responsive grid │ └── dashboard.js ← Client render, Chart.js, null-safe DOM access ``` ### Design decisions - **Dashboard store** wraps `SQLiteClusterStore` with read-only methods (stats, pagination, series, frequencies, detail). No writes, no enrichment. - **Single shared store** (`_shared_store`) — one DB connection pool for the entire process. - **Static SPA** served via FastAPI `StaticFiles` — no Jinja2/templating dependency. - **Client-side rendering** with `fetch()` + Chart.js — avoids HTMX raw-JSON-in-DOM issues. - **Default lookback** follows `NEWS_DEFAULT_LOOKBACK_HOURS` (144h), not hardcoded. - **Cluster ordering** — always date-descending (SQL `ORDER BY updated_at DESC` + client-side sort as safety net). ### Known gaps (for future work) - No auth (LAN-only assumption) - Entity detail view is functional but minimal - No alerting/threshold notifications (Phase 2) - No server-sent events for real-time dashboard updates