Project: news-mcp
Goal
Provide a signal-extraction MCP server that converts RSS into deduplicated, enriched news clusters that are easy for agents to use.
Current architecture (v0.3.1)
- FastMCP SSE server mounted at
/mcp
- SQLite cache for clusters + entity metadata + feed state + LLM summary caches
- Concurrent RSS fetch (async
asyncio.gather + httpx, bounded semaphore)
- Multi-signal clustering: cosine embedding + fuzzy title + token Jaccard + consensus cascade; compares against ALL cluster articles (not just seed)
- Stable cluster IDs:
sha1(topic | min_article_key) — order-independent, consistent across polling cycles
- Cross-cycle merge: poller seeds clustering with recent DB clusters (configurable
NEWS_CLUSTER_MAX_AGE_HOURS, default 4h)
- Orphan merge: post-clustering Union-Find pass merges clusters sharing article keys
- Concurrent Ollama embeddings (pre-computed before clustering loop)
- Concurrent LLM enrichment (entity extraction, topic classification, sentiment) with per-provider semaphore
- Per-cluster retry with exponential backoff (3 retries, 2s/4s/8s) + cross-cycle failure recovery
- All concurrency limits configurable via env vars (
NEWS_RSS_MAX_CONCURRENCY, NEWS_OLLAMA_MAX_CONCURRENCY, NEWS_LLM_CONCURRENCY_<PROVIDER>)
- Dashboard REST API (
/api/v1/*) for clusters, sentiment series, entity frequencies
get_latest_events() defaults to all topics (omit topic for unfiltered)
Previous: v0.2.x architecture
- FastMCP SSE server mounted at
/mcp
- SQLite cache for clusters + Groq summary caches
- RSS fetch (breakingthenews.net)
- v1 dedup via fuzzy title similarity only, seed-article-only comparison
- optional Ollama embeddings path for clustering (when
NEWS_EMBEDDINGS_ENABLED=true)
- configurable embedding similarity threshold (
NEWS_EMBEDDING_SIMILARITY_THRESHOLD)
- optional embeddings backfill script for precomputing cluster vectors in SQLite
- optional merge-analysis script for threshold experiments before any DB rewrite
- optional merge pass for destructive consolidation after threshold review
- optional article-dedup cleanup for repeated article variants inside a cluster
- Groq enrichment (topic/entities/sentiment/keywords)
- Tools expose semantic queries over cached clusters
MCP tools (current)
get_latest_events(topic, limit, include_articles)
get_events_for_entity(entity, limit, timeframe, include_articles)
get_event_summary(event_id, include_articles)
detect_emerging_topics(limit)
get_news_sentiment(entity, timeframe)
get_related_recent_entities(subject, timeframe, limit, include_trends)
get_capabilities()
Refresh & caching
Future work (planned): entity graph over time
Instead of treating detect_emerging_topics() as a flat list, we want a higher-level representation:
- Convert emerging topic/entity co-occurrence signals into a weighted entity graph
- Group the graph into communities (story neighborhoods)
- Track time evolution across refresh windows:
- node “momentum” (trend_score/count changes)
- edge strength changes (relation tightening/weakening)
- community emergence/disappearance
Eventual agent tool shape (later): get_emerging_entity_graph(timeframe, limit).
- Background refresh every
NEWS_REFRESH_INTERVAL_SECONDS (default 900s)
- Feed-hash skipping to avoid redundant RSS+Groq work
- Cluster TTL (
NEWS_CLUSTERS_TTL_HOURS via CLUSTERS_TTL_HOURS)
- Summary caching for
get_event_summary
Definition of “committable”
- Tests pass offline (dedup/storage unit tests)
- Server exposes tool surface with valid schemas
- Caching prevents repeated Groq calls for unchanged clusters
- Embeddings remain optional: Ollama is tried first when enabled, otherwise the heuristic path stays active
- Embeddings backfill script exists for older cluster rows before the server restart
- Merge-analysis script exists to inspect candidate cluster pairs at multiple thresholds
- Merge pass exists for destructive consolidation once thresholds look sane
- Article-dedup cleanup exists for fixing duplicated article records already in SQLite
- Entity lookup now respects timeframe as the scan window, with limit acting as a cap
Dashboard & REST API (added May 2026)
What was added
- 5 REST endpoints (
/api/v1/*) for programmatic access to cluster data, sentiment series, entity frequencies, and health stats
- Dashboard SPA at
/dashboard — HTMX-based shell with Chart.js visualizations (5 views: health, clusters, sentiment, entities, detail)
- Non-blocking startup — moved from synchronous
@app.on_event("startup") pruning to lifespan-based fire-and-forget background loop; server responds within ~0.3s regardless of feed/LLM latency
Architecture
news-mcp/
├── news_mcp/mcp_server_fastmcp.py ← MCP tools + REST API + dashboard mount
├── news_mcp/dashboard/
│ ├── dashboard_store.py ← Read-only query layer (no side effects)
│ ├── index.html ← SPA shell with 5 views
│ ├── style.css ← Dark theme, responsive
│ └── dashboard.js ← Client-side rendering + Chart.js
Key design decisions
- Dashboard store wraps
SQLiteClusterStore with thin read-only methods — no enrichment, no writes
- Single shared store instance (
_shared_store) avoids repeated DB connections
- Static SPA files are served by FastAPI's
StaticFiles mount — no Jinja2/templating dependency
- Client-side
fetch() + Chart.js avoids HTMX raw-JSON-in-DOM issues
- Default lookback matches
NEWS_DEFAULT_LOOKBACK_HOURS (144h), not a hardcoded 24h
Known gaps
- No auth (LAN-only, no login)
- Entity detail view in dashboard is minimal (click-to-expand from entity list is stub)
- No alerting/threshold notifications yet (Phase 2: velocity spikes, sentiment divergence)
Dashboard & REST API (added May 2026)
What was added
- 5 REST endpoints (
/api/v1/*) — JSON-only, for programmatic access and the dashboard
- Dashboard SPA at
/dashboard — 5 views (health, clusters, sentiment, entities, detail), Chart.js visualizations, instant client-side rendering
- Non-blocking startup — replaced synchronous
@app.on_event("startup") with lifespan-based fire-and-forget background loop; server responds in <0.3s
- Async ingestion lock —
asyncio.Lock prevents overlapping refresh cycles
- Hardened LLM calls — OpenRouter retry logic with exponential backoff on 429/5xx, response shape validation
Architecture additions
news-mcp/
├── news_mcp/mcp_server_fastmcp.py ← MCP + REST API + /dashboard static mount
├── news_mcp/dashboard/
│ ├── __init__.py
│ ├── dashboard_store.py ← Read-only query layer (no side effects)
│ ├── index.html ← SPA shell, 5 views
│ ├── style.css ← Dark theme, responsive grid
│ └── dashboard.js ← Client render, Chart.js, null-safe DOM access
Design decisions
- Dashboard store wraps
SQLiteClusterStore with read-only methods (stats, pagination, series, frequencies, detail). No writes, no enrichment.
- Single shared store (
_shared_store) — one DB connection pool for the entire process.
- Static SPA served via FastAPI
StaticFiles — no Jinja2/templating dependency.
- Client-side rendering with
fetch() + Chart.js — avoids HTMX raw-JSON-in-DOM issues.
- Default lookback follows
NEWS_DEFAULT_LOOKBACK_HOURS (144h), not hardcoded.
- Cluster ordering — always date-descending (SQL
ORDER BY updated_at DESC + client-side sort as safety net).
Known gaps (for future work)
- No auth (LAN-only assumption)
- Entity detail view is functional but minimal
- No alerting/threshold notifications (Phase 2)
- No server-sent events for real-time dashboard updates