Project: news-mcp

Goal

Provide a signal-extraction MCP server that converts RSS into deduplicated, enriched news clusters that are easy for agents to use.

Current architecture (v0.3.2)

FastMCP SSE server mounted at /mcp
SQLite cache for clusters + entity metadata + feed state + LLM summary caches
Concurrent RSS fetch (async asyncio.gather + httpx, bounded semaphore)
Multi-signal clustering: cosine embedding + fuzzy title + token Jaccard + consensus cascade; compares against ALL cluster articles (not just seed)
Stable cluster IDs: sha1(min_article_key) — topic-independent, order-independent, consistent across polling cycles. The topic is excluded from the hash so that the same article always maps to the same cluster_id regardless of heuristic vs LLM-enriched topic classification.
Cross-cycle merge: poller seeds clustering with recent DB clusters (configurable NEWS_CLUSTER_MAX_AGE_HOURS, default 4h). Existing clusters are re-bucketed by the same heuristic topic function (normalize_topic_from_title) that new articles use, ensuring matching works even when the enriched topic drifted.
Orphan merge: post-clustering Union-Find pass merges clusters sharing article keys
Concurrent Ollama embeddings (pre-computed before clustering loop)
Concurrent LLM enrichment (entity extraction, topic classification, sentiment) with per-provider semaphore
Per-cluster retry with exponential backoff (3 retries, 2s/4s/8s) + cross-cycle failure recovery
All concurrency limits configurable via env vars (NEWS_RSS_MAX_CONCURRENCY, NEWS_OLLAMA_MAX_CONCURRENCY, NEWS_LLM_CONCURRENCY_<PROVIDER>)
Dashboard REST API (/api/v1/*) for clusters, sentiment series, entity frequencies
get_latest_events() defaults to all topics (omit topic for unfiltered)

Previous: v0.2.x architecture

FastMCP SSE server mounted at /mcp
SQLite cache for clusters + Groq summary caches
RSS fetch (breakingthenews.net)
v1 dedup via fuzzy title similarity only, seed-article-only comparison
optional Ollama embeddings path for clustering (when NEWS_EMBEDDINGS_ENABLED=true)
configurable embedding similarity threshold (NEWS_EMBEDDING_SIMILARITY_THRESHOLD)
optional embeddings backfill script for precomputing cluster vectors in SQLite
optional merge-analysis script for threshold experiments before any DB rewrite
optional merge pass for destructive consolidation after threshold review
optional article-dedup cleanup for repeated article variants inside a cluster
Groq enrichment (topic/entities/sentiment/keywords)
Tools expose semantic queries over cached clusters

MCP tools (current)

get_latest_events(topic, limit, include_articles)
get_events_for_entity(entity, limit, timeframe, include_articles)
get_event_summary(event_id, include_articles)
detect_emerging_topics(limit)
get_news_sentiment(entity, timeframe)
get_related_recent_entities(subject, timeframe, limit, include_trends)
get_capabilities()

Refresh & caching

Future work (planned): entity graph over time

Instead of treating detect_emerging_topics() as a flat list, we want a higher-level representation:

Convert emerging topic/entity co-occurrence signals into a weighted entity graph
Group the graph into communities (story neighborhoods)
Track time evolution across refresh windows:
- node “momentum” (trend_score/count changes)
- edge strength changes (relation tightening/weakening)
- community emergence/disappearance

Eventual agent tool shape (later): get_emerging_entity_graph(timeframe, limit).

Background refresh every NEWS_REFRESH_INTERVAL_SECONDS (default 900s)
Feed-hash skipping to avoid redundant RSS+Groq work
Cluster TTL (NEWS_CLUSTERS_TTL_HOURS via CLUSTERS_TTL_HOURS)
Summary caching for get_event_summary

Definition of “committable”

Tests pass offline (dedup/storage unit tests)
Server exposes tool surface with valid schemas
Caching prevents repeated Groq calls for unchanged clusters
Embeddings remain optional: Ollama is tried first when enabled, otherwise the heuristic path stays active
Embeddings backfill script exists for older cluster rows before the server restart
Merge-analysis script exists to inspect candidate cluster pairs at multiple thresholds
Merge pass exists for destructive consolidation once thresholds look sane
Article-dedup cleanup exists for fixing duplicated article records already in SQLite
Entity lookup now respects timeframe as the scan window, with limit acting as a cap

Dashboard & REST API (added May 2026)

What was added

5 REST endpoints (/api/v1/*) for programmatic access to cluster data, sentiment series, entity frequencies, and health stats
Dashboard SPA at /dashboard — HTMX-based shell with Chart.js visualizations (5 views: health, clusters, sentiment, entities, detail)
Non-blocking startup — moved from synchronous @app.on_event("startup") pruning to lifespan-based fire-and-forget background loop; server responds within ~0.3s regardless of feed/LLM latency

Architecture

news-mcp/
├── news_mcp/mcp_server_fastmcp.py   ← MCP tools + REST API + dashboard mount
├── news_mcp/dashboard/
│   ├── dashboard_store.py           ← Read-only query layer (no side effects)
│   ├── index.html                   ← SPA shell with 5 views
│   ├── style.css                    ← Dark theme, responsive
│   └── dashboard.js                 ← Client-side rendering + Chart.js

Key design decisions

Dashboard store wraps SQLiteClusterStore with thin read-only methods — no enrichment, no writes
Single shared store instance (_shared_store) avoids repeated DB connections
Static SPA files are served by FastAPI's StaticFiles mount — no Jinja2/templating dependency
Client-side fetch() + Chart.js avoids HTMX raw-JSON-in-DOM issues
Default lookback matches NEWS_DEFAULT_LOOKBACK_HOURS (144h), not a hardcoded 24h

Known gaps

No auth (LAN-only, no login)
Entity detail view in dashboard is minimal (click-to-expand from entity list is stub)
No alerting/threshold notifications yet (Phase 2: velocity spikes, sentiment divergence)

Dashboard & REST API (added May 2026)

What was added

5 REST endpoints (/api/v1/*) — JSON-only, for programmatic access and the dashboard
Dashboard SPA at /dashboard — 5 views (health, clusters, sentiment, entities, detail), Chart.js visualizations, instant client-side rendering
Non-blocking startup — replaced synchronous @app.on_event("startup") with lifespan-based fire-and-forget background loop; server responds in <0.3s
Async ingestion lock — asyncio.Lock prevents overlapping refresh cycles
Hardened LLM calls — OpenRouter retry logic with exponential backoff on 429/5xx, response shape validation

Architecture additions

news-mcp/
├── news_mcp/mcp_server_fastmcp.py   ← MCP + REST API + /dashboard static mount
├── news_mcp/dashboard/
│   ├── __init__.py
│   ├── dashboard_store.py           ← Read-only query layer (no side effects)
│   ├── index.html                   ← SPA shell, 5 views
│   ├── style.css                    ← Dark theme, responsive grid
│   └── dashboard.js                 ← Client render, Chart.js, null-safe DOM access

Design decisions

Dashboard store wraps SQLiteClusterStore with read-only methods (stats, pagination, series, frequencies, detail). No writes, no enrichment.
Single shared store (_shared_store) — one DB connection pool for the entire process.
Static SPA served via FastAPI StaticFiles — no Jinja2/templating dependency.
Client-side rendering with fetch() + Chart.js — avoids HTMX raw-JSON-in-DOM issues.
Default lookback follows NEWS_DEFAULT_LOOKBACK_HOURS (144h), not hardcoded.
Cluster ordering — always date-descending (SQL ORDER BY updated_at DESC + client-side sort as safety net).

Known gaps (for future work)

No auth (LAN-only assumption)
Entity detail view is functional but minimal
No alerting/threshold notifications (Phase 2)
No server-sent events for real-time dashboard updates

Keyword Utilization Upgrade (May 2026)

Problem

Keywords are extracted by the LLM (extract_entities.prompt — "provide short keywords that justify the classification"), stored in the cluster payload, and displayed in the dashboard detail view — but they are not used by any search, scoring, or retrieval path. Thematic signals like "ETF", "rate-cut", "contagion" are invisible to entity search, emerging-topics detection, and related-entity expansion.

Plan

Phase 1 — Search & Retrieval (done)

1a: Add keywords to _cluster_entity_haystack() in mcp_server_fastmcp.py so get_events_for_entity() and get_news_sentiment() match clusters by thematic keywords, not just named entities.
1b: Add keywords field to cluster output dicts in get_latest_events() and get_events_for_entity() so downstream LLM agents see the full semantic picture.

Phase 2 — Emerging Topics (pending)

2a: Count keywords in detect_emerging_topics() with parallel keyword_counts_recent / keyword_counts_prior accumulators, scored with the same velocity/recency/source-diversity formula as entities.
2b: Optionally promote high-velocity keywords to "suggested entities" on the dashboard.

Phase 3 — Relatedness & Dashboard (pending)

3a: Add keyword co-occurrence counting in _collect_local_related() in related_entities.py.
3b: Add get_keyword_frequencies() to DashboardStore and a "Keywords" panel on the dashboard.

Phase 4 — Prompt Refinement (optional)

Split keyword extraction into "theme keywords" (subject matter) and "signal keywords" (what's new/notable) for differential weighting downstream.

Timestamp Normalization (May 2026)

Problem

Cluster payloads stored timestamps as raw RSS strings (RFC 2822 HTTP-date like "Sat, 30 May 2026 02:00:12 +00:00"). Every read path needed fragile format-guessing, and SQL time-range queries on updated_at (row modification time, not event time) returned wrong data.

Fix

_normalize_ts() helper in sqlite_store.py: parses ISO 8601, RFC 2822/HTTP-date, epoch seconds → uniform YYYY-MM-DDTHH:MM:SS+00:00
sanitize_cluster_payload() now normalizes timestamp, first_seen, last_updated, and all article[].timestamp before writing to DB
merge_cluster_embeddings.py: same normalization on merged payloads
scripts/normalize_cluster_timestamps.py: backfill script for existing rows (run on live server with correct --db path)
get_sentiment_series() and get_entity_frequencies(): filter by payload.timestamp in Python, not updated_at in SQL

Key invariant

updated_at in the DB = row modification time (set to datetime.now() on every upsert). For time-range queries, always use payload.timestamp parsed from the JSON.

PROJECT.md 11 KB Riwayat Mentahan

Project: news-mcp

Goal

Current architecture (v0.3.2)

Previous: v0.2.x architecture

MCP tools (current)

Refresh & caching

Future work (planned): entity graph over time

Definition of “committable”

Dashboard & REST API (added May 2026)

What was added

Architecture

Key design decisions

Known gaps

Dashboard & REST API (added May 2026)

What was added

Architecture additions

Design decisions

Known gaps (for future work)

Keyword Utilization Upgrade (May 2026)

Problem

Plan

Phase 1 — Search & Retrieval (done)

Phase 2 — Emerging Topics (pending)

Phase 3 — Relatedness & Dashboard (pending)

Phase 4 — Prompt Refinement (optional)

Timestamp Normalization (May 2026)

Problem

Fix

Key invariant

PROJECT.md 11 KB

Riwayat Mentahan