Provide a signal-extraction MCP server that converts RSS into deduplicated, enriched news clusters that are easy for agents to use.
/mcpasyncio.gather + httpx, bounded semaphore)sha1(min_article_key) — topic-independent, order-independent, consistent across polling cycles. The topic is excluded from the hash so that the same article always maps to the same cluster_id regardless of heuristic vs LLM-enriched topic classification.NEWS_CLUSTER_MAX_AGE_HOURS, default 4h). Existing clusters are re-bucketed by the same heuristic topic function (normalize_topic_from_title) that new articles use, ensuring matching works even when the enriched topic drifted.NEWS_RSS_MAX_CONCURRENCY, NEWS_OLLAMA_MAX_CONCURRENCY, NEWS_LLM_CONCURRENCY_<PROVIDER>)/api/v1/*) for clusters, sentiment series, entity frequenciesget_latest_events() defaults to all topics (omit topic for unfiltered)/mcpNEWS_EMBEDDINGS_ENABLED=true)NEWS_EMBEDDING_SIMILARITY_THRESHOLD)get_latest_events(topic, limit, include_articles)get_events_for_entity(entity, limit, timeframe, include_articles)get_event_summary(event_id, include_articles)detect_emerging_topics(limit)get_news_sentiment(entity, timeframe)get_related_recent_entities(subject, timeframe, limit, include_trends)get_capabilities()Instead of treating detect_emerging_topics() as a flat list, we want a higher-level representation:
Eventual agent tool shape (later): get_emerging_entity_graph(timeframe, limit).
NEWS_REFRESH_INTERVAL_SECONDS (default 900s)NEWS_CLUSTERS_TTL_HOURS via CLUSTERS_TTL_HOURS)get_event_summary/api/v1/*) for programmatic access to cluster data, sentiment series, entity frequencies, and health stats/dashboard — HTMX-based shell with Chart.js visualizations (5 views: health, clusters, sentiment, entities, detail)@app.on_event("startup") pruning to lifespan-based fire-and-forget background loop; server responds within ~0.3s regardless of feed/LLM latencynews-mcp/
├── news_mcp/mcp_server_fastmcp.py ← MCP tools + REST API + dashboard mount
├── news_mcp/dashboard/
│ ├── dashboard_store.py ← Read-only query layer (no side effects)
│ ├── index.html ← SPA shell with 5 views
│ ├── style.css ← Dark theme, responsive
│ └── dashboard.js ← Client-side rendering + Chart.js
SQLiteClusterStore with thin read-only methods — no enrichment, no writes_shared_store) avoids repeated DB connectionsStaticFiles mount — no Jinja2/templating dependencyfetch() + Chart.js avoids HTMX raw-JSON-in-DOM issuesNEWS_DEFAULT_LOOKBACK_HOURS (144h), not a hardcoded 24h/api/v1/*) — JSON-only, for programmatic access and the dashboard/dashboard — 5 views (health, clusters, sentiment, entities, detail), Chart.js visualizations, instant client-side rendering@app.on_event("startup") with lifespan-based fire-and-forget background loop; server responds in <0.3sasyncio.Lock prevents overlapping refresh cyclesnews-mcp/
├── news_mcp/mcp_server_fastmcp.py ← MCP + REST API + /dashboard static mount
├── news_mcp/dashboard/
│ ├── __init__.py
│ ├── dashboard_store.py ← Read-only query layer (no side effects)
│ ├── index.html ← SPA shell, 5 views
│ ├── style.css ← Dark theme, responsive grid
│ └── dashboard.js ← Client render, Chart.js, null-safe DOM access
SQLiteClusterStore with read-only methods (stats, pagination, series, frequencies, detail). No writes, no enrichment._shared_store) — one DB connection pool for the entire process.StaticFiles — no Jinja2/templating dependency.fetch() + Chart.js — avoids HTMX raw-JSON-in-DOM issues.NEWS_DEFAULT_LOOKBACK_HOURS (144h), not hardcoded.ORDER BY updated_at DESC + client-side sort as safety net).Keywords are extracted by the LLM (extract_entities.prompt — "provide short keywords that justify the classification"), stored in the cluster payload, and displayed in the dashboard detail view — but they are not used by any search, scoring, or retrieval path. Thematic signals like "ETF", "rate-cut", "contagion" are invisible to entity search, emerging-topics detection, and related-entity expansion.
_cluster_entity_haystack() in mcp_server_fastmcp.py so get_events_for_entity() and get_news_sentiment() match clusters by thematic keywords, not just named entities.keywords field to cluster output dicts in get_latest_events() and get_events_for_entity() so downstream LLM agents see the full semantic picture.detect_emerging_topics() with parallel keyword_counts_recent / keyword_counts_prior accumulators, scored with the same velocity/recency/source-diversity formula as entities._collect_local_related() in related_entities.py.get_keyword_frequencies() to DashboardStore and a "Keywords" panel on the dashboard.Cluster payloads stored timestamps as raw RSS strings (RFC 2822 HTTP-date like "Sat, 30 May 2026 02:00:12 +00:00"). Every read path needed fragile format-guessing, and SQL time-range queries on updated_at (row modification time, not event time) returned wrong data.
_normalize_ts() helper in sqlite_store.py: parses ISO 8601, RFC 2822/HTTP-date, epoch seconds → uniform YYYY-MM-DDTHH:MM:SS+00:00sanitize_cluster_payload() now normalizes timestamp, first_seen, last_updated, and all article[].timestamp before writing to DBmerge_cluster_embeddings.py: same normalization on merged payloadsscripts/normalize_cluster_timestamps.py: backfill script for existing rows (run on live server with correct --db path)get_sentiment_series() and get_entity_frequencies(): filter by payload.timestamp in Python, not updated_at in SQLupdated_at in the DB = row modification time (set to datetime.now() on every upsert). For time-range queries, always use payload.timestamp parsed from the JSON.