Provide a signal-extraction MCP server that converts RSS into deduplicated, enriched news clusters that are easy for agents to use.
/mcpasyncio.gather + httpx, bounded semaphore)sha1(min_article_key) — topic-independent, order-independent, consistent across polling cycles. The topic is excluded from the hash so that the same article always maps to the same cluster_id regardless of heuristic vs LLM-enriched topic classification.NEWS_CLUSTER_MAX_AGE_HOURS, default 4h). Existing clusters are re-bucketed by the same heuristic topic function (normalize_topic_from_title) that new articles use, ensuring matching works even when the enriched topic drifted.NEWS_RSS_MAX_CONCURRENCY, NEWS_OLLAMA_MAX_CONCURRENCY, NEWS_LLM_CONCURRENCY_<PROVIDER>)/api/v1/*) for clusters, sentiment series, entity frequenciesget_latest_events() defaults to all topics (omit topic for unfiltered)/mcpNEWS_EMBEDDINGS_ENABLED=true)NEWS_EMBEDDING_SIMILARITY_THRESHOLD)get_latest_events(topic, limit, include_articles)get_events_for_entity(entity, limit, timeframe, include_articles)get_event_summary(event_id, include_articles)detect_emerging_topics(limit)get_news_sentiment(entity, timeframe)get_related_recent_entities(subject, timeframe, limit, include_trends)get_capabilities()Instead of treating detect_emerging_topics() as a flat list, we want a higher-level representation:
Eventual agent tool shape (later): get_emerging_entity_graph(timeframe, limit).
NEWS_REFRESH_INTERVAL_SECONDS (default 900s)NEWS_CLUSTERS_TTL_HOURS via CLUSTERS_TTL_HOURS)get_event_summary/api/v1/*) for programmatic access to cluster data, sentiment series, entity frequencies, and health stats/dashboard — HTMX-based shell with Chart.js visualizations (5 views: health, clusters, sentiment, entities, detail)@app.on_event("startup") pruning to lifespan-based fire-and-forget background loop; server responds within ~0.3s regardless of feed/LLM latencynews-mcp/
├── news_mcp/mcp_server_fastmcp.py ← MCP tools + REST API + dashboard mount
├── news_mcp/dashboard/
│ ├── dashboard_store.py ← Read-only query layer (no side effects)
│ ├── index.html ← SPA shell with 5 views
│ ├── style.css ← Dark theme, responsive
│ └── dashboard.js ← Client-side rendering + Chart.js
SQLiteClusterStore with thin read-only methods — no enrichment, no writes_shared_store) avoids repeated DB connectionsStaticFiles mount — no Jinja2/templating dependencyfetch() + Chart.js avoids HTMX raw-JSON-in-DOM issuesNEWS_DEFAULT_LOOKBACK_HOURS (144h), not a hardcoded 24h/api/v1/*) — JSON-only, for programmatic access and the dashboard/dashboard — 5 views (health, clusters, sentiment, entities, detail), Chart.js visualizations, instant client-side rendering@app.on_event("startup") with lifespan-based fire-and-forget background loop; server responds in <0.3sasyncio.Lock prevents overlapping refresh cyclesnews-mcp/
├── news_mcp/mcp_server_fastmcp.py ← MCP + REST API + /dashboard static mount
├── news_mcp/dashboard/
│ ├── __init__.py
│ ├── dashboard_store.py ← Read-only query layer (no side effects)
│ ├── index.html ← SPA shell, 5 views
│ ├── style.css ← Dark theme, responsive grid
│ └── dashboard.js ← Client render, Chart.js, null-safe DOM access
SQLiteClusterStore with read-only methods (stats, pagination, series, frequencies, detail). No writes, no enrichment._shared_store) — one DB connection pool for the entire process.StaticFiles — no Jinja2/templating dependency.fetch() + Chart.js — avoids HTMX raw-JSON-in-DOM issues.NEWS_DEFAULT_LOOKBACK_HOURS (144h), not hardcoded.ORDER BY updated_at DESC + client-side sort as safety net).Keywords are extracted by the LLM (extract_entities.prompt — "provide short keywords that justify the classification"), stored in the cluster payload, and displayed in the dashboard detail view — but they are not used by any search, scoring, or retrieval path. Thematic signals like "ETF", "rate-cut", "contagion" are invisible to entity search, emerging-topics detection, and related-entity expansion.
_cluster_entity_haystack() in mcp_server_fastmcp.py so get_events_for_entity() and get_news_sentiment() match clusters by thematic keywords, not just named entities.keywords field to cluster output dicts in get_latest_events() and get_events_for_entity() so downstream LLM agents see the full semantic picture.detect_emerging_topics() with parallel keyword_counts_recent / keyword_counts_prior accumulators, scored with the same velocity/recency/source-diversity formula as entities._collect_local_related() in related_entities.py.get_keyword_frequencies() to DashboardStore and a "Keywords" panel on the dashboard.Cluster payloads stored timestamps as raw RSS strings (RFC 2822 HTTP-date like "Sat, 30 May 2026 02:00:12 +00:00"). Every read path needed fragile format-guessing, and SQL time-range queries on updated_at (row modification time, not event time) returned wrong data.
_normalize_ts() helper in sqlite_store.py: parses ISO 8601, RFC 2822/HTTP-date, epoch seconds → uniform YYYY-MM-DDTHH:MM:SS+00:00sanitize_cluster_payload() now normalizes timestamp, first_seen, last_updated, and all article[].timestamp before writing to DBmerge_cluster_embeddings.py: same normalization on merged payloadsscripts/normalize_cluster_timestamps.py: backfill script for existing rows (run on live server with correct --db path)get_sentiment_series() and get_entity_frequencies(): filter by payload.timestamp in Python, not updated_at in SQLupdated_at in the DB = row modification time (set to datetime.now() on every upsert). For time-range queries, always use payload.timestamp parsed from the JSON.
After normalization, all read paths still contained defensive RFC 2822 / parsedate_to_datetime fallback parsers. This was dead code on the live server (all stored timestamps are ISO 8601 UTC) and risked being re-introduced by future contributors who misread the defensive pattern as necessary.
_read_ts(ts) -> float | None to sqlite_store.py (module-level, exported). Uses only datetime.fromisoformat(). No RFC 2822 fallback. If it fails, the normalization pipeline has a bug — fix that instead.sqlite_store.py, dashboard_store.py, and mcp_server_fastmcp.py now uses _read_ts or plain fromisoformat.parsedate_to_datetime removed from dashboard_store.py and mcp_server_fastmcp.py imports entirely.parsedate_to_datetime is only retained in sqlite_store._normalize_ts() (the write path) and dedup/cluster.py (raw ingest before normalization).payload.timestamp, payload.first_seen, payload.last_updated are always YYYY-MM-DDTHH:MM:SS+00:00 for any row written after the normalization migration._read_ts() from sqlite_store or datetime.fromisoformat() directly. Never add parsedate_to_datetime to a read path.sanitize_cluster_payload() in sqlite_store.py is the single normalization point. All writes go through upsert_clusters() which calls it.All read paths deserialize every JSON payload to filter by entity/keyword/time. With 6000+ clusters, get_clusters_page returns only the 100 newest — clicking an entity that appears 34x shows only 2 clusters because the other 32 are outside the LIMIT. get_entity_frequencies counts correctly but the detail view can't find them. Every query does a full table scan with JSON parsing.
Schema (migrated in _init_db, incremental-safe):
-- Indexed event timestamp (SQLite generated column — zero write-path cost)
ALTER TABLE clusters ADD COLUMN payload_ts
GENERATED ALWAYS AS (json_extract(payload, '$.timestamp')) STORED;
CREATE INDEX IF NOT EXISTS idx_clusters_payload_ts ON clusters(payload_ts);
-- Entity junction table for SQL-level entity search
CREATE TABLE IF NOT EXISTS cluster_entities (
cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
entity TEXT NOT NULL,
PRIMARY KEY (cluster_id, entity)
);
CREATE INDEX IF NOT EXISTS idx_cluster_entities_entity ON cluster_entities(entity);
-- Keyword junction table for SQL-level keyword search
CREATE TABLE IF NOT EXISTS cluster_keywords (
cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
keyword TEXT NOT NULL,
PRIMARY KEY (cluster_id, keyword)
);
CREATE INDEX IF NOT EXISTS idx_cluster_keywords_keyword ON cluster_keywords(keyword);
Write path (upsert_clusters): Within the existing transaction, after sanitizing the payload and before INSERT/UPDATE:
DELETE FROM cluster_entities WHERE cluster_id = ? (handles re-enrichment)DELETE FROM cluster_keywords WHERE cluster_id = ?INSERT OR IGNORE INTO cluster_entities VALUES (?, ?) for each entityINSERT OR IGNORE INTO cluster_keywords VALUES (?, ?) for each keywordpayload_ts is auto-maintained by SQLite's generated column — no code neededRead paths — all SQL-level, no JSON parsing at query time:
get_clusters_page: WHERE payload_ts >= ? ORDER BY payload_ts DESC LIMIT ? OFFSET ?get_entity_frequencies: JOIN cluster_entities ... WHERE payload_ts >= ? GROUP BY entity ORDER BY cnt DESCget_keyword_frequencies: JOIN cluster_keywords ... WHERE payload_ts >= ? GROUP BY keyword ORDER BY cnt DESCget_clusters_by_entity: JOIN cluster_entities WHERE payload_ts >= ? AND entity = ?get_clusters_by_keyword: JOIN cluster_keywords WHERE payload_ts >= ? AND keyword = ?Backfill script (scripts/backfill_junction_tables.py):
normalize_cluster_timestamps.py--db arg, defaults to config DB_PATHcluster_entities and cluster_keywordspayload_ts is auto-populated by SQLite's generated columnINSERT OR IGNORE + transaction)docker exec -it <container> python3 scripts/backfill_junction_tables.pyREST API changes:
GET /api/v1/clusters — now uses SQL payload_ts filter, consistent totalGET /api/v1/entities — SQL COUNT(*) ... GROUP BY via junction tableGET /api/v1/keywords — SQL COUNT(*) ... GROUP BY via junction tableGET /api/v1/clusters/by-entity?entity=X&hours=Y&limit=Z — SQL entity searchGET /api/v1/clusters/by-keyword?keyword=X&hours=Y&limit=Z — SQL keyword searchDashboard JS changes:
showEntityDetail(label) — calls /api/v1/clusters/by-entity instead of fetching all clustersshowKeywordDetail(label) — calls /api/v1/clusters/by-keyword instead of fetching all clustersFiles changed:
| File | Change |
|---|---|
| news_mcp/storage/sqlite_store.py | Schema migration (generated column + junction tables), write-path junction population, new SQL-level read methods |
| news_mcp/mcp_server_fastmcp.py | New REST endpoints for entity/keyword cluster search |
| news_mcp/dashboard/dashboard_store.py | get_entity_frequencies, get_keyword_frequencies use SQL junction table counts |
| dashboard/dashboard.js | showEntityDetail, showKeywordDetail call new endpoints |
| scripts/backfill_junction_tables.py | New backfill script (same pattern as normalize_cluster_timestamps.py) |
Migration safety:
IF NOT EXISTS / ADD COLUMN IF NOT EXISTS — safe to re-runINSERT OR IGNORE in transactions)