# Project: news-mcp ## Goal Provide a signal-extraction MCP server that converts RSS into **deduplicated, enriched news clusters** that are easy for agents to use. ## Current architecture (v0.4.0) - FastMCP SSE server mounted at `/mcp` - SQLite cache for clusters + entity metadata + feed state + LLM summary caches - **payload_ts** — indexed VIRTUAL GENERATED column: `json_extract(payload, '$.timestamp')`. Auto-maintained by SQLite on write. Indexed for O(log n) time-range queries. No write-path code needed. - **cluster_entities** junction table — `(cluster_id, entity)` with index on `entity`. Populated in `upsert_clusters()`. SQL-level entity search. - **cluster_keywords** junction table — `(cluster_id, keyword)` with index on `keyword`. Same pattern. - All time-range filters and entity/keyword searches use SQL indexes. No full-table JSON parsing at query time. - **Stable cluster IDs**: `sha1(min_article_key)` — topic-independent, order-independent, consistent across polling cycles. - **Cross-cycle merge**: poller seeds clustering with recent DB clusters (configurable `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h). - **Orphan merge**: post-clustering Union-Find pass merges clusters sharing article keys - Concurrent RSS fetch, Ollama embeddings, LLM enrichment with per-provider semaphore - Dashboard REST API (`/api/v1/*`) + Keywords panel + entity/keyword drill-down via junction tables ## MCP tools - `get_latest_events(topic, limit, include_articles)` - `get_events_for_entity(entity, limit, timeframe, include_articles)` - `get_event_summary(event_id, include_articles)` - `detect_emerging_topics(limit, timeframe, topic, around)` — returns signal_type (entity/keyword/phrase) - `get_news_sentiment(entity, timeframe)` - `get_related_recent_entities(subject, timeframe, limit, include_trends)` - `get_feeds()` / `toggle_feed(feed_url, enabled)` - `get_capabilities()` ## REST API - `GET /` — server info, tools list - `GET /health` — uptime, version hash - `GET /api/v1/clusters` — paginated, filtered by `payload_ts` SQL index - `GET /api/v1/entities` — top entities via junction table GROUP BY - `GET /api/v1/keywords` — top keywords via junction table GROUP BY - `GET /api/v1/clusters/by-entity?entity=X&hours=Y` — SQL entity search (NEW) - `GET /api/v1/clusters/by-keyword?keyword=X&hours=Y` — SQL keyword search (NEW) - `GET /api/v1/sentiment-series` — filtered by `payload_ts` SQL index - `GET /api/v1/cluster/{cluster_id}` — full detail - `GET /api/v1/feeds` / `POST /api/v1/feeds/toggle` — feed management ## Refresh & caching - Background refresh every `NEWS_REFRESH_INTERVAL_SECONDS` (default 300s) - Feed-hash skipping to avoid redundant RSS+LLM work - Summary caching for `get_event_summary` - Pruning via `NEWS_RETENTION_DAYS`, `NEWS_PRUNE_INTERVAL_HOURS` ## Schema (clusters table) ```sql CREATE TABLE clusters ( cluster_id TEXT PRIMARY KEY, topic TEXT NOT NULL, payload TEXT NOT NULL, updated_at TEXT NOT NULL, -- row modification time (set on every upsert) summary_payload TEXT, summary_updated_at TEXT, payload_ts GENERATED ALWAYS AS -- indexed event time (auto-maintained) (json_extract(payload, '$.timestamp')) VIRTUAL ); CREATE INDEX idx_clusters_payload_ts ON clusters(payload_ts); CREATE TABLE cluster_entities ( cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE, entity TEXT NOT NULL, -- lowercased PRIMARY KEY (cluster_id, entity) ); CREATE INDEX idx_cluster_entities_entity ON cluster_entities(entity); CREATE TABLE cluster_keywords ( cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE, keyword TEXT NOT NULL, -- lowercased PRIMARY KEY (cluster_id, keyword) ); CREATE INDEX idx_cluster_keywords_keyword ON cluster_keywords(keyword); ``` ## Keyword Utilization (done, May 2026) Keywords extracted by the LLM are now first-class search signals: - `_cluster_entity_haystack()` includes keywords → `get_events_for_entity()` matches themes - Cluster output includes `keywords[]` field - `detect_emerging_topics()` scores keywords with velocity/recency/source-diversity formula (`signal_type: "keyword"`) - `_collect_local_related()` counts keyword co-occurrence - Dashboard Keywords panel with SQL frequency counts via junction table - Topic labels (crypto/macro/regulation/ai/other) filtered from keywords at extraction time ## Two-Store Collapse (done, May 2026) `DashboardStore` has been eliminated. All of its methods were moved into `SQLiteClusterStore` (the single data access layer), and the REST API routes now use the shared `SQLiteClusterStore` instance directly. All MCP tools (`get_events_for_entity`, `get_news_sentiment`, `get_latest_events` entity mode) now use `SQLiteClusterStore.get_clusters_by_entity_or_keyword()` which searches via junction-table SQL joins — no row-limit blind spot. The `cluster_entities` and `cluster_keywords` junction tables are indexed for O(log n) lookup across any time window. ## Timestamp Pipeline (May 2026) 1. **Write**: `sanitize_cluster_payload()` normalizes `timestamp`/`first_seen`/`last_updated` to `YYYY-MM-DDTHH:MM:SS+00:00`. If all three missing, falls back to `datetime.now()`. 2. **Generated column**: `payload_ts` auto-extracts from JSON on write. Indexed. 3. **Read**: All queries filter by `payload_ts >= ?` in SQL. No JSON parsing for time filtering. 4. **Backfill**: One-time `scripts/backfill_junction_tables.py` populated junction tables from existing payloads. `payload_ts` was auto-populated by SQLite. ## Design Flaw: Two Stores (FIXED, May 2026) **What happened:** `DashboardStore` was a thin read-only query layer that wrapped `SQLiteClusterStore`. The MCP tools (`get_events_for_entity`, `get_news_sentiment`, `get_latest_events` entity mode) did Python-side entity matching by fetching top-N clusters via `payload_ts` then filtering in Python. Entities in clusters beyond the limit were silently missed. **Fix applied:** - `DashboardStore` was deleted. All its methods are now in `SQLiteClusterStore`. - All MCP tools use `SQLiteClusterStore.get_clusters_by_entity_or_keyword()` — SQL-level junction-table search with no row-limit blind spot. - The combined method uses `LEFT JOIN` on `cluster_entities` and `cluster_keywords` with `WHERE (ce.entity IN (...) OR ck.keyword IN (...))`, which matches both named entities and thematic keywords across any time window. - Exact matching (via `IN`) replaced substring matching — more correct, no false positives from partial name matches.