# Project: news-mcp ## Goal Provide a signal-extraction MCP server that converts RSS into **deduplicated, enriched news clusters** that are easy for agents to use. ## Current architecture (v0.4.0) - FastMCP SSE server mounted at `/mcp` - SQLite cache for clusters + entity metadata + feed state + LLM summary caches - **payload_ts** — indexed VIRTUAL GENERATED column: `json_extract(payload, '$.timestamp')`. Auto-maintained by SQLite on write. Indexed for O(log n) time-range queries. No write-path code needed. - **cluster_entities** junction table — `(cluster_id, entity)` with index on `entity`. Populated in `upsert_clusters()`. SQL-level entity search. - **cluster_keywords** junction table — `(cluster_id, keyword)` with index on `keyword`. Same pattern. - All time-range filters and entity/keyword searches use SQL indexes. No full-table JSON parsing at query time. - **Stable cluster IDs**: `sha1(min_article_key)` — topic-independent, order-independent, consistent across polling cycles. - **Cross-cycle merge**: poller seeds clustering with recent DB clusters (configurable `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h). - **Orphan merge**: post-clustering Union-Find pass merges clusters sharing article keys - Concurrent RSS fetch, Ollama embeddings, LLM enrichment with per-provider semaphore - Dashboard REST API (`/api/v1/*`) + Keywords panel + entity/keyword drill-down via junction tables ## MCP tools - `get_latest_events(topic, limit, include_articles)` - `get_events_for_entity(entity, limit, timeframe, include_articles)` - `get_event_summary(event_id, include_articles)` - `detect_emerging_topics(limit, timeframe, topic, around)` — returns signal_type (entity/keyword/phrase) - `get_news_sentiment(entity, timeframe)` - `get_related_recent_entities(subject, timeframe, limit, include_trends)` - `get_feeds()` / `toggle_feed(feed_url, enabled)` - `get_capabilities()` ## REST API - `GET /` — server info, tools list - `GET /health` — uptime, version hash - `GET /api/v1/clusters` — paginated, filtered by `payload_ts` SQL index - `GET /api/v1/entities` — top entities via junction table GROUP BY - `GET /api/v1/keywords` — top keywords via junction table GROUP BY - `GET /api/v1/clusters/by-entity?entity=X&hours=Y` — SQL entity search (NEW) - `GET /api/v1/clusters/by-keyword?keyword=X&hours=Y` — SQL keyword search (NEW) - `GET /api/v1/sentiment-series` — filtered by `payload_ts` SQL index - `GET /api/v1/cluster/{cluster_id}` — full detail - `GET /api/v1/feeds` / `POST /api/v1/feeds/toggle` — feed management ## Refresh & caching - Background refresh every `NEWS_REFRESH_INTERVAL_SECONDS` (default 300s) - Feed-hash skipping to avoid redundant RSS+LLM work - Summary caching for `get_event_summary` - Pruning via `NEWS_RETENTION_DAYS`, `NEWS_PRUNE_INTERVAL_HOURS` ## Schema (clusters table) ```sql CREATE TABLE clusters ( cluster_id TEXT PRIMARY KEY, topic TEXT NOT NULL, payload TEXT NOT NULL, updated_at TEXT NOT NULL, -- row modification time (set on every upsert) summary_payload TEXT, summary_updated_at TEXT, payload_ts GENERATED ALWAYS AS -- indexed event time (auto-maintained) (json_extract(payload, '$.timestamp')) VIRTUAL ); CREATE INDEX idx_clusters_payload_ts ON clusters(payload_ts); CREATE TABLE cluster_entities ( cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE, entity TEXT NOT NULL, -- lowercased PRIMARY KEY (cluster_id, entity) ); CREATE INDEX idx_cluster_entities_entity ON cluster_entities(entity); CREATE TABLE cluster_keywords ( cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE, keyword TEXT NOT NULL, -- lowercased PRIMARY KEY (cluster_id, keyword) ); CREATE INDEX idx_cluster_keywords_keyword ON cluster_keywords(keyword); ``` ## Keyword Utilization (done, May 2026) Keywords extracted by the LLM are now first-class search signals: - `_cluster_entity_haystack()` includes keywords → `get_events_for_entity()` matches themes - Cluster output includes `keywords[]` field - `detect_emerging_topics()` scores keywords with velocity/recency/source-diversity formula (`signal_type: "keyword"`) - `_collect_local_related()` counts keyword co-occurrence - Dashboard Keywords panel with SQL frequency counts via junction table - Topic labels (crypto/macro/regulation/ai/other) filtered from keywords at extraction time ## Timestamp Pipeline (May 2026) 1. **Write**: `sanitize_cluster_payload()` normalizes `timestamp`/`first_seen`/`last_updated` to `YYYY-MM-DDTHH:MM:SS+00:00`. If all three missing, falls back to `datetime.now()`. 2. **Generated column**: `payload_ts` auto-extracts from JSON on write. Indexed. 3. **Read**: All queries filter by `payload_ts >= ?` in SQL. No JSON parsing for time filtering. 4. **Backfill**: One-time `scripts/backfill_junction_tables.py` populated junction tables from existing payloads. `payload_ts` was auto-populated by SQLite. ## Design Flaw: Two Stores (KNOWN, fix planned) **Problem:** `SQLiteClusterStore` and `DashboardStore` are parallel copies of the same data access layer. Methods were duplicated when DashboardStore was added, with the same JSON-parsing approach. When junction tables were implemented, only `DashboardStore` was updated. `SQLiteClusterStore` (used by MCP tools) still does full-table JSON parsing for entity/keyword search. **Current state:** - `DashboardStore` — uses SQL `payload_ts` filter + junction tables ✓ - `SQLiteClusterStore` — uses SQL `payload_ts` filter for time ✓, but MCP tool entity search (`get_events_for_entity`, `get_news_sentiment`) still fetches top-N clusters by time then filters entities in Python **Consequence:** `get_events_for_entity("Pete Hegseth", timeframe="72h")` fetches the 200 most recent clusters (via `payload_ts`), then loops in Python checking entities. If the entity appears in 34 clusters but only 15 are in the top 200, 19 are missed. **Proposed fix:** Collapse both stores into one. `SQLiteClusterStore` should be the single data access layer with proper junction-table methods for entity/keyword search. `DashboardStore` should be a thin wrapper or removed entirely. MCP tools should call `SQLiteClusterStore.get_clusters_by_entity()` using junction tables instead of Python-side filtering.