# Project: news-mcp

## Goal
Provide a signal-extraction MCP server that converts RSS into **deduplicated, enriched news clusters** that are easy for agents to use.

## Current architecture (v0.4.0)
- FastMCP SSE server mounted at `/mcp`
- SQLite cache for clusters + entity metadata + feed state + LLM summary caches
- **payload_ts** — indexed VIRTUAL GENERATED column: `json_extract(payload, '$.timestamp')`. Auto-maintained by SQLite on write. Indexed for O(log n) time-range queries. No write-path code needed.
- **cluster_entities** junction table — `(cluster_id, entity)` with index on `entity`. Populated in `upsert_clusters()`. SQL-level entity search.
- **cluster_keywords** junction table — `(cluster_id, keyword)` with index on `keyword`. Same pattern.
- All time-range filters and entity/keyword searches use SQL indexes. No full-table JSON parsing at query time.
- **Stable cluster IDs**: `sha1(min_article_key)` — topic-independent, order-independent, consistent across polling cycles.
- **Cross-cycle merge**: poller seeds clustering with recent DB clusters (configurable `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h).
- **Orphan merge**: post-clustering Union-Find pass merges clusters sharing article keys
- Concurrent RSS fetch, Ollama embeddings, LLM enrichment with per-provider semaphore
- Dashboard REST API (`/api/v1/*`) + Keywords panel + entity/keyword drill-down via junction tables

## MCP tools
- `get_latest_events(topic, limit, include_articles)`
- `get_events_for_entity(entity, limit, timeframe, include_articles)`
- `get_event_summary(event_id, include_articles)`
- `detect_emerging_topics(limit, timeframe, topic, around)` — returns signal_type (entity/keyword/phrase)
- `get_news_sentiment(entity, timeframe)`
- `get_related_recent_entities(subject, timeframe, limit, include_trends)`
- `get_feeds()` / `toggle_feed(feed_url, enabled)`
- `get_capabilities()`

## REST API
- `GET /` — server info, tools list
- `GET /health` — uptime, version hash
- `GET /api/v1/clusters` — paginated, filtered by `payload_ts` SQL index
- `GET /api/v1/entities` — top entities via junction table GROUP BY
- `GET /api/v1/keywords` — top keywords via junction table GROUP BY
- `GET /api/v1/clusters/by-entity?entity=X&hours=Y` — SQL entity search (NEW)
- `GET /api/v1/clusters/by-keyword?keyword=X&hours=Y` — SQL keyword search (NEW)
- `GET /api/v1/sentiment-series` — filtered by `payload_ts` SQL index
- `GET /api/v1/cluster/{cluster_id}` — full detail
- `GET /api/v1/feeds` / `POST /api/v1/feeds/toggle` — feed management

## Refresh & caching
- Background refresh every `NEWS_REFRESH_INTERVAL_SECONDS` (default 300s)
- Feed-hash skipping to avoid redundant RSS+LLM work
- Summary caching for `get_event_summary`
- Pruning via `NEWS_RETENTION_DAYS`, `NEWS_PRUNE_INTERVAL_HOURS`

## Schema (clusters table)
```sql
CREATE TABLE clusters (
    cluster_id TEXT PRIMARY KEY,
    topic TEXT NOT NULL,
    payload TEXT NOT NULL,
    updated_at TEXT NOT NULL,          -- row modification time (set on every upsert)
    summary_payload TEXT,
    summary_updated_at TEXT,
    payload_ts GENERATED ALWAYS AS     -- indexed event time (auto-maintained)
        (json_extract(payload, '$.timestamp')) VIRTUAL
);
CREATE INDEX idx_clusters_payload_ts ON clusters(payload_ts);

CREATE TABLE cluster_entities (
    cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
    entity     TEXT NOT NULL,          -- lowercased
    PRIMARY KEY (cluster_id, entity)
);
CREATE INDEX idx_cluster_entities_entity ON cluster_entities(entity);

CREATE TABLE cluster_keywords (
    cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
    keyword    TEXT NOT NULL,          -- lowercased
    PRIMARY KEY (cluster_id, keyword)
);
CREATE INDEX idx_cluster_keywords_keyword ON cluster_keywords(keyword);
```

## Keyword Utilization (done, May 2026)
Keywords extracted by the LLM are now first-class search signals:
- `_cluster_entity_haystack()` includes keywords → `get_events_for_entity()` matches themes
- Cluster output includes `keywords[]` field
- `detect_emerging_topics()` scores keywords with velocity/recency/source-diversity formula (`signal_type: "keyword"`)
- `_collect_local_related()` counts keyword co-occurrence
- Dashboard Keywords panel with SQL frequency counts via junction table
- Topic labels (crypto/macro/regulation/ai/other) filtered from keywords at extraction time

## Timestamp Pipeline (May 2026)
1. **Write**: `sanitize_cluster_payload()` normalizes `timestamp`/`first_seen`/`last_updated` to `YYYY-MM-DDTHH:MM:SS+00:00`. If all three missing, falls back to `datetime.now()`.
2. **Generated column**: `payload_ts` auto-extracts from JSON on write. Indexed.
3. **Read**: All queries filter by `payload_ts >= ?` in SQL. No JSON parsing for time filtering.
4. **Backfill**: One-time `scripts/backfill_junction_tables.py` populated junction tables from existing payloads. `payload_ts` was auto-populated by SQLite.

## Design Flaw: Two Stores (KNOWN, fix planned)

**Problem:** `SQLiteClusterStore` and `DashboardStore` are parallel copies of the same data access layer. Methods were duplicated when DashboardStore was added, with the same JSON-parsing approach. When junction tables were implemented, only `DashboardStore` was updated. `SQLiteClusterStore` (used by MCP tools) still does full-table JSON parsing for entity/keyword search.

**Current state:**
- `DashboardStore` — uses SQL `payload_ts` filter + junction tables ✓
- `SQLiteClusterStore` — uses SQL `payload_ts` filter for time ✓, but MCP tool entity search (`get_events_for_entity`, `get_news_sentiment`) still fetches top-N clusters by time then filters entities in Python

**Consequence:** `get_events_for_entity("Pete Hegseth", timeframe="72h")` fetches the 200 most recent clusters (via `payload_ts`), then loops in Python checking entities. If the entity appears in 34 clusters but only 15 are in the top 200, 19 are missed.

**Proposed fix:** Collapse both stores into one. `SQLiteClusterStore` should be the single data access layer with proper junction-table methods for entity/keyword search. `DashboardStore` should be a thin wrapper or removed entirely. MCP tools should call `SQLiteClusterStore.get_clusters_by_entity()` using junction tables instead of Python-side filtering.