# Project: news-mcp

## Goal
Provide a signal-extraction MCP server that converts RSS into **deduplicated, enriched news clusters** that are easy for agents to use.

## Current architecture (v0.4.0)
- FastMCP SSE server mounted at `/mcp`
- SQLite cache for clusters + entity metadata + feed state + LLM summary caches
- **payload_ts** — indexed VIRTUAL GENERATED column: `json_extract(payload, '$.timestamp')`. Auto-maintained by SQLite on write. Indexed for O(log n) time-range queries. No write-path code needed.
- **cluster_entities** junction table — `(cluster_id, entity)` with index on `entity`. Populated in `upsert_clusters()`. SQL-level entity search.
- **cluster_keywords** junction table — `(cluster_id, keyword)` with index on `keyword`. Same pattern.
- All time-range filters and entity/keyword searches use SQL indexes. No full-table JSON parsing at query time.
- **Stable cluster IDs**: `sha1(min_article_key)` — topic-independent, order-independent, consistent across polling cycles.
- **Cross-cycle merge**: poller seeds clustering with recent DB clusters (configurable `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h).
- **Orphan merge**: post-clustering Union-Find pass merges clusters sharing article keys
- Concurrent RSS fetch, Ollama embeddings, LLM enrichment with per-provider semaphore
- Dashboard REST API (`/api/v1/*`) + Keywords panel + entity/keyword drill-down via junction tables

## MCP tools
- `get_latest_events(topic, limit, include_articles)`
- `get_events_for_entity(entity, limit, timeframe, include_articles)`
- `get_event_summary(event_id, include_articles)`
- `detect_emerging_topics(limit, timeframe, topic, around)` — returns signal_type (entity/keyword/phrase)
- `get_news_sentiment(entity, timeframe)`
- `get_related_recent_entities(subject, timeframe, limit, include_trends)`
- `get_feeds()` / `toggle_feed(feed_url, enabled)`
- `get_capabilities()`

## REST API
- `GET /` — server info, tools list
- `GET /health` — uptime, version hash
- `GET /api/v1/clusters` — paginated, filtered by `payload_ts` SQL index
- `GET /api/v1/entities` — top entities via junction table GROUP BY
- `GET /api/v1/keywords` — top keywords via junction table GROUP BY
- `GET /api/v1/clusters/by-entity?entity=X&hours=Y` — SQL entity search (NEW)
- `GET /api/v1/clusters/by-keyword?keyword=X&hours=Y` — SQL keyword search (NEW)
- `GET /api/v1/sentiment-series` — filtered by `payload_ts` SQL index
- `GET /api/v1/cluster/{cluster_id}` — full detail
- `GET /api/v1/feeds` / `POST /api/v1/feeds/toggle` — feed management

## Refresh & caching
- Background refresh every `NEWS_REFRESH_INTERVAL_SECONDS` (default 300s)
- Feed-hash skipping to avoid redundant RSS+LLM work
- Summary caching for `get_event_summary`
- Pruning via `NEWS_RETENTION_DAYS`, `NEWS_PRUNE_INTERVAL_HOURS`

## Schema (clusters table)
```sql
CREATE TABLE clusters (
    cluster_id TEXT PRIMARY KEY,
    topic TEXT NOT NULL,
    payload TEXT NOT NULL,
    updated_at TEXT NOT NULL,          -- row modification time (set on every upsert)
    summary_payload TEXT,
    summary_updated_at TEXT,
    payload_ts GENERATED ALWAYS AS     -- indexed event time (auto-maintained)
        (json_extract(payload, '$.timestamp')) VIRTUAL
);
CREATE INDEX idx_clusters_payload_ts ON clusters(payload_ts);

CREATE TABLE cluster_entities (
    cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
    entity     TEXT NOT NULL,          -- lowercased
    PRIMARY KEY (cluster_id, entity)
);
CREATE INDEX idx_cluster_entities_entity ON cluster_entities(entity);

CREATE TABLE cluster_keywords (
    cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
    keyword    TEXT NOT NULL,          -- lowercased
    PRIMARY KEY (cluster_id, keyword)
);
CREATE INDEX idx_cluster_keywords_keyword ON cluster_keywords(keyword);
```

## Keyword Utilization (done, May 2026)
Keywords extracted by the LLM are now first-class search signals:
- `_cluster_entity_haystack()` includes keywords → `get_events_for_entity()` matches themes
- Cluster output includes `keywords[]` field
- `detect_emerging_topics()` scores keywords with velocity/recency/source-diversity formula (`signal_type: "keyword"`)
- `_collect_local_related()` counts keyword co-occurrence
- Dashboard Keywords panel with SQL frequency counts via junction table
- Topic labels (crypto/macro/regulation/ai/other) filtered from keywords at extraction time

## Two-Store Collapse (done, May 2026)

`DashboardStore` has been eliminated. All of its methods were moved into `SQLiteClusterStore` (the single data access layer), and the REST API routes now use the shared `SQLiteClusterStore` instance directly.

All MCP tools (`get_events_for_entity`, `get_news_sentiment`, `get_latest_events` entity mode) now use `SQLiteClusterStore.get_clusters_by_entity_or_keyword()` which searches via junction-table SQL joins — no row-limit blind spot. The `cluster_entities` and `cluster_keywords` junction tables are indexed for O(log n) lookup across any time window.

## Timestamp Pipeline (May 2026)
1. **Write**: `sanitize_cluster_payload()` normalizes `timestamp`/`first_seen`/`last_updated` to `YYYY-MM-DDTHH:MM:SS+00:00`. If all three missing, falls back to `datetime.now()`.
2. **Generated column**: `payload_ts` auto-extracts from JSON on write. Indexed.
3. **Read**: All queries filter by `payload_ts >= ?` in SQL. No JSON parsing for time filtering.
4. **Backfill**: One-time `scripts/backfill_junction_tables.py` populated junction tables from existing payloads. `payload_ts` was auto-populated by SQLite.

## Design Flaw: Two Stores (FIXED, May 2026)

**What happened:** `DashboardStore` was a thin read-only query layer that wrapped `SQLiteClusterStore`. The MCP tools (`get_events_for_entity`, `get_news_sentiment`, `get_latest_events` entity mode) did Python-side entity matching by fetching top-N clusters via `payload_ts` then filtering in Python. Entities in clusters beyond the limit were silently missed.

**Fix applied:** 
- `DashboardStore` was deleted. All its methods are now in `SQLiteClusterStore`.
- All MCP tools use `SQLiteClusterStore.get_clusters_by_entity_or_keyword()` — SQL-level junction-table search with no row-limit blind spot.
- The combined method uses `LEFT JOIN` on `cluster_entities` and `cluster_keywords` with `WHERE (ce.entity IN (...) OR ck.keyword IN (...))`, which matches both named entities and thematic keywords across any time window.
- Exact matching (via `IN`) replaced substring matching — more correct, no false positives from partial name matches.