# Project: news-mcp

## Goal
Provide a signal-extraction MCP server that converts RSS into **deduplicated, enriched news clusters** that are easy for agents to use.

## Current architecture (v0.5.0)
- FastMCP SSE server mounted at `/mcp`
- SQLite cache for clusters + entity metadata + feed state + seen_articles
- **payload_ts** — indexed VIRTUAL GENERATED column: `json_extract(payload, '$.timestamp')`. Auto-maintained by SQLite on write.
- **cluster_entities** junction table — `(cluster_id, entity)` with index on `entity`. SQL-level entity search.
- **cluster_keywords** junction table — `(cluster_id, keyword)` with index on `keyword`. SQL-level keyword search.
- **seen_articles** junction table — `(article_key, cluster_id, content_hash)`. Per-article dedup with content-change detection.
- All time-range filters and entity/keyword searches use SQL indexes. No full-table JSON parsing at query time.
- **Stable cluster IDs**: `sha1(min_article_key)` — topic-independent, order-independent, consistent across polling cycles.
- **Cross-cycle merge**: poller seeds clustering with recent DB clusters (configurable `NEWS_CLUSTER_MAX_AGE_HOURS`, default 6h).
- **Orphan merge**: post-clustering Union-Find pass merges clusters sharing article keys.
- **Three-layer dedup**: feed-level hash (coarse) → seen_articles by URL (fine) → content hash (detects in-place updates).
- **Dual-signal clustering**: title ≥ 0.75, jaccard ≥ 0.55, or title ≥ 0.55 + jaccard ≥ 0.25 (dual). Embedding cosine ≥ 0.885 when enabled.
- **Content-change detection**: `seen_articles.content_hash` (SHA-1 of title+summary) detects in-place article updates (e.g. "More to come..." → real content). Changed articles are re-clustered and their `enriched_at` is cleared for re-enrichment.
- Concurrent RSS fetch, Ollama embeddings, LLM enrichment with per-provider semaphore + rate limiter.
- Dashboard with Config page for runtime parameter tuning via `site_config` DB table.
- `article_identity.py` — single source of truth for `article_key()` and `article_content_hash()`.

## MCP tools
- `get_latest_events(topic, limit, include_articles)`
- `get_events_for_entity(entity, limit, timeframe, include_articles)`
- `get_event_summary(event_id, include_articles)`
- `detect_emerging_topics(limit, timeframe, topic, around)` — returns signal_type (entity/keyword/phrase)
- `get_news_sentiment(entity, timeframe)`
- `get_related_recent_entities(subject, timeframe, limit, include_trends)`
- `get_feeds()` / `toggle_feed(feed_url, enabled)`
- `debug_dedup(url, title?)` — inspect dedup status, similarity signals, match decisions
- `get_capabilities()`

## REST API
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/v1/health` | Extended health: stats, freshness, feed state, pruning, seen_article_count |
| GET | `/api/v1/clusters` | Paginated clusters. Params: `topic`, `hours`, `limit`, `offset` |
| GET | `/api/v1/sentiment-series` | Sentiment time-series. Params: `topic`, `hours`, `bucket_hours` |
| GET | `/api/v1/entities` | Top entities by frequency. Params: `hours`, `limit` |
| GET | `/api/v1/keywords` | Top keywords by frequency. Params: `hours`, `limit` |
| GET | `/api/v1/clusters/by-entity` | SQL entity search via junction table |
| GET | `/api/v1/clusters/by-keyword` | SQL keyword search via junction table |
| GET | `/api/v1/cluster/{cluster_id}` | Full cluster detail |
| GET | `/api/v1/feeds` | Feed state list |
| POST | `/api/v1/feeds/toggle` | Enable/disable a feed |
| GET | `/api/v1/config` | All site config parameters |
| POST | `/api/v1/config/update` | Update a config parameter at runtime |
| POST | `/api/v1/config/reset` | Reset all config to .env/defaults |

## Refresh & caching
- Background refresh every `NEWS_REFRESH_INTERVAL_SECONDS` (default 300s)
- **Three-layer dedup**:
  1. Feed-level content hash — skip entire unchanged feeds (coarse, O(1))
  2. `seen_articles` by `article_key` (URL) — skip already-processed articles (fine)
  3. `content_hash` comparison — detect in-place content updates, re-cluster + re-enrich changed articles
- Enrichment caching via `enriched_at` timestamp in cluster payload
- Pruning via `NEWS_RETENTION_DAYS`, `NEWS_PRUNE_INTERVAL_HOURS`

## Schema (clusters table)
```sql
CREATE TABLE clusters (
    cluster_id TEXT PRIMARY KEY,
    topic TEXT NOT NULL,
    payload TEXT NOT NULL,
    updated_at TEXT NOT NULL,
    summary_payload TEXT,
    summary_updated_at TEXT,
    payload_ts GENERATED ALWAYS AS
        (json_extract(payload, '$.timestamp')) VIRTUAL
);
CREATE INDEX idx_clusters_payload_ts ON clusters(payload_ts);

CREATE TABLE cluster_entities (
    cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
    entity     TEXT NOT NULL,
    PRIMARY KEY (cluster_id, entity)
);
CREATE INDEX idx_cluster_entities_entity ON cluster_entities(entity);

CREATE TABLE cluster_keywords (
    cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
    keyword    TEXT NOT NULL,
    PRIMARY KEY (cluster_id, keyword)
);
CREATE INDEX idx_cluster_keywords_keyword ON cluster_keywords(keyword);

CREATE TABLE seen_articles (
    article_key  TEXT PRIMARY KEY,
    cluster_id   TEXT NOT NULL,
    first_seen   TEXT NOT NULL,
    url          TEXT NOT NULL DEFAULT '',
    content_hash TEXT NOT NULL DEFAULT ''
);

CREATE TABLE site_config (
    key         TEXT PRIMARY KEY,
    value       TEXT NOT NULL,
    type        TEXT NOT NULL DEFAULT 'str',
    category    TEXT NOT NULL DEFAULT 'general',
    description TEXT NOT NULL DEFAULT '',
    source      TEXT NOT NULL DEFAULT 'default'
);
```

## Clustering thresholds (v0.5.0)
All configurable via `site_config` DB table (dashboard Config page or REST API):

| Parameter | Default | Signal |
|-----------|---------|--------|
| `title_threshold` | 0.75 | Min title similarity (SequenceMatcher) |
| `jaccard_threshold` | 0.55 | Min Jaccard token overlap |
| `dual_title_floor` | 0.55 | Dual-signal: min title |
| `dual_jaccard_floor` | 0.25 | Dual-signal: min jaccard |
| `embedding_similarity_threshold` | 0.885 | Cosine threshold (embeddings enabled) |
| `cluster_max_age_hours` | 6 | Cross-cycle merge window |

## Content-change detection (v0.5.0)
1. On each poll, `filter_already_seen()` computes `content_hash = SHA1(title|summary)` for each article
2. If `article_key` seen but `content_hash` differs → article is "changed"
3. Changed articles are re-clustered into their existing cluster (same `article_key` → same cluster)
4. `enriched_at` is cleared in the cluster payload → next enrichment cycle re-processes it
5. Empty stored hashes (pre-migration rows) are treated as unchanged — hash is populated on next upsert

## Site config (v0.5.0)
- `site_config` DB table seeded from `.env` on first startup
- Dashboard Config page: grouped by category (Clustering, Enrichment, Retention)
- Runtime updates via REST API (`POST /api/v1/config/update`)
- Reset to defaults via `POST /api/v1/config/reset`
- Source tracking: `env` (from .env), `api` (runtime update), `default` (built-in)

## Dashboard (v0.5.0)
- **Health** — stats, freshness, topic distribution, sentiment overview, feed activity
- **Feeds** — toggle feeds on/off
- **Clusters** — filterable table, click for drill-down modal
- **Sentiment** — time-series chart
- **Entities** — top entities, bar chart, click for matching clusters
- **Keywords** — top keywords, bar chart, click for matching clusters
- **Config** — runtime parameter tuning (new in v0.5.0)

## Backfill scripts
After deploying schema changes:
```bash
docker exec -it news-mcp python3 scripts/backfill_seen_articles.py
```

## Version history
- **v0.5.0** (2026-06-03): seen_articles table, content-change detection, dual-signal clustering, site_config DB + dashboard Config page, debug_dedup tool, article_identity module
- **v0.4.0** (2026-05): junction tables, stable cluster IDs, cross-cycle merge, orphan merge, dashboard