Project: news-mcp

Goal

Provide a signal-extraction MCP server that converts RSS into deduplicated, enriched news clusters that are easy for agents to use.

Current architecture (v0.5.0)

FastMCP SSE server mounted at /mcp
SQLite cache for clusters + entity metadata + feed state + seen_articles
payload_ts — indexed VIRTUAL GENERATED column: json_extract(payload, '$.timestamp'). Auto-maintained by SQLite on write.
cluster_entities junction table — (cluster_id, entity) with index on entity. SQL-level entity search.
cluster_keywords junction table — (cluster_id, keyword) with index on keyword. SQL-level keyword search.
seen_articles junction table — (article_key, cluster_id, content_hash). Per-article dedup with content-change detection.
All time-range filters and entity/keyword searches use SQL indexes. No full-table JSON parsing at query time.
Stable cluster IDs: sha1(min_article_key) — topic-independent, order-independent, consistent across polling cycles.
Cross-cycle merge: poller seeds clustering with recent DB clusters (configurable NEWS_CLUSTER_MAX_AGE_HOURS, default 6h).
Orphan merge: post-clustering Union-Find pass merges clusters sharing article keys.
Three-layer dedup: feed-level hash (coarse) → seen_articles by URL (fine) → content hash (detects in-place updates).
Dual-signal clustering: title ≥ 0.75, jaccard ≥ 0.55, or title ≥ 0.55 + jaccard ≥ 0.25 (dual). Embedding cosine ≥ 0.885 when enabled.
Content-change detection: seen_articles.content_hash (SHA-1 of title+summary) detects in-place article updates (e.g. "More to come..." → real content). Changed articles are re-clustered and their enriched_at is cleared for re-enrichment.
Concurrent RSS fetch, Ollama embeddings, LLM enrichment with per-provider semaphore + rate limiter.
Dashboard with Config page for runtime parameter tuning via site_config DB table.
article_identity.py — single source of truth for article_key() and article_content_hash().

MCP tools

get_latest_events(topic, limit, include_articles)
get_events_for_entity(entity, limit, timeframe, include_articles)
get_event_summary(event_id, include_articles)
detect_emerging_topics(limit, timeframe, topic, around) — returns signal_type (entity/keyword/phrase)
get_news_sentiment(entity, timeframe)
get_related_recent_entities(subject, timeframe, limit, include_trends)
get_feeds() / toggle_feed(feed_url, enabled)
debug_dedup(url, title?) — inspect dedup status, similarity signals, match decisions
get_capabilities()

REST API

Method	Path	Description
GET	`/api/v1/health`	Extended health: stats, freshness, feed state, pruning, seen_article_count
GET	`/api/v1/clusters`	Paginated clusters. Params: `topic`, `hours`, `limit`, `offset`
GET	`/api/v1/sentiment-series`	Sentiment time-series. Params: `topic`, `hours`, `bucket_hours`
GET	`/api/v1/entities`	Top entities by frequency. Params: `hours`, `limit`
GET	`/api/v1/keywords`	Top keywords by frequency. Params: `hours`, `limit`
GET	`/api/v1/clusters/by-entity`	SQL entity search via junction table
GET	`/api/v1/clusters/by-keyword`	SQL keyword search via junction table
GET	`/api/v1/cluster/{cluster_id}`	Full cluster detail
GET	`/api/v1/feeds`	Feed state list
POST	`/api/v1/feeds/toggle`	Enable/disable a feed
GET	`/api/v1/config`	All site config parameters
POST	`/api/v1/config/update`	Update a config parameter at runtime
POST	`/api/v1/config/reset`	Reset all config to .env/defaults

Refresh & caching

Background refresh every NEWS_REFRESH_INTERVAL_SECONDS (default 300s)
Three-layer dedup:
1. Feed-level content hash — skip entire unchanged feeds (coarse, O(1))
2. seen_articles by article_key (URL) — skip already-processed articles (fine)
3. content_hash comparison — detect in-place content updates, re-cluster + re-enrich changed articles
Enrichment caching via enriched_at timestamp in cluster payload
Pruning via NEWS_RETENTION_DAYS, NEWS_PRUNE_INTERVAL_HOURS

Schema (clusters table)

CREATE TABLE clusters (
    cluster_id TEXT PRIMARY KEY,
    topic TEXT NOT NULL,
    payload TEXT NOT NULL,
    updated_at TEXT NOT NULL,
    summary_payload TEXT,
    summary_updated_at TEXT,
    payload_ts GENERATED ALWAYS AS
        (json_extract(payload, '$.timestamp')) VIRTUAL
);
CREATE INDEX idx_clusters_payload_ts ON clusters(payload_ts);

CREATE TABLE cluster_entities (
    cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
    entity     TEXT NOT NULL,
    PRIMARY KEY (cluster_id, entity)
);
CREATE INDEX idx_cluster_entities_entity ON cluster_entities(entity);

CREATE TABLE cluster_keywords (
    cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
    keyword    TEXT NOT NULL,
    PRIMARY KEY (cluster_id, keyword)
);
CREATE INDEX idx_cluster_keywords_keyword ON cluster_keywords(keyword);

CREATE TABLE seen_articles (
    article_key  TEXT PRIMARY KEY,
    cluster_id   TEXT NOT NULL,
    first_seen   TEXT NOT NULL,
    url          TEXT NOT NULL DEFAULT '',
    content_hash TEXT NOT NULL DEFAULT ''
);

CREATE TABLE site_config (
    key         TEXT PRIMARY KEY,
    value       TEXT NOT NULL,
    type        TEXT NOT NULL DEFAULT 'str',
    category    TEXT NOT NULL DEFAULT 'general',
    description TEXT NOT NULL DEFAULT '',
    source      TEXT NOT NULL DEFAULT 'default'
);

Clustering thresholds (v0.5.0)

All configurable via site_config DB table (dashboard Config page or REST API):

Parameter	Default	Signal
`title_threshold`	0.75	Min title similarity (SequenceMatcher)
`jaccard_threshold`	0.55	Min Jaccard token overlap
`dual_title_floor`	0.55	Dual-signal: min title
`dual_jaccard_floor`	0.25	Dual-signal: min jaccard
`embedding_similarity_threshold`	0.885	Cosine threshold (embeddings enabled)
`cluster_max_age_hours`	6	Cross-cycle merge window

Content-change detection (v0.5.0)

On each poll, filter_already_seen() computes content_hash = SHA1(title|summary) for each article
If article_key seen but content_hash differs → article is "changed"
Changed articles are re-clustered into their existing cluster (same article_key → same cluster)
enriched_at is cleared in the cluster payload → next enrichment cycle re-processes it
Empty stored hashes (pre-migration rows) are treated as unchanged — hash is populated on next upsert

Site config (v0.5.0)

site_config DB table seeded from .env on first startup
Dashboard Config page: grouped by category (Clustering, Enrichment, Retention)
Runtime updates via REST API (POST /api/v1/config/update)
Reset to defaults via POST /api/v1/config/reset
Source tracking: env (from .env), api (runtime update), default (built-in)

Dashboard (v0.5.0)

Health — stats, freshness, topic distribution, sentiment overview, feed activity
Feeds — toggle feeds on/off
Clusters — filterable table, click for drill-down modal
Sentiment — time-series chart
Entities — top entities, bar chart, click for matching clusters
Keywords — top keywords, bar chart, click for matching clusters
Config — runtime parameter tuning (new in v0.5.0)

Backfill scripts

After deploying schema changes:

docker exec -it news-mcp python3 scripts/backfill_seen_articles.py

Version history

v0.5.0 (2026-06-03): seen_articles table, content-change detection, dual-signal clustering, site_config DB + dashboard Config page, debug_dedup tool, article_identity module
v0.4.0 (2026-05): junction tables, stable cluster IDs, cross-cycle merge, orphan merge, dashboard

PROJECT.md 7.8 KB Előzmények Nyers