Provide a signal-extraction MCP server that converts RSS into deduplicated, enriched news clusters that are easy for agents to use.
/mcpjson_extract(payload, '$.timestamp'). Auto-maintained by SQLite on write.(cluster_id, entity) with index on entity. SQL-level entity search.(cluster_id, keyword) with index on keyword. SQL-level keyword search.(article_key, cluster_id, content_hash). Per-article dedup with content-change detection.sha1(min_article_key) — topic-independent, order-independent, consistent across polling cycles.NEWS_CLUSTER_MAX_AGE_HOURS, default 6h).seen_articles.content_hash (SHA-1 of title+summary) detects in-place article updates (e.g. "More to come..." → real content). Changed articles are re-clustered and their enriched_at is cleared for re-enrichment.site_config DB table.article_identity.py — single source of truth for article_key() and article_content_hash().get_latest_events(topic, limit, include_articles)get_events_for_entity(entity, limit, timeframe, include_articles)get_event_summary(event_id, include_articles)detect_emerging_topics(limit, timeframe, topic, around) — returns signal_type (entity/keyword/phrase)get_news_sentiment(entity, timeframe)get_related_recent_entities(subject, timeframe, limit, include_trends)get_feeds() / toggle_feed(feed_url, enabled)debug_dedup(url, title?) — inspect dedup status, similarity signals, match decisionsget_capabilities()| Method | Path | Description |
|---|---|---|
| GET | /api/v1/health |
Extended health: stats, freshness, feed state, pruning, seen_article_count |
| GET | /api/v1/clusters |
Paginated clusters. Params: topic, hours, limit, offset |
| GET | /api/v1/sentiment-series |
Sentiment time-series. Params: topic, hours, bucket_hours |
| GET | /api/v1/entities |
Top entities by frequency. Params: hours, limit |
| GET | /api/v1/keywords |
Top keywords by frequency. Params: hours, limit |
| GET | /api/v1/clusters/by-entity |
SQL entity search via junction table |
| GET | /api/v1/clusters/by-keyword |
SQL keyword search via junction table |
| GET | /api/v1/cluster/{cluster_id} |
Full cluster detail |
| GET | /api/v1/feeds |
Feed state list |
| POST | /api/v1/feeds/toggle |
Enable/disable a feed |
| GET | /api/v1/config |
All site config parameters |
| POST | /api/v1/config/update |
Update a config parameter at runtime |
| POST | /api/v1/config/reset |
Reset all config to .env/defaults |
NEWS_REFRESH_INTERVAL_SECONDS (default 300s)seen_articles by article_key (URL) — skip already-processed articles (fine)content_hash comparison — detect in-place content updates, re-cluster + re-enrich changed articlesenriched_at timestamp in cluster payloadNEWS_RETENTION_DAYS, NEWS_PRUNE_INTERVAL_HOURSCREATE TABLE clusters (
cluster_id TEXT PRIMARY KEY,
topic TEXT NOT NULL,
payload TEXT NOT NULL,
updated_at TEXT NOT NULL,
summary_payload TEXT,
summary_updated_at TEXT,
payload_ts GENERATED ALWAYS AS
(json_extract(payload, '$.timestamp')) VIRTUAL
);
CREATE INDEX idx_clusters_payload_ts ON clusters(payload_ts);
CREATE TABLE cluster_entities (
cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
entity TEXT NOT NULL,
PRIMARY KEY (cluster_id, entity)
);
CREATE INDEX idx_cluster_entities_entity ON cluster_entities(entity);
CREATE TABLE cluster_keywords (
cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
keyword TEXT NOT NULL,
PRIMARY KEY (cluster_id, keyword)
);
CREATE INDEX idx_cluster_keywords_keyword ON cluster_keywords(keyword);
CREATE TABLE seen_articles (
article_key TEXT PRIMARY KEY,
cluster_id TEXT NOT NULL,
first_seen TEXT NOT NULL,
url TEXT NOT NULL DEFAULT '',
content_hash TEXT NOT NULL DEFAULT ''
);
CREATE TABLE site_config (
key TEXT PRIMARY KEY,
value TEXT NOT NULL,
type TEXT NOT NULL DEFAULT 'str',
category TEXT NOT NULL DEFAULT 'general',
description TEXT NOT NULL DEFAULT '',
source TEXT NOT NULL DEFAULT 'default'
);
All configurable via site_config DB table (dashboard Config page or REST API):
| Parameter | Default | Signal |
|---|---|---|
title_threshold |
0.75 | Min title similarity (SequenceMatcher) |
jaccard_threshold |
0.55 | Min Jaccard token overlap |
dual_title_floor |
0.55 | Dual-signal: min title |
dual_jaccard_floor |
0.25 | Dual-signal: min jaccard |
embedding_similarity_threshold |
0.885 | Cosine threshold (embeddings enabled) |
cluster_max_age_hours |
6 | Cross-cycle merge window |
filter_already_seen() computes content_hash = SHA1(title|summary) for each articlearticle_key seen but content_hash differs → article is "changed"article_key → same cluster)enriched_at is cleared in the cluster payload → next enrichment cycle re-processes itsite_config DB table seeded from .env on first startupPOST /api/v1/config/update)POST /api/v1/config/resetenv (from .env), api (runtime update), default (built-in)After deploying schema changes:
docker exec -it news-mcp python3 scripts/backfill_seen_articles.py