PROJECT.md 7.8 KB

Project: news-mcp

Goal

Provide a signal-extraction MCP server that converts RSS into deduplicated, enriched news clusters that are easy for agents to use.

Current architecture (v0.5.0)

  • FastMCP SSE server mounted at /mcp
  • SQLite cache for clusters + entity metadata + feed state + seen_articles
  • payload_ts — indexed VIRTUAL GENERATED column: json_extract(payload, '$.timestamp'). Auto-maintained by SQLite on write.
  • cluster_entities junction table — (cluster_id, entity) with index on entity. SQL-level entity search.
  • cluster_keywords junction table — (cluster_id, keyword) with index on keyword. SQL-level keyword search.
  • seen_articles junction table — (article_key, cluster_id, content_hash). Per-article dedup with content-change detection.
  • All time-range filters and entity/keyword searches use SQL indexes. No full-table JSON parsing at query time.
  • Stable cluster IDs: sha1(min_article_key) — topic-independent, order-independent, consistent across polling cycles.
  • Cross-cycle merge: poller seeds clustering with recent DB clusters (configurable NEWS_CLUSTER_MAX_AGE_HOURS, default 6h).
  • Orphan merge: post-clustering Union-Find pass merges clusters sharing article keys.
  • Three-layer dedup: feed-level hash (coarse) → seen_articles by URL (fine) → content hash (detects in-place updates).
  • Dual-signal clustering: title ≥ 0.75, jaccard ≥ 0.55, or title ≥ 0.55 + jaccard ≥ 0.25 (dual). Embedding cosine ≥ 0.885 when enabled.
  • Content-change detection: seen_articles.content_hash (SHA-1 of title+summary) detects in-place article updates (e.g. "More to come..." → real content). Changed articles are re-clustered and their enriched_at is cleared for re-enrichment.
  • Concurrent RSS fetch, Ollama embeddings, LLM enrichment with per-provider semaphore + rate limiter.
  • Dashboard with Config page for runtime parameter tuning via site_config DB table.
  • article_identity.py — single source of truth for article_key() and article_content_hash().

MCP tools

  • get_latest_events(topic, limit, include_articles)
  • get_events_for_entity(entity, limit, timeframe, include_articles)
  • get_event_summary(event_id, include_articles)
  • detect_emerging_topics(limit, timeframe, topic, around) — returns signal_type (entity/keyword/phrase)
  • get_news_sentiment(entity, timeframe)
  • get_related_recent_entities(subject, timeframe, limit, include_trends)
  • get_feeds() / toggle_feed(feed_url, enabled)
  • debug_dedup(url, title?) — inspect dedup status, similarity signals, match decisions
  • get_capabilities()

REST API

Method Path Description
GET /api/v1/health Extended health: stats, freshness, feed state, pruning, seen_article_count
GET /api/v1/clusters Paginated clusters. Params: topic, hours, limit, offset
GET /api/v1/sentiment-series Sentiment time-series. Params: topic, hours, bucket_hours
GET /api/v1/entities Top entities by frequency. Params: hours, limit
GET /api/v1/keywords Top keywords by frequency. Params: hours, limit
GET /api/v1/clusters/by-entity SQL entity search via junction table
GET /api/v1/clusters/by-keyword SQL keyword search via junction table
GET /api/v1/cluster/{cluster_id} Full cluster detail
GET /api/v1/feeds Feed state list
POST /api/v1/feeds/toggle Enable/disable a feed
GET /api/v1/config All site config parameters
POST /api/v1/config/update Update a config parameter at runtime
POST /api/v1/config/reset Reset all config to .env/defaults

Refresh & caching

  • Background refresh every NEWS_REFRESH_INTERVAL_SECONDS (default 300s)
  • Three-layer dedup:
    1. Feed-level content hash — skip entire unchanged feeds (coarse, O(1))
    2. seen_articles by article_key (URL) — skip already-processed articles (fine)
    3. content_hash comparison — detect in-place content updates, re-cluster + re-enrich changed articles
  • Enrichment caching via enriched_at timestamp in cluster payload
  • Pruning via NEWS_RETENTION_DAYS, NEWS_PRUNE_INTERVAL_HOURS

Schema (clusters table)

CREATE TABLE clusters (
    cluster_id TEXT PRIMARY KEY,
    topic TEXT NOT NULL,
    payload TEXT NOT NULL,
    updated_at TEXT NOT NULL,
    summary_payload TEXT,
    summary_updated_at TEXT,
    payload_ts GENERATED ALWAYS AS
        (json_extract(payload, '$.timestamp')) VIRTUAL
);
CREATE INDEX idx_clusters_payload_ts ON clusters(payload_ts);

CREATE TABLE cluster_entities (
    cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
    entity     TEXT NOT NULL,
    PRIMARY KEY (cluster_id, entity)
);
CREATE INDEX idx_cluster_entities_entity ON cluster_entities(entity);

CREATE TABLE cluster_keywords (
    cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
    keyword    TEXT NOT NULL,
    PRIMARY KEY (cluster_id, keyword)
);
CREATE INDEX idx_cluster_keywords_keyword ON cluster_keywords(keyword);

CREATE TABLE seen_articles (
    article_key  TEXT PRIMARY KEY,
    cluster_id   TEXT NOT NULL,
    first_seen   TEXT NOT NULL,
    url          TEXT NOT NULL DEFAULT '',
    content_hash TEXT NOT NULL DEFAULT ''
);

CREATE TABLE site_config (
    key         TEXT PRIMARY KEY,
    value       TEXT NOT NULL,
    type        TEXT NOT NULL DEFAULT 'str',
    category    TEXT NOT NULL DEFAULT 'general',
    description TEXT NOT NULL DEFAULT '',
    source      TEXT NOT NULL DEFAULT 'default'
);

Clustering thresholds (v0.5.0)

All configurable via site_config DB table (dashboard Config page or REST API):

Parameter Default Signal
title_threshold 0.75 Min title similarity (SequenceMatcher)
jaccard_threshold 0.55 Min Jaccard token overlap
dual_title_floor 0.55 Dual-signal: min title
dual_jaccard_floor 0.25 Dual-signal: min jaccard
embedding_similarity_threshold 0.885 Cosine threshold (embeddings enabled)
cluster_max_age_hours 6 Cross-cycle merge window

Content-change detection (v0.5.0)

  1. On each poll, filter_already_seen() computes content_hash = SHA1(title|summary) for each article
  2. If article_key seen but content_hash differs → article is "changed"
  3. Changed articles are re-clustered into their existing cluster (same article_key → same cluster)
  4. enriched_at is cleared in the cluster payload → next enrichment cycle re-processes it
  5. Empty stored hashes (pre-migration rows) are treated as unchanged — hash is populated on next upsert

Site config (v0.5.0)

  • site_config DB table seeded from .env on first startup
  • Dashboard Config page: grouped by category (Clustering, Enrichment, Retention)
  • Runtime updates via REST API (POST /api/v1/config/update)
  • Reset to defaults via POST /api/v1/config/reset
  • Source tracking: env (from .env), api (runtime update), default (built-in)

Dashboard (v0.5.0)

  • Health — stats, freshness, topic distribution, sentiment overview, feed activity
  • Feeds — toggle feeds on/off
  • Clusters — filterable table, click for drill-down modal
  • Sentiment — time-series chart
  • Entities — top entities, bar chart, click for matching clusters
  • Keywords — top keywords, bar chart, click for matching clusters
  • Config — runtime parameter tuning (new in v0.5.0)

Backfill scripts

After deploying schema changes:

docker exec -it news-mcp python3 scripts/backfill_seen_articles.py

Version history

  • v0.5.0 (2026-06-03): seen_articles table, content-change detection, dual-signal clustering, site_config DB + dashboard Config page, debug_dedup tool, article_identity module
  • v0.4.0 (2026-05): junction tables, stable cluster IDs, cross-cycle merge, orphan merge, dashboard