PROJECT.md 6.4 KB

Project: news-mcp

Goal

Provide a signal-extraction MCP server that converts RSS into deduplicated, enriched news clusters that are easy for agents to use.

Current architecture (v0.4.0)

  • FastMCP SSE server mounted at /mcp
  • SQLite cache for clusters + entity metadata + feed state + LLM summary caches
  • payload_ts — indexed VIRTUAL GENERATED column: json_extract(payload, '$.timestamp'). Auto-maintained by SQLite on write. Indexed for O(log n) time-range queries. No write-path code needed.
  • cluster_entities junction table — (cluster_id, entity) with index on entity. Populated in upsert_clusters(). SQL-level entity search.
  • cluster_keywords junction table — (cluster_id, keyword) with index on keyword. Same pattern.
  • All time-range filters and entity/keyword searches use SQL indexes. No full-table JSON parsing at query time.
  • Stable cluster IDs: sha1(min_article_key) — topic-independent, order-independent, consistent across polling cycles.
  • Cross-cycle merge: poller seeds clustering with recent DB clusters (configurable NEWS_CLUSTER_MAX_AGE_HOURS, default 4h).
  • Orphan merge: post-clustering Union-Find pass merges clusters sharing article keys
  • Concurrent RSS fetch, Ollama embeddings, LLM enrichment with per-provider semaphore
  • Dashboard REST API (/api/v1/*) + Keywords panel + entity/keyword drill-down via junction tables

MCP tools

  • get_latest_events(topic, limit, include_articles)
  • get_events_for_entity(entity, limit, timeframe, include_articles)
  • get_event_summary(event_id, include_articles)
  • detect_emerging_topics(limit, timeframe, topic, around) — returns signal_type (entity/keyword/phrase)
  • get_news_sentiment(entity, timeframe)
  • get_related_recent_entities(subject, timeframe, limit, include_trends)
  • get_feeds() / toggle_feed(feed_url, enabled)
  • get_capabilities()

REST API

  • GET / — server info, tools list
  • GET /health — uptime, version hash
  • GET /api/v1/clusters — paginated, filtered by payload_ts SQL index
  • GET /api/v1/entities — top entities via junction table GROUP BY
  • GET /api/v1/keywords — top keywords via junction table GROUP BY
  • GET /api/v1/clusters/by-entity?entity=X&hours=Y — SQL entity search (NEW)
  • GET /api/v1/clusters/by-keyword?keyword=X&hours=Y — SQL keyword search (NEW)
  • GET /api/v1/sentiment-series — filtered by payload_ts SQL index
  • GET /api/v1/cluster/{cluster_id} — full detail
  • GET /api/v1/feeds / POST /api/v1/feeds/toggle — feed management

Refresh & caching

  • Background refresh every NEWS_REFRESH_INTERVAL_SECONDS (default 300s)
  • Feed-hash skipping to avoid redundant RSS+LLM work
  • Summary caching for get_event_summary
  • Pruning via NEWS_RETENTION_DAYS, NEWS_PRUNE_INTERVAL_HOURS

Schema (clusters table)

CREATE TABLE clusters (
    cluster_id TEXT PRIMARY KEY,
    topic TEXT NOT NULL,
    payload TEXT NOT NULL,
    updated_at TEXT NOT NULL,          -- row modification time (set on every upsert)
    summary_payload TEXT,
    summary_updated_at TEXT,
    payload_ts GENERATED ALWAYS AS     -- indexed event time (auto-maintained)
        (json_extract(payload, '$.timestamp')) VIRTUAL
);
CREATE INDEX idx_clusters_payload_ts ON clusters(payload_ts);

CREATE TABLE cluster_entities (
    cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
    entity     TEXT NOT NULL,          -- lowercased
    PRIMARY KEY (cluster_id, entity)
);
CREATE INDEX idx_cluster_entities_entity ON cluster_entities(entity);

CREATE TABLE cluster_keywords (
    cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
    keyword    TEXT NOT NULL,          -- lowercased
    PRIMARY KEY (cluster_id, keyword)
);
CREATE INDEX idx_cluster_keywords_keyword ON cluster_keywords(keyword);

Keyword Utilization (done, May 2026)

Keywords extracted by the LLM are now first-class search signals:

  • _cluster_entity_haystack() includes keywords → get_events_for_entity() matches themes
  • Cluster output includes keywords[] field
  • detect_emerging_topics() scores keywords with velocity/recency/source-diversity formula (signal_type: "keyword")
  • _collect_local_related() counts keyword co-occurrence
  • Dashboard Keywords panel with SQL frequency counts via junction table
  • Topic labels (crypto/macro/regulation/ai/other) filtered from keywords at extraction time

Two-Store Collapse (done, May 2026)

DashboardStore has been eliminated. All of its methods were moved into SQLiteClusterStore (the single data access layer), and the REST API routes now use the shared SQLiteClusterStore instance directly.

All MCP tools (get_events_for_entity, get_news_sentiment, get_latest_events entity mode) now use SQLiteClusterStore.get_clusters_by_entity_or_keyword() which searches via junction-table SQL joins — no row-limit blind spot. The cluster_entities and cluster_keywords junction tables are indexed for O(log n) lookup across any time window.

Timestamp Pipeline (May 2026)

  1. Write: sanitize_cluster_payload() normalizes timestamp/first_seen/last_updated to YYYY-MM-DDTHH:MM:SS+00:00. If all three missing, falls back to datetime.now().
  2. Generated column: payload_ts auto-extracts from JSON on write. Indexed.
  3. Read: All queries filter by payload_ts >= ? in SQL. No JSON parsing for time filtering.
  4. Backfill: One-time scripts/backfill_junction_tables.py populated junction tables from existing payloads. payload_ts was auto-populated by SQLite.

Design Flaw: Two Stores (FIXED, May 2026)

What happened: DashboardStore was a thin read-only query layer that wrapped SQLiteClusterStore. The MCP tools (get_events_for_entity, get_news_sentiment, get_latest_events entity mode) did Python-side entity matching by fetching top-N clusters via payload_ts then filtering in Python. Entities in clusters beyond the limit were silently missed.

Fix applied:

  • DashboardStore was deleted. All its methods are now in SQLiteClusterStore.
  • All MCP tools use SQLiteClusterStore.get_clusters_by_entity_or_keyword() — SQL-level junction-table search with no row-limit blind spot.
  • The combined method uses LEFT JOIN on cluster_entities and cluster_keywords with WHERE (ce.entity IN (...) OR ck.keyword IN (...)), which matches both named entities and thematic keywords across any time window.
  • Exact matching (via IN) replaced substring matching — more correct, no false positives from partial name matches.