PROJECT.md 6.2 KB

Project: news-mcp

Goal

Provide a signal-extraction MCP server that converts RSS into deduplicated, enriched news clusters that are easy for agents to use.

Current architecture (v0.4.0)

  • FastMCP SSE server mounted at /mcp
  • SQLite cache for clusters + entity metadata + feed state + LLM summary caches
  • payload_ts — indexed VIRTUAL GENERATED column: json_extract(payload, '$.timestamp'). Auto-maintained by SQLite on write. Indexed for O(log n) time-range queries. No write-path code needed.
  • cluster_entities junction table — (cluster_id, entity) with index on entity. Populated in upsert_clusters(). SQL-level entity search.
  • cluster_keywords junction table — (cluster_id, keyword) with index on keyword. Same pattern.
  • All time-range filters and entity/keyword searches use SQL indexes. No full-table JSON parsing at query time.
  • Stable cluster IDs: sha1(min_article_key) — topic-independent, order-independent, consistent across polling cycles.
  • Cross-cycle merge: poller seeds clustering with recent DB clusters (configurable NEWS_CLUSTER_MAX_AGE_HOURS, default 4h).
  • Orphan merge: post-clustering Union-Find pass merges clusters sharing article keys
  • Concurrent RSS fetch, Ollama embeddings, LLM enrichment with per-provider semaphore
  • Dashboard REST API (/api/v1/*) + Keywords panel + entity/keyword drill-down via junction tables

MCP tools

  • get_latest_events(topic, limit, include_articles)
  • get_events_for_entity(entity, limit, timeframe, include_articles)
  • get_event_summary(event_id, include_articles)
  • detect_emerging_topics(limit, timeframe, topic, around) — returns signal_type (entity/keyword/phrase)
  • get_news_sentiment(entity, timeframe)
  • get_related_recent_entities(subject, timeframe, limit, include_trends)
  • get_feeds() / toggle_feed(feed_url, enabled)
  • get_capabilities()

REST API

  • GET / — server info, tools list
  • GET /health — uptime, version hash
  • GET /api/v1/clusters — paginated, filtered by payload_ts SQL index
  • GET /api/v1/entities — top entities via junction table GROUP BY
  • GET /api/v1/keywords — top keywords via junction table GROUP BY
  • GET /api/v1/clusters/by-entity?entity=X&hours=Y — SQL entity search (NEW)
  • GET /api/v1/clusters/by-keyword?keyword=X&hours=Y — SQL keyword search (NEW)
  • GET /api/v1/sentiment-series — filtered by payload_ts SQL index
  • GET /api/v1/cluster/{cluster_id} — full detail
  • GET /api/v1/feeds / POST /api/v1/feeds/toggle — feed management

Refresh & caching

  • Background refresh every NEWS_REFRESH_INTERVAL_SECONDS (default 300s)
  • Feed-hash skipping to avoid redundant RSS+LLM work
  • Summary caching for get_event_summary
  • Pruning via NEWS_RETENTION_DAYS, NEWS_PRUNE_INTERVAL_HOURS

Schema (clusters table)

CREATE TABLE clusters (
    cluster_id TEXT PRIMARY KEY,
    topic TEXT NOT NULL,
    payload TEXT NOT NULL,
    updated_at TEXT NOT NULL,          -- row modification time (set on every upsert)
    summary_payload TEXT,
    summary_updated_at TEXT,
    payload_ts GENERATED ALWAYS AS     -- indexed event time (auto-maintained)
        (json_extract(payload, '$.timestamp')) VIRTUAL
);
CREATE INDEX idx_clusters_payload_ts ON clusters(payload_ts);

CREATE TABLE cluster_entities (
    cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
    entity     TEXT NOT NULL,          -- lowercased
    PRIMARY KEY (cluster_id, entity)
);
CREATE INDEX idx_cluster_entities_entity ON cluster_entities(entity);

CREATE TABLE cluster_keywords (
    cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
    keyword    TEXT NOT NULL,          -- lowercased
    PRIMARY KEY (cluster_id, keyword)
);
CREATE INDEX idx_cluster_keywords_keyword ON cluster_keywords(keyword);

Keyword Utilization (done, May 2026)

Keywords extracted by the LLM are now first-class search signals:

  • _cluster_entity_haystack() includes keywords → get_events_for_entity() matches themes
  • Cluster output includes keywords[] field
  • detect_emerging_topics() scores keywords with velocity/recency/source-diversity formula (signal_type: "keyword")
  • _collect_local_related() counts keyword co-occurrence
  • Dashboard Keywords panel with SQL frequency counts via junction table
  • Topic labels (crypto/macro/regulation/ai/other) filtered from keywords at extraction time

Timestamp Pipeline (May 2026)

  1. Write: sanitize_cluster_payload() normalizes timestamp/first_seen/last_updated to YYYY-MM-DDTHH:MM:SS+00:00. If all three missing, falls back to datetime.now().
  2. Generated column: payload_ts auto-extracts from JSON on write. Indexed.
  3. Read: All queries filter by payload_ts >= ? in SQL. No JSON parsing for time filtering.
  4. Backfill: One-time scripts/backfill_junction_tables.py populated junction tables from existing payloads. payload_ts was auto-populated by SQLite.

Design Flaw: Two Stores (KNOWN, fix planned)

Problem: SQLiteClusterStore and DashboardStore are parallel copies of the same data access layer. Methods were duplicated when DashboardStore was added, with the same JSON-parsing approach. When junction tables were implemented, only DashboardStore was updated. SQLiteClusterStore (used by MCP tools) still does full-table JSON parsing for entity/keyword search.

Current state:

  • DashboardStore — uses SQL payload_ts filter + junction tables ✓
  • SQLiteClusterStore — uses SQL payload_ts filter for time ✓, but MCP tool entity search (get_events_for_entity, get_news_sentiment) still fetches top-N clusters by time then filters entities in Python

Consequence: get_events_for_entity("Pete Hegseth", timeframe="72h") fetches the 200 most recent clusters (via payload_ts), then loops in Python checking entities. If the entity appears in 34 clusters but only 15 are in the top 200, 19 are missed.

Proposed fix: Collapse both stores into one. SQLiteClusterStore should be the single data access layer with proper junction-table methods for entity/keyword search. DashboardStore should be a thin wrapper or removed entirely. MCP tools should call SQLiteClusterStore.get_clusters_by_entity() using junction tables instead of Python-side filtering.