Project: news-mcp

Goal

Provide a signal-extraction MCP server that converts RSS into deduplicated, enriched news clusters that are easy for agents to use.

Current architecture (v0.4.0)

FastMCP SSE server mounted at /mcp
SQLite cache for clusters + entity metadata + feed state + LLM summary caches
payload_ts — indexed VIRTUAL GENERATED column: json_extract(payload, '$.timestamp'). Auto-maintained by SQLite on write. Indexed for O(log n) time-range queries. No write-path code needed.
cluster_entities junction table — (cluster_id, entity) with index on entity. Populated in upsert_clusters(). SQL-level entity search.
cluster_keywords junction table — (cluster_id, keyword) with index on keyword. Same pattern.
All time-range filters and entity/keyword searches use SQL indexes. No full-table JSON parsing at query time.
Stable cluster IDs: sha1(min_article_key) — topic-independent, order-independent, consistent across polling cycles.
Cross-cycle merge: poller seeds clustering with recent DB clusters (configurable NEWS_CLUSTER_MAX_AGE_HOURS, default 4h).
Orphan merge: post-clustering Union-Find pass merges clusters sharing article keys
Concurrent RSS fetch, Ollama embeddings, LLM enrichment with per-provider semaphore
Dashboard REST API (/api/v1/*) + Keywords panel + entity/keyword drill-down via junction tables

MCP tools

get_latest_events(topic, limit, include_articles)
get_events_for_entity(entity, limit, timeframe, include_articles)
get_event_summary(event_id, include_articles)
detect_emerging_topics(limit, timeframe, topic, around) — returns signal_type (entity/keyword/phrase)
get_news_sentiment(entity, timeframe)
get_related_recent_entities(subject, timeframe, limit, include_trends)
get_feeds() / toggle_feed(feed_url, enabled)
get_capabilities()

REST API

GET / — server info, tools list
GET /health — uptime, version hash
GET /api/v1/clusters — paginated, filtered by payload_ts SQL index
GET /api/v1/entities — top entities via junction table GROUP BY
GET /api/v1/keywords — top keywords via junction table GROUP BY
GET /api/v1/clusters/by-entity?entity=X&hours=Y — SQL entity search (NEW)
GET /api/v1/clusters/by-keyword?keyword=X&hours=Y — SQL keyword search (NEW)
GET /api/v1/sentiment-series — filtered by payload_ts SQL index
GET /api/v1/cluster/{cluster_id} — full detail
GET /api/v1/feeds / POST /api/v1/feeds/toggle — feed management

Refresh & caching

Background refresh every NEWS_REFRESH_INTERVAL_SECONDS (default 300s)
Feed-hash skipping to avoid redundant RSS+LLM work
Summary caching for get_event_summary
Pruning via NEWS_RETENTION_DAYS, NEWS_PRUNE_INTERVAL_HOURS

Schema (clusters table)

CREATE TABLE clusters (
    cluster_id TEXT PRIMARY KEY,
    topic TEXT NOT NULL,
    payload TEXT NOT NULL,
    updated_at TEXT NOT NULL,          -- row modification time (set on every upsert)
    summary_payload TEXT,
    summary_updated_at TEXT,
    payload_ts GENERATED ALWAYS AS     -- indexed event time (auto-maintained)
        (json_extract(payload, '$.timestamp')) VIRTUAL
);
CREATE INDEX idx_clusters_payload_ts ON clusters(payload_ts);

CREATE TABLE cluster_entities (
    cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
    entity     TEXT NOT NULL,          -- lowercased
    PRIMARY KEY (cluster_id, entity)
);
CREATE INDEX idx_cluster_entities_entity ON cluster_entities(entity);

CREATE TABLE cluster_keywords (
    cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
    keyword    TEXT NOT NULL,          -- lowercased
    PRIMARY KEY (cluster_id, keyword)
);
CREATE INDEX idx_cluster_keywords_keyword ON cluster_keywords(keyword);

Keyword Utilization (done, May 2026)

Keywords extracted by the LLM are now first-class search signals:

_cluster_entity_haystack() includes keywords → get_events_for_entity() matches themes
Cluster output includes keywords[] field
detect_emerging_topics() scores keywords with velocity/recency/source-diversity formula (signal_type: "keyword")
_collect_local_related() counts keyword co-occurrence
Dashboard Keywords panel with SQL frequency counts via junction table
Topic labels (crypto/macro/regulation/ai/other) filtered from keywords at extraction time

Timestamp Pipeline (May 2026)

Write: sanitize_cluster_payload() normalizes timestamp/first_seen/last_updated to YYYY-MM-DDTHH:MM:SS+00:00. If all three missing, falls back to datetime.now().
Generated column: payload_ts auto-extracts from JSON on write. Indexed.
Read: All queries filter by payload_ts >= ? in SQL. No JSON parsing for time filtering.
Backfill: One-time scripts/backfill_junction_tables.py populated junction tables from existing payloads. payload_ts was auto-populated by SQLite.

Design Flaw: Two Stores (KNOWN, fix planned)

Problem: SQLiteClusterStore and DashboardStore are parallel copies of the same data access layer. Methods were duplicated when DashboardStore was added, with the same JSON-parsing approach. When junction tables were implemented, only DashboardStore was updated. SQLiteClusterStore (used by MCP tools) still does full-table JSON parsing for entity/keyword search.

Current state:

DashboardStore — uses SQL payload_ts filter + junction tables ✓
SQLiteClusterStore — uses SQL payload_ts filter for time ✓, but MCP tool entity search (get_events_for_entity, get_news_sentiment) still fetches top-N clusters by time then filters entities in Python

Consequence: get_events_for_entity("Pete Hegseth", timeframe="72h") fetches the 200 most recent clusters (via payload_ts), then loops in Python checking entities. If the entity appears in 34 clusters but only 15 are in the top 200, 19 are missed.

Proposed fix: Collapse both stores into one. SQLiteClusterStore should be the single data access layer with proper junction-table methods for entity/keyword search. DashboardStore should be a thin wrapper or removed entirely. MCP tools should call SQLiteClusterStore.get_clusters_by_entity() using junction tables instead of Python-side filtering.

PROJECT.md 6.2 KB Histórico Raw