Provide a signal-extraction MCP server that converts RSS into deduplicated, enriched news clusters that are easy for agents to use.
/mcpjson_extract(payload, '$.timestamp'). Auto-maintained by SQLite on write. Indexed for O(log n) time-range queries. No write-path code needed.(cluster_id, entity) with index on entity. Populated in upsert_clusters(). SQL-level entity search.(cluster_id, keyword) with index on keyword. Same pattern.sha1(min_article_key) — topic-independent, order-independent, consistent across polling cycles.NEWS_CLUSTER_MAX_AGE_HOURS, default 4h)./api/v1/*) + Keywords panel + entity/keyword drill-down via junction tablesget_latest_events(topic, limit, include_articles)get_events_for_entity(entity, limit, timeframe, include_articles)get_event_summary(event_id, include_articles)detect_emerging_topics(limit, timeframe, topic, around) — returns signal_type (entity/keyword/phrase)get_news_sentiment(entity, timeframe)get_related_recent_entities(subject, timeframe, limit, include_trends)get_feeds() / toggle_feed(feed_url, enabled)get_capabilities()GET / — server info, tools listGET /health — uptime, version hashGET /api/v1/clusters — paginated, filtered by payload_ts SQL indexGET /api/v1/entities — top entities via junction table GROUP BYGET /api/v1/keywords — top keywords via junction table GROUP BYGET /api/v1/clusters/by-entity?entity=X&hours=Y — SQL entity search (NEW)GET /api/v1/clusters/by-keyword?keyword=X&hours=Y — SQL keyword search (NEW)GET /api/v1/sentiment-series — filtered by payload_ts SQL indexGET /api/v1/cluster/{cluster_id} — full detailGET /api/v1/feeds / POST /api/v1/feeds/toggle — feed managementNEWS_REFRESH_INTERVAL_SECONDS (default 300s)get_event_summaryNEWS_RETENTION_DAYS, NEWS_PRUNE_INTERVAL_HOURSCREATE TABLE clusters (
cluster_id TEXT PRIMARY KEY,
topic TEXT NOT NULL,
payload TEXT NOT NULL,
updated_at TEXT NOT NULL, -- row modification time (set on every upsert)
summary_payload TEXT,
summary_updated_at TEXT,
payload_ts GENERATED ALWAYS AS -- indexed event time (auto-maintained)
(json_extract(payload, '$.timestamp')) VIRTUAL
);
CREATE INDEX idx_clusters_payload_ts ON clusters(payload_ts);
CREATE TABLE cluster_entities (
cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
entity TEXT NOT NULL, -- lowercased
PRIMARY KEY (cluster_id, entity)
);
CREATE INDEX idx_cluster_entities_entity ON cluster_entities(entity);
CREATE TABLE cluster_keywords (
cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
keyword TEXT NOT NULL, -- lowercased
PRIMARY KEY (cluster_id, keyword)
);
CREATE INDEX idx_cluster_keywords_keyword ON cluster_keywords(keyword);
Keywords extracted by the LLM are now first-class search signals:
_cluster_entity_haystack() includes keywords → get_events_for_entity() matches themeskeywords[] fielddetect_emerging_topics() scores keywords with velocity/recency/source-diversity formula (signal_type: "keyword")_collect_local_related() counts keyword co-occurrencesanitize_cluster_payload() normalizes timestamp/first_seen/last_updated to YYYY-MM-DDTHH:MM:SS+00:00. If all three missing, falls back to datetime.now().payload_ts auto-extracts from JSON on write. Indexed.payload_ts >= ? in SQL. No JSON parsing for time filtering.scripts/backfill_junction_tables.py populated junction tables from existing payloads. payload_ts was auto-populated by SQLite.Problem: SQLiteClusterStore and DashboardStore are parallel copies of the same data access layer. Methods were duplicated when DashboardStore was added, with the same JSON-parsing approach. When junction tables were implemented, only DashboardStore was updated. SQLiteClusterStore (used by MCP tools) still does full-table JSON parsing for entity/keyword search.
Current state:
DashboardStore — uses SQL payload_ts filter + junction tables ✓SQLiteClusterStore — uses SQL payload_ts filter for time ✓, but MCP tool entity search (get_events_for_entity, get_news_sentiment) still fetches top-N clusters by time then filters entities in PythonConsequence: get_events_for_entity("Pete Hegseth", timeframe="72h") fetches the 200 most recent clusters (via payload_ts), then loops in Python checking entities. If the entity appears in 34 clusters but only 15 are in the top 200, 19 are missed.
Proposed fix: Collapse both stores into one. SQLiteClusterStore should be the single data access layer with proper junction-table methods for entity/keyword search. DashboardStore should be a thin wrapper or removed entirely. MCP tools should call SQLiteClusterStore.get_clusters_by_entity() using junction tables instead of Python-side filtering.