Project: news-mcp
Goal
Provide a signal-extraction MCP server that converts RSS into deduplicated, enriched news clusters that are easy for agents to use.
Current architecture (v1)
- FastMCP SSE server mounted at
/mcp
- SQLite cache for clusters + Groq summary caches
- RSS fetch (breakingthenews.net)
- v1 dedup via fuzzy title similarity
- optional Ollama embeddings path for clustering (when
NEWS_EMBEDDINGS_ENABLED=true)
- configurable embedding similarity threshold (
NEWS_EMBEDDING_SIMILARITY_THRESHOLD)
- optional embeddings backfill script for precomputing cluster vectors in SQLite
- optional merge-analysis script for threshold experiments before any DB rewrite
- optional merge pass for destructive consolidation after threshold review
- optional article-dedup cleanup for repeated article variants inside a cluster
- Groq enrichment (topic/entities/sentiment/keywords)
- Tools expose semantic queries over cached clusters
MCP tools (current)
get_latest_events(topic, limit)
get_events_for_entity(entity, limit)
get_events_for_entity(entity, limit, timeframe)
get_event_summary(event_id)
detect_emerging_topics(limit)
get_related_entities(subject, timeframe, limit)
Refresh & caching
Future work (planned): entity graph over time
Instead of treating detect_emerging_topics() as a flat list, we want a higher-level representation:
- Convert emerging topic/entity co-occurrence signals into a weighted entity graph
- Group the graph into communities (story neighborhoods)
- Track time evolution across refresh windows:
- node “momentum” (trend_score/count changes)
- edge strength changes (relation tightening/weakening)
- community emergence/disappearance
Eventual agent tool shape (later): get_emerging_entity_graph(timeframe, limit).
- Background refresh every
NEWS_REFRESH_INTERVAL_SECONDS (default 900s)
- Feed-hash skipping to avoid redundant RSS+Groq work
- Cluster TTL (
NEWS_CLUSTERS_TTL_HOURS via CLUSTERS_TTL_HOURS)
- Summary caching for
get_event_summary
Definition of “committable”
- Tests pass offline (dedup/storage unit tests)
- Server exposes tool surface with valid schemas
- Caching prevents repeated Groq calls for unchanged clusters
- Embeddings remain optional: Ollama is tried first when enabled, otherwise the heuristic path stays active
- Embeddings backfill script exists for older cluster rows before the server restart
- Merge-analysis script exists to inspect candidate cluster pairs at multiple thresholds
- Merge pass exists for destructive consolidation once thresholds look sane
- Article-dedup cleanup exists for fixing duplicated article records already in SQLite
- Entity lookup now respects timeframe as the scan window, with limit acting as a cap