📰 News MCP Server — Requirements Spec
🎯 Goal
Provide structured, deduplicated, topic-aware news signals
that an agent can use for reasoning about:
- events
- narratives
- sentiment shifts
👉 Not a feed reader
👉 Not a headline dump
👉 A signal extraction layer
🧠 Core Design Principle
Raw news is useless to agents.
Processed news is powerful.
🏗️ 1. Internal Architecture
🧩 Data Sources Layer (sources/)
Mix of:
- RSS feeds (primary)
- optional APIs later
Examples:
- Reuters
- Bloomberg
- CoinDesk
Responsibilities:
- fetch articles
- normalize format
🔄 Ingestion Pipeline
Runs periodically (e.g. every few minutes)
Steps:
- fetch articles
normalize fields:
- title
- url
- source
- timestamp
- summary (if available)
🧹 Deduplication Layer
Problem:
Same story appears across many sources.
Solution:
Cluster articles by similarity:
Methods:
- title similarity (fuzzy match / embeddings)
- URL canonicalization
- content similarity (optional later)
Output:
{
"cluster_id": "...",
"headline": "Canonical headline",
"articles": [...],
"sources": ["Reuters", "Bloomberg"],
"first_seen": "...",
"last_updated": "..."
}
👉 This is your core unit of truth, not individual articles
🧠 Enrichment Layer
Adds meaning to clusters.
1. Entity extraction
- assets (BTC, ETH)
- companies
- macro topics (inflation, rates)
2. Topic classification
Examples:
- crypto
- macro
- regulation
- AI
3. Sentiment (lightweight)
- positive / negative / neutral
- or simple score
👉 Keep this simple in v1 (don’t over-engineer NLP)
4. Importance scoring (VERY useful)
Heuristic:
- number of sources covering it
- recency
- source credibility
- keyword weighting
🗃️ Storage Layer
You need short-term memory:
- clusters (not raw articles)
- TTL: e.g. 24–72h
Optional:
- in-memory store (start)
- later: DB
we have a choice of storage possibilites including qdrant, postgresql, couchdb
🧰 2. Agent-Facing Tools (IMPORTANT)
Keep tools high-level and semantic
1. get_latest_events
“What is happening right now?”
Input:
{
"topic": "crypto",
"limit": 5,
"include_articles": false
}
Output:
[
{
"headline": "...",
"summary": "...",
"entities": ["BTC"],
"sentiment": "positive",
"importance": 0.82,
"sources": ["Reuters", "CoinDesk"],
"timestamp": "...",
"articles": [
{
"title": "...",
"url": "...",
"source": "Reuters",
"timestamp": "..."
}
]
}
]
2. get_events_for_entity
“What’s happening with X?”
{
"entity": "BTC",
"include_articles": false
}
👉 filters clusters by entity
Optional:
include_articles to include article title/url/source/timestamp in the payload
3. get_event_summary
“Explain this event clearly”
{
"event_id": "cluster_id",
"include_articles": false
}
Output:
- merged summary
- key facts
- sources
- optional articles (title/url/source/timestamp)
👉 This is where you compress multiple articles into one clean narrative
4. get_news_sentiment
“What’s the tone around X?”
{
"entity": "BTC",
"timeframe": "24h"
}
Output:
{
"sentiment": "positive",
"score": 0.64,
"article_count": 42
}
5. detect_emerging_topics (very valuable)
“What is gaining attention?”
Output:
[
{
"topic": "Ethereum ETF",
"trend_score": 0.91,
"related_entities": ["ETH", "BlackRock", "SEC"],
"count": 8,
"avg_importance": 0.17
}
]
6. get_related_entities
“What entities tend to appear with X?”
{
"subject": "Iran",
"timeframe": "24h",
"limit": 10
}
Output:
[
{
"entity": "United States",
"count": 5,
"avg_importance": 0.11,
"sentiment": "negative",
"score": -0.2
}
]
👉 entity-only co-occurrence neighborhood for real-time sense-making
⚠️ 3. What NOT to expose
Avoid:
- raw RSS feeds
- individual article endpoints
- unprocessed headlines
❌ Bad:
get_raw_articles()
👉 This destroys signal quality for agents
🔁 4. Caching & Freshness Strategy
Key difference from crypto:
- News is append-only + evolving
- Not real-time tick data
Strategy:
Fetch layer:
Cluster layer:
- update clusters incrementally
Tool responses:
- no heavy recomputation
- serve from processed store
🧠 5. Deduplication Strategy (critical)
Start simple:
v1:
- normalize titles (lowercase, strip punctuation)
- fuzzy match (threshold ~0.8)
v2:
- embeddings / semantic similarity
Planned runtime order:
- when
NEWS_EMBEDDINGS_ENABLED=true, try Ollama embeddings first
- if Ollama fails, fall back to the existing heuristic cluster path
- keep candidate pre-filtering cheap before any vector compare
⚡ 6. Signal Quality Rules
Your MCP should:
✅ Do:
- reduce 100 articles → 5–10 clusters
- highlight consensus
- surface importance
❌ Don’t:
- overwhelm agent with volume
- pass conflicting duplicates
- expose noise
🧩 7. Relationship to Other MCPs
This MCP becomes powerful when combined with:
- crypto MCP → price
- trends MCP → attention
👉 News MCP provides:
causal narratives
🧭 8. Design Philosophy
Each tool should answer:
“What is happening, and why should I care?”
🚀 9. Suggested Build Order
- RSS ingestion
- normalization
- basic deduplication
- clustering
- simple summarization
- entity tagging
👉 Only then expose tools
🧠 Final takeaway
Crypto MCP gives you facts
News MCP gives you meaning
But only if you:
- aggressively deduplicate
- cluster events
- compress information
✅ Completed since this outlook was written
- v0.1.0 released and tagged
- provider-agnostic LLM extraction/summarization layer added
- prompts moved into separate files for easier updates
- entity blacklist implemented and made case-insensitive
- wildcard blacklist support added for entities/topics/keywords
- live extraction smoke test added
- JSON-backed alias map added for query normalization
- query normalization added so shorthand like
btc and trump still works
- docs updated with the new env vars and workflow
- optional article payloads added to event tools
- blacklist enforcement maintenance script added
- related-entities tool added for co-occurrence neighborhoods
- emerging-topic scoring improved with importance-weighting and co-occurrence
🔭 Next high-level steps
What is left of v0.1.0
The first version is now effectively a usable baseline. The remaining work for v0.1.x is mostly polish:
- stabilize extraction quality across a few more real-world samples
- expand the alias map only where usage demands it
- tune emerging-topic noise so repeated source names do not dominate
- keep sentiment labels aligned with scores as the model improves
Where v0.2.0 should lead
Future plan (worth building slowly): “Emerging entity graph over time”
Right now detect_emerging_topics() returns a flat list of emerging topics/entities.
Next-level idea: turn it into an entity graph that an agent can reason over.
Core concept
- Collapse/group results into canonical entity nodes (e.g.
iran, israel, donald_trump, strait_of_hormuz, etc.)
- Build weighted edges from co-occurrence in recent clusters:
- edge weight ~ frequency/co-occurrence strength
- node weight ~ trend_score + count (+ optional avg_importance)
- Infer communities (graph grouping) so related nodes form stable “story neighborhoods”
Over time (the important part)
- Each refresh window produces a snapshot of the graph
- Store snapshots / deltas to observe:
- rising/falling node weights (“momentum”)
- strengthening/weaker relations
- emerging communities and topic shifts
Suggested output for an eventual agent tool
get_emerging_entity_graph(timeframe, limit) returning:
- grouped communities
- top nodes + weights
- top relations + direction (optional)
- summary of “what changed since last snapshot”
This needs extra time to become a real usable MCP tool, so it’s intentionally captured here for later execution.
Normalization layer
- canonicalize acronyms and entity variants before storage / querying
- keep the blacklist as a separate post-processing rule
Wildcard blacklist support
- allow patterns for entities / topics / keywords
- keep matching case-insensitive
Emerging signal quality
- tune what counts as an emerging topic/entity
- reduce noise from repeated source names and generic terms
Entity/time tracking and replay (future capability)
- track how important entities evolve over time
- allow replay of when entities first appeared, how topics shifted, and how sentiment changed
- useful later for narrative reconstruction and trend timelines
Longer-term direction
The endgame is not just “news search”, but a light narrative memory system:
- entity histories over time
- topic shifts and turning points
- sentiment arcs
- replayable timelines for a person, company, or event
That should stay in mind while keeping the current implementation simple.