📰 News MCP Server — Requirements Spec
🎯 Goal
Provide structured, deduplicated, topic-aware news signals
that an agent can use for reasoning about:
- events
- narratives
- sentiment shifts
👉 Not a feed reader
👉 Not a headline dump
👉 A signal extraction layer
🧠 Core Design Principle
Raw news is useless to agents.
Processed news is powerful.
🏗️ 1. Internal Architecture
🧩 Data Sources Layer (sources/)
Mix of:
- RSS feeds (primary)
- optional APIs later
Examples:
- Reuters
- Bloomberg
- CoinDesk
Responsibilities:
- fetch articles
- normalize format
🔄 Ingestion Pipeline
Runs periodically (e.g. every few minutes)
Steps:
- fetch articles
normalize fields:
- title
- url
- source
- timestamp
- summary (if available)
🧹 Deduplication Layer
Problem:
Same story appears across many sources.
Solution:
Cluster articles by similarity:
Methods:
- title similarity (fuzzy match / embeddings)
- URL canonicalization
- content similarity (optional later)
Output:
{
"cluster_id": "...",
"headline": "Canonical headline",
"articles": [...],
"sources": ["Reuters", "Bloomberg"],
"first_seen": "...",
"last_updated": "..."
}
👉 This is your core unit of truth, not individual articles
🧠 Enrichment Layer
Adds meaning to clusters.
1. Entity extraction
- assets (BTC, ETH)
- companies
- macro topics (inflation, rates)
2. Topic classification
Examples:
- crypto
- macro
- regulation
- AI
3. Sentiment (lightweight)
- positive / negative / neutral
- or simple score
👉 Keep this simple in v1 (don’t over-engineer NLP)
4. Importance scoring (VERY useful)
Heuristic:
- number of sources covering it
- recency
- source credibility
- keyword weighting
🗃️ Storage Layer
You need short-term memory:
- clusters (not raw articles)
- TTL: e.g. 24–72h
Optional:
- in-memory store (start)
- later: DB
we have a choice of storage possibilites including qdrant, postgresql, couchdb
🧰 2. Agent-Facing Tools (IMPORTANT)
Keep tools high-level and semantic
1. get_latest_events
“What is happening right now?”
Input:
{
"topic": "crypto",
"limit": 5
}
Output:
[
{
"headline": "...",
"summary": "...",
"entities": ["BTC"],
"sentiment": "positive",
"importance": 0.82,
"sources": ["Reuters", "CoinDesk"],
"timestamp": "..."
}
]
2. get_events_for_entity
“What’s happening with X?”
{
"entity": "BTC"
}
👉 filters clusters by entity
3. get_event_summary
“Explain this event clearly”
{
"event_id": "cluster_id"
}
Output:
- merged summary
- key facts
- sources
👉 This is where you compress multiple articles into one clean narrative
4. get_news_sentiment
“What’s the tone around X?”
{
"entity": "BTC",
"timeframe": "24h"
}
Output:
{
"sentiment": "positive",
"score": 0.64,
"article_count": 42
}
5. detect_emerging_topics (very valuable)
“What is gaining attention?”
Output:
[
{
"topic": "Ethereum ETF",
"trend_score": 0.91,
"related_entities": ["ETH"]
}
]
⚠️ 3. What NOT to expose
Avoid:
- raw RSS feeds
- individual article endpoints
- unprocessed headlines
❌ Bad:
get_raw_articles()
👉 This destroys signal quality for agents
🔁 4. Caching & Freshness Strategy
Key difference from crypto:
- News is append-only + evolving
- Not real-time tick data
Strategy:
Fetch layer:
Cluster layer:
- update clusters incrementally
Tool responses:
- no heavy recomputation
- serve from processed store
🧠 5. Deduplication Strategy (critical)
Start simple:
v1:
- normalize titles (lowercase, strip punctuation)
- fuzzy match (threshold ~0.8)
v2:
- embeddings / semantic similarity
⚡ 6. Signal Quality Rules
Your MCP should:
✅ Do:
- reduce 100 articles → 5–10 clusters
- highlight consensus
- surface importance
❌ Don’t:
- overwhelm agent with volume
- pass conflicting duplicates
- expose noise
🧩 7. Relationship to Other MCPs
This MCP becomes powerful when combined with:
- crypto MCP → price
- trends MCP → attention
👉 News MCP provides:
causal narratives
🧭 8. Design Philosophy
Each tool should answer:
“What is happening, and why should I care?”
🚀 9. Suggested Build Order
- RSS ingestion
- normalization
- basic deduplication
- clustering
- simple summarization
- entity tagging
👉 Only then expose tools
🧠 Final takeaway
Crypto MCP gives you facts
News MCP gives you meaning
But only if you:
- aggressively deduplicate
- cluster events
- compress information
✅ Completed since this outlook was written
- provider-agnostic LLM extraction/summarization layer added
- prompts moved into separate files for easier updates
- entity blacklist implemented and made case-insensitive
- live extraction smoke test added
- docs updated with the new env vars and workflow
🔭 Next high-level steps
Normalization layer
- canonicalize acronyms and entity variants before storage / querying
- keep the blacklist as a separate post-processing rule
Wildcard blacklist support
- allow patterns for entities / topics / keywords
- keep matching case-insensitive
Emerging signal quality
- tune what counts as an emerging topic/entity
- reduce noise from repeated source names and generic terms
Entity/time tracking and replay (future capability)
- track how important entities evolve over time
- allow replay of when entities first appeared, how topics shifted, and how sentiment changed
- useful later for narrative reconstruction and trend timelines