|
|
@@ -0,0 +1,390 @@
|
|
|
+
|
|
|
+# 📰 News MCP Server — Requirements Spec
|
|
|
+
|
|
|
+## 🎯 Goal
|
|
|
+
|
|
|
+Provide **structured, deduplicated, topic-aware news signals**
|
|
|
+that an agent can use for reasoning about:
|
|
|
+
|
|
|
+* events
|
|
|
+* narratives
|
|
|
+* sentiment shifts
|
|
|
+
|
|
|
+👉 Not a feed reader
|
|
|
+👉 Not a headline dump
|
|
|
+👉 A **signal extraction layer**
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+# 🧠 Core Design Principle
|
|
|
+
|
|
|
+> Raw news is useless to agents.
|
|
|
+> **Processed news is powerful.**
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+# 🏗️ 1. Internal Architecture
|
|
|
+
|
|
|
+## 🧩 Data Sources Layer (`sources/`)
|
|
|
+
|
|
|
+Mix of:
|
|
|
+
|
|
|
+* RSS feeds (primary)
|
|
|
+* optional APIs later
|
|
|
+
|
|
|
+Examples:
|
|
|
+
|
|
|
+* Reuters
|
|
|
+* Bloomberg
|
|
|
+* CoinDesk
|
|
|
+
|
|
|
+### Responsibilities:
|
|
|
+
|
|
|
+* fetch articles
|
|
|
+* normalize format
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 🔄 Ingestion Pipeline
|
|
|
+
|
|
|
+Runs periodically (e.g. every few minutes)
|
|
|
+
|
|
|
+Steps:
|
|
|
+
|
|
|
+1. fetch articles
|
|
|
+2. normalize fields:
|
|
|
+
|
|
|
+ * title
|
|
|
+ * url
|
|
|
+ * source
|
|
|
+ * timestamp
|
|
|
+ * summary (if available)
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 🧹 Deduplication Layer
|
|
|
+
|
|
|
+### Problem:
|
|
|
+
|
|
|
+Same story appears across many sources.
|
|
|
+
|
|
|
+### Solution:
|
|
|
+
|
|
|
+Cluster articles by similarity:
|
|
|
+
|
|
|
+Methods:
|
|
|
+
|
|
|
+* title similarity (fuzzy match / embeddings)
|
|
|
+* URL canonicalization
|
|
|
+* content similarity (optional later)
|
|
|
+
|
|
|
+### Output:
|
|
|
+
|
|
|
+```json id="cluster"
|
|
|
+{
|
|
|
+ "cluster_id": "...",
|
|
|
+ "headline": "Canonical headline",
|
|
|
+ "articles": [...],
|
|
|
+ "sources": ["Reuters", "Bloomberg"],
|
|
|
+ "first_seen": "...",
|
|
|
+ "last_updated": "..."
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+👉 This is your **core unit of truth**, not individual articles
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 🧠 Enrichment Layer
|
|
|
+
|
|
|
+Adds meaning to clusters.
|
|
|
+
|
|
|
+### 1. Entity extraction
|
|
|
+
|
|
|
+* assets (BTC, ETH)
|
|
|
+* companies
|
|
|
+* macro topics (inflation, rates)
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### 2. Topic classification
|
|
|
+
|
|
|
+Examples:
|
|
|
+
|
|
|
+* crypto
|
|
|
+* macro
|
|
|
+* regulation
|
|
|
+* AI
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### 3. Sentiment (lightweight)
|
|
|
+
|
|
|
+* positive / negative / neutral
|
|
|
+* or simple score
|
|
|
+
|
|
|
+👉 Keep this simple in v1 (don’t over-engineer NLP)
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### 4. Importance scoring (VERY useful)
|
|
|
+
|
|
|
+Heuristic:
|
|
|
+
|
|
|
+* number of sources covering it
|
|
|
+* recency
|
|
|
+* source credibility
|
|
|
+* keyword weighting
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 🗃️ Storage Layer
|
|
|
+
|
|
|
+You need short-term memory:
|
|
|
+
|
|
|
+* clusters (not raw articles)
|
|
|
+* TTL: e.g. 24–72h
|
|
|
+
|
|
|
+Optional:
|
|
|
+
|
|
|
+* in-memory store (start)
|
|
|
+* later: DB
|
|
|
+
|
|
|
+we have a choice of storage possibilites including qdrant, postgresql, couchdb
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+# 🧰 2. Agent-Facing Tools (IMPORTANT)
|
|
|
+
|
|
|
+Keep tools **high-level and semantic**
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 1. `get_latest_events`
|
|
|
+
|
|
|
+> “What is happening right now?”
|
|
|
+
|
|
|
+Input:
|
|
|
+
|
|
|
+```json id="n1"
|
|
|
+{
|
|
|
+ "topic": "crypto",
|
|
|
+ "limit": 5
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+Output:
|
|
|
+
|
|
|
+```json id="n2"
|
|
|
+[
|
|
|
+ {
|
|
|
+ "headline": "...",
|
|
|
+ "summary": "...",
|
|
|
+ "entities": ["BTC"],
|
|
|
+ "sentiment": "positive",
|
|
|
+ "importance": 0.82,
|
|
|
+ "sources": ["Reuters", "CoinDesk"],
|
|
|
+ "timestamp": "..."
|
|
|
+ }
|
|
|
+]
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 2. `get_events_for_entity`
|
|
|
+
|
|
|
+> “What’s happening with X?”
|
|
|
+
|
|
|
+```json id="n3"
|
|
|
+{
|
|
|
+ "entity": "BTC"
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+👉 filters clusters by entity
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 3. `get_event_summary`
|
|
|
+
|
|
|
+> “Explain this event clearly”
|
|
|
+
|
|
|
+```json id="n4"
|
|
|
+{
|
|
|
+ "event_id": "cluster_id"
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+Output:
|
|
|
+
|
|
|
+* merged summary
|
|
|
+* key facts
|
|
|
+* sources
|
|
|
+
|
|
|
+👉 This is where you compress multiple articles into one clean narrative
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 4. `get_news_sentiment`
|
|
|
+
|
|
|
+> “What’s the tone around X?”
|
|
|
+
|
|
|
+```json id="n5"
|
|
|
+{
|
|
|
+ "entity": "BTC",
|
|
|
+ "timeframe": "24h"
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+Output:
|
|
|
+
|
|
|
+```json id="n6"
|
|
|
+{
|
|
|
+ "sentiment": "positive",
|
|
|
+ "score": 0.64,
|
|
|
+ "article_count": 42
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## 5. `detect_emerging_topics` (very valuable)
|
|
|
+
|
|
|
+> “What is gaining attention?”
|
|
|
+
|
|
|
+Output:
|
|
|
+
|
|
|
+```json id="n7"
|
|
|
+[
|
|
|
+ {
|
|
|
+ "topic": "Ethereum ETF",
|
|
|
+ "trend_score": 0.91,
|
|
|
+ "related_entities": ["ETH"]
|
|
|
+ }
|
|
|
+]
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+# ⚠️ 3. What NOT to expose
|
|
|
+
|
|
|
+Avoid:
|
|
|
+
|
|
|
+* raw RSS feeds
|
|
|
+* individual article endpoints
|
|
|
+* unprocessed headlines
|
|
|
+
|
|
|
+❌ Bad:
|
|
|
+
|
|
|
+```id="bad-news"
|
|
|
+get_raw_articles()
|
|
|
+```
|
|
|
+
|
|
|
+👉 This destroys signal quality for agents
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+# 🔁 4. Caching & Freshness Strategy
|
|
|
+
|
|
|
+## Key difference from crypto:
|
|
|
+
|
|
|
+* News is **append-only + evolving**
|
|
|
+* Not real-time tick data
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+## Strategy:
|
|
|
+
|
|
|
+### Fetch layer:
|
|
|
+
|
|
|
+* poll every few minutes
|
|
|
+
|
|
|
+### Cluster layer:
|
|
|
+
|
|
|
+* update clusters incrementally
|
|
|
+
|
|
|
+### Tool responses:
|
|
|
+
|
|
|
+* no heavy recomputation
|
|
|
+* serve from processed store
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+# 🧠 5. Deduplication Strategy (critical)
|
|
|
+
|
|
|
+Start simple:
|
|
|
+
|
|
|
+### v1:
|
|
|
+
|
|
|
+* normalize titles (lowercase, strip punctuation)
|
|
|
+* fuzzy match (threshold ~0.8)
|
|
|
+
|
|
|
+### v2:
|
|
|
+
|
|
|
+* embeddings / semantic similarity
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+# ⚡ 6. Signal Quality Rules
|
|
|
+
|
|
|
+Your MCP should:
|
|
|
+
|
|
|
+### ✅ Do:
|
|
|
+
|
|
|
+* reduce 100 articles → 5–10 clusters
|
|
|
+* highlight consensus
|
|
|
+* surface importance
|
|
|
+
|
|
|
+### ❌ Don’t:
|
|
|
+
|
|
|
+* overwhelm agent with volume
|
|
|
+* pass conflicting duplicates
|
|
|
+* expose noise
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+# 🧩 7. Relationship to Other MCPs
|
|
|
+
|
|
|
+This MCP becomes powerful when combined with:
|
|
|
+
|
|
|
+* crypto MCP → price
|
|
|
+* trends MCP → attention
|
|
|
+
|
|
|
+👉 News MCP provides:
|
|
|
+
|
|
|
+> **causal narratives**
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+# 🧭 8. Design Philosophy
|
|
|
+
|
|
|
+Each tool should answer:
|
|
|
+
|
|
|
+> “What is happening, and why should I care?”
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+# 🚀 9. Suggested Build Order
|
|
|
+
|
|
|
+1. RSS ingestion
|
|
|
+2. normalization
|
|
|
+3. basic deduplication
|
|
|
+4. clustering
|
|
|
+5. simple summarization
|
|
|
+6. entity tagging
|
|
|
+
|
|
|
+👉 Only then expose tools
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+# 🧠 Final takeaway
|
|
|
+
|
|
|
+> Crypto MCP gives you **facts**
|
|
|
+> News MCP gives you **meaning**
|
|
|
+
|
|
|
+But only if you:
|
|
|
+
|
|
|
+* aggressively deduplicate
|
|
|
+* cluster events
|
|
|
+* compress information
|
|
|
+
|