# πŸ“° News MCP Server β€” Requirements Spec ## 🎯 Goal Provide **structured, deduplicated, topic-aware news signals** that an agent can use for reasoning about: * events * narratives * sentiment shifts πŸ‘‰ Not a feed reader πŸ‘‰ Not a headline dump πŸ‘‰ A **signal extraction layer** --- # 🧠 Core Design Principle > Raw news is useless to agents. > **Processed news is powerful.** --- # πŸ—οΈ 1. Internal Architecture ## 🧩 Data Sources Layer (`sources/`) Mix of: * RSS feeds (primary) * optional APIs later Examples: * Reuters * Bloomberg * CoinDesk ### Responsibilities: * fetch articles * normalize format --- ## πŸ”„ Ingestion Pipeline Runs periodically (e.g. every few minutes) Steps: 1. fetch articles 2. normalize fields: * title * url * source * timestamp * summary (if available) --- ## 🧹 Deduplication Layer ### Problem: Same story appears across many sources. ### Solution: Cluster articles by similarity: Methods: * title similarity (fuzzy match / embeddings) * URL canonicalization * content similarity (optional later) ### Output: ```json id="cluster" { "cluster_id": "...", "headline": "Canonical headline", "articles": [...], "sources": ["Reuters", "Bloomberg"], "first_seen": "...", "last_updated": "..." } ``` πŸ‘‰ This is your **core unit of truth**, not individual articles --- ## 🧠 Enrichment Layer Adds meaning to clusters. ### 1. Entity extraction * assets (BTC, ETH) * companies * macro topics (inflation, rates) --- ### 2. Topic classification Examples: * crypto * macro * regulation * AI --- ### 3. Sentiment (lightweight) * positive / negative / neutral * or simple score πŸ‘‰ Keep this simple in v1 (don’t over-engineer NLP) --- ### 4. Importance scoring (VERY useful) Heuristic: * number of sources covering it * recency * source credibility * keyword weighting --- ## πŸ—ƒοΈ Storage Layer You need short-term memory: * clusters (not raw articles) * TTL: e.g. 24–72h Optional: * in-memory store (start) * later: DB we have a choice of storage possibilites including qdrant, postgresql, couchdb --- # 🧰 2. Agent-Facing Tools (IMPORTANT) Keep tools **high-level and semantic** --- ## 1. `get_latest_events` > β€œWhat is happening right now?” Input: ```json id="n1" { "topic": "crypto", "limit": 5 } ``` Output: ```json id="n2" [ { "headline": "...", "summary": "...", "entities": ["BTC"], "sentiment": "positive", "importance": 0.82, "sources": ["Reuters", "CoinDesk"], "timestamp": "..." } ] ``` --- ## 2. `get_events_for_entity` > β€œWhat’s happening with X?” ```json id="n3" { "entity": "BTC" } ``` πŸ‘‰ filters clusters by entity --- ## 3. `get_event_summary` > β€œExplain this event clearly” ```json id="n4" { "event_id": "cluster_id" } ``` Output: * merged summary * key facts * sources πŸ‘‰ This is where you compress multiple articles into one clean narrative --- ## 4. `get_news_sentiment` > β€œWhat’s the tone around X?” ```json id="n5" { "entity": "BTC", "timeframe": "24h" } ``` Output: ```json id="n6" { "sentiment": "positive", "score": 0.64, "article_count": 42 } ``` --- ## 5. `detect_emerging_topics` (very valuable) > β€œWhat is gaining attention?” Output: ```json id="n7" [ { "topic": "Ethereum ETF", "trend_score": 0.91, "related_entities": ["ETH"] } ] ``` --- # ⚠️ 3. What NOT to expose Avoid: * raw RSS feeds * individual article endpoints * unprocessed headlines ❌ Bad: ```id="bad-news" get_raw_articles() ``` πŸ‘‰ This destroys signal quality for agents --- # πŸ” 4. Caching & Freshness Strategy ## Key difference from crypto: * News is **append-only + evolving** * Not real-time tick data --- ## Strategy: ### Fetch layer: * poll every few minutes ### Cluster layer: * update clusters incrementally ### Tool responses: * no heavy recomputation * serve from processed store --- # 🧠 5. Deduplication Strategy (critical) Start simple: ### v1: * normalize titles (lowercase, strip punctuation) * fuzzy match (threshold ~0.8) ### v2: * embeddings / semantic similarity --- # ⚑ 6. Signal Quality Rules Your MCP should: ### βœ… Do: * reduce 100 articles β†’ 5–10 clusters * highlight consensus * surface importance ### ❌ Don’t: * overwhelm agent with volume * pass conflicting duplicates * expose noise --- # 🧩 7. Relationship to Other MCPs This MCP becomes powerful when combined with: * crypto MCP β†’ price * trends MCP β†’ attention πŸ‘‰ News MCP provides: > **causal narratives** --- # 🧭 8. Design Philosophy Each tool should answer: > β€œWhat is happening, and why should I care?” --- # πŸš€ 9. Suggested Build Order 1. RSS ingestion 2. normalization 3. basic deduplication 4. clustering 5. simple summarization 6. entity tagging πŸ‘‰ Only then expose tools --- # 🧠 Final takeaway > Crypto MCP gives you **facts** > News MCP gives you **meaning** But only if you: * aggressively deduplicate * cluster events * compress information