# πŸ“° News MCP Server β€” Requirements Spec ## 🎯 Goal Provide **structured, deduplicated, topic-aware news signals** that an agent can use for reasoning about: * events * narratives * sentiment shifts πŸ‘‰ Not a feed reader πŸ‘‰ Not a headline dump πŸ‘‰ A **signal extraction layer** --- # 🧠 Core Design Principle > Raw news is useless to agents. > **Processed news is powerful.** --- # πŸ—οΈ 1. Internal Architecture ## 🧩 Data Sources Layer (`sources/`) Mix of: * RSS feeds (primary) * optional APIs later Examples: * Reuters * Bloomberg * CoinDesk ### Responsibilities: * fetch articles * normalize format --- ## πŸ”„ Ingestion Pipeline Runs periodically (e.g. every few minutes) Steps: 1. fetch articles 2. normalize fields: * title * url * source * timestamp * summary (if available) --- ## 🧹 Deduplication Layer ### Problem: Same story appears across many sources. ### Solution: Cluster articles by similarity: Methods: * title similarity (fuzzy match / embeddings) * URL canonicalization * content similarity (optional later) ### Output: ```json id="cluster" { "cluster_id": "...", "headline": "Canonical headline", "articles": [...], "sources": ["Reuters", "Bloomberg"], "first_seen": "...", "last_updated": "..." } ``` πŸ‘‰ This is your **core unit of truth**, not individual articles --- ## 🧠 Enrichment Layer Adds meaning to clusters. ### 1. Entity extraction * assets (BTC, ETH) * companies * macro topics (inflation, rates) --- ### 2. Topic classification Examples: * crypto * macro * regulation * AI --- ### 3. Sentiment (lightweight) * positive / negative / neutral * or simple score πŸ‘‰ Keep this simple in v1 (don’t over-engineer NLP) --- ### 4. Importance scoring (VERY useful) Heuristic: * number of sources covering it * recency * source credibility * keyword weighting --- ## πŸ—ƒοΈ Storage Layer You need short-term memory: * clusters (not raw articles) * TTL: e.g. 24–72h Optional: * in-memory store (start) * later: DB we have a choice of storage possibilites including qdrant, postgresql, couchdb --- # 🧰 2. Agent-Facing Tools (IMPORTANT) Keep tools **high-level and semantic** --- ## 1. `get_latest_events` > β€œWhat is happening right now?” Input: ```json id="n1" { "topic": "crypto", "limit": 5, "include_articles": false } ``` Output: ```json id="n2" [ { "headline": "...", "summary": "...", "entities": ["BTC"], "sentiment": "positive", "importance": 0.82, "sources": ["Reuters", "CoinDesk"], "timestamp": "...", "articles": [ { "title": "...", "url": "...", "source": "Reuters", "timestamp": "..." } ] } ] ``` --- ## 2. `get_events_for_entity` > β€œWhat’s happening with X?” ```json id="n3" { "entity": "BTC", "include_articles": false } ``` πŸ‘‰ filters clusters by entity Optional: * `include_articles` to include article title/url/source/timestamp in the payload --- ## 3. `get_event_summary` > β€œExplain this event clearly” ```json id="n4" { "event_id": "cluster_id", "include_articles": false } ``` Output: * merged summary * key facts * sources * optional articles (title/url/source/timestamp) πŸ‘‰ This is where you compress multiple articles into one clean narrative --- ## 4. `get_news_sentiment` > β€œWhat’s the tone around X?” ```json id="n5" { "entity": "BTC", "timeframe": "24h" } ``` Output: ```json id="n6" { "sentiment": "positive", "score": 0.64, "article_count": 42 } ``` --- ## 5. `detect_emerging_topics` (very valuable) > β€œWhat is gaining attention?” Output: ```json id="n7" [ { "topic": "Ethereum ETF", "trend_score": 0.91, "related_entities": ["ETH", "BlackRock", "SEC"], "count": 8, "avg_importance": 0.17 } ] ``` --- ## 6. `get_related_entities` > β€œWhat entities tend to appear with X?” ```json id="n8" { "subject": "Iran", "timeframe": "24h", "limit": 10 } ``` Output: ```json id="n9" [ { "entity": "United States", "count": 5, "avg_importance": 0.11, "sentiment": "negative", "score": -0.2 } ] ``` πŸ‘‰ entity-only co-occurrence neighborhood for real-time sense-making --- # ⚠️ 3. What NOT to expose Avoid: * raw RSS feeds * individual article endpoints * unprocessed headlines ❌ Bad: ```id="bad-news" get_raw_articles() ``` πŸ‘‰ This destroys signal quality for agents --- # πŸ” 4. Caching & Freshness Strategy ## Key difference from crypto: * News is **append-only + evolving** * Not real-time tick data --- ## Strategy: ### Fetch layer: * poll every few minutes ### Cluster layer: * update clusters incrementally ### Tool responses: * no heavy recomputation * serve from processed store --- # 🧠 5. Deduplication Strategy (critical) Start simple: ### v1: * normalize titles (lowercase, strip punctuation) * fuzzy match (threshold ~0.8) ### v2: * embeddings / semantic similarity --- # ⚑ 6. Signal Quality Rules Your MCP should: ### βœ… Do: * reduce 100 articles β†’ 5–10 clusters * highlight consensus * surface importance ### ❌ Don’t: * overwhelm agent with volume * pass conflicting duplicates * expose noise --- # 🧩 7. Relationship to Other MCPs This MCP becomes powerful when combined with: * crypto MCP β†’ price * trends MCP β†’ attention πŸ‘‰ News MCP provides: > **causal narratives** --- # 🧭 8. Design Philosophy Each tool should answer: > β€œWhat is happening, and why should I care?” --- # πŸš€ 9. Suggested Build Order 1. RSS ingestion 2. normalization 3. basic deduplication 4. clustering 5. simple summarization 6. entity tagging πŸ‘‰ Only then expose tools --- # 🧠 Final takeaway > Crypto MCP gives you **facts** > News MCP gives you **meaning** But only if you: * aggressively deduplicate * cluster events * compress information --- # βœ… Completed since this outlook was written * v0.1.0 released and tagged * provider-agnostic LLM extraction/summarization layer added * prompts moved into separate files for easier updates * entity blacklist implemented and made case-insensitive * wildcard blacklist support added for entities/topics/keywords * live extraction smoke test added * JSON-backed alias map added for query normalization * query normalization added so shorthand like `btc` and `trump` still works * docs updated with the new env vars and workflow * optional article payloads added to event tools * blacklist enforcement maintenance script added * related-entities tool added for co-occurrence neighborhoods * emerging-topic scoring improved with importance-weighting and co-occurrence --- # πŸ”­ Next high-level steps ## What is left of v0.1.0 The first version is now effectively a usable baseline. The remaining work for v0.1.x is mostly polish: * stabilize extraction quality across a few more real-world samples * expand the alias map only where usage demands it * tune emerging-topic noise so repeated source names do not dominate * keep sentiment labels aligned with scores as the model improves ## Where v0.2.0 should lead 1. **Normalization layer** * canonicalize acronyms and entity variants before storage / querying * keep the blacklist as a separate post-processing rule 2. **Wildcard blacklist support** * allow patterns for entities / topics / keywords * keep matching case-insensitive 3. **Emerging signal quality** * tune what counts as an emerging topic/entity * reduce noise from repeated source names and generic terms 4. **Entity/time tracking and replay (future capability)** * track how important entities evolve over time * allow replay of when entities first appeared, how topics shifted, and how sentiment changed * useful later for narrative reconstruction and trend timelines ## Longer-term direction The endgame is not just β€œnews search”, but a light narrative memory system: * entity histories over time * topic shifts and turning points * sentiment arcs * replayable timelines for a person, company, or event That should stay in mind while keeping the current implementation simple.