OUTLOOK.md 8.2 KB

📰 News MCP Server — Requirements Spec

🎯 Goal

Provide structured, deduplicated, topic-aware news signals that an agent can use for reasoning about:

  • events
  • narratives
  • sentiment shifts

👉 Not a feed reader 👉 Not a headline dump 👉 A signal extraction layer


🧠 Core Design Principle

Raw news is useless to agents. Processed news is powerful.


🏗️ 1. Internal Architecture

🧩 Data Sources Layer (sources/)

Mix of:

  • RSS feeds (primary)
  • optional APIs later

Examples:

  • Reuters
  • Bloomberg
  • CoinDesk

Responsibilities:

  • fetch articles
  • normalize format

🔄 Ingestion Pipeline

Runs periodically (e.g. every few minutes)

Steps:

  1. fetch articles
  2. normalize fields:

    • title
    • url
    • source
    • timestamp
    • summary (if available)

🧹 Deduplication Layer

Problem:

Same story appears across many sources.

Solution:

Cluster articles by similarity:

Methods:

  • title similarity (fuzzy match / embeddings)
  • URL canonicalization
  • content similarity (optional later)

Output:

{
  "cluster_id": "...",
  "headline": "Canonical headline",
  "articles": [...],
  "sources": ["Reuters", "Bloomberg"],
  "first_seen": "...",
  "last_updated": "..."
}

👉 This is your core unit of truth, not individual articles


🧠 Enrichment Layer

Adds meaning to clusters.

1. Entity extraction

  • assets (BTC, ETH)
  • companies
  • macro topics (inflation, rates)

2. Topic classification

Examples:

  • crypto
  • macro
  • regulation
  • AI

3. Sentiment (lightweight)

  • positive / negative / neutral
  • or simple score

👉 Keep this simple in v1 (don’t over-engineer NLP)


4. Importance scoring (VERY useful)

Heuristic:

  • number of sources covering it
  • recency
  • source credibility
  • keyword weighting

🗃️ Storage Layer

You need short-term memory:

  • clusters (not raw articles)
  • TTL: e.g. 24–72h

Optional:

  • in-memory store (start)
  • later: DB

we have a choice of storage possibilites including qdrant, postgresql, couchdb


🧰 2. Agent-Facing Tools (IMPORTANT)

Keep tools high-level and semantic


1. get_latest_events

“What is happening right now?”

Input:

{
  "topic": "crypto",
  "limit": 5,
  "include_articles": false
}

Output:

[
  {
    "headline": "...",
    "summary": "...",
    "entities": ["BTC"],
    "sentiment": "positive",
    "importance": 0.82,
    "sources": ["Reuters", "CoinDesk"],
    "timestamp": "...",
    "articles": [
      {
        "title": "...",
        "url": "...",
        "source": "Reuters",
        "timestamp": "..."
      }
    ]
  }
]

2. get_events_for_entity

“What’s happening with X?”

{
  "entity": "BTC",
  "include_articles": false
}

👉 filters clusters by entity

Optional:

  • include_articles to include article title/url/source/timestamp in the payload

3. get_event_summary

“Explain this event clearly”

{
  "event_id": "cluster_id",
  "include_articles": false
}

Output:

  • merged summary
  • key facts
  • sources
  • optional articles (title/url/source/timestamp)

👉 This is where you compress multiple articles into one clean narrative


4. get_news_sentiment

“What’s the tone around X?”

{
  "entity": "BTC",
  "timeframe": "24h"
}

Output:

{
  "sentiment": "positive",
  "score": 0.64,
  "article_count": 42
}

5. detect_emerging_topics (very valuable)

“What is gaining attention?”

Output:

[
  {
    "topic": "Ethereum ETF",
    "trend_score": 0.91,
    "related_entities": ["ETH", "BlackRock", "SEC"],
    "count": 8,
    "avg_importance": 0.17
  }
]

6. get_related_entities

“What entities tend to appear with X?”

{
  "subject": "Iran",
  "timeframe": "24h",
  "limit": 10
}

Output:

[
  {
    "entity": "United States",
    "count": 5,
    "avg_importance": 0.11,
    "sentiment": "negative",
    "score": -0.2
  }
]

👉 entity-only co-occurrence neighborhood for real-time sense-making


⚠️ 3. What NOT to expose

Avoid:

  • raw RSS feeds
  • individual article endpoints
  • unprocessed headlines

❌ Bad:

get_raw_articles()

👉 This destroys signal quality for agents


🔁 4. Caching & Freshness Strategy

Key difference from crypto:

  • News is append-only + evolving
  • Not real-time tick data

Strategy:

Fetch layer:

  • poll every few minutes

Cluster layer:

  • update clusters incrementally

Tool responses:

  • no heavy recomputation
  • serve from processed store

🧠 5. Deduplication Strategy (critical)

Start simple:

v1:

  • normalize titles (lowercase, strip punctuation)
  • fuzzy match (threshold ~0.8)

v2:

  • embeddings / semantic similarity

Planned runtime order:

  • when NEWS_EMBEDDINGS_ENABLED=true, try Ollama embeddings first
  • if Ollama fails, fall back to the existing heuristic cluster path
  • keep candidate pre-filtering cheap before any vector compare

⚡ 6. Signal Quality Rules

Your MCP should:

✅ Do:

  • reduce 100 articles → 5–10 clusters
  • highlight consensus
  • surface importance

❌ Don’t:

  • overwhelm agent with volume
  • pass conflicting duplicates
  • expose noise

🧩 7. Relationship to Other MCPs

This MCP becomes powerful when combined with:

  • crypto MCP → price
  • trends MCP → attention

👉 News MCP provides:

causal narratives


🧭 8. Design Philosophy

Each tool should answer:

“What is happening, and why should I care?”


🚀 9. Suggested Build Order

  1. RSS ingestion
  2. normalization
  3. basic deduplication
  4. clustering
  5. simple summarization
  6. entity tagging

👉 Only then expose tools


🧠 Final takeaway

Crypto MCP gives you facts News MCP gives you meaning

But only if you:

  • aggressively deduplicate
  • cluster events
  • compress information

✅ Completed since this outlook was written

  • v0.1.0 released and tagged
  • provider-agnostic LLM extraction/summarization layer added
  • prompts moved into separate files for easier updates
  • entity blacklist implemented and made case-insensitive
  • wildcard blacklist support added for entities/topics/keywords
  • live extraction smoke test added
  • JSON-backed alias map added for query normalization
  • query normalization added so shorthand like btc and trump still works
  • docs updated with the new env vars and workflow
  • optional article payloads added to event tools
  • blacklist enforcement maintenance script added
  • related-entities tool added for co-occurrence neighborhoods
  • emerging-topic scoring improved with importance-weighting and co-occurrence

🔭 Next high-level steps

What is left of v0.1.0

The first version is now effectively a usable baseline. The remaining work for v0.1.x is mostly polish:

  • stabilize extraction quality across a few more real-world samples
  • expand the alias map only where usage demands it
  • tune emerging-topic noise so repeated source names do not dominate
  • keep sentiment labels aligned with scores as the model improves

Where v0.2.0 should lead

  1. Normalization layer

    • canonicalize acronyms and entity variants before storage / querying
    • keep the blacklist as a separate post-processing rule
  2. Wildcard blacklist support

    • allow patterns for entities / topics / keywords
    • keep matching case-insensitive
  3. Emerging signal quality

    • tune what counts as an emerging topic/entity
    • reduce noise from repeated source names and generic terms
  4. Entity/time tracking and replay (future capability)

    • track how important entities evolve over time
    • allow replay of when entities first appeared, how topics shifted, and how sentiment changed
    • useful later for narrative reconstruction and trend timelines

Longer-term direction

The endgame is not just “news search”, but a light narrative memory system:

  • entity histories over time
  • topic shifts and turning points
  • sentiment arcs
  • replayable timelines for a person, company, or event

That should stay in mind while keeping the current implementation simple.