# 📰 News MCP Server — Requirements Spec

## 🎯 Goal

Provide **structured, deduplicated, topic-aware news signals**
that an agent can use for reasoning about:

* events
* narratives
* sentiment shifts

👉 Not a feed reader
👉 Not a headline dump
👉 A **signal extraction layer**

---

# 🧠 Core Design Principle

> Raw news is useless to agents.
> **Processed news is powerful.**

---

# 🏗️ 1. Internal Architecture

## 🧩 Data Sources Layer (`sources/`)

Mix of:

* RSS feeds (primary)
* optional APIs later

Examples:

* Reuters
* Bloomberg
* CoinDesk

### Responsibilities:

* fetch articles
* normalize format

---

## 🔄 Ingestion Pipeline

Runs periodically (e.g. every few minutes)

Steps:

1. fetch articles
2. normalize fields:

   * title
   * url
   * source
   * timestamp
   * summary (if available)

---

## 🧹 Deduplication Layer

### Problem:

Same story appears across many sources.

### Solution:

Cluster articles by similarity:

Methods:

* title similarity (fuzzy match / embeddings)
* URL canonicalization
* content similarity (optional later)

### Output:

```json id="cluster"
{
  "cluster_id": "...",
  "headline": "Canonical headline",
  "articles": [...],
  "sources": ["Reuters", "Bloomberg"],
  "first_seen": "...",
  "last_updated": "..."
}
```

👉 This is your **core unit of truth**, not individual articles

---

## 🧠 Enrichment Layer

Adds meaning to clusters.

### 1. Entity extraction

* assets (BTC, ETH)
* companies
* macro topics (inflation, rates)

---

### 2. Topic classification

Examples:

* crypto
* macro
* regulation
* AI

---

### 3. Sentiment (lightweight)

* positive / negative / neutral
* or simple score

👉 Keep this simple in v1 (don’t over-engineer NLP)

---

### 4. Importance scoring (VERY useful)

Heuristic:

* number of sources covering it
* recency
* source credibility
* keyword weighting

---

## 🗃️ Storage Layer

You need short-term memory:

* clusters (not raw articles)
* TTL: e.g. 24–72h

Optional:

* in-memory store (start)
* later: DB

we have a choice of storage possibilites including qdrant, postgresql, couchdb

---

# 🧰 2. Agent-Facing Tools (IMPORTANT)

Keep tools **high-level and semantic**

---

## 1. `get_latest_events`

> “What is happening right now?”

Input:

```json id="n1"
{
  "topic": "crypto",
  "limit": 5,
  "include_articles": false
}
```

Output:

```json id="n2"
[
  {
    "headline": "...",
    "summary": "...",
    "entities": ["BTC"],
    "sentiment": "positive",
    "importance": 0.82,
    "sources": ["Reuters", "CoinDesk"],
    "timestamp": "...",
    "articles": [
      {
        "title": "...",
        "url": "...",
        "source": "Reuters",
        "timestamp": "..."
      }
    ]
  }
]
```

---

## 2. `get_events_for_entity`

> “What’s happening with X?”

```json id="n3"
{
  "entity": "BTC",
  "include_articles": false
}
```

👉 filters clusters by entity

Optional:

* `include_articles` to include article title/url/source/timestamp in the payload

---

## 3. `get_event_summary`

> “Explain this event clearly”

```json id="n4"
{
  "event_id": "cluster_id",
  "include_articles": false
}
```

Output:

* merged summary
* key facts
* sources
* optional articles (title/url/source/timestamp)

👉 This is where you compress multiple articles into one clean narrative

---

## 4. `get_news_sentiment`

> “What’s the tone around X?”

```json id="n5"
{
  "entity": "BTC",
  "timeframe": "24h"
}
```

Output:

```json id="n6"
{
  "sentiment": "positive",
  "score": 0.64,
  "article_count": 42
}
```

---

## 5. `detect_emerging_topics` (very valuable)

> “What is gaining attention?”

Output:

```json id="n7"
[
  {
    "topic": "Ethereum ETF",
    "trend_score": 0.91,
    "related_entities": ["ETH", "BlackRock", "SEC"],
    "count": 8,
    "avg_importance": 0.17
  }
]
```

---

## 6. `get_related_entities`

> “What entities tend to appear with X?”

```json id="n8"
{
  "subject": "Iran",
  "timeframe": "24h",
  "limit": 10
}
```

Output:

```json id="n9"
[
  {
    "entity": "United States",
    "count": 5,
    "avg_importance": 0.11,
    "sentiment": "negative",
    "score": -0.2
  }
]
```

👉 entity-only co-occurrence neighborhood for real-time sense-making

---

# ⚠️ 3. What NOT to expose

Avoid:

* raw RSS feeds
* individual article endpoints
* unprocessed headlines

❌ Bad:

```id="bad-news"
get_raw_articles()
```

👉 This destroys signal quality for agents

---

# 🔁 4. Caching & Freshness Strategy

## Key difference from crypto:

* News is **append-only + evolving**
* Not real-time tick data

---

## Strategy:

### Fetch layer:

* poll every few minutes

### Cluster layer:

* update clusters incrementally

### Tool responses:

* no heavy recomputation
* serve from processed store

---

# 🧠 5. Deduplication Strategy (critical)

Start simple:

### v1:

* normalize titles (lowercase, strip punctuation)
* fuzzy match (threshold ~0.8)

### v2:

* embeddings / semantic similarity

---

# ⚡ 6. Signal Quality Rules

Your MCP should:

### ✅ Do:

* reduce 100 articles → 5–10 clusters
* highlight consensus
* surface importance

### ❌ Don’t:

* overwhelm agent with volume
* pass conflicting duplicates
* expose noise

---

# 🧩 7. Relationship to Other MCPs

This MCP becomes powerful when combined with:

* crypto MCP → price
* trends MCP → attention

👉 News MCP provides:

> **causal narratives**

---

# 🧭 8. Design Philosophy

Each tool should answer:

> “What is happening, and why should I care?”

---

# 🚀 9. Suggested Build Order

1. RSS ingestion
2. normalization
3. basic deduplication
4. clustering
5. simple summarization
6. entity tagging

👉 Only then expose tools

---

# 🧠 Final takeaway

> Crypto MCP gives you **facts**
> News MCP gives you **meaning**

But only if you:

* aggressively deduplicate
* cluster events
* compress information

---

# ✅ Completed since this outlook was written

* v0.1.0 released and tagged
* provider-agnostic LLM extraction/summarization layer added
* prompts moved into separate files for easier updates
* entity blacklist implemented and made case-insensitive
* wildcard blacklist support added for entities/topics/keywords
* live extraction smoke test added
* JSON-backed alias map added for query normalization
* query normalization added so shorthand like `btc` and `trump` still works
* docs updated with the new env vars and workflow
* optional article payloads added to event tools
* blacklist enforcement maintenance script added
* related-entities tool added for co-occurrence neighborhoods
* emerging-topic scoring improved with importance-weighting and co-occurrence

---

# 🔭 Next high-level steps

## What is left of v0.1.0

The first version is now effectively a usable baseline. The remaining work for v0.1.x is mostly polish:

* stabilize extraction quality across a few more real-world samples
* expand the alias map only where usage demands it
* tune emerging-topic noise so repeated source names do not dominate
* keep sentiment labels aligned with scores as the model improves

## Where v0.2.0 should lead

1. **Normalization layer**

   * canonicalize acronyms and entity variants before storage / querying
   * keep the blacklist as a separate post-processing rule

2. **Wildcard blacklist support**

   * allow patterns for entities / topics / keywords
   * keep matching case-insensitive

3. **Emerging signal quality**

   * tune what counts as an emerging topic/entity
   * reduce noise from repeated source names and generic terms

4. **Entity/time tracking and replay (future capability)**

   * track how important entities evolve over time
   * allow replay of when entities first appeared, how topics shifted, and how sentiment changed
   * useful later for narrative reconstruction and trend timelines

## Longer-term direction

The endgame is not just “news search”, but a light narrative memory system:

* entity histories over time
* topic shifts and turning points
* sentiment arcs
* replayable timelines for a person, company, or event

That should stay in mind while keeping the current implementation simple.