📰 News MCP Server — Requirements Spec

Current version: v0.3.1 — see RELEASE_NOTES.md for changelog.

🎯 Goal

Provide structured, deduplicated, topic-aware news signals that an agent can use for reasoning about:

events
narratives
sentiment shifts

👉 Not a feed reader 👉 Not a headline dump 👉 A signal extraction layer

🧠 Core Design Principle

Raw news is useless to agents. Processed news is powerful.

🏗️ 1. Internal Architecture

🧩 Data Sources Layer (`sources/`)

Mix of:

RSS feeds (primary)
optional APIs later

Examples:

Reuters
Bloomberg
CoinDesk

Responsibilities:

fetch articles
normalize format

🔄 Ingestion Pipeline

Runs periodically (e.g. every few minutes)

Steps:

fetch articles
normalize fields:
- title
- url
- source
- timestamp
- summary (if available)

🧹 Deduplication Layer

Problem:

Same story appears across many sources.

Solution:

Cluster articles by similarity:

Methods:

title similarity (fuzzy match / embeddings)
URL canonicalization
content similarity (optional later)

Output:

{
  "cluster_id": "...",
  "headline": "Canonical headline",
  "articles": [...],
  "sources": ["Reuters", "Bloomberg"],
  "first_seen": "...",
  "last_updated": "..."
}

👉 This is your core unit of truth, not individual articles

🧠 Enrichment Layer

Adds meaning to clusters.

1. Entity extraction

assets (BTC, ETH)
companies
macro topics (inflation, rates)

2. Topic classification

Examples:

crypto
macro
regulation
AI

3. Sentiment (lightweight)

positive / negative / neutral
or simple score

👉 Keep this simple in v1 (don’t over-engineer NLP)

4. Importance scoring (VERY useful)

Heuristic:

number of sources covering it
recency
source credibility
keyword weighting

🗃️ Storage Layer

You need short-term memory:

clusters (not raw articles)
TTL: e.g. 24–72h

Optional:

in-memory store (start)
later: DB

we have a choice of storage possibilites including qdrant, postgresql, couchdb

🧰 2. Agent-Facing Tools (IMPORTANT)

Keep tools high-level and semantic

1. `get_latest_events`

“What is happening right now?”

Input:

{
  "topic": "crypto",
  "limit": 5,
  "include_articles": false
}

Output:

[
  {
    "headline": "...",
    "summary": "...",
    "entities": ["BTC"],
    "sentiment": "positive",
    "importance": 0.82,
    "sources": ["Reuters", "CoinDesk"],
    "timestamp": "...",
    "articles": [
      {
        "title": "...",
        "url": "...",
        "source": "Reuters",
        "timestamp": "..."
      }
    ]
  }
]

2. `get_events_for_entity`

“What’s happening with X?”

{
  "entity": "BTC",
  "include_articles": false
}

👉 filters clusters by entity

Optional:

include_articles to include article title/url/source/timestamp in the payload

3. `get_event_summary`

“Explain this event clearly”

{
  "event_id": "cluster_id",
  "include_articles": false
}

Output:

merged summary
key facts
sources
optional articles (title/url/source/timestamp)

👉 This is where you compress multiple articles into one clean narrative

4. `get_news_sentiment`

“What’s the tone around X?”

{
  "entity": "BTC",
  "timeframe": "24h"
}

Output:

{
  "sentiment": "positive",
  "score": 0.64,
  "article_count": 42
}

5. `detect_emerging_topics` (very valuable)

“What is gaining attention?”

Output:

[
  {
    "topic": "Ethereum ETF",
    "trend_score": 0.91,
    "related_entities": ["ETH", "BlackRock", "SEC"],
    "count": 8,
    "avg_importance": 0.17
  }
]

6. `get_related_entities`

“What entities tend to appear with X?”

{
  "subject": "Iran",
  "timeframe": "24h",
  "limit": 10
}

Output:

[
  {
    "entity": "United States",
    "count": 5,
    "avg_importance": 0.11,
    "sentiment": "negative",
    "score": -0.2
  }
]

👉 entity-only co-occurrence neighborhood for real-time sense-making

⚠️ 3. What NOT to expose

Avoid:

raw RSS feeds
individual article endpoints
unprocessed headlines

❌ Bad:

get_raw_articles()

👉 This destroys signal quality for agents

🔁 4. Caching & Freshness Strategy

Key difference from crypto:

News is append-only + evolving
Not real-time tick data

Strategy:

Fetch layer:

poll every few minutes

Cluster layer:

update clusters incrementally

Tool responses:

no heavy recomputation
serve from processed store

🧠 5. Deduplication Strategy (critical)

Clustering is the unit of truth, not individual articles.

Signal cascade (cheapest first, short-circuit on match):

Cosine similarity (if embeddings enabled) against cluster centroid
Fuzzy title similarity (SequenceMatcher, configurable threshold, default 0.87)
Token Jaccard over headline+summary (default threshold 0.55)
Consensus: cosine ≥ 0.80 AND (jaccard ≥ 0.30 OR title ≥ 0.55)

Each new article is compared against all articles in a candidate cluster; the best signal across all members is used.

Stable cluster IDs: sha1(topic | min_article_key) — the same set of articles always maps to the same ID regardless of which article arrived first or which polling cycle created the cluster.

Cross-cycle merge: the poller loads recent clusters from the DB (controlled by NEWS_CLUSTER_MAX_AGE_HOURS, default 4h) and seeds them as merge targets before clustering. New articles can merge into clusters from previous polling cycles.

Orphan merge: a post-clustering Union-Find pass merges clusters that share article keys, catching cases where articles about the same event didn't match during the main loop.

Planned runtime order:

when NEWS_EMBEDDINGS_ENABLED=true, try Ollama embeddings first
if Ollama fails, fall back to the existing heuristic cluster path
keep candidate pre-filtering cheap before any vector compare

⚡ 6. Signal Quality Rules

Your MCP should:

✅ Do:

reduce 100 articles → 5–10 clusters
highlight consensus
surface importance

❌ Don’t:

overwhelm agent with volume
pass conflicting duplicates
expose noise

🧩 7. Relationship to Other MCPs

This MCP becomes powerful when combined with:

crypto MCP → price
trends MCP → attention

👉 News MCP provides:

causal narratives

🧭 8. Design Philosophy

Each tool should answer:

“What is happening, and why should I care?”

🚀 9. Suggested Build Order

RSS ingestion
normalization
basic deduplication
clustering
simple summarization
entity tagging

👉 Only then expose tools

🧠 Final takeaway

Crypto MCP gives you facts News MCP gives you meaning

But only if you:

aggressively deduplicate
cluster events
compress information

✅ Completed since this outlook was written

v0.1.0 released and tagged
provider-agnostic LLM extraction/summarization layer added
prompts moved into separate files for easier updates
entity blacklist implemented and made case-insensitive
wildcard blacklist support added for entities/topics/keywords
live extraction smoke test added
JSON-backed alias map added for query normalization
query normalization added so shorthand like btc and trump still works
docs updated with the new env vars and workflow
optional article payloads added to event tools
blacklist enforcement maintenance script added
related-entities tool added for co-occurrence neighborhoods
emerging-topic scoring improved with importance-weighting and co-occurrence
concurrent RSS/OLLAMA/LLM pipelines added (v0.3.0)
stable cluster IDs, cross-cycle merge, orphan dedup, multi-article signal comparison added (v0.3.1)

🔭 Next high-level steps

What is left of v0.1.0

The first version is now effectively a usable baseline. The remaining work for v0.1.x is mostly polish:

stabilize extraction quality across a few more real-world samples
expand the alias map only where usage demands it
tune emerging-topic noise so repeated source names do not dominate
keep sentiment labels aligned with scores as the model improves

Where v0.2.0 should lead

Future plan (worth building slowly): “Emerging entity graph over time”

Right now detect_emerging_topics() returns a flat list of emerging topics/entities. Next-level idea: turn it into an entity graph that an agent can reason over.

Core concept

Collapse/group results into canonical entity nodes (e.g. iran, israel, donald_trump, strait_of_hormuz, etc.)
Build weighted edges from co-occurrence in recent clusters:
- edge weight ~ frequency/co-occurrence strength
- node weight ~ trend_score + count (+ optional avg_importance)
Infer communities (graph grouping) so related nodes form stable “story neighborhoods”

Over time (the important part)

Each refresh window produces a snapshot of the graph
Store snapshots / deltas to observe:
- rising/falling node weights (“momentum”)
- strengthening/weaker relations
- emerging communities and topic shifts

Suggested output for an eventual agent tool

get_emerging_entity_graph(timeframe, limit) returning:
- grouped communities
- top nodes + weights
- top relations + direction (optional)
- summary of “what changed since last snapshot”

This needs extra time to become a real usable MCP tool, so it’s intentionally captured here for later execution.

Normalization layer
- canonicalize acronyms and entity variants before storage / querying
- keep the blacklist as a separate post-processing rule
Wildcard blacklist support
- allow patterns for entities / topics / keywords
- keep matching case-insensitive
Emerging signal quality
- tune what counts as an emerging topic/entity
- reduce noise from repeated source names and generic terms
Entity/time tracking and replay (future capability)
- track how important entities evolve over time
- allow replay of when entities first appeared, how topics shifted, and how sentiment changed
- useful later for narrative reconstruction and trend timelines

Longer-term direction

The endgame is not just “news search”, but a light narrative memory system:

entity histories over time
topic shifts and turning points
sentiment arcs
replayable timelines for a person, company, or event

That should stay in mind while keeping the current implementation simple.

OUTLOOK.md 11 KB Cronologia Originale

📰 News MCP Server — Requirements Spec

🎯 Goal

🧠 Core Design Principle

🏗️ 1. Internal Architecture

🧩 Data Sources Layer (sources/)

Responsibilities:

🔄 Ingestion Pipeline

🧹 Deduplication Layer

Problem:

Solution:

Output:

🧠 Enrichment Layer

1. Entity extraction

2. Topic classification

3. Sentiment (lightweight)

4. Importance scoring (VERY useful)

🗃️ Storage Layer

🧰 2. Agent-Facing Tools (IMPORTANT)

1. get_latest_events

2. get_events_for_entity

3. get_event_summary

4. get_news_sentiment

5. detect_emerging_topics (very valuable)

6. get_related_entities

⚠️ 3. What NOT to expose

🔁 4. Caching & Freshness Strategy

Key difference from crypto:

Strategy:

Fetch layer:

Cluster layer:

Tool responses:

🧠 5. Deduplication Strategy (critical)

⚡ 6. Signal Quality Rules

✅ Do:

❌ Don’t:

🧩 7. Relationship to Other MCPs

🧭 8. Design Philosophy

🚀 9. Suggested Build Order

🧠 Final takeaway

✅ Completed since this outlook was written

🔭 Next high-level steps

What is left of v0.1.0

Where v0.2.0 should lead

Future plan (worth building slowly): “Emerging entity graph over time”

Longer-term direction

OUTLOOK.md 11 KB

Cronologia Originale

🧩 Data Sources Layer (`sources/`)

1. `get_latest_events`

2. `get_events_for_entity`

3. `get_event_summary`

4. `get_news_sentiment`

5. `detect_emerging_topics` (very valuable)

6. `get_related_entities`