# 📰 News MCP Server

FastMCP-based MCP server that turns news feeds into **deduplicated, enriched clusters**.

## Quick start

```bash
cd news-mcp
source .venv/bin/activate
pip install -r requirements.txt
./run.sh
```

Default SSE mount (FastMCP):
- `http://127.0.0.1:8506/mcp/sse`

Health:
- `http://127.0.0.1:8506/health`

## What this server provides
- Fetches from one or more configured news feeds (`NEWS_FEED_URL` / `NEWS_FEED_URLS`)
- Deduplicates articles into clusters (v1 fuzzy title similarity)
- Enriches clusters with configurable LLM providers/models (topic/entities/sentiment/keywords)
- Applies a case-insensitive entity blacklist after extraction
- Caches clusters + LLM fields in SQLite

## Tools (MCP)

1) `get_latest_events(topic, limit, include_articles=false)`
- `topic` is a coarse category: `crypto | macro | regulation | ai | other`
- when `include_articles=true`, includes `articles[].url` + minimal fields per returned cluster

2) `get_events_for_entity(entity, limit, include_articles=false)`
- substring, case-insensitive match over extracted `entities`
- uses a shallow recent scan first, then falls back to a wider historical scan if needed
- when `include_articles=true`, includes `articles[].url` + minimal fields per returned cluster

3) `get_event_summary(event_id, include_articles=false)`
- Groq-written compressed narrative for a given `cluster_id`
- when `include_articles=true`, includes the underlying `articles` list (with `url`) from the stored cluster

4) `detect_emerging_topics(limit)`
- derives “emerging” signals from recent cached clusters

5) `get_news_sentiment(entity, timeframe)`
- aggregates sentiment around an entity from cached enriched clusters

6) `get_related_entities(subject, timeframe, limit)`
- entity-only co-occurrence neighborhood: for a given subject entity, returns related entities with aggregated
  `count`, `avg_importance`, and `sentiment`

### Entity aliasing

The server keeps a conservative alias map in `config/entity_aliases.json` for obvious shorthands
like `btc -> Bitcoin`, `eth -> Ethereum`, and `ether -> Ethereum`. Keep this map tight; it is meant
to reduce false misses, not to rewrite every possible name variant.

## Configuration

See `news-mcp/.env`.
Key variables:
- `NEWS_EXTRACT_PROVIDER`, `NEWS_EXTRACT_MODEL`
- `NEWS_SUMMARY_PROVIDER`, `NEWS_SUMMARY_MODEL`
- `GROQ_API_KEY`, `OPENAI_API_KEY`
- `ENTITY_BLACKLIST` (comma-separated, case-insensitive exact entity match)
- `NEWS_PROMPTS_DIR` (override prompt directory)
- `NEWS_ENTITY_ALIASES_FILE` (override entity alias JSON file)
- `NEWS_FEED_URL` (single feed fallback)
- `NEWS_FEED_URLS` (comma-separated feed URLs; overrides `NEWS_FEED_URL`)
- `NEWS_REFRESH_INTERVAL_SECONDS` (default 900)
- `NEWS_BACKGROUND_REFRESH_ON_START` (default true)
- `NEWS_BACKGROUND_REFRESH_ENABLED` (default true)
- `NEWS_CLUSTERS_TTL_HOURS`
- `GROQ_ENRICH_OTHER_ONLY` (default false; set true for cost control)
- `NEWS_EMBEDDINGS_ENABLED` (default false; enables Ollama embeddings for clustering when wired in)
- `OLLAMA_BASE_URL` / `OLLAMA_URL` (default `http://127.0.0.1:11434`)
- `OLLAMA_EMBEDDING_MODEL` (default `nomic-embed-text`)
- `NEWS_EMBEDDING_SIMILARITY_THRESHOLD` (default `0.885`; used when embeddings are enabled)

When embeddings are enabled, news-mcp tries Ollama first and falls back to the existing heuristic clustering path if Ollama is unavailable.

## Live extraction smoke test

Run a standardized, fabricated extraction test against the currently configured provider/model:

```bash
./live_tests.sh
```

The script reads `./.env`, selects OpenAI or Groq based on the configured keys, and checks that the core expected entities are extracted.

## mcporter examples (all news-mcp calls)

Use your existing config path:

```bash
CONFIG=/home/lucky/.openclaw/workspace/config/mcporter.json
```

Inspect server + tools:

```bash
mcporter --config "$CONFIG" list news --schema
```

### 1) Latest events

```bash
mcporter --config "$CONFIG" call news.get_latest_events topic=crypto limit=10
mcporter --config "$CONFIG" call news.get_latest_events topic=macro limit=5
```

### 2) Events for an entity

```bash
mcporter --config "$CONFIG" call news.get_events_for_entity entity=Bitcoin limit=10
mcporter --config "$CONFIG" call news.get_events_for_entity entity=ETH limit=10
mcporter --config "$CONFIG" call news.get_events_for_entity entity=ETF limit=10
```

### 3) Event summary (by cluster_id)

```bash
# First fetch an event id
mcporter --config "$CONFIG" call news.get_latest_events topic=crypto limit=1

# Then summarize it
mcporter --config "$CONFIG" call news.get_event_summary event_id=<cluster_id>
```

### 4) Emerging topics

```bash
mcporter --config "$CONFIG" call news.detect_emerging_topics limit=10
```

### 5) Sentiment for an entity

```bash
mcporter --config "$CONFIG" call news.get_news_sentiment entity=Bitcoin timeframe=24h
mcporter --config "$CONFIG" call news.get_news_sentiment entity=Ethereum timeframe=72h
```

### 6) Related entities (co-occurrence neighborhood)

```bash
mcporter --config "$CONFIG" call news.get_related_entities subject=iran timeframe=24h limit=8
mcporter --config "$CONFIG" call news.get_related_entities subject="iran war" timeframe=3d limit=8
```

## Blacklist enforcement (optional back-clean)

If you change `ENTITY_BLACKLIST`, existing clusters in `news.sqlite` may still
contain entities/keywords that would now be filtered at extraction time.

For one-off cleanup, run:

```bash
./.venv/bin/python scripts/enforce_news_blacklist.py --dry-run --limit 200
./.venv/bin/python scripts/enforce_news_blacklist.py --limit 1000
```

This enforces `ENTITY_BLACKLIST` inside stored clusters by removing matching
entries from `payload.entities` and `payload.keywords` and (if needed) setting
`payload.topic = "other"`.

## Embeddings backfill (optional)

If `NEWS_EMBEDDINGS_ENABLED=true`, you can precompute cluster embeddings for
older rows before restarting the server:

```bash
./.venv/bin/python scripts/backfill_news_embeddings.py --dry-run --limit 200
./.venv/bin/python scripts/backfill_news_embeddings.py --limit 1000
```

This stores a cluster-level `embedding` and `embedding_model` inside the SQLite
payload so the Ollama-first clustering path has data ready to use.

## Embedding merge analysis (optional)

To inspect likely cluster merges at different cosine thresholds without writing
anything back to the DB:

```bash
./.venv/bin/python scripts/analyze_cluster_embedding_merges.py --thresholds 0.82 0.85 0.88 --limit 200
```

This prints candidate pairs per threshold so you can decide whether a merge
script is worth adding next.

## Embedding merge pass (optional, destructive)

After inspecting the analysis output, you can merge clusters above a chosen
threshold. Start with dry-run:

```bash
./.venv/bin/python scripts/merge_cluster_embeddings.py --dry-run --threshold 0.90
```

If the groupings look right, run wet:

```bash
./.venv/bin/python scripts/merge_cluster_embeddings.py --threshold 0.90
```

This merges embedding-similar clusters within the same topic and removes the
absorbed duplicates from SQLite.

## Article dedup cleanup (optional)

Some stored clusters may contain repeated article entries for the same
underlying article id / URL path. To clean existing rows:

```bash
./.venv/bin/python scripts/dedup_articles_in_clusters.py --dry-run
./.venv/bin/python scripts/dedup_articles_in_clusters.py
```

The live clustering path also deduplicates article entries when new data comes in.
```