# 📰 News MCP Server

FastMCP-based MCP server that turns news feeds into **deduplicated, enriched clusters**.

## Quick start

Local:
```bash
cd news-mcp
source .venv/bin/activate
pip install -r requirements.txt
./run.sh
```

Docker Compose:
```bash
docker compose up --build
```

Default SSE mount (FastMCP):
- `http://127.0.0.1:8506/mcp/sse`

Health:
- `http://127.0.0.1:8506/health`

## What this server provides
- Fetches from one or more configured news feeds (`NEWS_FEED_URL` / `NEWS_FEED_URLS`)
- Deduplicates articles into clusters (v1 fuzzy title similarity)
- Enriches clusters with configurable LLM providers/models (topic/entities/sentiment/keywords)
- Applies a case-insensitive entity blacklist after extraction
- Caches clusters + LLM fields in SQLite
- Resolves entities in-process via Google Trends suggestions (no `trends-mcp` hop required for entity resolution)

## Tools (MCP)

1) `get_latest_events(topic, limit, include_articles=false)`
- `topic` is a coarse category: `crypto | macro | regulation | ai | other`
- when `include_articles=true`, includes `articles[].url` + minimal fields per returned cluster

2) `get_events_for_entity(entity, limit, timeframe="24h", include_articles=false)`
- substring, case-insensitive match over extracted `entities`
- uses the requested timeframe as the scan window; `limit` is the cap within that window
- when `include_articles=true`, includes `articles[].url` + minimal fields per returned cluster

3) `get_event_summary(event_id, include_articles=false)`
- LLM-written compressed narrative for a given `cluster_id`
- when `include_articles=true`, includes the underlying `articles` list (with `url`) from the stored cluster

4) `detect_emerging_topics(limit)`
- derives “emerging” signals from recent cached clusters

5) `get_news_sentiment(entity, timeframe)`
- aggregates sentiment around an entity from cached enriched clusters

6) `get_related_recent_entities(subject, timeframe, limit, include_trends=true)`
- merges recent co-occurrence data from cached clusters with Google Trends suggestions and returns
  related entities (with `mid` when available) plus source/score metadata

7) `get_capabilities()`
- describes the server’s tool surface, composition recipes, and output conventions for agents

### Entity aliasing

The server keeps a conservative alias map in `config/entity_aliases.json` for obvious shorthands
like `btc -> Bitcoin`, `eth -> Ethereum`, and `ether -> Ethereum`. Keep this map tight; it is meant
to reduce false misses, not to rewrite every possible name variant.

## Configuration

See `news-mcp/.env`.
Key variables:
- `NEWS_EXTRACT_PROVIDER`, `NEWS_EXTRACT_MODEL`
- `NEWS_SUMMARY_PROVIDER`, `NEWS_SUMMARY_MODEL`
- `GROQ_API_KEY`, `OPENAI_API_KEY`
- `ENTITY_BLACKLIST` (comma-separated, case-insensitive patterns; wildcards are supported)
- `NEWS_PROMPTS_DIR` (override prompt directory)
- `NEWS_ENTITY_ALIASES_FILE` (override entity alias JSON file)
- `NEWS_FEED_URL` (single feed fallback)
- `NEWS_FEED_URLS` (comma-separated feed URLs; overrides `NEWS_FEED_URL`)
- `NEWS_REFRESH_INTERVAL_SECONDS` (default 900)
- `NEWS_BACKGROUND_REFRESH_ON_START` (default true)
- `NEWS_BACKGROUND_REFRESH_ENABLED` (default true)
- `NEWS_DEFAULT_LOOKBACK_HOURS` (freshness window for reads; older rows are ignored by queries)
- `NEWS_PRUNING_ENABLED` (default true; if false, no rows are physically deleted)
- `NEWS_RETENTION_DAYS` (physical delete threshold for stored clusters)
- `NEWS_PRUNE_INTERVAL_HOURS` (how often in-server pruning may run)
- `GROQ_ENRICH_OTHER_ONLY` (default false; set true for cost control)
- `NEWS_EMBEDDINGS_ENABLED` (default false; enables Ollama embeddings for clustering when wired in)
- `OLLAMA_BASE_URL` / `OLLAMA_URL` (default `http://127.0.0.1:11434`)
- `OLLAMA_EMBEDDING_MODEL` (default `nomic-embed-text`)
- `NEWS_EMBEDDING_SIMILARITY_THRESHOLD` (default `0.885`; used when embeddings are enabled)

When embeddings are enabled, news-mcp tries Ollama first and falls back to the existing heuristic clustering path if Ollama is unavailable.

## Persistence and migration

The default database path is project-relative:

- `NEWS_MCP_DATA_DIR=./data`
- `NEWS_MCP_DB_PATH=./data/news.sqlite`

That keeps persistence inside the repository tree in both local and Docker runs.

Recommended workflow:

1. Keep **code** in Git.
2. Keep the **data directory** outside Git but inside the project tree.
3. Use `rsync` for the initial data transfer to a remote server.
4. After that, move code with Git and move data only when you actually need a fresh copy.

Example initial transfer:

```bash
rsync -a ./data/ user@remote:/srv/news-mcp/data/
```

If you change the location later, override the defaults with:

- `NEWS_MCP_DATA_DIR`
- `NEWS_MCP_DB_PATH`

## TTL vs pruning

These are intentionally different:

- `NEWS_DEFAULT_LOOKBACK_HOURS` controls **read freshness** only. Older rows remain in SQLite but do not appear in normal "latest" queries.
- `NEWS_PRUNING_ENABLED` controls whether the server is allowed to **physically delete** old rows.
- `NEWS_RETENTION_DAYS` controls how old rows may get before they are deleted.
- `NEWS_PRUNE_INTERVAL_HOURS` controls how often the server checks whether deletion is due.

Pruning is self-contained inside the server:
- on startup
- after refresh cycles (prune-if-due)

If `NEWS_PRUNING_ENABLED=false`, no pruning occurs and old rows are retained indefinitely.

## Live extraction smoke test

Run a standardized, fabricated extraction test against the currently configured provider/model:

```bash
./live_tests.sh
```

The script reads `./.env`, selects OpenAI or Groq based on the configured keys, and checks that the core expected entities are extracted.

## mcporter examples (all news-mcp calls)

Use your existing config path:

```bash
CONFIG=/home/lucky/.openclaw/workspace/config/mcporter.json
```

Inspect server + tools:

```bash
mcporter --config "$CONFIG" list news --schema
```

### 1) Latest events

```bash
mcporter --config "$CONFIG" call news.get_latest_events topic=crypto limit=10
mcporter --config "$CONFIG" call news.get_latest_events topic=macro limit=5
```

### 2) Events for an entity

```bash
mcporter --config "$CONFIG" call news.get_events_for_entity entity=Bitcoin timeframe=24h limit=10
mcporter --config "$CONFIG" call news.get_events_for_entity entity=ETH timeframe=3d limit=10
mcporter --config "$CONFIG" call news.get_events_for_entity entity=ETF timeframe=7d limit=10
```

### 3) Event summary (by cluster_id)

```bash
# First fetch an event id
mcporter --config "$CONFIG" call news.get_latest_events topic=crypto limit=1

# Then summarize it
mcporter --config "$CONFIG" call news.get_event_summary event_id=<cluster_id>
```

### 4) Emerging topics

```bash
mcporter --config "$CONFIG" call news.detect_emerging_topics limit=10
```

### 5) Sentiment for an entity

```bash
mcporter --config "$CONFIG" call news.get_news_sentiment entity=Bitcoin timeframe=24h
mcporter --config "$CONFIG" call news.get_news_sentiment entity=Ethereum timeframe=72h
```

### 6) Related entities (recent neighborhood + trends blending)

```bash
# Iran: blend local co-occurrence with Google Trends related topics
mcporter --config "$CONFIG" call news.get_related_recent_entities subject=Iran timeframe=72h limit=12 include_trends=true

# Another seed phrase
mcporter --config "$CONFIG" call news.get_related_recent_entities subject="iran war" timeframe=72h limit=12 include_trends=true
```

### 7) Capabilities / composition guidance

```bash
mcporter --config "$CONFIG" call news.get_capabilities
```

Use this when you want the server to explain how to chain the tools together, which fields to keep hidden (e.g. `cluster_id`), and how to present sources/timestamps consistently.

## Blacklist enforcement (optional back-clean)

If you change `ENTITY_BLACKLIST`, existing clusters in `news.sqlite` may still
contain entities/keywords that would now be filtered at extraction time.

For one-off cleanup, run:

```bash
./.venv/bin/python scripts/enforce_news_blacklist.py --dry-run --limit 200
./.venv/bin/python scripts/enforce_news_blacklist.py --limit 1000
```

This enforces `ENTITY_BLACKLIST` inside stored clusters by removing matching
entries from `payload.entities` and `payload.keywords` and (if needed) setting
`payload.topic = "other"`.

## Embeddings backfill (optional)

If `NEWS_EMBEDDINGS_ENABLED=true`, you can precompute cluster embeddings for
older rows before restarting the server:

```bash
./.venv/bin/python scripts/backfill_news_embeddings.py --dry-run --limit 200
./.venv/bin/python scripts/backfill_news_embeddings.py --limit 1000
```

This stores a cluster-level `embedding` and `embedding_model` inside the SQLite
payload so the Ollama-first clustering path has data ready to use.

## Embedding merge analysis (optional)

To inspect likely cluster merges at different cosine thresholds without writing
anything back to the DB:

```bash
./.venv/bin/python scripts/analyze_cluster_embedding_merges.py --thresholds 0.82 0.85 0.88 --limit 200
```

This prints candidate pairs per threshold so you can decide whether a merge
script is worth adding next.

## Embedding merge pass (optional, destructive)

After inspecting the analysis output, you can merge clusters above a chosen
threshold. Start with dry-run:

```bash
./.venv/bin/python scripts/merge_cluster_embeddings.py --dry-run --threshold 0.90
```

If the groupings look right, run wet:

```bash
./.venv/bin/python scripts/merge_cluster_embeddings.py --threshold 0.90
```

This merges embedding-similar clusters within the same topic and removes the
absorbed duplicates from SQLite.

## Article dedup cleanup (optional)

Some stored clusters may contain repeated article entries for the same
underlying article id / URL path. To clean existing rows:

```bash
./.venv/bin/python scripts/dedup_articles_in_clusters.py --dry-run
./.venv/bin/python scripts/dedup_articles_in_clusters.py
```

The live clustering path also deduplicates article entries when new data comes in.

As of the latest hardening, the server/storage write path also self-heals `payload.articles` by deduplicating before persisting (so historical rows can be fixed via the cleanup script, and future writes won’t reintroduce duplicates). 
```

## Dashboard (new)

A browser-based monitoring dashboard is available at:
```
http://127.0.0.1:8506/dashboard/
```

**Views:**
- **Health** — cluster/entity counts, freshness indicator, topic distribution (doughnut chart), sentiment overview, feed activity
- **Clusters** — filterable/sortable table with topic, sentiment, importance, entity chips; search by keyword
- **Sentiment** — time-series chart (avg sentiment per configurable time bucket) with cluster count overlay
- **Entities** — top entities by mention frequency, horizontal bar chart, click for detail
- **Detail** — click any cluster row or search by cluster ID for full drill-down (summary, key facts, articles, keywords, entities)

**Tech:** Pure static HTML/JS with Chart.js for visualizations. Served from the same FastAPI process at `/dashboard`.

**Configuration defaults:** The dashboard's default lookback window follows `NEWS_DEFAULT_LOOKBACK_HOURS` (configured via `.env`, default 144h).

## REST API

The following read-only endpoints are available for programmatic access (in addition to the MCP SSE tools):

| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/v1/health` | Extended health: cluster/entity counts, freshness, feed state, pruning config |
| GET | `/api/v1/clusters` | Paginated clusters. Params: `topic`, `hours`, `limit`, `offset` |
| GET | `/api/v1/sentiment-series` | Sentiment time-series. Params: `topic`, `hours`, `bucket_hours` |
| GET | `/api/v1/entities` | Top entities by frequency. Params: `hours`, `limit` |
| GET | `/api/v1/cluster/{cluster_id}` | Full cluster detail with summary, facts, articles |

## Startup behavior

The server uses a lifespan-based startup (FastAPI ≥0.111). Background feed refresh and pruning run as fire-and-forget coroutines, so the HTTP API and dashboard are available immediately — no blocking on the first fetch cycle.

Important env vars controlling background behavior:
- `NEWS_BACKGROUND_REFRESH_ENABLED=true` — enable/disable background loop
- `NEWS_BACKGROUND_REFRESH_ON_START=true` — fetch immediately on startup (previously `false`; changed to `true` for faster first data)
- `NEWS_REFRESH_INTERVAL_SECONDS=900` — polling interval between refresh cycles

## Dashboard (updated)

A browser-based monitoring dashboard is available at:
```
http://<your-host>:8506/dashboard/
```

**Views:**
| View | What it shows |
|---|---|
| Health | Cluster/entity counts, freshness badge, topic doughnut, sentiment overview, feed activity |
| Clusters | Filterable table — topic, sentiment, importance, entity chips, full-text search |
| Sentiment | Time-series line chart (avg sentiment per configurable bucket) + cluster count overlay |
| Entities | Top entities by mention frequency, bar chart, click-through to matching clusters |
| Detail | Click any cluster row or paste a cluster_id — full drill-down with summary, key facts, articles, keywords, entities |

**Tech stack:** Vanilla HTML/JS + Chart.js, served as static files from the same FastAPI process at `/dashboard`. No extra dependencies.

The dashboard default lookback window follows `NEWS_DEFAULT_LOOKBACK_HOURS` (configured via `.env`, default: 144h).

## REST API

Five read-only JSON endpoints for programmatic access:

| Method | Path | Params | Returns |
|--------|------|--------|---------|
| GET | `/api/v1/health` | — | stats, freshness, feed state, pruning config |
| GET | `/api/v1/clusters` | `topic`, `hours`, `limit`, `offset` | paginated cluster list |
| GET | `/api/v1/sentiment-series` | `topic`, `hours`, `bucket_hours` | time-series data for Chart.js |
| GET | `/api/v1/entities` | `hours`, `limit` | top entities by mention count |
| GET | `/api/v1/cluster/{id}` | — (path param) | full cluster detail |

## Startup

Uses FastAPI `lifespan` — **HTTP API available in <0.3s** regardless of feed/LLM latency. Background refresh + pruning are fire-and-forget coroutines.

Key `.env` vars:
- `NEWS_BACKGROUND_REFRESH_ON_START=true` — fetch immediately on boot
- `NEWS_BACKGROUND_REFRESH_ENABLED=true` — enable/disable the background loop
- `NEWS_REFRESH_INTERVAL_SECONDS=900` — polling interval