# 📰 News MCP Server

FastMCP-based MCP server that turns news feeds into **deduplicated, enriched clusters**.

## Quick start

Local:
```bash
cd news-mcp
source .venv/bin/activate
pip install -r requirements.txt
./run.sh
```

Docker Compose:
```bash
docker compose up --build
```

Endpoints:
- MCP: `http://127.0.0.1:8506/mcp/sse`
- Health: `http://127.0.0.1:8506/health`
- Dashboard: `http://127.0.0.1:8506/dashboard/`

## What it does

- Fetches from configured news feeds (`NEWS_FEED_URLS`)
- **Three-layer dedup**: feed hash → article URL → content hash (detects in-place updates)
- Clusters articles via title similarity (≥0.75), Jaccard (≥0.55), dual-signal, or embeddings
- Enriches clusters with LLM: topic, entities, sentiment, keywords, summary
- Resolves entities via Google Trends suggestions
- Dashboard with runtime Config page

## Tools (MCP)

| Tool | Description |
|------|-------------|
| `get_latest_events(topic, limit, include_articles)` | Latest clusters by topic |
| `get_events_for_entity(entity, limit, timeframe, include_articles)` | Clusters matching entity |
| `get_event_summary(event_id, include_articles)` | LLM narrative for cluster |
| `detect_emerging_topics(limit)` | Emerging signals from recent clusters |
| `get_news_sentiment(entity, timeframe)` | Aggregated sentiment |
| `get_related_recent_entities(subject, timeframe, limit)` | Co-occurrence + Trends blend |
| `get_feeds()` / `toggle_feed(url, enabled)` | Feed management |
| `debug_dedup(url, title?)` | Inspect dedup decisions & similarity signals |
| `get_capabilities()` | Tool surface documentation |

## REST API

| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/v1/health` | Stats, freshness, feed state, pruning |
| GET | `/api/v1/clusters` | Paginated clusters |
| GET | `/api/v1/sentiment-series` | Sentiment time-series |
| GET | `/api/v1/entities` | Top entities by frequency |
| GET | `/api/v1/keywords` | Top keywords by frequency |
| GET | `/api/v1/clusters/by-entity` | Entity search (SQL) |
| GET | `/api/v1/clusters/by-keyword` | Keyword search (SQL) |
| GET | `/api/v1/cluster/{id}` | Full cluster detail |
| GET | `/api/v1/feeds` | Feed state list |
| POST | `/api/v1/feeds/toggle` | Enable/disable feed |
| GET | `/api/v1/config` | All config parameters |
| POST | `/api/v1/config/update` | Update a parameter |
| POST | `/api/v1/config/reset` | Reset to defaults |

## Configuration

All parameters are stored in the `site_config` DB table and editable via the dashboard Config page.
On first startup, seeded from `.env` or built-in defaults.

Key `.env` vars (seeded into site_config):

| Variable | Default | Purpose |
|----------|---------|---------|
| `NEWS_FEED_URLS` | — | Comma-separated feed URLs |
| `NEWS_REFRESH_INTERVAL_SECONDS` | 300 | Polling interval |
| `NEWS_DEFAULT_LOOKBACK_HOURS` | 24 | Read freshness window |
| `NEWS_RETENTION_DAYS` | 10 | Prune threshold |
| `NEWS_PRUNE_INTERVAL_HOURS` | 12 | Prune check interval |
| `NEWS_CLUSTER_MAX_AGE_HOURS` | 6 | Cross-cycle merge window |
| `NEWS_EMBEDDINGS_ENABLED` | true | Enable Ollama embeddings |
| `NEWS_EMBEDDING_SIMILARITY_THRESHOLD` | 0.885 | Cosine threshold |
| `OLLAMA_BASE_URL` | `http://192.168.0.200:11434` | Ollama API URL |
| `NEWS_EXTRACT_PROVIDER` / `NEWS_SUMMARY_PROVIDER` | groq | LLM provider |
| `NEWS_EXTRACT_MODEL` / `NEWS_SUMMARY_MODEL` | llama4-16e | LLM model |
| `GROQ_API_KEY` / `OPENAI_API_KEY` / `OPENROUTER_API_KEY` | — | API keys |
| `ENTITY_BLACKLIST` | — | Comma-separated entity patterns |
| `ENRICHMENT_MAX_PER_REFRESH` | 0 (unlimited) | LLM enrichments per cycle |

Clustering thresholds (also in site_config):
- `title_threshold`: 0.75
- `jaccard_threshold`: 0.55
- `dual_title_floor`: 0.55
- `dual_jaccard_floor`: 0.25

## Clustering pipeline

1. **Fetch** all feeds concurrently
2. **Feed hash** — skip unchanged feeds entirely
3. **Retention filter** — drop articles older than `NEWS_RETENTION_DAYS`
4. **Seen articles** — `filter_already_seen()` splits into:
   - `new` → never seen, full processing
   - `unchanged` → same URL, same content hash → skip
   - `changed` → same URL, different content hash → re-cluster + re-enrich
5. **Cluster** — title similarity, Jaccard, embeddings, dual-signal merge
6. **Enrich** — LLM extraction, summarization, sentiment
7. **Prune** — delete clusters older than retention window

## Content-change detection

When an article is updated in-place at the same URL (e.g. FT's "More to come..." → real content):
1. `content_hash = SHA1(title|summary)` is computed
2. Compared against `seen_articles.content_hash`
3. If different → article is re-clustered into its existing cluster
4. `enriched_at` is cleared → next cycle re-enriches with updated content

## Persistence

- SQLite at `./data/news.sqlite` (local) or `/app/data/news.sqlite` (Docker)
- Schema auto-migrates on startup (ALTER TABLE for new columns)
- Backfill script for seeding `seen_articles` from existing clusters:
  ```bash
  docker exec -it news-mcp python3 scripts/backfill_seen_articles.py
  ```

## Dashboard

`http://<host>:8506/dashboard/`

- **Health** — stats, charts, feed status
- **Feeds** — toggle on/off
- **Clusters** — filterable table, click for drill-down modal
- **Sentiment** — time-series chart
- **Entities** — top entities, frequency chart
- **Keywords** — top keywords, frequency chart
- **Config** — runtime parameter tuning (new in v0.5.0)

## Version

See `./version-hash.sh` for the current content hash.

## Prompt Evaluation (extraction quality)

The extraction prompt (`prompts/extract_entities.prompt`) is tested against a curated set of annotated samples to ensure entity/keyword separation quality, especially for smaller models like `llama-3.1-8b-instant`.

### Running the evaluation

```bash
# Run against default prompt with 30 annotated samples (all 5 topics)
python scripts/eval_extraction.py --model llama-3.1-8b-instant --provider groq

# Run with specific prompt file
python scripts/eval_extraction.py --model llama-3.1-8b-instant --provider groq --prompt-file prompts/extract_entities.prompt

# Run against larger model for comparison
python scripts/eval_extraction.py --model deepseek/deepseek-v4-flash --provider openrouter

# Verbose per-sample output
python scripts/eval_extraction.py --model llama-3.1-8b-instant --provider groq --verbose

# Collect new samples from live DB for manual annotation
python scripts/eval_extraction.py --collect 30 --output new_samples.json
```

### What it measures

| Metric | Target | Description |
|--------|--------|-------------|
| Entity F1 | ≥ 0.65 | Precision/recall of named entities (proper nouns) |
| Keyword F1 | ≥ 0.40 | Precision/recall of thematic keywords (1-2 word tags) |
| Leakage | 0.0 | Entities appearing in keywords (should never happen) |
| Topic Accuracy | ≥ 0.80 | Correct topic classification (crypto/macro/regulation/ai/other) |

### Annotated samples

The 30 golden samples in `data/annotated_samples.json` cover all 5 topics:
- **regulation** (6): SEC lawsuits, OFAC sanctions, House crypto bills, WAMCO settlement, Cuba sanctions, Iran frozen funds
- **macro** (7): Fed/ECB decisions, China stimulus, OPEC+ cuts, India forex/trade, jobs report
- **crypto** (6): Bitcoin ETF flows, memecoins, seller exhaustion, XRP liquidation, Visa stablecoin, Kalshi
- **ai** (5): Nvidia earnings, Anthropic pause, Microsoft AI, AI bubble debate, Morgan Stanley AI funding
- **other** (6): Israel/Iran strikes, Trump intel firings, Boeing 737 Max, Putin/Trump, Paris bridge, Ukraine drones

### Current results (llama-3.1-8b-instant via Groq)

```
Entity F1:     0.665  (P=0.814 R=0.601)
Keyword F1:    0.468  (P=0.572 R=0.400)
Leakage (avg): 0.000
Topic Acc:     0.867
```

The prompt uses 6 few-shot examples with explicit entity/keyword decision rules and topic classification boundaries (especially the regulation vs other distinction for sanctions enforcement).