# 📰 News MCP Server FastMCP-based MCP server that turns news feeds into **deduplicated, enriched clusters**. ## Quick start Local: ```bash cd news-mcp source .venv/bin/activate pip install -r requirements.txt ./run.sh ``` Docker Compose: ```bash docker compose up --build ``` Endpoints: - MCP: `http://127.0.0.1:8506/mcp/sse` - Health: `http://127.0.0.1:8506/health` - Dashboard: `http://127.0.0.1:8506/dashboard/` ## What it does - Fetches from configured news feeds (`NEWS_FEED_URLS`) - **Three-layer dedup**: feed hash → article URL → content hash (detects in-place updates) - Clusters articles via title similarity (≥0.75), Jaccard (≥0.55), dual-signal, or embeddings - Enriches clusters with LLM: topic, entities, sentiment, keywords, summary - Resolves entities via Google Trends suggestions - Dashboard with runtime Config page ## Tools (MCP) | Tool | Description | |------|-------------| | `get_latest_events(topic, limit, include_articles)` | Latest clusters by topic | | `get_events_for_entity(entity, limit, timeframe, include_articles)` | Clusters matching entity | | `get_event_summary(event_id, include_articles)` | LLM narrative for cluster | | `detect_emerging_topics(limit)` | Emerging signals from recent clusters | | `get_news_sentiment(entity, timeframe)` | Aggregated sentiment | | `get_related_recent_entities(subject, timeframe, limit)` | Co-occurrence + Trends blend | | `get_feeds()` / `toggle_feed(url, enabled)` | Feed management | | `debug_dedup(url, title?)` | Inspect dedup decisions & similarity signals | | `get_capabilities()` | Tool surface documentation | ## REST API | Method | Path | Description | |--------|------|-------------| | GET | `/api/v1/health` | Stats, freshness, feed state, pruning | | GET | `/api/v1/clusters` | Paginated clusters | | GET | `/api/v1/sentiment-series` | Sentiment time-series | | GET | `/api/v1/entities` | Top entities by frequency | | GET | `/api/v1/keywords` | Top keywords by frequency | | GET | `/api/v1/clusters/by-entity` | Entity search (SQL) | | GET | `/api/v1/clusters/by-keyword` | Keyword search (SQL) | | GET | `/api/v1/cluster/{id}` | Full cluster detail | | GET | `/api/v1/feeds` | Feed state list | | POST | `/api/v1/feeds/toggle` | Enable/disable feed | | GET | `/api/v1/config` | All config parameters | | POST | `/api/v1/config/update` | Update a parameter | | POST | `/api/v1/config/reset` | Reset to defaults | ## Configuration All parameters are stored in the `site_config` DB table and editable via the dashboard Config page. On first startup, seeded from `.env` or built-in defaults. Key `.env` vars (seeded into site_config): | Variable | Default | Purpose | |----------|---------|---------| | `NEWS_FEED_URLS` | — | Comma-separated feed URLs | | `NEWS_REFRESH_INTERVAL_SECONDS` | 300 | Polling interval | | `NEWS_DEFAULT_LOOKBACK_HOURS` | 24 | Read freshness window | | `NEWS_RETENTION_DAYS` | 10 | Prune threshold | | `NEWS_PRUNE_INTERVAL_HOURS` | 12 | Prune check interval | | `NEWS_CLUSTER_MAX_AGE_HOURS` | 6 | Cross-cycle merge window | | `NEWS_EMBEDDINGS_ENABLED` | true | Enable Ollama embeddings | | `NEWS_EMBEDDING_SIMILARITY_THRESHOLD` | 0.885 | Cosine threshold | | `OLLAMA_BASE_URL` | `http://192.168.0.200:11434` | Ollama API URL | | `NEWS_EXTRACT_PROVIDER` / `NEWS_SUMMARY_PROVIDER` | groq | LLM provider | | `NEWS_EXTRACT_MODEL` / `NEWS_SUMMARY_MODEL` | llama4-16e | LLM model | | `GROQ_API_KEY` / `OPENAI_API_KEY` / `OPENROUTER_API_KEY` | — | API keys | | `ENTITY_BLACKLIST` | — | Comma-separated entity patterns | | `ENRICHMENT_MAX_PER_REFRESH` | 0 (unlimited) | LLM enrichments per cycle | Clustering thresholds (also in site_config): - `title_threshold`: 0.75 - `jaccard_threshold`: 0.55 - `dual_title_floor`: 0.55 - `dual_jaccard_floor`: 0.25 ## Clustering pipeline 1. **Fetch** all feeds concurrently 2. **Feed hash** — skip unchanged feeds entirely 3. **Retention filter** — drop articles older than `NEWS_RETENTION_DAYS` 4. **Seen articles** — `filter_already_seen()` splits into: - `new` → never seen, full processing - `unchanged` → same URL, same content hash → skip - `changed` → same URL, different content hash → re-cluster + re-enrich 5. **Cluster** — title similarity, Jaccard, embeddings, dual-signal merge 6. **Enrich** — LLM extraction, summarization, sentiment 7. **Prune** — delete clusters older than retention window ## Content-change detection When an article is updated in-place at the same URL (e.g. FT's "More to come..." → real content): 1. `content_hash = SHA1(title|summary)` is computed 2. Compared against `seen_articles.content_hash` 3. If different → article is re-clustered into its existing cluster 4. `enriched_at` is cleared → next cycle re-enriches with updated content ## Persistence - SQLite at `./data/news.sqlite` (local) or `/app/data/news.sqlite` (Docker) - Schema auto-migrates on startup (ALTER TABLE for new columns) - Backfill script for seeding `seen_articles` from existing clusters: ```bash docker exec -it news-mcp python3 scripts/backfill_seen_articles.py ``` ## Dashboard `http://:8506/dashboard/` - **Health** — stats, charts, feed status - **Feeds** — toggle on/off - **Clusters** — filterable table, click for drill-down modal - **Sentiment** — time-series chart - **Entities** — top entities, frequency chart - **Keywords** — top keywords, frequency chart - **Config** — runtime parameter tuning (new in v0.5.0) ## Version See `./version-hash.sh` for the current content hash. ## Prompt Evaluation (extraction quality) The extraction prompt (`prompts/extract_entities.prompt`) is tested against a curated set of annotated samples to ensure entity/keyword separation quality, especially for smaller models like `llama-3.1-8b-instant`. ### Running the evaluation ```bash # Run against default prompt with 30 annotated samples (all 5 topics) python scripts/eval_extraction.py --model llama-3.1-8b-instant --provider groq # Run with specific prompt file python scripts/eval_extraction.py --model llama-3.1-8b-instant --provider groq --prompt-file prompts/extract_entities.prompt # Run against larger model for comparison python scripts/eval_extraction.py --model deepseek/deepseek-v4-flash --provider openrouter # Verbose per-sample output python scripts/eval_extraction.py --model llama-3.1-8b-instant --provider groq --verbose # Collect new samples from live DB for manual annotation python scripts/eval_extraction.py --collect 30 --output new_samples.json ``` ### What it measures | Metric | Target | Description | |--------|--------|-------------| | Entity F1 | ≥ 0.65 | Precision/recall of named entities (proper nouns) | | Keyword F1 | ≥ 0.40 | Precision/recall of thematic keywords (1-2 word tags) | | Leakage | 0.0 | Entities appearing in keywords (should never happen) | | Topic Accuracy | ≥ 0.80 | Correct topic classification (crypto/macro/regulation/ai/other) | ### Annotated samples The 30 golden samples in `data/annotated_samples.json` cover all 5 topics: - **regulation** (6): SEC lawsuits, OFAC sanctions, House crypto bills, WAMCO settlement, Cuba sanctions, Iran frozen funds - **macro** (7): Fed/ECB decisions, China stimulus, OPEC+ cuts, India forex/trade, jobs report - **crypto** (6): Bitcoin ETF flows, memecoins, seller exhaustion, XRP liquidation, Visa stablecoin, Kalshi - **ai** (5): Nvidia earnings, Anthropic pause, Microsoft AI, AI bubble debate, Morgan Stanley AI funding - **other** (6): Israel/Iran strikes, Trump intel firings, Boeing 737 Max, Putin/Trump, Paris bridge, Ukraine drones ### Current results (llama-3.1-8b-instant via Groq) ``` Entity F1: 0.665 (P=0.814 R=0.601) Keyword F1: 0.468 (P=0.572 R=0.400) Leakage (avg): 0.000 Topic Acc: 0.867 ``` The prompt uses 6 few-shot examples with explicit entity/keyword decision rules and topic classification boundaries (especially the regulation vs other distinction for sanctions enforcement).