# 📰 News MCP Server FastMCP-based MCP server that turns news feeds into **deduplicated, enriched clusters**. ## Quick start Local: ```bash cd news-mcp source .venv/bin/activate pip install -r requirements.txt ./run.sh ``` Docker Compose: ```bash docker compose up --build ``` Endpoints: - MCP: `http://127.0.0.1:8506/mcp/sse` - Health: `http://127.0.0.1:8506/health` - Dashboard: `http://127.0.0.1:8506/dashboard/` ## What it does - Fetches from configured news feeds (`NEWS_FEED_URLS`) - **Three-layer dedup**: feed hash → article URL → content hash (detects in-place updates) - Clusters articles via title similarity (≥0.75), Jaccard (≥0.55), dual-signal, or embeddings - Enriches clusters with LLM: topic, entities, sentiment, keywords, summary - Resolves entities via Google Trends suggestions - Dashboard with runtime Config page ## Tools (MCP) | Tool | Description | |------|-------------| | `get_latest_events(topic, limit, include_articles)` | Latest clusters by topic | | `get_events_for_entity(entity, limit, timeframe, include_articles)` | Clusters matching entity | | `get_event_summary(event_id, include_articles)` | LLM narrative for cluster | | `detect_emerging_topics(limit)` | Emerging signals from recent clusters | | `get_news_sentiment(entity, timeframe)` | Aggregated sentiment | | `get_related_recent_entities(subject, timeframe, limit)` | Co-occurrence + Trends blend | | `get_feeds()` / `toggle_feed(url, enabled)` | Feed management | | `debug_dedup(url, title?)` | Inspect dedup decisions & similarity signals | | `get_capabilities()` | Tool surface documentation | ## REST API | Method | Path | Description | |--------|------|-------------| | GET | `/api/v1/health` | Stats, freshness, feed state, pruning | | GET | `/api/v1/clusters` | Paginated clusters | | GET | `/api/v1/sentiment-series` | Sentiment time-series | | GET | `/api/v1/entities` | Top entities by frequency | | GET | `/api/v1/keywords` | Top keywords by frequency | | GET | `/api/v1/clusters/by-entity` | Entity search (SQL) | | GET | `/api/v1/clusters/by-keyword` | Keyword search (SQL) | | GET | `/api/v1/cluster/{id}` | Full cluster detail | | GET | `/api/v1/feeds` | Feed state list | | POST | `/api/v1/feeds/toggle` | Enable/disable feed | | GET | `/api/v1/config` | All config parameters | | POST | `/api/v1/config/update` | Update a parameter | | POST | `/api/v1/config/reset` | Reset to defaults | ## Configuration All parameters are stored in the `site_config` DB table and editable via the dashboard Config page. On first startup, seeded from `.env` or built-in defaults. Key `.env` vars (seeded into site_config): | Variable | Default | Purpose | |----------|---------|---------| | `NEWS_FEED_URLS` | — | Comma-separated feed URLs | | `NEWS_REFRESH_INTERVAL_SECONDS` | 300 | Polling interval | | `NEWS_DEFAULT_LOOKBACK_HOURS` | 24 | Read freshness window | | `NEWS_RETENTION_DAYS` | 10 | Prune threshold | | `NEWS_PRUNE_INTERVAL_HOURS` | 12 | Prune check interval | | `NEWS_CLUSTER_MAX_AGE_HOURS` | 6 | Cross-cycle merge window | | `NEWS_EMBEDDINGS_ENABLED` | true | Enable Ollama embeddings | | `NEWS_EMBEDDING_SIMILARITY_THRESHOLD` | 0.885 | Cosine threshold | | `OLLAMA_BASE_URL` | `http://192.168.0.200:11434` | Ollama API URL | | `NEWS_EXTRACT_PROVIDER` / `NEWS_SUMMARY_PROVIDER` | groq | LLM provider | | `NEWS_EXTRACT_MODEL` / `NEWS_SUMMARY_MODEL` | llama4-16e | LLM model | | `GROQ_API_KEY` / `OPENAI_API_KEY` / `OPENROUTER_API_KEY` | — | API keys | | `ENTITY_BLACKLIST` | — | Comma-separated entity patterns | | `ENRICHMENT_MAX_PER_REFRESH` | 0 (unlimited) | LLM enrichments per cycle | Clustering thresholds (also in site_config): - `title_threshold`: 0.75 - `jaccard_threshold`: 0.55 - `dual_title_floor`: 0.55 - `dual_jaccard_floor`: 0.25 ## Clustering pipeline 1. **Fetch** all feeds concurrently 2. **Feed hash** — skip unchanged feeds entirely 3. **Retention filter** — drop articles older than `NEWS_RETENTION_DAYS` 4. **Seen articles** — `filter_already_seen()` splits into: - `new` → never seen, full processing - `unchanged` → same URL, same content hash → skip - `changed` → same URL, different content hash → re-cluster + re-enrich 5. **Cluster** — title similarity, Jaccard, embeddings, dual-signal merge 6. **Enrich** — LLM extraction, summarization, sentiment 7. **Prune** — delete clusters older than retention window ## Content-change detection When an article is updated in-place at the same URL (e.g. FT's "More to come..." → real content): 1. `content_hash = SHA1(title|summary)` is computed 2. Compared against `seen_articles.content_hash` 3. If different → article is re-clustered into its existing cluster 4. `enriched_at` is cleared → next cycle re-enriches with updated content ## Persistence - SQLite at `./data/news.sqlite` (local) or `/app/data/news.sqlite` (Docker) - Schema auto-migrates on startup (ALTER TABLE for new columns) - Backfill script for seeding `seen_articles` from existing clusters: ```bash docker exec -it news-mcp python3 scripts/backfill_seen_articles.py ``` ## Dashboard `http://:8506/dashboard/` - **Health** — stats, charts, feed status - **Feeds** — toggle on/off - **Clusters** — filterable table, click for drill-down modal - **Sentiment** — time-series chart - **Entities** — top entities, frequency chart - **Keywords** — top keywords, frequency chart - **Config** — runtime parameter tuning (new in v0.5.0) ## Version See `./version-hash.sh` for the current content hash.