# πŸ“° News MCP Server FastMCP-based MCP server that turns news feeds into **deduplicated, enriched clusters**. ## Quick start Local: ```bash cd news-mcp source .venv/bin/activate pip install -r requirements.txt ./run.sh ``` Docker Compose: ```bash docker compose up --build ``` Default SSE mount (FastMCP): - `http://127.0.0.1:8506/mcp/sse` Health: - `http://127.0.0.1:8506/health` ## What this server provides - Fetches from one or more configured news feeds (`NEWS_FEED_URL` / `NEWS_FEED_URLS`) - Deduplicates articles into clusters (v1 fuzzy title similarity) - Enriches clusters with configurable LLM providers/models (topic/entities/sentiment/keywords) - Applies a case-insensitive entity blacklist after extraction - Caches clusters + LLM fields in SQLite - Resolves entities in-process via Google Trends suggestions (no `trends-mcp` hop required for entity resolution) ## Tools (MCP) 1) `get_latest_events(topic, limit, include_articles=false)` - `topic` is a coarse category: `crypto | macro | regulation | ai | other` - when `include_articles=true`, includes `articles[].url` + minimal fields per returned cluster 2) `get_events_for_entity(entity, limit, timeframe="24h", include_articles=false)` - substring, case-insensitive match over extracted `entities` - uses the requested timeframe as the scan window; `limit` is the cap within that window - when `include_articles=true`, includes `articles[].url` + minimal fields per returned cluster 3) `get_event_summary(event_id, include_articles=false)` - LLM-written compressed narrative for a given `cluster_id` - when `include_articles=true`, includes the underlying `articles` list (with `url`) from the stored cluster 4) `detect_emerging_topics(limit)` - derives β€œemerging” signals from recent cached clusters 5) `get_news_sentiment(entity, timeframe)` - aggregates sentiment around an entity from cached enriched clusters 6) `get_related_recent_entities(subject, timeframe, limit, include_trends=true)` - merges recent co-occurrence data from cached clusters with Google Trends suggestions and returns related entities (with `mid` when available) plus source/score metadata 7) `get_capabilities()` - describes the server’s tool surface, composition recipes, and output conventions for agents ### Entity aliasing The server keeps a conservative alias map in `config/entity_aliases.json` for obvious shorthands like `btc -> Bitcoin`, `eth -> Ethereum`, and `ether -> Ethereum`. Keep this map tight; it is meant to reduce false misses, not to rewrite every possible name variant. ## Configuration See `news-mcp/.env`. Key variables: - `NEWS_EXTRACT_PROVIDER`, `NEWS_EXTRACT_MODEL` - `NEWS_SUMMARY_PROVIDER`, `NEWS_SUMMARY_MODEL` - `GROQ_API_KEY`, `OPENAI_API_KEY` - `ENTITY_BLACKLIST` (comma-separated, case-insensitive patterns; wildcards are supported) - `NEWS_PROMPTS_DIR` (override prompt directory) - `NEWS_ENTITY_ALIASES_FILE` (override entity alias JSON file) - `NEWS_FEED_URL` (single feed fallback) - `NEWS_FEED_URLS` (comma-separated feed URLs; overrides `NEWS_FEED_URL`) - `NEWS_REFRESH_INTERVAL_SECONDS` (default 900) - `NEWS_BACKGROUND_REFRESH_ON_START` (default true) - `NEWS_BACKGROUND_REFRESH_ENABLED` (default true) - `NEWS_DEFAULT_LOOKBACK_HOURS` (freshness window for reads; older rows are ignored by queries) - `NEWS_PRUNING_ENABLED` (default true; if false, no rows are physically deleted) - `NEWS_RETENTION_DAYS` (physical delete threshold for stored clusters) - `NEWS_PRUNE_INTERVAL_HOURS` (how often in-server pruning may run) - `GROQ_ENRICH_OTHER_ONLY` (default false; set true for cost control) - `NEWS_EMBEDDINGS_ENABLED` (default false; enables Ollama embeddings for clustering when wired in) - `OLLAMA_BASE_URL` / `OLLAMA_URL` (default `http://127.0.0.1:11434`) - `OLLAMA_EMBEDDING_MODEL` (default `nomic-embed-text`) - `NEWS_EMBEDDING_SIMILARITY_THRESHOLD` (default `0.885`; used when embeddings are enabled) When embeddings are enabled, news-mcp tries Ollama first and falls back to the existing heuristic clustering path if Ollama is unavailable. ## Persistence and migration The default database path is project-relative: - `NEWS_MCP_DATA_DIR=./data` - `NEWS_MCP_DB_PATH=./data/news.sqlite` That keeps persistence inside the repository tree in both local and Docker runs. Recommended workflow: 1. Keep **code** in Git. 2. Keep the **data directory** outside Git but inside the project tree. 3. Use `rsync` for the initial data transfer to a remote server. 4. After that, move code with Git and move data only when you actually need a fresh copy. Example initial transfer: ```bash rsync -a ./data/ user@remote:/srv/news-mcp/data/ ``` If you change the location later, override the defaults with: - `NEWS_MCP_DATA_DIR` - `NEWS_MCP_DB_PATH` ## TTL vs pruning These are intentionally different: - `NEWS_DEFAULT_LOOKBACK_HOURS` controls **read freshness** only. Older rows remain in SQLite but do not appear in normal "latest" queries. - `NEWS_PRUNING_ENABLED` controls whether the server is allowed to **physically delete** old rows. - `NEWS_RETENTION_DAYS` controls how old rows may get before they are deleted. - `NEWS_PRUNE_INTERVAL_HOURS` controls how often the server checks whether deletion is due. Pruning is self-contained inside the server: - on startup - after refresh cycles (prune-if-due) If `NEWS_PRUNING_ENABLED=false`, no pruning occurs and old rows are retained indefinitely. ## Live extraction smoke test Run a standardized, fabricated extraction test against the currently configured provider/model: ```bash ./live_tests.sh ``` The script reads `./.env`, selects OpenAI or Groq based on the configured keys, and checks that the core expected entities are extracted. ## mcporter examples (all news-mcp calls) Use your existing config path: ```bash CONFIG=/home/lucky/.openclaw/workspace/config/mcporter.json ``` Inspect server + tools: ```bash mcporter --config "$CONFIG" list news --schema ``` ### 1) Latest events ```bash mcporter --config "$CONFIG" call news.get_latest_events topic=crypto limit=10 mcporter --config "$CONFIG" call news.get_latest_events topic=macro limit=5 ``` ### 2) Events for an entity ```bash mcporter --config "$CONFIG" call news.get_events_for_entity entity=Bitcoin timeframe=24h limit=10 mcporter --config "$CONFIG" call news.get_events_for_entity entity=ETH timeframe=3d limit=10 mcporter --config "$CONFIG" call news.get_events_for_entity entity=ETF timeframe=7d limit=10 ``` ### 3) Event summary (by cluster_id) ```bash # First fetch an event id mcporter --config "$CONFIG" call news.get_latest_events topic=crypto limit=1 # Then summarize it mcporter --config "$CONFIG" call news.get_event_summary event_id= ``` ### 4) Emerging topics ```bash mcporter --config "$CONFIG" call news.detect_emerging_topics limit=10 ``` ### 5) Sentiment for an entity ```bash mcporter --config "$CONFIG" call news.get_news_sentiment entity=Bitcoin timeframe=24h mcporter --config "$CONFIG" call news.get_news_sentiment entity=Ethereum timeframe=72h ``` ### 6) Related entities (recent neighborhood + trends blending) ```bash # Iran: blend local co-occurrence with Google Trends related topics mcporter --config "$CONFIG" call news.get_related_recent_entities subject=Iran timeframe=72h limit=12 include_trends=true # Another seed phrase mcporter --config "$CONFIG" call news.get_related_recent_entities subject="iran war" timeframe=72h limit=12 include_trends=true ``` ### 7) Capabilities / composition guidance ```bash mcporter --config "$CONFIG" call news.get_capabilities ``` Use this when you want the server to explain how to chain the tools together, which fields to keep hidden (e.g. `cluster_id`), and how to present sources/timestamps consistently. ## Blacklist enforcement (optional back-clean) If you change `ENTITY_BLACKLIST`, existing clusters in `news.sqlite` may still contain entities/keywords that would now be filtered at extraction time. For one-off cleanup, run: ```bash ./.venv/bin/python scripts/enforce_news_blacklist.py --dry-run --limit 200 ./.venv/bin/python scripts/enforce_news_blacklist.py --limit 1000 ``` This enforces `ENTITY_BLACKLIST` inside stored clusters by removing matching entries from `payload.entities` and `payload.keywords` and (if needed) setting `payload.topic = "other"`. ## Embeddings backfill (optional) If `NEWS_EMBEDDINGS_ENABLED=true`, you can precompute cluster embeddings for older rows before restarting the server: ```bash ./.venv/bin/python scripts/backfill_news_embeddings.py --dry-run --limit 200 ./.venv/bin/python scripts/backfill_news_embeddings.py --limit 1000 ``` This stores a cluster-level `embedding` and `embedding_model` inside the SQLite payload so the Ollama-first clustering path has data ready to use. ## Embedding merge analysis (optional) To inspect likely cluster merges at different cosine thresholds without writing anything back to the DB: ```bash ./.venv/bin/python scripts/analyze_cluster_embedding_merges.py --thresholds 0.82 0.85 0.88 --limit 200 ``` This prints candidate pairs per threshold so you can decide whether a merge script is worth adding next. ## Embedding merge pass (optional, destructive) After inspecting the analysis output, you can merge clusters above a chosen threshold. Start with dry-run: ```bash ./.venv/bin/python scripts/merge_cluster_embeddings.py --dry-run --threshold 0.90 ``` If the groupings look right, run wet: ```bash ./.venv/bin/python scripts/merge_cluster_embeddings.py --threshold 0.90 ``` This merges embedding-similar clusters within the same topic and removes the absorbed duplicates from SQLite. ## Article dedup cleanup (optional) Some stored clusters may contain repeated article entries for the same underlying article id / URL path. To clean existing rows: ```bash ./.venv/bin/python scripts/dedup_articles_in_clusters.py --dry-run ./.venv/bin/python scripts/dedup_articles_in_clusters.py ``` The live clustering path also deduplicates article entries when new data comes in. As of the latest hardening, the server/storage write path also self-heals `payload.articles` by deduplicating before persisting (so historical rows can be fixed via the cleanup script, and future writes won’t reintroduce duplicates). ``` ## Dashboard (new) A browser-based monitoring dashboard is available at: ``` http://127.0.0.1:8506/dashboard/ ``` **Views:** - **Health** β€” cluster/entity counts, freshness indicator, topic distribution (doughnut chart), sentiment overview, feed activity - **Clusters** β€” filterable/sortable table with topic, sentiment, importance, entity chips; search by keyword - **Sentiment** β€” time-series chart (avg sentiment per configurable time bucket) with cluster count overlay - **Entities** β€” top entities by mention frequency, horizontal bar chart, click for detail - **Detail** β€” click any cluster row or search by cluster ID for full drill-down (summary, key facts, articles, keywords, entities) **Tech:** Pure static HTML/JS with Chart.js for visualizations. Served from the same FastAPI process at `/dashboard`. **Configuration defaults:** The dashboard's default lookback window follows `NEWS_DEFAULT_LOOKBACK_HOURS` (configured via `.env`, default 144h). ## REST API The following read-only endpoints are available for programmatic access (in addition to the MCP SSE tools): | Method | Path | Description | |--------|------|-------------| | GET | `/api/v1/health` | Extended health: cluster/entity counts, freshness, feed state, pruning config | | GET | `/api/v1/clusters` | Paginated clusters. Params: `topic`, `hours`, `limit`, `offset` | | GET | `/api/v1/sentiment-series` | Sentiment time-series. Params: `topic`, `hours`, `bucket_hours` | | GET | `/api/v1/entities` | Top entities by frequency. Params: `hours`, `limit` | | GET | `/api/v1/cluster/{cluster_id}` | Full cluster detail with summary, facts, articles | ## Startup behavior The server uses a lifespan-based startup (FastAPI β‰₯0.111). Background feed refresh and pruning run as fire-and-forget coroutines, so the HTTP API and dashboard are available immediately β€” no blocking on the first fetch cycle. Important env vars controlling background behavior: - `NEWS_BACKGROUND_REFRESH_ENABLED=true` β€” enable/disable background loop - `NEWS_BACKGROUND_REFRESH_ON_START=true` β€” fetch immediately on startup (previously `false`; changed to `true` for faster first data) - `NEWS_REFRESH_INTERVAL_SECONDS=900` β€” polling interval between refresh cycles ## Dashboard (updated) A browser-based monitoring dashboard is available at: ``` http://:8506/dashboard/ ``` **Views:** | View | What it shows | |---|---| | Health | Cluster/entity counts, freshness badge, topic doughnut, sentiment overview, feed activity | | Clusters | Filterable table β€” topic, sentiment, importance, entity chips, full-text search | | Sentiment | Time-series line chart (avg sentiment per configurable bucket) + cluster count overlay | | Entities | Top entities by mention frequency, bar chart, click-through to matching clusters | | Detail | Click any cluster row or paste a cluster_id β€” full drill-down with summary, key facts, articles, keywords, entities | **Tech stack:** Vanilla HTML/JS + Chart.js, served as static files from the same FastAPI process at `/dashboard`. No extra dependencies. The dashboard default lookback window follows `NEWS_DEFAULT_LOOKBACK_HOURS` (configured via `.env`, default: 144h). ## REST API Five read-only JSON endpoints for programmatic access: | Method | Path | Params | Returns | |--------|------|--------|---------| | GET | `/api/v1/health` | β€” | stats, freshness, feed state, pruning config | | GET | `/api/v1/clusters` | `topic`, `hours`, `limit`, `offset` | paginated cluster list | | GET | `/api/v1/sentiment-series` | `topic`, `hours`, `bucket_hours` | time-series data for Chart.js | | GET | `/api/v1/entities` | `hours`, `limit` | top entities by mention count | | GET | `/api/v1/cluster/{id}` | β€” (path param) | full cluster detail | ## Startup Uses FastAPI `lifespan` β€” **HTTP API available in <0.3s** regardless of feed/LLM latency. Background refresh + pruning are fire-and-forget coroutines. Key `.env` vars: - `NEWS_BACKGROUND_REFRESH_ON_START=true` β€” fetch immediately on boot - `NEWS_BACKGROUND_REFRESH_ENABLED=true` β€” enable/disable the background loop - `NEWS_REFRESH_INTERVAL_SECONDS=900` β€” polling interval