# 📰 News MCP Server FastMCP-based MCP server that turns news feeds into **deduplicated, enriched clusters**. ## Quick start Local: ```bash cd news-mcp source .venv/bin/activate pip install -r requirements.txt ./run.sh ``` Docker Compose: ```bash docker compose up --build ``` Default SSE mount (FastMCP): - `http://127.0.0.1:8506/mcp/sse` Health: - `http://127.0.0.1:8506/health` ## What this server provides - Fetches from one or more configured news feeds (`NEWS_FEED_URL` / `NEWS_FEED_URLS`) - Deduplicates articles into clusters (v1 fuzzy title similarity) - Enriches clusters with configurable LLM providers/models (topic/entities/sentiment/keywords) - Applies a case-insensitive entity blacklist after extraction - Caches clusters + LLM fields in SQLite - Resolves entities in-process via Google Trends suggestions (no `trends-mcp` hop required for entity resolution) ## Tools (MCP) 1) `get_latest_events(topic, limit, include_articles=false)` - `topic` is a coarse category: `crypto | macro | regulation | ai | other` - when `include_articles=true`, includes `articles[].url` + minimal fields per returned cluster 2) `get_events_for_entity(entity, limit, timeframe="24h", include_articles=false)` - substring, case-insensitive match over extracted `entities` - uses the requested timeframe as the scan window; `limit` is the cap within that window - when `include_articles=true`, includes `articles[].url` + minimal fields per returned cluster 3) `get_event_summary(event_id, include_articles=false)` - LLM-written compressed narrative for a given `cluster_id` - when `include_articles=true`, includes the underlying `articles` list (with `url`) from the stored cluster 4) `detect_emerging_topics(limit)` - derives “emerging” signals from recent cached clusters 5) `get_news_sentiment(entity, timeframe)` - aggregates sentiment around an entity from cached enriched clusters 6) `get_related_recent_entities(subject, timeframe, limit, include_trends=true)` - merges recent co-occurrence data from cached clusters with Google Trends suggestions and returns related entities (with `mid` when available) plus source/score metadata 7) `get_capabilities()` - describes the server’s tool surface, composition recipes, and output conventions for agents ### Entity aliasing The server keeps a conservative alias map in `config/entity_aliases.json` for obvious shorthands like `btc -> Bitcoin`, `eth -> Ethereum`, and `ether -> Ethereum`. Keep this map tight; it is meant to reduce false misses, not to rewrite every possible name variant. ## Configuration See `news-mcp/.env`. Key variables: - `NEWS_EXTRACT_PROVIDER`, `NEWS_EXTRACT_MODEL` - `NEWS_SUMMARY_PROVIDER`, `NEWS_SUMMARY_MODEL` - `GROQ_API_KEY`, `OPENAI_API_KEY` - `ENTITY_BLACKLIST` (comma-separated, case-insensitive exact entity match) - `NEWS_PROMPTS_DIR` (override prompt directory) - `NEWS_ENTITY_ALIASES_FILE` (override entity alias JSON file) - `NEWS_FEED_URL` (single feed fallback) - `NEWS_FEED_URLS` (comma-separated feed URLs; overrides `NEWS_FEED_URL`) - `NEWS_REFRESH_INTERVAL_SECONDS` (default 900) - `NEWS_BACKGROUND_REFRESH_ON_START` (default true) - `NEWS_BACKGROUND_REFRESH_ENABLED` (default true) - `NEWS_DEFAULT_LOOKBACK_HOURS` (freshness window for reads; older rows are ignored by queries) - `NEWS_PRUNING_ENABLED` (default true; if false, no rows are physically deleted) - `NEWS_RETENTION_DAYS` (physical delete threshold for stored clusters) - `NEWS_PRUNE_INTERVAL_HOURS` (how often in-server pruning may run) - `GROQ_ENRICH_OTHER_ONLY` (default false; set true for cost control) - `NEWS_EMBEDDINGS_ENABLED` (default false; enables Ollama embeddings for clustering when wired in) - `OLLAMA_BASE_URL` / `OLLAMA_URL` (default `http://127.0.0.1:11434`) - `OLLAMA_EMBEDDING_MODEL` (default `nomic-embed-text`) - `NEWS_EMBEDDING_SIMILARITY_THRESHOLD` (default `0.885`; used when embeddings are enabled) When embeddings are enabled, news-mcp tries Ollama first and falls back to the existing heuristic clustering path if Ollama is unavailable. ## Persistence and migration The default database path is project-relative: - `NEWS_MCP_DATA_DIR=./data` - `NEWS_MCP_DB_PATH=./data/news.sqlite` That keeps persistence inside the repository tree in both local and Docker runs. Recommended workflow: 1. Keep **code** in Git. 2. Keep the **data directory** outside Git but inside the project tree. 3. Use `rsync` for the initial data transfer to a remote server. 4. After that, move code with Git and move data only when you actually need a fresh copy. Example initial transfer: ```bash rsync -a ./data/ user@remote:/srv/news-mcp/data/ ``` If you change the location later, override the defaults with: - `NEWS_MCP_DATA_DIR` - `NEWS_MCP_DB_PATH` ## TTL vs pruning These are intentionally different: - `NEWS_DEFAULT_LOOKBACK_HOURS` controls **read freshness** only. Older rows remain in SQLite but do not appear in normal "latest" queries. - `NEWS_PRUNING_ENABLED` controls whether the server is allowed to **physically delete** old rows. - `NEWS_RETENTION_DAYS` controls how old rows may get before they are deleted. - `NEWS_PRUNE_INTERVAL_HOURS` controls how often the server checks whether deletion is due. Pruning is self-contained inside the server: - on startup - after refresh cycles (prune-if-due) If `NEWS_PRUNING_ENABLED=false`, no pruning occurs and old rows are retained indefinitely. ## Live extraction smoke test Run a standardized, fabricated extraction test against the currently configured provider/model: ```bash ./live_tests.sh ``` The script reads `./.env`, selects OpenAI or Groq based on the configured keys, and checks that the core expected entities are extracted. ## mcporter examples (all news-mcp calls) Use your existing config path: ```bash CONFIG=/home/lucky/.openclaw/workspace/config/mcporter.json ``` Inspect server + tools: ```bash mcporter --config "$CONFIG" list news --schema ``` ### 1) Latest events ```bash mcporter --config "$CONFIG" call news.get_latest_events topic=crypto limit=10 mcporter --config "$CONFIG" call news.get_latest_events topic=macro limit=5 ``` ### 2) Events for an entity ```bash mcporter --config "$CONFIG" call news.get_events_for_entity entity=Bitcoin timeframe=24h limit=10 mcporter --config "$CONFIG" call news.get_events_for_entity entity=ETH timeframe=3d limit=10 mcporter --config "$CONFIG" call news.get_events_for_entity entity=ETF timeframe=7d limit=10 ``` ### 3) Event summary (by cluster_id) ```bash # First fetch an event id mcporter --config "$CONFIG" call news.get_latest_events topic=crypto limit=1 # Then summarize it mcporter --config "$CONFIG" call news.get_event_summary event_id= ``` ### 4) Emerging topics ```bash mcporter --config "$CONFIG" call news.detect_emerging_topics limit=10 ``` ### 5) Sentiment for an entity ```bash mcporter --config "$CONFIG" call news.get_news_sentiment entity=Bitcoin timeframe=24h mcporter --config "$CONFIG" call news.get_news_sentiment entity=Ethereum timeframe=72h ``` ### 6) Related entities (recent neighborhood + trends blending) ```bash # Iran: blend local co-occurrence with Google Trends related topics mcporter --config "$CONFIG" call news.get_related_recent_entities subject=Iran timeframe=72h limit=12 include_trends=true # Another seed phrase mcporter --config "$CONFIG" call news.get_related_recent_entities subject="iran war" timeframe=72h limit=12 include_trends=true ``` ### 7) Capabilities / composition guidance ```bash mcporter --config "$CONFIG" call news.get_capabilities ``` Use this when you want the server to explain how to chain the tools together, which fields to keep hidden (e.g. `cluster_id`), and how to present sources/timestamps consistently. ## Blacklist enforcement (optional back-clean) If you change `ENTITY_BLACKLIST`, existing clusters in `news.sqlite` may still contain entities/keywords that would now be filtered at extraction time. For one-off cleanup, run: ```bash ./.venv/bin/python scripts/enforce_news_blacklist.py --dry-run --limit 200 ./.venv/bin/python scripts/enforce_news_blacklist.py --limit 1000 ``` This enforces `ENTITY_BLACKLIST` inside stored clusters by removing matching entries from `payload.entities` and `payload.keywords` and (if needed) setting `payload.topic = "other"`. ## Embeddings backfill (optional) If `NEWS_EMBEDDINGS_ENABLED=true`, you can precompute cluster embeddings for older rows before restarting the server: ```bash ./.venv/bin/python scripts/backfill_news_embeddings.py --dry-run --limit 200 ./.venv/bin/python scripts/backfill_news_embeddings.py --limit 1000 ``` This stores a cluster-level `embedding` and `embedding_model` inside the SQLite payload so the Ollama-first clustering path has data ready to use. ## Embedding merge analysis (optional) To inspect likely cluster merges at different cosine thresholds without writing anything back to the DB: ```bash ./.venv/bin/python scripts/analyze_cluster_embedding_merges.py --thresholds 0.82 0.85 0.88 --limit 200 ``` This prints candidate pairs per threshold so you can decide whether a merge script is worth adding next. ## Embedding merge pass (optional, destructive) After inspecting the analysis output, you can merge clusters above a chosen threshold. Start with dry-run: ```bash ./.venv/bin/python scripts/merge_cluster_embeddings.py --dry-run --threshold 0.90 ``` If the groupings look right, run wet: ```bash ./.venv/bin/python scripts/merge_cluster_embeddings.py --threshold 0.90 ``` This merges embedding-similar clusters within the same topic and removes the absorbed duplicates from SQLite. ## Article dedup cleanup (optional) Some stored clusters may contain repeated article entries for the same underlying article id / URL path. To clean existing rows: ```bash ./.venv/bin/python scripts/dedup_articles_in_clusters.py --dry-run ./.venv/bin/python scripts/dedup_articles_in_clusters.py ``` The live clustering path also deduplicates article entries when new data comes in. As of the latest hardening, the server/storage write path also self-heals `payload.articles` by deduplicating before persisting (so historical rows can be fixed via the cleanup script, and future writes won’t reintroduce duplicates). ```