Sem descrição

Lukas Goldschmidt b3cceafd27 news-mcp: rename health ttl field to lookback há 1 mês atrás
config 57bb07fdd6 Improve entity lookup fallback and docs há 1 mês atrás
news_mcp b3cceafd27 news-mcp: rename health ttl field to lookback há 1 mês atrás
prompts cdd52b9f1e Refactor news LLM extraction pipeline há 1 mês atrás
scripts 89e141466f Fix entityResolutions drift and harden article dedup há 1 mês atrás
.env.example 5ec094693f news-mcp: clarify lookback, pruning, and tool docs há 1 mês atrás
.gitignore 13f8f1d5ab Initialize news-mcp scaffold há 1 mês atrás
OUTLOOK.md 980c2b8996 Docs: add future plan for emerging entity graph over time há 1 mês atrás
PROJECT.md 980c2b8996 Docs: add future plan for emerging entity graph over time há 1 mês atrás
README.md 5ec094693f news-mcp: clarify lookback, pruning, and tool docs há 1 mês atrás
RELEASE_NOTES.md 86bca1ac5c docs: add v0.2.0 release notes há 1 mês atrás
killserver.sh a4096b9dfb news-mcp: cleanup feed naming, improve ingestion logs, add mcporter README examples há 1 mês atrás
live_tests.sh cdd52b9f1e Refactor news LLM extraction pipeline há 1 mês atrás
requirements.txt 600fcdbd55 Polish news-mcp docs + add emerging topics and tests há 1 mês atrás
restart.sh 13f8f1d5ab Initialize news-mcp scaffold há 1 mês atrás
run.sh 13f8f1d5ab Initialize news-mcp scaffold há 1 mês atrás
test_embedding_support.py c984d1f589 tests: add embedding support guards for clustering há 1 mês atrás
test_news_mcp.py 5ec094693f news-mcp: clarify lookback, pruning, and tool docs há 1 mês atrás
tests.sh 600fcdbd55 Polish news-mcp docs + add emerging topics and tests há 1 mês atrás

README.md

📰 News MCP Server

FastMCP-based MCP server that turns news feeds into deduplicated, enriched clusters.

Quick start

cd news-mcp
source .venv/bin/activate
pip install -r requirements.txt
./run.sh

Default SSE mount (FastMCP):

  • http://127.0.0.1:8506/mcp/sse

Health:

  • http://127.0.0.1:8506/health

What this server provides

  • Fetches from one or more configured news feeds (NEWS_FEED_URL / NEWS_FEED_URLS)
  • Deduplicates articles into clusters (v1 fuzzy title similarity)
  • Enriches clusters with configurable LLM providers/models (topic/entities/sentiment/keywords)
  • Applies a case-insensitive entity blacklist after extraction
  • Caches clusters + LLM fields in SQLite
  • Resolves entities in-process via Google Trends suggestions (no trends-mcp hop required for entity resolution)

Tools (MCP)

1) get_latest_events(topic, limit, include_articles=false)

  • topic is a coarse category: crypto | macro | regulation | ai | other
  • when include_articles=true, includes articles[].url + minimal fields per returned cluster

2) get_events_for_entity(entity, limit, timeframe="24h", include_articles=false)

  • substring, case-insensitive match over extracted entities
  • uses the requested timeframe as the scan window; limit is the cap within that window
  • when include_articles=true, includes articles[].url + minimal fields per returned cluster

3) get_event_summary(event_id, include_articles=false)

  • LLM-written compressed narrative for a given cluster_id
  • when include_articles=true, includes the underlying articles list (with url) from the stored cluster

4) detect_emerging_topics(limit)

  • derives “emerging” signals from recent cached clusters

5) get_news_sentiment(entity, timeframe)

  • aggregates sentiment around an entity from cached enriched clusters

6) get_related_entities(subject, timeframe, limit)

  • entity-only co-occurrence neighborhood: for a given subject entity, returns related entities with aggregated count, avg_importance, and sentiment

Entity aliasing

The server keeps a conservative alias map in config/entity_aliases.json for obvious shorthands like btc -> Bitcoin, eth -> Ethereum, and ether -> Ethereum. Keep this map tight; it is meant to reduce false misses, not to rewrite every possible name variant.

Configuration

See news-mcp/.env. Key variables:

  • NEWS_EXTRACT_PROVIDER, NEWS_EXTRACT_MODEL
  • NEWS_SUMMARY_PROVIDER, NEWS_SUMMARY_MODEL
  • GROQ_API_KEY, OPENAI_API_KEY
  • ENTITY_BLACKLIST (comma-separated, case-insensitive exact entity match)
  • NEWS_PROMPTS_DIR (override prompt directory)
  • NEWS_ENTITY_ALIASES_FILE (override entity alias JSON file)
  • NEWS_FEED_URL (single feed fallback)
  • NEWS_FEED_URLS (comma-separated feed URLs; overrides NEWS_FEED_URL)
  • NEWS_REFRESH_INTERVAL_SECONDS (default 900)
  • NEWS_BACKGROUND_REFRESH_ON_START (default true)
  • NEWS_BACKGROUND_REFRESH_ENABLED (default true)
  • NEWS_DEFAULT_LOOKBACK_HOURS (freshness window for reads; older rows are ignored by queries)
  • NEWS_PRUNING_ENABLED (default true; if false, no rows are physically deleted)
  • NEWS_RETENTION_DAYS (physical delete threshold for stored clusters)
  • NEWS_PRUNE_INTERVAL_HOURS (how often in-server pruning may run)
  • GROQ_ENRICH_OTHER_ONLY (default false; set true for cost control)
  • NEWS_EMBEDDINGS_ENABLED (default false; enables Ollama embeddings for clustering when wired in)
  • OLLAMA_BASE_URL / OLLAMA_URL (default http://127.0.0.1:11434)
  • OLLAMA_EMBEDDING_MODEL (default nomic-embed-text)
  • NEWS_EMBEDDING_SIMILARITY_THRESHOLD (default 0.885; used when embeddings are enabled)

When embeddings are enabled, news-mcp tries Ollama first and falls back to the existing heuristic clustering path if Ollama is unavailable.

TTL vs pruning

These are intentionally different:

  • NEWS_DEFAULT_LOOKBACK_HOURS controls read freshness only. Older rows remain in SQLite but do not appear in normal "latest" queries.
  • NEWS_PRUNING_ENABLED controls whether the server is allowed to physically delete old rows.
  • NEWS_RETENTION_DAYS controls how old rows may get before they are deleted.
  • NEWS_PRUNE_INTERVAL_HOURS controls how often the server checks whether deletion is due.

Pruning is self-contained inside the server:

  • on startup
  • after refresh cycles (prune-if-due)

If NEWS_PRUNING_ENABLED=false, no pruning occurs and old rows are retained indefinitely.

Live extraction smoke test

Run a standardized, fabricated extraction test against the currently configured provider/model:

./live_tests.sh

The script reads ./.env, selects OpenAI or Groq based on the configured keys, and checks that the core expected entities are extracted.

mcporter examples (all news-mcp calls)

Use your existing config path:

CONFIG=/home/lucky/.openclaw/workspace/config/mcporter.json

Inspect server + tools:

mcporter --config "$CONFIG" list news --schema

1) Latest events

mcporter --config "$CONFIG" call news.get_latest_events topic=crypto limit=10
mcporter --config "$CONFIG" call news.get_latest_events topic=macro limit=5

2) Events for an entity

mcporter --config "$CONFIG" call news.get_events_for_entity entity=Bitcoin timeframe=24h limit=10
mcporter --config "$CONFIG" call news.get_events_for_entity entity=ETH timeframe=3d limit=10
mcporter --config "$CONFIG" call news.get_events_for_entity entity=ETF timeframe=7d limit=10

3) Event summary (by cluster_id)

# First fetch an event id
mcporter --config "$CONFIG" call news.get_latest_events topic=crypto limit=1

# Then summarize it
mcporter --config "$CONFIG" call news.get_event_summary event_id=<cluster_id>

4) Emerging topics

mcporter --config "$CONFIG" call news.detect_emerging_topics limit=10

5) Sentiment for an entity

mcporter --config "$CONFIG" call news.get_news_sentiment entity=Bitcoin timeframe=24h
mcporter --config "$CONFIG" call news.get_news_sentiment entity=Ethereum timeframe=72h

6) Related entities (co-occurrence neighborhood)

mcporter --config "$CONFIG" call news.get_related_entities subject=iran timeframe=24h limit=8
mcporter --config "$CONFIG" call news.get_related_entities subject="iran war" timeframe=3d limit=8

Blacklist enforcement (optional back-clean)

If you change ENTITY_BLACKLIST, existing clusters in news.sqlite may still contain entities/keywords that would now be filtered at extraction time.

For one-off cleanup, run:

./.venv/bin/python scripts/enforce_news_blacklist.py --dry-run --limit 200
./.venv/bin/python scripts/enforce_news_blacklist.py --limit 1000

This enforces ENTITY_BLACKLIST inside stored clusters by removing matching entries from payload.entities and payload.keywords and (if needed) setting payload.topic = "other".

Embeddings backfill (optional)

If NEWS_EMBEDDINGS_ENABLED=true, you can precompute cluster embeddings for older rows before restarting the server:

./.venv/bin/python scripts/backfill_news_embeddings.py --dry-run --limit 200
./.venv/bin/python scripts/backfill_news_embeddings.py --limit 1000

This stores a cluster-level embedding and embedding_model inside the SQLite payload so the Ollama-first clustering path has data ready to use.

Embedding merge analysis (optional)

To inspect likely cluster merges at different cosine thresholds without writing anything back to the DB:

./.venv/bin/python scripts/analyze_cluster_embedding_merges.py --thresholds 0.82 0.85 0.88 --limit 200

This prints candidate pairs per threshold so you can decide whether a merge script is worth adding next.

Embedding merge pass (optional, destructive)

After inspecting the analysis output, you can merge clusters above a chosen threshold. Start with dry-run:

./.venv/bin/python scripts/merge_cluster_embeddings.py --dry-run --threshold 0.90

If the groupings look right, run wet:

./.venv/bin/python scripts/merge_cluster_embeddings.py --threshold 0.90

This merges embedding-similar clusters within the same topic and removes the absorbed duplicates from SQLite.

Article dedup cleanup (optional)

Some stored clusters may contain repeated article entries for the same underlying article id / URL path. To clean existing rows:

./.venv/bin/python scripts/dedup_articles_in_clusters.py --dry-run
./.venv/bin/python scripts/dedup_articles_in_clusters.py

The live clustering path also deduplicates article entries when new data comes in.

As of the latest hardening, the server/storage write path also self-heals payload.articles by deduplicating before persisting (so historical rows can be fixed via the cleanup script, and future writes won’t reintroduce duplicates). ```