|
|
преди 6 дни | |
|---|---|---|
| config | преди 2 месеца | |
| dashboard | преди 6 дни | |
| news_mcp | преди 6 дни | |
| prompts | преди 1 седмица | |
| scripts | преди 6 дни | |
| .dockerignore | преди 2 месеца | |
| .env.example | преди 1 седмица | |
| .gitignore | преди 2 месеца | |
| AGENTS.md | преди 1 седмица | |
| Dockerfile | преди 2 седмици | |
| OUTLOOK.md | преди 1 седмица | |
| POLLER_UPGRADE_PLAN.md | преди 1 седмица | |
| PROJECT.md | преди 1 седмица | |
| README.md | преди 1 седмица | |
| RELEASE_NOTES.md | преди 1 седмица | |
| docker-compose.yml | преди 2 седмици | |
| killserver.sh | преди 4 седмици | |
| live_tests.sh | преди 2 месеца | |
| provider_test.sh | преди 2 седмици | |
| requirements.txt | преди 2 месеца | |
| restart.sh | преди 2 месеца | |
| run.sh | преди 4 седмици | |
| test_embedding_support.py | преди 2 месеца | |
| test_news_mcp.py | преди 6 дни | |
| tests.sh | преди 2 месеца | |
| version-hash.sh | преди 1 седмица | |
| wipe.sh | преди 1 седмица |
FastMCP-based MCP server that turns news feeds into deduplicated, enriched clusters.
Local:
cd news-mcp
source .venv/bin/activate
pip install -r requirements.txt
./run.sh
Docker Compose:
docker compose up --build
Default SSE mount (FastMCP):
http://127.0.0.1:8506/mcp/sseHealth:
http://127.0.0.1:8506/healthNEWS_FEED_URL / NEWS_FEED_URLS)trends-mcp hop required for entity resolution)1) get_latest_events(topic, limit, include_articles=false)
topic is a coarse category: crypto | macro | regulation | ai | otherinclude_articles=true, includes articles[].url + minimal fields per returned cluster2) get_events_for_entity(entity, limit, timeframe="24h", include_articles=false)
entitieslimit is the cap within that windowinclude_articles=true, includes articles[].url + minimal fields per returned cluster3) get_event_summary(event_id, include_articles=false)
cluster_idinclude_articles=true, includes the underlying articles list (with url) from the stored cluster4) detect_emerging_topics(limit)
5) get_news_sentiment(entity, timeframe)
6) get_related_recent_entities(subject, timeframe, limit, include_trends=true)
mid when available) plus source/score metadata7) get_capabilities()
The server keeps a conservative alias map in config/entity_aliases.json for obvious shorthands
like btc -> Bitcoin, eth -> Ethereum, and ether -> Ethereum. Keep this map tight; it is meant
to reduce false misses, not to rewrite every possible name variant.
See news-mcp/.env.
Key variables:
NEWS_EXTRACT_PROVIDER, NEWS_EXTRACT_MODELNEWS_SUMMARY_PROVIDER, NEWS_SUMMARY_MODELGROQ_API_KEY, OPENAI_API_KEY, OPENROUTER_API_KEYENTITY_BLACKLIST (comma-separated, case-insensitive patterns; wildcards are supported)NEWS_PROMPTS_DIR (override prompt directory)NEWS_ENTITY_ALIASES_FILE (override entity alias JSON file)NEWS_FEED_URL (single feed fallback)NEWS_FEED_URLS (comma-separated feed URLs; overrides NEWS_FEED_URL)NEWS_FEED_ITEMS_PER_POLL (per-feed fetch cap per poll; default 50)NEWS_REFRESH_INTERVAL_SECONDS (default 900)NEWS_BACKGROUND_REFRESH_ON_START (default true)NEWS_BACKGROUND_REFRESH_ENABLED (default true)NEWS_DEFAULT_LOOKBACK_HOURS (freshness window for reads; older rows are ignored by queries)NEWS_PRUNING_ENABLED (default true; if false, no rows are physically deleted)NEWS_RETENTION_DAYS (physical delete threshold for stored clusters)NEWS_PRUNE_INTERVAL_HOURS (how often in-server pruning may run)ENRICH_OTHER_TOPICS_ONLY (default false; set true to only LLM-enrich "other" topic clusters)ENRICHMENT_MAX_PER_REFRESH (default 0 = no limit; max clusters to LLM-enrich per refresh cycle)NEWS_LLM_DEBUG (default false; enable debug logging for LLM calls)NEWS_LLM_CONCURRENCY_<PROVIDER> (e.g. NEWS_LLM_CONCURRENCY_GROQ; max concurrent outbound LLM calls per provider; overrides the built-in defaults: groq=8, openai=5, openrouter=2)NEWS_LLM_RATE_LIMIT_<PROVIDER> (e.g. NEWS_LLM_RATE_LIMIT_GROQ; max LLM calls per second per provider. Set to 0 to disable rate limiting. Built-in defaults: groq=1.0, openai=5.0, openrouter=2.0)NEWS_EMBEDDINGS_ENABLED (default false; enables Ollama embeddings for clustering)OLLAMA_BASE_URL / OLLAMA_URL (default http://127.0.0.1:11434)OLLAMA_EMBEDDING_MODEL (default nomic-embed-text)NEWS_EMBEDDING_SIMILARITY_THRESHOLD (default 0.885)NEWS_CLUSTER_MAX_AGE_HOURS (default 4; cross-cycle merge window. Set 0 to disable)When embeddings are enabled, news-mcp tries Ollama first and falls back to the existing heuristic clustering path if Ollama is unavailable.
The clustering pipeline has two modes:
In-cycle dedup (every poll): new articles are compared against each other and against recently loaded existing clusters. A match merges into the existing cluster; no match creates a new cluster.
Cross-cycle merge (controlled by NEWS_CLUSTER_MAX_AGE_HOURS): before clustering, the poller loads recent clusters from the DB and seeds them as merge targets. This means an article that arrives in poll N+1 can merge into a cluster created in poll N, even if the article's title is different enough that it wouldn't match against the cluster's original seed article. Set to 0 to disable.
Stable cluster IDs: cluster IDs are derived from the topic and the lexicographically smallest article key in the cluster, not from the first article's title. This means the same set of articles always resolves to the same cluster_id regardless of processing order or polling cycle.
Orphan merge: a post-clustering pass detects clusters that share article keys (via Union-Find) and merges them. This catches cases where two articles about the same event didn't match during the main loop (e.g. embeddings were temporarily unavailable).
Signal cascade: each new article is compared against all articles in a candidate cluster (not just the seed). The matching cascade is: cosine similarity → title similarity → token Jaccard → consensus (cosine + title/jaccard). The first signal that clears its threshold wins.
The default database path is project-relative:
NEWS_MCP_DATA_DIR=./dataNEWS_MCP_DB_PATH=./data/news.sqliteThat keeps persistence inside the repository tree in both local and Docker runs.
Recommended workflow:
rsync for the initial data transfer to a remote server.Example initial transfer:
rsync -a ./data/ user@remote:/srv/news-mcp/data/
If you change the location later, override the defaults with:
NEWS_MCP_DATA_DIRNEWS_MCP_DB_PATHThese are intentionally different:
NEWS_DEFAULT_LOOKBACK_HOURS controls read freshness only. Older rows remain in SQLite but do not appear in normal "latest" queries.NEWS_PRUNING_ENABLED controls whether the server is allowed to physically delete old rows.NEWS_RETENTION_DAYS controls how old rows may get before they are deleted.NEWS_PRUNE_INTERVAL_HOURS controls how often the server checks whether deletion is due.Pruning is self-contained inside the server:
If NEWS_PRUNING_ENABLED=false, no pruning occurs and old rows are retained indefinitely.
Run a standardized, fabricated extraction test against the currently configured provider/model:
./live_tests.sh
The script reads ./.env, selects OpenAI or Groq based on the configured keys, and checks that the core expected entities are extracted.
Use your existing config path:
CONFIG=/home/lucky/.openclaw/workspace/config/mcporter.json
Inspect server + tools:
mcporter --config "$CONFIG" list news --schema
mcporter --config "$CONFIG" call news.get_latest_events topic=crypto limit=10
mcporter --config "$CONFIG" call news.get_latest_events topic=macro limit=5
mcporter --config "$CONFIG" call news.get_events_for_entity entity=Bitcoin timeframe=24h limit=10
mcporter --config "$CONFIG" call news.get_events_for_entity entity=ETH timeframe=3d limit=10
mcporter --config "$CONFIG" call news.get_events_for_entity entity=ETF timeframe=7d limit=10
# First fetch an event id
mcporter --config "$CONFIG" call news.get_latest_events topic=crypto limit=1
# Then summarize it
mcporter --config "$CONFIG" call news.get_event_summary event_id=<cluster_id>
mcporter --config "$CONFIG" call news.detect_emerging_topics limit=10
mcporter --config "$CONFIG" call news.get_news_sentiment entity=Bitcoin timeframe=24h
mcporter --config "$CONFIG" call news.get_news_sentiment entity=Ethereum timeframe=72h
# Iran: blend local co-occurrence with Google Trends related topics
mcporter --config "$CONFIG" call news.get_related_recent_entities subject=Iran timeframe=72h limit=12 include_trends=true
# Another seed phrase
mcporter --config "$CONFIG" call news.get_related_recent_entities subject="iran war" timeframe=72h limit=12 include_trends=true
mcporter --config "$CONFIG" call news.get_capabilities
Use this when you want the server to explain how to chain the tools together, which fields to keep hidden (e.g. cluster_id), and how to present sources/timestamps consistently.
If you change ENTITY_BLACKLIST, existing clusters in news.sqlite may still
contain entities/keywords that would now be filtered at extraction time.
For one-off cleanup, run:
./.venv/bin/python scripts/enforce_news_blacklist.py --dry-run --limit 200
./.venv/bin/python scripts/enforce_news_blacklist.py --limit 1000
This enforces ENTITY_BLACKLIST inside stored clusters by removing matching
entries from payload.entities and payload.keywords and (if needed) setting
payload.topic = "other".
If NEWS_EMBEDDINGS_ENABLED=true, you can precompute cluster embeddings for
older rows before restarting the server:
./.venv/bin/python scripts/backfill_news_embeddings.py --dry-run --limit 200
./.venv/bin/python scripts/backfill_news_embeddings.py --limit 1000
This stores a cluster-level embedding and embedding_model inside the SQLite
payload so the Ollama-first clustering path has data ready to use.
To inspect likely cluster merges at different cosine thresholds without writing anything back to the DB:
./.venv/bin/python scripts/analyze_cluster_embedding_merges.py --thresholds 0.82 0.85 0.88 --limit 200
This prints candidate pairs per threshold so you can decide whether a merge script is worth adding next.
After inspecting the analysis output, you can merge clusters above a chosen threshold. Start with dry-run:
./.venv/bin/python scripts/merge_cluster_embeddings.py --dry-run --threshold 0.90
If the groupings look right, run wet:
./.venv/bin/python scripts/merge_cluster_embeddings.py --threshold 0.90
This merges embedding-similar clusters within the same topic and removes the absorbed duplicates from SQLite.
Some stored clusters may contain repeated article entries for the same underlying article id / URL path. To clean existing rows:
./.venv/bin/python scripts/dedup_articles_in_clusters.py --dry-run
./.venv/bin/python scripts/dedup_articles_in_clusters.py
The live clustering path also deduplicates article entries when new data comes in.
As of the latest hardening, the server/storage write path also self-heals payload.articles by deduplicating before persisting (so historical rows can be fixed via the cleanup script, and future writes won’t reintroduce duplicates).
## Dashboard (new)
A browser-based monitoring dashboard is available at:
http://127.0.0.1:8506/dashboard/
**Views:**
- **Health** — cluster/entity counts, freshness indicator, topic distribution (doughnut chart), sentiment overview, feed activity
- **Clusters** — filterable/sortable table with topic, sentiment, importance, entity chips; search by keyword
- **Sentiment** — time-series chart (avg sentiment per configurable time bucket) with cluster count overlay
- **Entities** — top entities by mention frequency, horizontal bar chart, click for detail
- **Detail** — click any cluster row or search by cluster ID for full drill-down (summary, key facts, articles, keywords, entities)
**Tech:** Pure static HTML/JS with Chart.js for visualizations. Served from the same FastAPI process at `/dashboard`.
**Configuration defaults:** The dashboard's default lookback window follows `NEWS_DEFAULT_LOOKBACK_HOURS` (configured via `.env`, default 144h).
## REST API
The following read-only endpoints are available for programmatic access (in addition to the MCP SSE tools):
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/v1/health` | Extended health: cluster/entity counts, freshness, feed state, pruning config |
| GET | `/api/v1/clusters` | Paginated clusters. Params: `topic`, `hours`, `limit`, `offset` |
| GET | `/api/v1/sentiment-series` | Sentiment time-series. Params: `topic`, `hours`, `bucket_hours` |
| GET | `/api/v1/entities` | Top entities by frequency. Params: `hours`, `limit` |
| GET | `/api/v1/cluster/{cluster_id}` | Full cluster detail with summary, facts, articles |
## Startup behavior
The server uses a lifespan-based startup (FastAPI ≥0.111). Background feed refresh and pruning run as fire-and-forget coroutines, so the HTTP API and dashboard are available immediately — no blocking on the first fetch cycle.
Important env vars controlling background behavior:
- `NEWS_BACKGROUND_REFRESH_ENABLED=true` — enable/disable background loop
- `NEWS_BACKGROUND_REFRESH_ON_START=true` — fetch immediately on startup (previously `false`; changed to `true` for faster first data)
- `NEWS_REFRESH_INTERVAL_SECONDS=900` — polling interval between refresh cycles
## Dashboard (updated)
A browser-based monitoring dashboard is available at:
http://:8506/dashboard/ ```
Views: | View | What it shows | |---|---| | Health | Cluster/entity counts, freshness badge, topic doughnut, sentiment overview, feed activity | | Clusters | Filterable table — topic, sentiment, importance, entity chips, full-text search | | Sentiment | Time-series line chart (avg sentiment per configurable bucket) + cluster count overlay | | Entities | Top entities by mention frequency, bar chart, click-through to matching clusters | | Detail | Click any cluster row or paste a cluster_id — full drill-down with summary, key facts, articles, keywords, entities |
Tech stack: Vanilla HTML/JS + Chart.js, served as static files from the same FastAPI process at /dashboard. No extra dependencies.
The dashboard default lookback window follows NEWS_DEFAULT_LOOKBACK_HOURS (configured via .env, default: 144h).
Five read-only JSON endpoints for programmatic access:
| Method | Path | Params | Returns |
|---|---|---|---|
| GET | /api/v1/health |
— | stats, freshness, feed state, pruning config |
| GET | /api/v1/clusters |
topic, hours, limit, offset |
paginated cluster list |
| GET | /api/v1/sentiment-series |
topic, hours, bucket_hours |
time-series data for Chart.js |
| GET | /api/v1/entities |
hours, limit |
top entities by mention count |
| GET | /api/v1/cluster/{id} |
— (path param) | full cluster detail |
Uses FastAPI lifespan — HTTP API available in <0.3s regardless of feed/LLM latency. Background refresh + pruning are fire-and-forget coroutines.
Key .env vars:
NEWS_BACKGROUND_REFRESH_ON_START=true — fetch immediately on bootNEWS_BACKGROUND_REFRESH_ENABLED=true — enable/disable the background loopNEWS_REFRESH_INTERVAL_SECONDS=900 — polling interval