|
|
1 dienu atpakaļ | |
|---|---|---|
| config | 1 mēnesi atpakaļ | |
| news_mcp | 1 dienu atpakaļ | |
| prompts | 1 mēnesi atpakaļ | |
| scripts | 1 mēnesi atpakaļ | |
| .dockerignore | 1 mēnesi atpakaļ | |
| .env.example | 1 mēnesi atpakaļ | |
| .gitignore | 1 mēnesi atpakaļ | |
| AGENTS.md | 2 dienas atpakaļ | |
| Dockerfile | 1 mēnesi atpakaļ | |
| OUTLOOK.md | 1 mēnesi atpakaļ | |
| PROJECT.md | 1 dienu atpakaļ | |
| README.md | 1 dienu atpakaļ | |
| RELEASE_NOTES.md | 1 mēnesi atpakaļ | |
| docker-compose.yml | 1 mēnesi atpakaļ | |
| killserver.sh | 1 dienu atpakaļ | |
| live_tests.sh | 1 mēnesi atpakaļ | |
| requirements.txt | 1 mēnesi atpakaļ | |
| restart.sh | 1 mēnesi atpakaļ | |
| run.sh | 1 dienu atpakaļ | |
| test_embedding_support.py | 1 mēnesi atpakaļ | |
| test_news_mcp.py | 1 dienu atpakaļ | |
| tests.sh | 1 mēnesi atpakaļ |
FastMCP-based MCP server that turns news feeds into deduplicated, enriched clusters.
Local:
cd news-mcp
source .venv/bin/activate
pip install -r requirements.txt
./run.sh
Docker Compose:
docker compose up --build
Default SSE mount (FastMCP):
http://127.0.0.1:8506/mcp/sseHealth:
http://127.0.0.1:8506/healthNEWS_FEED_URL / NEWS_FEED_URLS)trends-mcp hop required for entity resolution)1) get_latest_events(topic, limit, include_articles=false)
topic is a coarse category: crypto | macro | regulation | ai | otherinclude_articles=true, includes articles[].url + minimal fields per returned cluster2) get_events_for_entity(entity, limit, timeframe="24h", include_articles=false)
entitieslimit is the cap within that windowinclude_articles=true, includes articles[].url + minimal fields per returned cluster3) get_event_summary(event_id, include_articles=false)
cluster_idinclude_articles=true, includes the underlying articles list (with url) from the stored cluster4) detect_emerging_topics(limit)
5) get_news_sentiment(entity, timeframe)
6) get_related_recent_entities(subject, timeframe, limit, include_trends=true)
mid when available) plus source/score metadata7) get_capabilities()
The server keeps a conservative alias map in config/entity_aliases.json for obvious shorthands
like btc -> Bitcoin, eth -> Ethereum, and ether -> Ethereum. Keep this map tight; it is meant
to reduce false misses, not to rewrite every possible name variant.
See news-mcp/.env.
Key variables:
NEWS_EXTRACT_PROVIDER, NEWS_EXTRACT_MODELNEWS_SUMMARY_PROVIDER, NEWS_SUMMARY_MODELGROQ_API_KEY, OPENAI_API_KEYENTITY_BLACKLIST (comma-separated, case-insensitive patterns; wildcards are supported)NEWS_PROMPTS_DIR (override prompt directory)NEWS_ENTITY_ALIASES_FILE (override entity alias JSON file)NEWS_FEED_URL (single feed fallback)NEWS_FEED_URLS (comma-separated feed URLs; overrides NEWS_FEED_URL)NEWS_REFRESH_INTERVAL_SECONDS (default 900)NEWS_BACKGROUND_REFRESH_ON_START (default true)NEWS_BACKGROUND_REFRESH_ENABLED (default true)NEWS_DEFAULT_LOOKBACK_HOURS (freshness window for reads; older rows are ignored by queries)NEWS_PRUNING_ENABLED (default true; if false, no rows are physically deleted)NEWS_RETENTION_DAYS (physical delete threshold for stored clusters)NEWS_PRUNE_INTERVAL_HOURS (how often in-server pruning may run)GROQ_ENRICH_OTHER_ONLY (default false; set true for cost control)NEWS_EMBEDDINGS_ENABLED (default false; enables Ollama embeddings for clustering when wired in)OLLAMA_BASE_URL / OLLAMA_URL (default http://127.0.0.1:11434)OLLAMA_EMBEDDING_MODEL (default nomic-embed-text)NEWS_EMBEDDING_SIMILARITY_THRESHOLD (default 0.885; used when embeddings are enabled)When embeddings are enabled, news-mcp tries Ollama first and falls back to the existing heuristic clustering path if Ollama is unavailable.
The default database path is project-relative:
NEWS_MCP_DATA_DIR=./dataNEWS_MCP_DB_PATH=./data/news.sqliteThat keeps persistence inside the repository tree in both local and Docker runs.
Recommended workflow:
rsync for the initial data transfer to a remote server.Example initial transfer:
rsync -a ./data/ user@remote:/srv/news-mcp/data/
If you change the location later, override the defaults with:
NEWS_MCP_DATA_DIRNEWS_MCP_DB_PATHThese are intentionally different:
NEWS_DEFAULT_LOOKBACK_HOURS controls read freshness only. Older rows remain in SQLite but do not appear in normal "latest" queries.NEWS_PRUNING_ENABLED controls whether the server is allowed to physically delete old rows.NEWS_RETENTION_DAYS controls how old rows may get before they are deleted.NEWS_PRUNE_INTERVAL_HOURS controls how often the server checks whether deletion is due.Pruning is self-contained inside the server:
If NEWS_PRUNING_ENABLED=false, no pruning occurs and old rows are retained indefinitely.
Run a standardized, fabricated extraction test against the currently configured provider/model:
./live_tests.sh
The script reads ./.env, selects OpenAI or Groq based on the configured keys, and checks that the core expected entities are extracted.
Use your existing config path:
CONFIG=/home/lucky/.openclaw/workspace/config/mcporter.json
Inspect server + tools:
mcporter --config "$CONFIG" list news --schema
mcporter --config "$CONFIG" call news.get_latest_events topic=crypto limit=10
mcporter --config "$CONFIG" call news.get_latest_events topic=macro limit=5
mcporter --config "$CONFIG" call news.get_events_for_entity entity=Bitcoin timeframe=24h limit=10
mcporter --config "$CONFIG" call news.get_events_for_entity entity=ETH timeframe=3d limit=10
mcporter --config "$CONFIG" call news.get_events_for_entity entity=ETF timeframe=7d limit=10
# First fetch an event id
mcporter --config "$CONFIG" call news.get_latest_events topic=crypto limit=1
# Then summarize it
mcporter --config "$CONFIG" call news.get_event_summary event_id=<cluster_id>
mcporter --config "$CONFIG" call news.detect_emerging_topics limit=10
mcporter --config "$CONFIG" call news.get_news_sentiment entity=Bitcoin timeframe=24h
mcporter --config "$CONFIG" call news.get_news_sentiment entity=Ethereum timeframe=72h
# Iran: blend local co-occurrence with Google Trends related topics
mcporter --config "$CONFIG" call news.get_related_recent_entities subject=Iran timeframe=72h limit=12 include_trends=true
# Another seed phrase
mcporter --config "$CONFIG" call news.get_related_recent_entities subject="iran war" timeframe=72h limit=12 include_trends=true
mcporter --config "$CONFIG" call news.get_capabilities
Use this when you want the server to explain how to chain the tools together, which fields to keep hidden (e.g. cluster_id), and how to present sources/timestamps consistently.
If you change ENTITY_BLACKLIST, existing clusters in news.sqlite may still
contain entities/keywords that would now be filtered at extraction time.
For one-off cleanup, run:
./.venv/bin/python scripts/enforce_news_blacklist.py --dry-run --limit 200
./.venv/bin/python scripts/enforce_news_blacklist.py --limit 1000
This enforces ENTITY_BLACKLIST inside stored clusters by removing matching
entries from payload.entities and payload.keywords and (if needed) setting
payload.topic = "other".
If NEWS_EMBEDDINGS_ENABLED=true, you can precompute cluster embeddings for
older rows before restarting the server:
./.venv/bin/python scripts/backfill_news_embeddings.py --dry-run --limit 200
./.venv/bin/python scripts/backfill_news_embeddings.py --limit 1000
This stores a cluster-level embedding and embedding_model inside the SQLite
payload so the Ollama-first clustering path has data ready to use.
To inspect likely cluster merges at different cosine thresholds without writing anything back to the DB:
./.venv/bin/python scripts/analyze_cluster_embedding_merges.py --thresholds 0.82 0.85 0.88 --limit 200
This prints candidate pairs per threshold so you can decide whether a merge script is worth adding next.
After inspecting the analysis output, you can merge clusters above a chosen threshold. Start with dry-run:
./.venv/bin/python scripts/merge_cluster_embeddings.py --dry-run --threshold 0.90
If the groupings look right, run wet:
./.venv/bin/python scripts/merge_cluster_embeddings.py --threshold 0.90
This merges embedding-similar clusters within the same topic and removes the absorbed duplicates from SQLite.
Some stored clusters may contain repeated article entries for the same underlying article id / URL path. To clean existing rows:
./.venv/bin/python scripts/dedup_articles_in_clusters.py --dry-run
./.venv/bin/python scripts/dedup_articles_in_clusters.py
The live clustering path also deduplicates article entries when new data comes in.
As of the latest hardening, the server/storage write path also self-heals payload.articles by deduplicating before persisting (so historical rows can be fixed via the cleanup script, and future writes won’t reintroduce duplicates).
## Dashboard (new)
A browser-based monitoring dashboard is available at:
http://127.0.0.1:8506/dashboard/
**Views:**
- **Health** — cluster/entity counts, freshness indicator, topic distribution (doughnut chart), sentiment overview, feed activity
- **Clusters** — filterable/sortable table with topic, sentiment, importance, entity chips; search by keyword
- **Sentiment** — time-series chart (avg sentiment per configurable time bucket) with cluster count overlay
- **Entities** — top entities by mention frequency, horizontal bar chart, click for detail
- **Detail** — click any cluster row or search by cluster ID for full drill-down (summary, key facts, articles, keywords, entities)
**Tech:** Pure static HTML/JS with Chart.js for visualizations. Served from the same FastAPI process at `/dashboard`.
**Configuration defaults:** The dashboard's default lookback window follows `NEWS_DEFAULT_LOOKBACK_HOURS` (configured via `.env`, default 144h).
## REST API
The following read-only endpoints are available for programmatic access (in addition to the MCP SSE tools):
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/v1/health` | Extended health: cluster/entity counts, freshness, feed state, pruning config |
| GET | `/api/v1/clusters` | Paginated clusters. Params: `topic`, `hours`, `limit`, `offset` |
| GET | `/api/v1/sentiment-series` | Sentiment time-series. Params: `topic`, `hours`, `bucket_hours` |
| GET | `/api/v1/entities` | Top entities by frequency. Params: `hours`, `limit` |
| GET | `/api/v1/cluster/{cluster_id}` | Full cluster detail with summary, facts, articles |
## Startup behavior
The server uses a lifespan-based startup (FastAPI ≥0.111). Background feed refresh and pruning run as fire-and-forget coroutines, so the HTTP API and dashboard are available immediately — no blocking on the first fetch cycle.
Important env vars controlling background behavior:
- `NEWS_BACKGROUND_REFRESH_ENABLED=true` — enable/disable background loop
- `NEWS_BACKGROUND_REFRESH_ON_START=true` — fetch immediately on startup (previously `false`; changed to `true` for faster first data)
- `NEWS_REFRESH_INTERVAL_SECONDS=900` — polling interval between refresh cycles
## Dashboard (updated)
A browser-based monitoring dashboard is available at:
http://:8506/dashboard/ ```
Views: | View | What it shows | |---|---| | Health | Cluster/entity counts, freshness badge, topic doughnut, sentiment overview, feed activity | | Clusters | Filterable table — topic, sentiment, importance, entity chips, full-text search | | Sentiment | Time-series line chart (avg sentiment per configurable bucket) + cluster count overlay | | Entities | Top entities by mention frequency, bar chart, click-through to matching clusters | | Detail | Click any cluster row or paste a cluster_id — full drill-down with summary, key facts, articles, keywords, entities |
Tech stack: Vanilla HTML/JS + Chart.js, served as static files from the same FastAPI process at /dashboard. No extra dependencies.
The dashboard default lookback window follows NEWS_DEFAULT_LOOKBACK_HOURS (configured via .env, default: 144h).
Five read-only JSON endpoints for programmatic access:
| Method | Path | Params | Returns |
|---|---|---|---|
| GET | /api/v1/health |
— | stats, freshness, feed state, pruning config |
| GET | /api/v1/clusters |
topic, hours, limit, offset |
paginated cluster list |
| GET | /api/v1/sentiment-series |
topic, hours, bucket_hours |
time-series data for Chart.js |
| GET | /api/v1/entities |
hours, limit |
top entities by mention count |
| GET | /api/v1/cluster/{id} |
— (path param) | full cluster detail |
Uses FastAPI lifespan — HTTP API available in <0.3s regardless of feed/LLM latency. Background refresh + pruning are fire-and-forget coroutines.
Key .env vars:
NEWS_BACKGROUND_REFRESH_ON_START=true — fetch immediately on bootNEWS_BACKGROUND_REFRESH_ENABLED=true — enable/disable the background loopNEWS_REFRESH_INTERVAL_SECONDS=900 — polling interval