|
|
4 days ago | |
|---|---|---|
| config | 2 months ago | |
| dashboard | 5 days ago | |
| news_mcp | 5 days ago | |
| prompts | 4 days ago | |
| scripts | 4 days ago | |
| .dockerignore | 2 months ago | |
| .env.example | 1 week ago | |
| .gitignore | 2 months ago | |
| AGENTS.md | 1 week ago | |
| Dockerfile | 2 weeks ago | |
| OUTLOOK.md | 6 days ago | |
| POLLER_UPGRADE_PLAN.md | 1 week ago | |
| PROJECT.md | 6 days ago | |
| README.md | 4 days ago | |
| RELEASE_NOTES.md | 1 week ago | |
| docker-compose.yml | 2 weeks ago | |
| killserver.sh | 4 weeks ago | |
| live_tests.sh | 2 months ago | |
| provider_test.sh | 2 weeks ago | |
| requirements.txt | 2 months ago | |
| restart.sh | 2 months ago | |
| run.sh | 4 weeks ago | |
| test_embedding_support.py | 2 months ago | |
| test_news_mcp.py | 6 days ago | |
| tests.sh | 2 months ago | |
| version-hash.sh | 1 week ago | |
| wipe.sh | 6 days ago |
FastMCP-based MCP server that turns news feeds into deduplicated, enriched clusters.
Local:
cd news-mcp
source .venv/bin/activate
pip install -r requirements.txt
./run.sh
Docker Compose:
docker compose up --build
Endpoints:
http://127.0.0.1:8506/mcp/ssehttp://127.0.0.1:8506/healthhttp://127.0.0.1:8506/dashboard/NEWS_FEED_URLS)| Tool | Description |
|---|---|
get_latest_events(topic, limit, include_articles) |
Latest clusters by topic |
get_events_for_entity(entity, limit, timeframe, include_articles) |
Clusters matching entity |
get_event_summary(event_id, include_articles) |
LLM narrative for cluster |
detect_emerging_topics(limit) |
Emerging signals from recent clusters |
get_news_sentiment(entity, timeframe) |
Aggregated sentiment |
get_related_recent_entities(subject, timeframe, limit) |
Co-occurrence + Trends blend |
get_feeds() / toggle_feed(url, enabled) |
Feed management |
debug_dedup(url, title?) |
Inspect dedup decisions & similarity signals |
get_capabilities() |
Tool surface documentation |
| Method | Path | Description |
|---|---|---|
| GET | /api/v1/health |
Stats, freshness, feed state, pruning |
| GET | /api/v1/clusters |
Paginated clusters |
| GET | /api/v1/sentiment-series |
Sentiment time-series |
| GET | /api/v1/entities |
Top entities by frequency |
| GET | /api/v1/keywords |
Top keywords by frequency |
| GET | /api/v1/clusters/by-entity |
Entity search (SQL) |
| GET | /api/v1/clusters/by-keyword |
Keyword search (SQL) |
| GET | /api/v1/cluster/{id} |
Full cluster detail |
| GET | /api/v1/feeds |
Feed state list |
| POST | /api/v1/feeds/toggle |
Enable/disable feed |
| GET | /api/v1/config |
All config parameters |
| POST | /api/v1/config/update |
Update a parameter |
| POST | /api/v1/config/reset |
Reset to defaults |
All parameters are stored in the site_config DB table and editable via the dashboard Config page.
On first startup, seeded from .env or built-in defaults.
Key .env vars (seeded into site_config):
| Variable | Default | Purpose |
|---|---|---|
NEWS_FEED_URLS |
— | Comma-separated feed URLs |
NEWS_REFRESH_INTERVAL_SECONDS |
300 | Polling interval |
NEWS_DEFAULT_LOOKBACK_HOURS |
24 | Read freshness window |
NEWS_RETENTION_DAYS |
10 | Prune threshold |
NEWS_PRUNE_INTERVAL_HOURS |
12 | Prune check interval |
NEWS_CLUSTER_MAX_AGE_HOURS |
6 | Cross-cycle merge window |
NEWS_EMBEDDINGS_ENABLED |
true | Enable Ollama embeddings |
NEWS_EMBEDDING_SIMILARITY_THRESHOLD |
0.885 | Cosine threshold |
OLLAMA_BASE_URL |
http://192.168.0.200:11434 |
Ollama API URL |
NEWS_EXTRACT_PROVIDER / NEWS_SUMMARY_PROVIDER |
groq | LLM provider |
NEWS_EXTRACT_MODEL / NEWS_SUMMARY_MODEL |
llama4-16e | LLM model |
GROQ_API_KEY / OPENAI_API_KEY / OPENROUTER_API_KEY |
— | API keys |
ENTITY_BLACKLIST |
— | Comma-separated entity patterns |
ENRICHMENT_MAX_PER_REFRESH |
0 (unlimited) | LLM enrichments per cycle |
Clustering thresholds (also in site_config):
title_threshold: 0.75jaccard_threshold: 0.55dual_title_floor: 0.55dual_jaccard_floor: 0.25NEWS_RETENTION_DAYSfilter_already_seen() splits into:
new → never seen, full processingunchanged → same URL, same content hash → skipchanged → same URL, different content hash → re-cluster + re-enrichWhen an article is updated in-place at the same URL (e.g. FT's "More to come..." → real content):
content_hash = SHA1(title|summary) is computedseen_articles.content_hashenriched_at is cleared → next cycle re-enriches with updated content./data/news.sqlite (local) or /app/data/news.sqlite (Docker)Backfill script for seeding seen_articles from existing clusters:
docker exec -it news-mcp python3 scripts/backfill_seen_articles.py
http://<host>:8506/dashboard/
See ./version-hash.sh for the current content hash.
The extraction prompt (prompts/extract_entities.prompt) is tested against a curated set of annotated samples to ensure entity/keyword separation quality, especially for smaller models like llama-3.1-8b-instant.
# Run against default prompt with 30 annotated samples (all 5 topics)
python scripts/eval_extraction.py --model llama-3.1-8b-instant --provider groq
# Run with specific prompt file
python scripts/eval_extraction.py --model llama-3.1-8b-instant --provider groq --prompt-file prompts/extract_entities.prompt
# Run against larger model for comparison
python scripts/eval_extraction.py --model deepseek/deepseek-v4-flash --provider openrouter
# Verbose per-sample output
python scripts/eval_extraction.py --model llama-3.1-8b-instant --provider groq --verbose
# Collect new samples from live DB for manual annotation
python scripts/eval_extraction.py --collect 30 --output new_samples.json
| Metric | Target | Description |
|---|---|---|
| Entity F1 | ≥ 0.65 | Precision/recall of named entities (proper nouns) |
| Keyword F1 | ≥ 0.40 | Precision/recall of thematic keywords (1-2 word tags) |
| Leakage | 0.0 | Entities appearing in keywords (should never happen) |
| Topic Accuracy | ≥ 0.80 | Correct topic classification (crypto/macro/regulation/ai/other) |
The 30 golden samples in data/annotated_samples.json cover all 5 topics:
Entity F1: 0.665 (P=0.814 R=0.601)
Keyword F1: 0.468 (P=0.572 R=0.400)
Leakage (avg): 0.000
Topic Acc: 0.867
The prompt uses 6 few-shot examples with explicit entity/keyword decision rules and topic classification boundaries (especially the regulation vs other distinction for sanctions enforcement).