Няма описание

Lukas Goldschmidt 8e87822bad fix: force re-enrichment when enriched_at is missing from cluster dict преди 6 дни
config 57bb07fdd6 Improve entity lookup fallback and docs преди 2 месеца
dashboard 9a39782637 fix: restore buildDetailHTML for cluster modal drill-down преди 6 дни
news_mcp 8e87822bad fix: force re-enrichment when enriched_at is missing from cluster dict преди 6 дни
prompts 0e2119d549 prompt преди 1 седмица
scripts b22882c580 feat: article_identity module, site_config DB table, debug_dedup tool преди 6 дни
.dockerignore 74e3b0f555 news-mcp: add docker compose persistence setup преди 2 месеца
.env.example cd7b6ade99 docs: update env example,readme,project,release notes for v0.3.1 преди 1 седмица
.gitignore 13f8f1d5ab Initialize news-mcp scaffold преди 2 месеца
AGENTS.md f8677e48b5 fix: collapse two-store design flaw — SQL-level entity/keyword search in all MCP tools преди 1 седмица
Dockerfile 1fb3e2416a docker improvements преди 2 седмици
OUTLOOK.md afa3876ad1 docs: update README, PROJECT.md, OUTLOOK.md for v0.5.0 преди 6 дни
POLLER_UPGRADE_PLAN.md e3d27d9fd1 docs: cleanup obsolete content, document design flaw преди 1 седмица
PROJECT.md afa3876ad1 docs: update README, PROJECT.md, OUTLOOK.md for v0.5.0 преди 6 дни
README.md afa3876ad1 docs: update README, PROJECT.md, OUTLOOK.md for v0.5.0 преди 6 дни
RELEASE_NOTES.md d855ede033 fix: topic-independent cluster IDs + cross-cycle merge bucket fix преди 1 седмица
docker-compose.yml 1fb3e2416a docker improvements преди 2 седмици
killserver.sh 861c3e851d more fixes, maybe stable преди 4 седмици
live_tests.sh cdd52b9f1e Refactor news LLM extraction pipeline преди 2 месеца
provider_test.sh 14ab064bec added provider test преди 2 седмици
requirements.txt 600fcdbd55 Polish news-mcp docs + add emerging topics and tests преди 2 месеца
restart.sh 13f8f1d5ab Initialize news-mcp scaffold преди 2 месеца
run.sh 861c3e851d more fixes, maybe stable преди 4 седмици
test_embedding_support.py c984d1f589 tests: add embedding support guards for clustering преди 2 месеца
test_news_mcp.py e8cef4a441 feat: detect in-place article content updates via content hash преди 6 дни
tests.sh 600fcdbd55 Polish news-mcp docs + add emerging topics and tests преди 2 месеца
version-hash.sh 5190469475 version hash преди 1 седмица
wipe.sh 2c049a1c7e fix wipe.sh: proper .env sourcing, inline Python, respect NEWS_MCP_DB_PATH преди 1 седмица

README.md

📰 News MCP Server

FastMCP-based MCP server that turns news feeds into deduplicated, enriched clusters.

Quick start

Local:

cd news-mcp
source .venv/bin/activate
pip install -r requirements.txt
./run.sh

Docker Compose:

docker compose up --build

Endpoints:

  • MCP: http://127.0.0.1:8506/mcp/sse
  • Health: http://127.0.0.1:8506/health
  • Dashboard: http://127.0.0.1:8506/dashboard/

What it does

  • Fetches from configured news feeds (NEWS_FEED_URLS)
  • Three-layer dedup: feed hash → article URL → content hash (detects in-place updates)
  • Clusters articles via title similarity (≥0.75), Jaccard (≥0.55), dual-signal, or embeddings
  • Enriches clusters with LLM: topic, entities, sentiment, keywords, summary
  • Resolves entities via Google Trends suggestions
  • Dashboard with runtime Config page

Tools (MCP)

Tool Description
get_latest_events(topic, limit, include_articles) Latest clusters by topic
get_events_for_entity(entity, limit, timeframe, include_articles) Clusters matching entity
get_event_summary(event_id, include_articles) LLM narrative for cluster
detect_emerging_topics(limit) Emerging signals from recent clusters
get_news_sentiment(entity, timeframe) Aggregated sentiment
get_related_recent_entities(subject, timeframe, limit) Co-occurrence + Trends blend
get_feeds() / toggle_feed(url, enabled) Feed management
debug_dedup(url, title?) Inspect dedup decisions & similarity signals
get_capabilities() Tool surface documentation

REST API

Method Path Description
GET /api/v1/health Stats, freshness, feed state, pruning
GET /api/v1/clusters Paginated clusters
GET /api/v1/sentiment-series Sentiment time-series
GET /api/v1/entities Top entities by frequency
GET /api/v1/keywords Top keywords by frequency
GET /api/v1/clusters/by-entity Entity search (SQL)
GET /api/v1/clusters/by-keyword Keyword search (SQL)
GET /api/v1/cluster/{id} Full cluster detail
GET /api/v1/feeds Feed state list
POST /api/v1/feeds/toggle Enable/disable feed
GET /api/v1/config All config parameters
POST /api/v1/config/update Update a parameter
POST /api/v1/config/reset Reset to defaults

Configuration

All parameters are stored in the site_config DB table and editable via the dashboard Config page. On first startup, seeded from .env or built-in defaults.

Key .env vars (seeded into site_config):

Variable Default Purpose
NEWS_FEED_URLS Comma-separated feed URLs
NEWS_REFRESH_INTERVAL_SECONDS 300 Polling interval
NEWS_DEFAULT_LOOKBACK_HOURS 24 Read freshness window
NEWS_RETENTION_DAYS 10 Prune threshold
NEWS_PRUNE_INTERVAL_HOURS 12 Prune check interval
NEWS_CLUSTER_MAX_AGE_HOURS 6 Cross-cycle merge window
NEWS_EMBEDDINGS_ENABLED true Enable Ollama embeddings
NEWS_EMBEDDING_SIMILARITY_THRESHOLD 0.885 Cosine threshold
OLLAMA_BASE_URL http://192.168.0.200:11434 Ollama API URL
NEWS_EXTRACT_PROVIDER / NEWS_SUMMARY_PROVIDER groq LLM provider
NEWS_EXTRACT_MODEL / NEWS_SUMMARY_MODEL llama4-16e LLM model
GROQ_API_KEY / OPENAI_API_KEY / OPENROUTER_API_KEY API keys
ENTITY_BLACKLIST Comma-separated entity patterns
ENRICHMENT_MAX_PER_REFRESH 0 (unlimited) LLM enrichments per cycle

Clustering thresholds (also in site_config):

  • title_threshold: 0.75
  • jaccard_threshold: 0.55
  • dual_title_floor: 0.55
  • dual_jaccard_floor: 0.25

Clustering pipeline

  1. Fetch all feeds concurrently
  2. Feed hash — skip unchanged feeds entirely
  3. Retention filter — drop articles older than NEWS_RETENTION_DAYS
  4. Seen articlesfilter_already_seen() splits into:
    • new → never seen, full processing
    • unchanged → same URL, same content hash → skip
    • changed → same URL, different content hash → re-cluster + re-enrich
  5. Cluster — title similarity, Jaccard, embeddings, dual-signal merge
  6. Enrich — LLM extraction, summarization, sentiment
  7. Prune — delete clusters older than retention window

Content-change detection

When an article is updated in-place at the same URL (e.g. FT's "More to come..." → real content):

  1. content_hash = SHA1(title|summary) is computed
  2. Compared against seen_articles.content_hash
  3. If different → article is re-clustered into its existing cluster
  4. enriched_at is cleared → next cycle re-enriches with updated content

Persistence

  • SQLite at ./data/news.sqlite (local) or /app/data/news.sqlite (Docker)
  • Schema auto-migrates on startup (ALTER TABLE for new columns)
  • Backfill script for seeding seen_articles from existing clusters:

    docker exec -it news-mcp python3 scripts/backfill_seen_articles.py
    

Dashboard

http://<host>:8506/dashboard/

  • Health — stats, charts, feed status
  • Feeds — toggle on/off
  • Clusters — filterable table, click for drill-down modal
  • Sentiment — time-series chart
  • Entities — top entities, frequency chart
  • Keywords — top keywords, frequency chart
  • Config — runtime parameter tuning (new in v0.5.0)

Version

See ./version-hash.sh for the current content hash.