news-mcp release notes

v0.3.1 — stable cluster IDs, cross-cycle merge, orphan dedup, multi-article signals

Highlights

Multi-article signal comparison: _signals() now compares a new article against ALL articles in a candidate cluster (not just the seed). The best title and jaccard scores across all cluster members are used for matching.
Stable cluster IDs: cluster_id = sha1(topic | min_article_key) instead of sha1(topic | seed_title). The same set of articles always maps to the same ID regardless of processing order. This eliminates duplicate clusters for the same event.
Cross-cycle merge: the poller loads recent clusters from the DB (controlled by NEWS_CLUSTER_MAX_AGE_HOURS, default 4h) and seeds them as merge targets before clustering. New articles in poll N+1 can merge into clusters created in poll N.
Orphan merge: post-clustering Union-Find pass detects and merges clusters that share article keys. Catches cases where articles about the same event didn't match during the main loop (e.g. embeddings temporarily unavailable).
Cascade match via _is_match(): unified signal evaluation — cosine → title → jaccard → consensus. Short-circuits on first passing signal. Configurable title_threshold parameter.
Cluster embedding updated on merge: when a new article merges into an existing cluster, the cluster's embedding is updated to the new article's vector, improving subsequent embedding-based matching.
NEWS_CLUSTER_MAX_AGE_HOURS env var (default 4): controls the cross-cycle merge window. Set to 0 to disable cross-cycle merge.

Migration notes

No database schema changes.
Existing cluster IDs will change format on the next polling cycle (old rows are updated in-place via ON CONFLICT(cluster_id) once the new ID is computed). Transient enrichment cache misses may occur for one cycle.
Old duplicate clusters (same event, different IDs) will age out via pruning. To clean them immediately, run the article dedup cleanup script.

v0.3.0 — concurrent polling, enrichment retry, all-topics default

Highlights

Async concurrent RSS fetching — all feeds fetched in parallel with asyncio.gather + httpx, bounded by semaphore (default 10 concurrent). Previously sequential: ~40 feeds × 2-5s each = minutes. Now ~10 at a time.
Concurrent Ollama embeddings — embedding vectors for all articles pre-computed in parallel before the clustering loop (bounded by semaphore, default 4). Previously one-by-one during clustering.
Concurrent LLM enrichment — entity extraction / topic classification / sentiment calls run concurrently across all clusters, bounded by per-provider semaphore:
- openrouter: 2 free tier
- openai: 5
- groq: 8
- Override via NEWS_LLM_CONCURRENCY_<PROVIDER> env var
Per-cluster retry with backoff — failed LLM calls retry up to 3 times (2s, 4s, 8s backoff) before marking the cluster as failed. Failed clusters are automatically retried on the next polling cycle.
Cross-cycle failure recovery — get_failed_enrichment_clusters() queries the DB for clusters with enrichment_failed_at set but below the retry threshold, so transient failures self-heal.
LLM provider retries — _call_groq and _call_openai now have the same retry logic as _call_openrouter (2 retries, exponential backoff on 429/500/502/503, empty response handling).
get_latest_events() default changed — omitting topic now returns clusters from all topics instead of defaulting to "crypto". Pass topic="crypto" (or macro/regulation/ai/other) to filter.
Configuration — all concurrency limits configurable via env vars; see config.py for NEWS_RSS_MAX_CONCURRENCY, NEWS_OLLAMA_MAX_CONCURRENCY, NEWS_LLM_CONCURRENCY_<PROVIDER>.

Migration notes

No database schema changes.
If you relied on get_latest_events() without a topic argument returning only crypto clusters, pass topic="crypto" explicitly.
Concurrency defaults are conservative for free-rate-limit providers. Tune up via env vars if you have paid plans.

v0.2.0 — embedding-aware clustering and richer agent tools

Highlights

Optional Ollama embedding path for clustering (NEWS_EMBEDDINGS_ENABLED=true)
Configurable Ollama base URL and embedding model
Tunable embedding similarity threshold (NEWS_EMBEDDING_SIMILARITY_THRESHOLD)
New agent tool: get_related_entities(subject, timeframe, limit)
Optional article payloads for get_latest_events, get_events_for_entity, and get_event_summary
Improved emerging-topic scoring with co-occurrence and importance weighting
Blacklist enforcement back-clean script for stored clusters
Embedding backfill script for older clusters
Embedding similarity analysis script for threshold tuning
Embedding-based merge script with dry-run and wet modes
Article dedup cleanup for repeated article variants inside clusters

Notes

Ollama embeddings are tried first when enabled; heuristic clustering remains the fallback.
The merge script is intentionally destructive and should be preceded by a dry run.
The article dedup cleanup script is safe to run after ingestion or on the historical dataset.

RELEASE_NOTES.md 5.1 KB История Исходник

news-mcp release notes

v0.3.1 — stable cluster IDs, cross-cycle merge, orphan dedup, multi-article signals

Highlights

Migration notes

v0.3.0 — concurrent polling, enrichment retry, all-topics default

Highlights

Migration notes

v0.2.0 — embedding-aware clustering and richer agent tools

Highlights

Notes

RELEASE_NOTES.md 5.1 KB

История Исходник