RELEASE_NOTES.md 8.6 KB

news-mcp release notes

v0.3.2 — topic-independent cluster IDs, fix for cross-cycle duplicate clusters

Highlights

  • Topic-independent cluster IDs: _stable_cluster_id no longer includes the topic in the hash. The ID is now sha1(min_article_key) instead of sha1(topic|min_article_key). This ensures the same article always maps to the same cluster_id regardless of whether the heuristic classifier or the LLM assigns a different topic. Previously, when the LLM reclassified a cluster's topic (e.g. "macro" → "crypto"), the same article arriving in the next polling cycle would get a different cluster_id, bypass ON CONFLICT DO UPDATE, and silently create a duplicate row in the DB.

  • Cross-cycle merge bucket fix: when seeding existing_clusters from the DB, the cluster's topic is now re-derived via the same heuristic (normalize_topic_from_title) that new articles use. Previously, existing clusters were bucketed by their enriched topic (from the DB), so a new article with a different heuristic topic would land in a different by_topic bucket and never be matched against the existing cluster. This was the primary mechanism producing the 419+ duplicate clusters observed in production.

Root cause

Two cooperating bugs allowed the same article to accumulate duplicate DB rows:

  1. cluster_id included topic → same article with different topic → different PK → ON CONFLICT never fires.
  2. existing clusters bucketed by enriched topic → new article bucketed by heuristic topic → cross-cycle matching loop never compares them → orphan merge only runs within a single topic bucket → no merge.

Both fixes together ensure that (a) the cluster_id is deterministic from article keys alone, and (b) cross-cycle matching works regardless of topic drift between heuristic and enriched classifications.

Migration notes

  • Existing cluster IDs will change format on the next polling cycle. Old rows with the previous ID format become stale (the new code writes with the new ID via ON CONFLICT). They will age out via pruning. To clean them immediately, run a one-time dedup pass or wipe and let the next refresh rebuild.
  • No database schema changes.

v0.3.1 — stable cluster IDs, cross-cycle merge, orphan dedup, multi-article signals

  • Emerging topics rewrite (detect_emerging_topics): complete rewrite with 5 new capabilities:
    • Timeframe parameter ("4h", "24h", "3d", etc.) — controls lookback window instead of always using DEFAULT_LOOKBACK_HOURS
    • Velocity scoring — splits the window into recent vs prior half, computes velocity = (recent + 0.5) / (prior + 0.5). Entities accelerating now vs before score much higher than steady-state ones
    • Composed trend score — replaces the flat 0.25 + 0.40*imp + 0.08*count formula with a weighted combination of: velocity (35%), recency concentration (25%), source diversity (15%), sustained presence across time buckets (10%), importance (15%)
    • Topic scoping — optional topic parameter filters to a specific category before scoring
    • Entity neighborhood scoping — optional around parameter only returns entities co-occurring with the specified entity (e.g. around="Bitcoin" finds what's emerging in Bitcoin's neighborhood)
    • Richer output — each result now includes velocity, recent_count, prior_count, source_count alongside trend_score and related_entities
  • Multi-article signal comparison: _signals() now compares a new article against ALL articles in a candidate cluster (not just the seed). The best title and jaccard scores across all cluster members are used for matching.
  • Stable cluster IDs: cluster_id = sha1(topic | min_article_key) instead of sha1(topic | seed_title). The same set of articles always maps to the same ID regardless of processing order. This eliminates duplicate clusters for the same event.
  • Cross-cycle merge: the poller loads recent clusters from the DB (controlled by NEWS_CLUSTER_MAX_AGE_HOURS, default 4h) and seeds them as merge targets before clustering. New articles in poll N+1 can merge into clusters created in poll N.
  • Orphan merge: post-clustering Union-Find pass detects and merges clusters that share article keys. Catches cases where articles about the same event didn't match during the main loop (e.g. embeddings temporarily unavailable).
  • Cascade match via _is_match(): unified signal evaluation — cosine → title → jaccard → consensus. Short-circuits on first passing signal. Configurable title_threshold parameter.
  • Cluster embedding updated on merge: when a new article merges into an existing cluster, the cluster's embedding is updated to the new article's vector, improving subsequent embedding-based matching.
  • NEWS_CLUSTER_MAX_AGE_HOURS env var (default 4): controls the cross-cycle merge window. Set to 0 to disable cross-cycle merge.

Migration notes

  • No database schema changes.
  • Existing cluster IDs will change format on the next polling cycle (old rows are updated in-place via ON CONFLICT(cluster_id) once the new ID is computed). Transient enrichment cache misses may occur for one cycle.
  • Old duplicate clusters (same event, different IDs) will age out via pruning. To clean them immediately, run the article dedup cleanup script.
  • detect_emerging_topics output shape changed: count replaced by recent_count + prior_count, new fields velocity and source_count. Clients using the old count field need to switch to recent_count.

v0.3.0 — concurrent polling, enrichment retry, all-topics default

Highlights

  • Async concurrent RSS fetching — all feeds fetched in parallel with asyncio.gather + httpx, bounded by semaphore (default 10 concurrent). Previously sequential: ~40 feeds × 2-5s each = minutes. Now ~10 at a time.
  • Concurrent Ollama embeddings — embedding vectors for all articles pre-computed in parallel before the clustering loop (bounded by semaphore, default 4). Previously one-by-one during clustering.
  • Concurrent LLM enrichment — entity extraction / topic classification / sentiment calls run concurrently across all clusters, bounded by per-provider semaphore:
    • openrouter: 2 free tier
    • openai: 5
    • groq: 8
    • Override via NEWS_LLM_CONCURRENCY_<PROVIDER> env var
  • Per-cluster retry with backoff — failed LLM calls retry up to 3 times (2s, 4s, 8s backoff) before marking the cluster as failed. Failed clusters are automatically retried on the next polling cycle.
  • Cross-cycle failure recoveryget_failed_enrichment_clusters() queries the DB for clusters with enrichment_failed_at set but below the retry threshold, so transient failures self-heal.
  • LLM provider retries_call_groq and _call_openai now have the same retry logic as _call_openrouter (2 retries, exponential backoff on 429/500/502/503, empty response handling).
  • get_latest_events() default changed — omitting topic now returns clusters from all topics instead of defaulting to "crypto". Pass topic="crypto" (or macro/regulation/ai/other) to filter.
  • Configuration — all concurrency limits configurable via env vars; see config.py for NEWS_RSS_MAX_CONCURRENCY, NEWS_OLLAMA_MAX_CONCURRENCY, NEWS_LLM_CONCURRENCY_<PROVIDER>.

Migration notes

  • No database schema changes.
  • If you relied on get_latest_events() without a topic argument returning only crypto clusters, pass topic="crypto" explicitly.
  • Concurrency defaults are conservative for free-rate-limit providers. Tune up via env vars if you have paid plans.

v0.2.0 — embedding-aware clustering and richer agent tools

Highlights

  • Optional Ollama embedding path for clustering (NEWS_EMBEDDINGS_ENABLED=true)
  • Configurable Ollama base URL and embedding model
  • Tunable embedding similarity threshold (NEWS_EMBEDDING_SIMILARITY_THRESHOLD)
  • New agent tool: get_related_entities(subject, timeframe, limit)
  • Optional article payloads for get_latest_events, get_events_for_entity, and get_event_summary
  • Improved emerging-topic scoring with co-occurrence and importance weighting
  • Blacklist enforcement back-clean script for stored clusters
  • Embedding backfill script for older clusters
  • Embedding similarity analysis script for threshold tuning
  • Embedding-based merge script with dry-run and wet modes
  • Article dedup cleanup for repeated article variants inside clusters

Notes

  • Ollama embeddings are tried first when enabled; heuristic clustering remains the fallback.
  • The merge script is intentionally destructive and should be preceded by a dry run.
  • The article dedup cleanup script is safe to run after ingestion or on the historical dataset.