# news-mcp release notes ## v0.3.2 — topic-independent cluster IDs, fix for cross-cycle duplicate clusters ### Highlights - **Topic-independent cluster IDs**: `_stable_cluster_id` no longer includes the `topic` in the hash. The ID is now `sha1(min_article_key)` instead of `sha1(topic|min_article_key)`. This ensures the same article always maps to the same cluster_id regardless of whether the heuristic classifier or the LLM assigns a different topic. Previously, when the LLM reclassified a cluster's topic (e.g. "macro" → "crypto"), the same article arriving in the next polling cycle would get a different cluster_id, bypass `ON CONFLICT DO UPDATE`, and silently create a duplicate row in the DB. - **Cross-cycle merge bucket fix**: when seeding `existing_clusters` from the DB, the cluster's topic is now re-derived via the same heuristic (`normalize_topic_from_title`) that new articles use. Previously, existing clusters were bucketed by their *enriched* topic (from the DB), so a new article with a different heuristic topic would land in a different `by_topic` bucket and never be matched against the existing cluster. This was the primary mechanism producing the 419+ duplicate clusters observed in production. ### Root cause Two cooperating bugs allowed the same article to accumulate duplicate DB rows: 1. **cluster_id included topic** → same article with different topic → different PK → `ON CONFLICT` never fires. 2. **existing clusters bucketed by enriched topic** → new article bucketed by heuristic topic → cross-cycle matching loop never compares them → orphan merge only runs within a single topic bucket → no merge. Both fixes together ensure that (a) the cluster_id is deterministic from article keys alone, and (b) cross-cycle matching works regardless of topic drift between heuristic and enriched classifications. ### Migration notes - Existing cluster IDs will change format on the next polling cycle. Old rows with the previous ID format become stale (the new code writes with the new ID via `ON CONFLICT`). They will age out via pruning. To clean them immediately, run a one-time dedup pass or wipe and let the next refresh rebuild. - No database schema changes. ## v0.3.1 — stable cluster IDs, cross-cycle merge, orphan dedup, multi-article signals - **Emerging topics rewrite** (`detect_emerging_topics`): complete rewrite with 5 new capabilities: - **Timeframe parameter** (`"4h"`, `"24h"`, `"3d"`, etc.) — controls lookback window instead of always using `DEFAULT_LOOKBACK_HOURS` - **Velocity scoring** — splits the window into recent vs prior half, computes `velocity = (recent + 0.5) / (prior + 0.5)`. Entities accelerating now vs before score much higher than steady-state ones - **Composed trend score** — replaces the flat `0.25 + 0.40*imp + 0.08*count` formula with a weighted combination of: velocity (35%), recency concentration (25%), source diversity (15%), sustained presence across time buckets (10%), importance (15%) - **Topic scoping** — optional `topic` parameter filters to a specific category before scoring - **Entity neighborhood scoping** — optional `around` parameter only returns entities co-occurring with the specified entity (e.g. `around="Bitcoin"` finds what's emerging in Bitcoin's neighborhood) - **Richer output** — each result now includes `velocity`, `recent_count`, `prior_count`, `source_count` alongside `trend_score` and `related_entities` - **Multi-article signal comparison**: `_signals()` now compares a new article against ALL articles in a candidate cluster (not just the seed). The best title and jaccard scores across all cluster members are used for matching. - **Stable cluster IDs**: `cluster_id = sha1(topic | min_article_key)` instead of `sha1(topic | seed_title)`. The same set of articles always maps to the same ID regardless of processing order. This eliminates duplicate clusters for the same event. - **Cross-cycle merge**: the poller loads recent clusters from the DB (controlled by `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h) and seeds them as merge targets before clustering. New articles in poll N+1 can merge into clusters created in poll N. - **Orphan merge**: post-clustering Union-Find pass detects and merges clusters that share article keys. Catches cases where articles about the same event didn't match during the main loop (e.g. embeddings temporarily unavailable). - **Cascade match via `_is_match()`**: unified signal evaluation — cosine → title → jaccard → consensus. Short-circuits on first passing signal. Configurable `title_threshold` parameter. - **Cluster embedding updated on merge**: when a new article merges into an existing cluster, the cluster's embedding is updated to the new article's vector, improving subsequent embedding-based matching. - **`NEWS_CLUSTER_MAX_AGE_HOURS`** env var (default `4`): controls the cross-cycle merge window. Set to `0` to disable cross-cycle merge. ### Migration notes - No database schema changes. - Existing cluster IDs will change format on the next polling cycle (old rows are updated in-place via `ON CONFLICT(cluster_id)` once the new ID is computed). Transient enrichment cache misses may occur for one cycle. - Old duplicate clusters (same event, different IDs) will age out via pruning. To clean them immediately, run the article dedup cleanup script. - `detect_emerging_topics` output shape changed: `count` replaced by `recent_count` + `prior_count`, new fields `velocity` and `source_count`. Clients using the old `count` field need to switch to `recent_count`. ## v0.3.0 — concurrent polling, enrichment retry, all-topics default ### Highlights - **Async concurrent RSS fetching** — all feeds fetched in parallel with `asyncio.gather` + `httpx`, bounded by semaphore (default 10 concurrent). Previously sequential: ~40 feeds × 2-5s each = minutes. Now ~10 at a time. - **Concurrent Ollama embeddings** — embedding vectors for all articles pre-computed in parallel before the clustering loop (bounded by semaphore, default 4). Previously one-by-one during clustering. - **Concurrent LLM enrichment** — entity extraction / topic classification / sentiment calls run concurrently across all clusters, bounded by per-provider semaphore: - `openrouter`: 2 free tier - `openai`: 5 - `groq`: 8 - Override via `NEWS_LLM_CONCURRENCY_` env var - **Per-cluster retry with backoff** — failed LLM calls retry up to 3 times (2s, 4s, 8s backoff) before marking the cluster as failed. Failed clusters are automatically retried on the next polling cycle. - **Cross-cycle failure recovery** — `get_failed_enrichment_clusters()` queries the DB for clusters with `enrichment_failed_at` set but below the retry threshold, so transient failures self-heal. - **LLM provider retries** — `_call_groq` and `_call_openai` now have the same retry logic as `_call_openrouter` (2 retries, exponential backoff on 429/500/502/503, empty response handling). - **`get_latest_events()` default changed** — omitting `topic` now returns clusters from **all** topics instead of defaulting to `"crypto"`. Pass `topic="crypto"` (or macro/regulation/ai/other) to filter. - **Configuration** — all concurrency limits configurable via env vars; see `config.py` for `NEWS_RSS_MAX_CONCURRENCY`, `NEWS_OLLAMA_MAX_CONCURRENCY`, `NEWS_LLM_CONCURRENCY_`. ### Migration notes - No database schema changes. - If you relied on `get_latest_events()` without a topic argument returning only crypto clusters, pass `topic="crypto"` explicitly. - Concurrency defaults are conservative for free-rate-limit providers. Tune up via env vars if you have paid plans. ## v0.2.0 — embedding-aware clustering and richer agent tools ### Highlights - Optional Ollama embedding path for clustering (`NEWS_EMBEDDINGS_ENABLED=true`) - Configurable Ollama base URL and embedding model - Tunable embedding similarity threshold (`NEWS_EMBEDDING_SIMILARITY_THRESHOLD`) - New agent tool: `get_related_entities(subject, timeframe, limit)` - Optional article payloads for `get_latest_events`, `get_events_for_entity`, and `get_event_summary` - Improved emerging-topic scoring with co-occurrence and importance weighting - Blacklist enforcement back-clean script for stored clusters - Embedding backfill script for older clusters - Embedding similarity analysis script for threshold tuning - Embedding-based merge script with dry-run and wet modes - Article dedup cleanup for repeated article variants inside clusters ### Notes - Ollama embeddings are tried first when enabled; heuristic clustering remains the fallback. - The merge script is intentionally destructive and should be preceded by a dry run. - The article dedup cleanup script is safe to run after ingestion or on the historical dataset.