# news-mcp release notes

## v0.3.2 — topic-independent cluster IDs, fix for cross-cycle duplicate clusters

### Highlights
- **Topic-independent cluster IDs**: `_stable_cluster_id` no longer includes the `topic` in the hash. The ID is now `sha1(min_article_key)` instead of `sha1(topic|min_article_key)`. This ensures the same article always maps to the same cluster_id regardless of whether the heuristic classifier or the LLM assigns a different topic. Previously, when the LLM reclassified a cluster's topic (e.g. "macro" → "crypto"), the same article arriving in the next polling cycle would get a different cluster_id, bypass `ON CONFLICT DO UPDATE`, and silently create a duplicate row in the DB.

- **Cross-cycle merge bucket fix**: when seeding `existing_clusters` from the DB, the cluster's topic is now re-derived via the same heuristic (`normalize_topic_from_title`) that new articles use. Previously, existing clusters were bucketed by their *enriched* topic (from the DB), so a new article with a different heuristic topic would land in a different `by_topic` bucket and never be matched against the existing cluster. This was the primary mechanism producing the 419+ duplicate clusters observed in production.

### Root cause
Two cooperating bugs allowed the same article to accumulate duplicate DB rows:

1. **cluster_id included topic** → same article with different topic → different PK → `ON CONFLICT` never fires.
2. **existing clusters bucketed by enriched topic** → new article bucketed by heuristic topic → cross-cycle matching loop never compares them → orphan merge only runs within a single topic bucket → no merge.

Both fixes together ensure that (a) the cluster_id is deterministic from article keys alone, and (b) cross-cycle matching works regardless of topic drift between heuristic and enriched classifications.

### Migration notes
- Existing cluster IDs will change format on the next polling cycle. Old rows with the previous ID format become stale (the new code writes with the new ID via `ON CONFLICT`). They will age out via pruning. To clean them immediately, run a one-time dedup pass or wipe and let the next refresh rebuild.
- No database schema changes.

## v0.3.1 — stable cluster IDs, cross-cycle merge, orphan dedup, multi-article signals
- **Emerging topics rewrite** (`detect_emerging_topics`): complete rewrite with 5 new capabilities:
  - **Timeframe parameter** (`"4h"`, `"24h"`, `"3d"`, etc.) — controls lookback window instead of always using `DEFAULT_LOOKBACK_HOURS`
  - **Velocity scoring** — splits the window into recent vs prior half, computes `velocity = (recent + 0.5) / (prior + 0.5)`. Entities accelerating now vs before score much higher than steady-state ones
  - **Composed trend score** — replaces the flat `0.25 + 0.40*imp + 0.08*count` formula with a weighted combination of: velocity (35%), recency concentration (25%), source diversity (15%), sustained presence across time buckets (10%), importance (15%)
  - **Topic scoping** — optional `topic` parameter filters to a specific category before scoring
  - **Entity neighborhood scoping** — optional `around` parameter only returns entities co-occurring with the specified entity (e.g. `around="Bitcoin"` finds what's emerging in Bitcoin's neighborhood)
  - **Richer output** — each result now includes `velocity`, `recent_count`, `prior_count`, `source_count` alongside `trend_score` and `related_entities`
- **Multi-article signal comparison**: `_signals()` now compares a new article against ALL articles in a candidate cluster (not just the seed). The best title and jaccard scores across all cluster members are used for matching.
- **Stable cluster IDs**: `cluster_id = sha1(topic | min_article_key)` instead of `sha1(topic | seed_title)`. The same set of articles always maps to the same ID regardless of processing order. This eliminates duplicate clusters for the same event.
- **Cross-cycle merge**: the poller loads recent clusters from the DB (controlled by `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h) and seeds them as merge targets before clustering. New articles in poll N+1 can merge into clusters created in poll N.
- **Orphan merge**: post-clustering Union-Find pass detects and merges clusters that share article keys. Catches cases where articles about the same event didn't match during the main loop (e.g. embeddings temporarily unavailable).
- **Cascade match via `_is_match()`**: unified signal evaluation — cosine → title → jaccard → consensus. Short-circuits on first passing signal. Configurable `title_threshold` parameter.
- **Cluster embedding updated on merge**: when a new article merges into an existing cluster, the cluster's embedding is updated to the new article's vector, improving subsequent embedding-based matching.
- **`NEWS_CLUSTER_MAX_AGE_HOURS`** env var (default `4`): controls the cross-cycle merge window. Set to `0` to disable cross-cycle merge.

### Migration notes
- No database schema changes.
- Existing cluster IDs will change format on the next polling cycle (old rows are updated in-place via `ON CONFLICT(cluster_id)` once the new ID is computed). Transient enrichment cache misses may occur for one cycle.
- Old duplicate clusters (same event, different IDs) will age out via pruning. To clean them immediately, run the article dedup cleanup script.
- `detect_emerging_topics` output shape changed: `count` replaced by `recent_count` + `prior_count`, new fields `velocity` and `source_count`. Clients using the old `count` field need to switch to `recent_count`.

## v0.3.0 — concurrent polling, enrichment retry, all-topics default

### Highlights
- **Async concurrent RSS fetching** — all feeds fetched in parallel with `asyncio.gather` + `httpx`, bounded by semaphore (default 10 concurrent). Previously sequential: ~40 feeds × 2-5s each = minutes. Now ~10 at a time.
- **Concurrent Ollama embeddings** — embedding vectors for all articles pre-computed in parallel before the clustering loop (bounded by semaphore, default 4). Previously one-by-one during clustering.
- **Concurrent LLM enrichment** — entity extraction / topic classification / sentiment calls run concurrently across all clusters, bounded by per-provider semaphore:
  - `openrouter`: 2 free tier
  - `openai`: 5
  - `groq`: 8
  - Override via `NEWS_LLM_CONCURRENCY_<PROVIDER>` env var
- **Per-cluster retry with backoff** — failed LLM calls retry up to 3 times (2s, 4s, 8s backoff) before marking the cluster as failed. Failed clusters are automatically retried on the next polling cycle.
- **Cross-cycle failure recovery** — `get_failed_enrichment_clusters()` queries the DB for clusters with `enrichment_failed_at` set but below the retry threshold, so transient failures self-heal.
- **LLM provider retries** — `_call_groq` and `_call_openai` now have the same retry logic as `_call_openrouter` (2 retries, exponential backoff on 429/500/502/503, empty response handling).
- **`get_latest_events()` default changed** — omitting `topic` now returns clusters from **all** topics instead of defaulting to `"crypto"`. Pass `topic="crypto"` (or macro/regulation/ai/other) to filter.
- **Configuration** — all concurrency limits configurable via env vars; see `config.py` for `NEWS_RSS_MAX_CONCURRENCY`, `NEWS_OLLAMA_MAX_CONCURRENCY`, `NEWS_LLM_CONCURRENCY_<PROVIDER>`.

### Migration notes
- No database schema changes.
- If you relied on `get_latest_events()` without a topic argument returning only crypto clusters, pass `topic="crypto"` explicitly.
- Concurrency defaults are conservative for free-rate-limit providers. Tune up via env vars if you have paid plans.

## v0.2.0 — embedding-aware clustering and richer agent tools

### Highlights
- Optional Ollama embedding path for clustering (`NEWS_EMBEDDINGS_ENABLED=true`)
- Configurable Ollama base URL and embedding model
- Tunable embedding similarity threshold (`NEWS_EMBEDDING_SIMILARITY_THRESHOLD`)
- New agent tool: `get_related_entities(subject, timeframe, limit)`
- Optional article payloads for `get_latest_events`, `get_events_for_entity`, and `get_event_summary`
- Improved emerging-topic scoring with co-occurrence and importance weighting
- Blacklist enforcement back-clean script for stored clusters
- Embedding backfill script for older clusters
- Embedding similarity analysis script for threshold tuning
- Embedding-based merge script with dry-run and wet modes
- Article dedup cleanup for repeated article variants inside clusters

### Notes
- Ollama embeddings are tried first when enabled; heuristic clustering remains the fallback.
- The merge script is intentionally destructive and should be preceded by a dry run.
- The article dedup cleanup script is safe to run after ingestion or on the historical dataset.