Multi-article signal comparison: _signals() now compares a new article against ALL articles in a candidate cluster (not just the seed). The best title and jaccard scores across all cluster members are used for matching.
Stable cluster IDs: cluster_id = sha1(topic | min_article_key) instead of sha1(topic | seed_title). The same set of articles always maps to the same ID regardless of processing order. This eliminates duplicate clusters for the same event.
Cross-cycle merge: the poller loads recent clusters from the DB (controlled by NEWS_CLUSTER_MAX_AGE_HOURS, default 4h) and seeds them as merge targets before clustering. New articles in poll N+1 can merge into clusters created in poll N.
Orphan merge: post-clustering Union-Find pass detects and merges clusters that share article keys. Catches cases where articles about the same event didn't match during the main loop (e.g. embeddings temporarily unavailable).
Cascade match via _is_match(): unified signal evaluation — cosine → title → jaccard → consensus. Short-circuits on first passing signal. Configurable title_threshold parameter.
Cluster embedding updated on merge: when a new article merges into an existing cluster, the cluster's embedding is updated to the new article's vector, improving subsequent embedding-based matching.
NEWS_CLUSTER_MAX_AGE_HOURS env var (default 4): controls the cross-cycle merge window. Set to 0 to disable cross-cycle merge.
Migration notes
No database schema changes.
Existing cluster IDs will change format on the next polling cycle (old rows are updated in-place via ON CONFLICT(cluster_id) once the new ID is computed). Transient enrichment cache misses may occur for one cycle.
Old duplicate clusters (same event, different IDs) will age out via pruning. To clean them immediately, run the article dedup cleanup script.
Async concurrent RSS fetching — all feeds fetched in parallel with asyncio.gather + httpx, bounded by semaphore (default 10 concurrent). Previously sequential: ~40 feeds × 2-5s each = minutes. Now ~10 at a time.
Concurrent Ollama embeddings — embedding vectors for all articles pre-computed in parallel before the clustering loop (bounded by semaphore, default 4). Previously one-by-one during clustering.
Concurrent LLM enrichment — entity extraction / topic classification / sentiment calls run concurrently across all clusters, bounded by per-provider semaphore:
openrouter: 2 free tier
openai: 5
groq: 8
Override via NEWS_LLM_CONCURRENCY_<PROVIDER> env var
Per-cluster retry with backoff — failed LLM calls retry up to 3 times (2s, 4s, 8s backoff) before marking the cluster as failed. Failed clusters are automatically retried on the next polling cycle.
Cross-cycle failure recovery — get_failed_enrichment_clusters() queries the DB for clusters with enrichment_failed_at set but below the retry threshold, so transient failures self-heal.
LLM provider retries — _call_groq and _call_openai now have the same retry logic as _call_openrouter (2 retries, exponential backoff on 429/500/502/503, empty response handling).
get_latest_events() default changed — omitting topic now returns clusters from all topics instead of defaulting to "crypto". Pass topic="crypto" (or macro/regulation/ai/other) to filter.
Configuration — all concurrency limits configurable via env vars; see config.py for NEWS_RSS_MAX_CONCURRENCY, NEWS_OLLAMA_MAX_CONCURRENCY, NEWS_LLM_CONCURRENCY_<PROVIDER>.
Migration notes
No database schema changes.
If you relied on get_latest_events() without a topic argument returning only crypto clusters, pass topic="crypto" explicitly.
Concurrency defaults are conservative for free-rate-limit providers. Tune up via env vars if you have paid plans.
v0.2.0 — embedding-aware clustering and richer agent tools
Highlights
Optional Ollama embedding path for clustering (NEWS_EMBEDDINGS_ENABLED=true)