Topic-independent cluster IDs: _stable_cluster_id no longer includes the topic in the hash. The ID is now sha1(min_article_key) instead of sha1(topic|min_article_key). This ensures the same article always maps to the same cluster_id regardless of whether the heuristic classifier or the LLM assigns a different topic. Previously, when the LLM reclassified a cluster's topic (e.g. "macro" → "crypto"), the same article arriving in the next polling cycle would get a different cluster_id, bypass ON CONFLICT DO UPDATE, and silently create a duplicate row in the DB.
Cross-cycle merge bucket fix: when seeding existing_clusters from the DB, the cluster's topic is now re-derived via the same heuristic (normalize_topic_from_title) that new articles use. Previously, existing clusters were bucketed by their enriched topic (from the DB), so a new article with a different heuristic topic would land in a different by_topic bucket and never be matched against the existing cluster. This was the primary mechanism producing the 419+ duplicate clusters observed in production.
Two cooperating bugs allowed the same article to accumulate duplicate DB rows:
ON CONFLICT never fires.Both fixes together ensure that (a) the cluster_id is deterministic from article keys alone, and (b) cross-cycle matching works regardless of topic drift between heuristic and enriched classifications.
ON CONFLICT). They will age out via pruning. To clean them immediately, run a one-time dedup pass or wipe and let the next refresh rebuild.detect_emerging_topics): complete rewrite with 5 new capabilities:
"4h", "24h", "3d", etc.) — controls lookback window instead of always using DEFAULT_LOOKBACK_HOURSvelocity = (recent + 0.5) / (prior + 0.5). Entities accelerating now vs before score much higher than steady-state ones0.25 + 0.40*imp + 0.08*count formula with a weighted combination of: velocity (35%), recency concentration (25%), source diversity (15%), sustained presence across time buckets (10%), importance (15%)topic parameter filters to a specific category before scoringaround parameter only returns entities co-occurring with the specified entity (e.g. around="Bitcoin" finds what's emerging in Bitcoin's neighborhood)velocity, recent_count, prior_count, source_count alongside trend_score and related_entities_signals() now compares a new article against ALL articles in a candidate cluster (not just the seed). The best title and jaccard scores across all cluster members are used for matching.cluster_id = sha1(topic | min_article_key) instead of sha1(topic | seed_title). The same set of articles always maps to the same ID regardless of processing order. This eliminates duplicate clusters for the same event.NEWS_CLUSTER_MAX_AGE_HOURS, default 4h) and seeds them as merge targets before clustering. New articles in poll N+1 can merge into clusters created in poll N._is_match(): unified signal evaluation — cosine → title → jaccard → consensus. Short-circuits on first passing signal. Configurable title_threshold parameter.NEWS_CLUSTER_MAX_AGE_HOURS env var (default 4): controls the cross-cycle merge window. Set to 0 to disable cross-cycle merge.ON CONFLICT(cluster_id) once the new ID is computed). Transient enrichment cache misses may occur for one cycle.detect_emerging_topics output shape changed: count replaced by recent_count + prior_count, new fields velocity and source_count. Clients using the old count field need to switch to recent_count.asyncio.gather + httpx, bounded by semaphore (default 10 concurrent). Previously sequential: ~40 feeds × 2-5s each = minutes. Now ~10 at a time.openrouter: 2 free tieropenai: 5groq: 8NEWS_LLM_CONCURRENCY_<PROVIDER> env varget_failed_enrichment_clusters() queries the DB for clusters with enrichment_failed_at set but below the retry threshold, so transient failures self-heal._call_groq and _call_openai now have the same retry logic as _call_openrouter (2 retries, exponential backoff on 429/500/502/503, empty response handling).get_latest_events() default changed — omitting topic now returns clusters from all topics instead of defaulting to "crypto". Pass topic="crypto" (or macro/regulation/ai/other) to filter.config.py for NEWS_RSS_MAX_CONCURRENCY, NEWS_OLLAMA_MAX_CONCURRENCY, NEWS_LLM_CONCURRENCY_<PROVIDER>.get_latest_events() without a topic argument returning only crypto clusters, pass topic="crypto" explicitly.NEWS_EMBEDDINGS_ENABLED=true)NEWS_EMBEDDING_SIMILARITY_THRESHOLD)get_related_entities(subject, timeframe, limit)get_latest_events, get_events_for_entity, and get_event_summary