hai 1 semana · d855ede033
--- a/PROJECT.md
+++ b/PROJECT.md
@@ -3,13 +3,13 @@
 
				 ## Goal
			
 
				 Provide a signal-extraction MCP server that converts RSS into **deduplicated, enriched news clusters** that are easy for agents to use.
			
 
				 
			
 
				-## Current architecture (v0.3.1)
			
 
				+## Current architecture (v0.3.2)
			
 
				 - FastMCP SSE server mounted at `/mcp`
			
 
				 - SQLite cache for clusters + entity metadata + feed state + LLM summary caches
			
 
				 - Concurrent RSS fetch (async `asyncio.gather` + `httpx`, bounded semaphore)
			
 
				 - **Multi-signal clustering**: cosine embedding + fuzzy title + token Jaccard + consensus cascade; compares against ALL cluster articles (not just seed)
			
 
				-- **Stable cluster IDs**: `sha1(topic | min_article_key)` — order-independent, consistent across polling cycles
			
 
				-- **Cross-cycle merge**: poller seeds clustering with recent DB clusters (configurable `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h)
			
 
				+- **Stable cluster IDs**: `sha1(min_article_key)` — topic-independent, order-independent, consistent across polling cycles. The topic is excluded from the hash so that the same article always maps to the same cluster_id regardless of heuristic vs LLM-enriched topic classification.
			
 
				+- **Cross-cycle merge**: poller seeds clustering with recent DB clusters (configurable `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h). Existing clusters are re-bucketed by the same heuristic topic function (`normalize_topic_from_title`) that new articles use, ensuring matching works even when the enriched topic drifted.
			
 
				 - **Orphan merge**: post-clustering Union-Find pass merges clusters sharing article keys
			
 
				 - Concurrent Ollama embeddings (pre-computed before clustering loop)
			
 
				 - Concurrent LLM enrichment (entity extraction, topic classification, sentiment) with per-provider semaphore
			
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -1,8 +1,25 @@
 
				 # news-mcp release notes
			
 
				 
			
 
				-## v0.3.1 — stable cluster IDs, cross-cycle merge, orphan dedup, multi-article signals
			
 
				+## v0.3.2 — topic-independent cluster IDs, fix for cross-cycle duplicate clusters
			
 
				 
			
 
				 ### Highlights
			
 
				+- **Topic-independent cluster IDs**: `_stable_cluster_id` no longer includes the `topic` in the hash. The ID is now `sha1(min_article_key)` instead of `sha1(topic|min_article_key)`. This ensures the same article always maps to the same cluster_id regardless of whether the heuristic classifier or the LLM assigns a different topic. Previously, when the LLM reclassified a cluster's topic (e.g. "macro" → "crypto"), the same article arriving in the next polling cycle would get a different cluster_id, bypass `ON CONFLICT DO UPDATE`, and silently create a duplicate row in the DB.
			
 
				+
			
 
				+- **Cross-cycle merge bucket fix**: when seeding `existing_clusters` from the DB, the cluster's topic is now re-derived via the same heuristic (`normalize_topic_from_title`) that new articles use. Previously, existing clusters were bucketed by their *enriched* topic (from the DB), so a new article with a different heuristic topic would land in a different `by_topic` bucket and never be matched against the existing cluster. This was the primary mechanism producing the 419+ duplicate clusters observed in production.
			
 
				+
			
 
				+### Root cause
			
 
				+Two cooperating bugs allowed the same article to accumulate duplicate DB rows:
			
 
				+
			
 
				+1. **cluster_id included topic** → same article with different topic → different PK → `ON CONFLICT` never fires.
			
 
				+2. **existing clusters bucketed by enriched topic** → new article bucketed by heuristic topic → cross-cycle matching loop never compares them → orphan merge only runs within a single topic bucket → no merge.
			
 
				+
			
 
				+Both fixes together ensure that (a) the cluster_id is deterministic from article keys alone, and (b) cross-cycle matching works regardless of topic drift between heuristic and enriched classifications.
			
 
				+
			
 
				+### Migration notes
			
 
				+- Existing cluster IDs will change format on the next polling cycle. Old rows with the previous ID format become stale (the new code writes with the new ID via `ON CONFLICT`). They will age out via pruning. To clean them immediately, run a one-time dedup pass or wipe and let the next refresh rebuild.
			
 
				+- No database schema changes.
			
 
				+
			
 
				+## v0.3.1 — stable cluster IDs, cross-cycle merge, orphan dedup, multi-article signals
			
 
				 - **Emerging topics rewrite** (`detect_emerging_topics`): complete rewrite with 5 new capabilities:
			
 
				   - **Timeframe parameter** (`"4h"`, `"24h"`, `"3d"`, etc.) — controls lookback window instead of always using `DEFAULT_LOOKBACK_HOURS`
			
 
				   - **Velocity scoring** — splits the window into recent vs prior half, computes `velocity = (recent + 0.5) / (prior + 0.5)`. Entities accelerating now vs before score much higher than steady-state ones
			
--- a/news_mcp/dedup/cluster.py
+++ b/news_mcp/dedup/cluster.py
@@ -168,16 +168,18 @@ def _is_match(
 
				 # ---------------------------------------------------------------------------
			
 
				 
			
 
				 def _stable_cluster_id(topic: str, articles: List[Dict[str, Any]]) -> str:
			
 
				-    """Deterministic cluster ID derived from the topic and the sorted set of
			
 
				-    article keys.  Using the minimum key (lexicographic) as the seed ensures
			
 
				-    that no matter which article arrives first, the same set of articles always
			
 
				-    maps to the same cluster_id."""
			
 
				+    """Deterministic cluster ID derived from the sorted set of article keys.
			
 
				+
			
 
				+    The topic is intentionally excluded from the hash: the same article may be
			
 
				+    classified under different topics across cycles (heuristic vs LLM-enriched),
			
 
				+    but it must always map to the same cluster_id so that ON CONFLICT DO UPDATE
			
 
				+    in upsert_clusters correctly merges them instead of creating duplicates."""
			
 
				     keys = sorted(_article_key(a) for a in articles if _article_key(a))
			
 
				     if not keys:
			
 
				         # Degenerate fallback — single article with empty url and title
			
 
				         return hashlib.sha1(topic.encode("utf-8")).hexdigest()
			
 
				     seed = keys[0]
			
 
				-    return hashlib.sha1(f"{topic}|{seed}".encode("utf-8")).hexdigest()
			
 
				+    return hashlib.sha1(seed.encode("utf-8")).hexdigest()
			
 
				 
			
 
				 
			
 
				 # ---------------------------------------------------------------------------
			
@@ -401,12 +403,17 @@ def dedup_and_cluster_articles(
 
				 
			
 
				     by_topic: Dict[str, List[Dict[str, Any]]] = {}
			
 
				 
			
 
				-    # Seed with existing clusters (filtered by age window)
			
 
				+    # Seed with existing clusters (filtered by age window).
			
 
				+    # Re-derive the topic via the same heuristic (normalize_topic_from_title)
			
 
				+    # that new articles use, so that existing and new clusters with the same
			
 
				+    # headline land in the same by_topic bucket regardless of what LLM
			
 
				+    # enrichment previously stored on the cluster.
			
 
				     if existing_clusters:
			
 
				         for c in existing_clusters:
			
 
				             if not _cluster_is_within_age_window(c, max_age_hours=max_age_hours):
			
 
				                 continue
			
 
				-            topic = c.get("topic", "other") or "other"
			
 
				+            seed_title = c.get("headline") or ""
			
 
				+            topic = normalize_topic_from_title(seed_title) if seed_title else (c.get("topic", "other") or "other")
			
 
				             by_topic.setdefault(topic, []).append(dict(c))
			
 
				 
			
 
				     for a in articles:
			
--- a/test_news_mcp.py
+++ b/test_news_mcp.py
@@ -909,3 +909,83 @@ def test_preseed_merge_into_existing_cluster():
 
				     # Should have exactly 1 cluster (the existing one, now with 2 articles)
			
 
				     assert len(all_clusters) == 1, f"Expected 1 cluster, got {len(all_clusters)}: {[c['headline'] for c in all_clusters]}"
			
 
				     assert len(all_clusters[0]["articles"]) == 2
			
 
				+
			
 
				+
			
 
				+def test_cross_cycle_merge_topic_mismatch():
			
 
				+    """Regression: same article arriving in two cycles must merge even when
			
 
				+    the existing cluster's enriched topic differs from the new article's
			
 
				+    heuristic topic.  Previously the cluster_id included the topic in the
			
 
				+    hash AND existing clusters were bucketed by enriched topic, so a
			
 
				+    topic mismatch silently produced two rows in the DB."""
			
 
				+    from news_mcp.dedup import cluster as dc
			
 
				+
			
 
				+    url = (
			
 
				+        "https://breakingthenews.net/Article/"
			
 
				+        "Hegseth-says-US-will-keep-pressure-on-Iran/66401647"
			
 
				+    )
			
 
				+
			
 
				+    existing = [{
			
 
				+        "cluster_id": "old-id",
			
 
				+        # Enriched topic from a prior LLM pass — *different* from what
			
 
				+        # normalize_topic_from_title would return for the headline.
			
 
				+        "topic": "crypto",
			
 
				+        "headline": "Hegseth says US will keep pressure on Iran",
			
 
				+        "summary": "",
			
 
				+        "sources": ["Breaking The News"],
			
 
				+        "timestamp": "Sat, 30 May 2026 13:00:00 GMT",
			
 
				+        "last_updated": datetime.now(timezone.utc).isoformat(),
			
 
				+        "first_seen": "Sat, 30 May 2026 13:00:00 GMT",
			
 
				+        "articles": [{
			
 
				+            "title": "Hegseth says US will keep pressure on Iran",
			
 
				+            "url": url,
			
 
				+            "source": "Breaking The News",
			
 
				+            "timestamp": "Sat, 30 May 2026 13:00:00 GMT",
			
 
				+            "summary": "",
			
 
				+        }],
			
 
				+        "entities": ["Pete Hegseth", "Iran"],
			
 
				+        "sentiment": "negative",
			
 
				+        "sentimentScore": -0.5,
			
 
				+        "importance": 0.1,
			
 
				+    }]
			
 
				+
			
 
				+    # The same article arrives again in the next polling cycle.
			
 
				+    # Its heuristic topic (normalize_topic_from_title) is "other" (no
			
 
				+    # keyword match), which differs from the stored "crypto" topic.
			
 
				+    new_article = {
			
 
				+        "title": "Hegseth says US will keep pressure on Iran",
			
 
				+        "url": url,
			
 
				+        "source": "Breaking The News",
			
 
				+        "timestamp": "Sat, 30 May 2026 13:00:00 GMT",
			
 
				+        "summary": "",
			
 
				+        # feed_url is used for per-feed hash tracking
			
 
				+        "feed_url": "https://breakingthenews.net/news-feed.xml",
			
 
				+        "importance": 0.11,
			
 
				+    }
			
 
				+
			
 
				+    clustered = dc.dedup_and_cluster_articles(
			
 
				+        [new_article],
			
 
				+        existing_clusters=existing,
			
 
				+        max_age_hours=4,
			
 
				+    )
			
 
				+
			
 
				+    all_clusters = [c for clusters in clustered.values() for c in clusters]
			
 
				+    # Must produce exactly 1 cluster — the new article merges into the
			
 
				+    # existing one.  Before the fix this yielded 2 clusters with different
			
 
				+    # cluster_ids because the topic mismatch prevented matching.
			
 
				+    assert len(all_clusters) == 1, (
			
 
				+        f"Expected 1 cluster, got {len(all_clusters)}: "
			
 
				+        f"{[c['headline'] for c in all_clusters]}"
			
 
				+    )
			
 
				+
			
 
				+    # The surviving cluster must carry the *same* cluster_id regardless of
			
 
				+    # which topic wins, i.e. cluster_id is now purely article-key based.
			
 
				+    from news_mcp.dedup.cluster import _stable_cluster_id
			
 
				+    expected_cid = _stable_cluster_id(
			
 
				+        "other",
			
 
				+        [{"title": "Hegseth says US will keep pressure on Iran", "url": url}],
			
 
				+    )
			
 
				+    assert all_clusters[0]["cluster_id"] == expected_cid
			
 
				+
			
 
				+    # The existing article must still be in the merged cluster.
			
 
				+    article_urls = [a["url"] for a in all_clusters[0]["articles"]]
			
 
				+    assert url in article_urls