Explorar o código

fix: topic-independent cluster IDs + cross-cycle merge bucket fix

Two cooperating bugs caused 419+ duplicate cluster rows in production:

1. _stable_cluster_id included topic in hash (sha1(topic|seed)).
   Same article with different heuristic vs LLM-enriched topic → different
   cluster_id → ON CONFLICT DO UPDATE never fires → duplicate DB row.

2. Cross-cycle merge: existing clusters from DB were bucketed by their
   enriched topic, new articles by heuristic topic. Topic mismatch meant
   _signals() was never called between them → orphan merge only runs
   within a single topic bucket → no merge.

Fixes:
- _stable_cluster_id: removed topic from hash → sha1(min_article_key) only
- dedup_and_cluster_articles: re-derive heuristic topic via
  normalize_topic_from_title when seeding existing_clusters, so both
  sides land in the same by_topic bucket

Added regression test: test_cross_cycle_merge_topic_mismatch.
All 39 tests pass.
Lukas Goldschmidt hai 1 semana
pai
achega
d855ede033
Modificáronse 4 ficheiros con 115 adicións e 11 borrados
  1. 3 3
      PROJECT.md
  2. 18 1
      RELEASE_NOTES.md
  3. 14 7
      news_mcp/dedup/cluster.py
  4. 80 0
      test_news_mcp.py

+ 3 - 3
PROJECT.md

@@ -3,13 +3,13 @@
 ## Goal
 Provide a signal-extraction MCP server that converts RSS into **deduplicated, enriched news clusters** that are easy for agents to use.
 
-## Current architecture (v0.3.1)
+## Current architecture (v0.3.2)
 - FastMCP SSE server mounted at `/mcp`
 - SQLite cache for clusters + entity metadata + feed state + LLM summary caches
 - Concurrent RSS fetch (async `asyncio.gather` + `httpx`, bounded semaphore)
 - **Multi-signal clustering**: cosine embedding + fuzzy title + token Jaccard + consensus cascade; compares against ALL cluster articles (not just seed)
-- **Stable cluster IDs**: `sha1(topic | min_article_key)` — order-independent, consistent across polling cycles
-- **Cross-cycle merge**: poller seeds clustering with recent DB clusters (configurable `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h)
+- **Stable cluster IDs**: `sha1(min_article_key)` — topic-independent, order-independent, consistent across polling cycles. The topic is excluded from the hash so that the same article always maps to the same cluster_id regardless of heuristic vs LLM-enriched topic classification.
+- **Cross-cycle merge**: poller seeds clustering with recent DB clusters (configurable `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h). Existing clusters are re-bucketed by the same heuristic topic function (`normalize_topic_from_title`) that new articles use, ensuring matching works even when the enriched topic drifted.
 - **Orphan merge**: post-clustering Union-Find pass merges clusters sharing article keys
 - Concurrent Ollama embeddings (pre-computed before clustering loop)
 - Concurrent LLM enrichment (entity extraction, topic classification, sentiment) with per-provider semaphore

+ 18 - 1
RELEASE_NOTES.md

@@ -1,8 +1,25 @@
 # news-mcp release notes
 
-## v0.3.1 — stable cluster IDs, cross-cycle merge, orphan dedup, multi-article signals
+## v0.3.2 — topic-independent cluster IDs, fix for cross-cycle duplicate clusters
 
 ### Highlights
+- **Topic-independent cluster IDs**: `_stable_cluster_id` no longer includes the `topic` in the hash. The ID is now `sha1(min_article_key)` instead of `sha1(topic|min_article_key)`. This ensures the same article always maps to the same cluster_id regardless of whether the heuristic classifier or the LLM assigns a different topic. Previously, when the LLM reclassified a cluster's topic (e.g. "macro" → "crypto"), the same article arriving in the next polling cycle would get a different cluster_id, bypass `ON CONFLICT DO UPDATE`, and silently create a duplicate row in the DB.
+
+- **Cross-cycle merge bucket fix**: when seeding `existing_clusters` from the DB, the cluster's topic is now re-derived via the same heuristic (`normalize_topic_from_title`) that new articles use. Previously, existing clusters were bucketed by their *enriched* topic (from the DB), so a new article with a different heuristic topic would land in a different `by_topic` bucket and never be matched against the existing cluster. This was the primary mechanism producing the 419+ duplicate clusters observed in production.
+
+### Root cause
+Two cooperating bugs allowed the same article to accumulate duplicate DB rows:
+
+1. **cluster_id included topic** → same article with different topic → different PK → `ON CONFLICT` never fires.
+2. **existing clusters bucketed by enriched topic** → new article bucketed by heuristic topic → cross-cycle matching loop never compares them → orphan merge only runs within a single topic bucket → no merge.
+
+Both fixes together ensure that (a) the cluster_id is deterministic from article keys alone, and (b) cross-cycle matching works regardless of topic drift between heuristic and enriched classifications.
+
+### Migration notes
+- Existing cluster IDs will change format on the next polling cycle. Old rows with the previous ID format become stale (the new code writes with the new ID via `ON CONFLICT`). They will age out via pruning. To clean them immediately, run a one-time dedup pass or wipe and let the next refresh rebuild.
+- No database schema changes.
+
+## v0.3.1 — stable cluster IDs, cross-cycle merge, orphan dedup, multi-article signals
 - **Emerging topics rewrite** (`detect_emerging_topics`): complete rewrite with 5 new capabilities:
   - **Timeframe parameter** (`"4h"`, `"24h"`, `"3d"`, etc.) — controls lookback window instead of always using `DEFAULT_LOOKBACK_HOURS`
   - **Velocity scoring** — splits the window into recent vs prior half, computes `velocity = (recent + 0.5) / (prior + 0.5)`. Entities accelerating now vs before score much higher than steady-state ones

+ 14 - 7
news_mcp/dedup/cluster.py

@@ -168,16 +168,18 @@ def _is_match(
 # ---------------------------------------------------------------------------
 
 def _stable_cluster_id(topic: str, articles: List[Dict[str, Any]]) -> str:
-    """Deterministic cluster ID derived from the topic and the sorted set of
-    article keys.  Using the minimum key (lexicographic) as the seed ensures
-    that no matter which article arrives first, the same set of articles always
-    maps to the same cluster_id."""
+    """Deterministic cluster ID derived from the sorted set of article keys.
+
+    The topic is intentionally excluded from the hash: the same article may be
+    classified under different topics across cycles (heuristic vs LLM-enriched),
+    but it must always map to the same cluster_id so that ON CONFLICT DO UPDATE
+    in upsert_clusters correctly merges them instead of creating duplicates."""
     keys = sorted(_article_key(a) for a in articles if _article_key(a))
     if not keys:
         # Degenerate fallback — single article with empty url and title
         return hashlib.sha1(topic.encode("utf-8")).hexdigest()
     seed = keys[0]
-    return hashlib.sha1(f"{topic}|{seed}".encode("utf-8")).hexdigest()
+    return hashlib.sha1(seed.encode("utf-8")).hexdigest()
 
 
 # ---------------------------------------------------------------------------
@@ -401,12 +403,17 @@ def dedup_and_cluster_articles(
 
     by_topic: Dict[str, List[Dict[str, Any]]] = {}
 
-    # Seed with existing clusters (filtered by age window)
+    # Seed with existing clusters (filtered by age window).
+    # Re-derive the topic via the same heuristic (normalize_topic_from_title)
+    # that new articles use, so that existing and new clusters with the same
+    # headline land in the same by_topic bucket regardless of what LLM
+    # enrichment previously stored on the cluster.
     if existing_clusters:
         for c in existing_clusters:
             if not _cluster_is_within_age_window(c, max_age_hours=max_age_hours):
                 continue
-            topic = c.get("topic", "other") or "other"
+            seed_title = c.get("headline") or ""
+            topic = normalize_topic_from_title(seed_title) if seed_title else (c.get("topic", "other") or "other")
             by_topic.setdefault(topic, []).append(dict(c))
 
     for a in articles:

+ 80 - 0
test_news_mcp.py

@@ -909,3 +909,83 @@ def test_preseed_merge_into_existing_cluster():
     # Should have exactly 1 cluster (the existing one, now with 2 articles)
     assert len(all_clusters) == 1, f"Expected 1 cluster, got {len(all_clusters)}: {[c['headline'] for c in all_clusters]}"
     assert len(all_clusters[0]["articles"]) == 2
+
+
+def test_cross_cycle_merge_topic_mismatch():
+    """Regression: same article arriving in two cycles must merge even when
+    the existing cluster's enriched topic differs from the new article's
+    heuristic topic.  Previously the cluster_id included the topic in the
+    hash AND existing clusters were bucketed by enriched topic, so a
+    topic mismatch silently produced two rows in the DB."""
+    from news_mcp.dedup import cluster as dc
+
+    url = (
+        "https://breakingthenews.net/Article/"
+        "Hegseth-says-US-will-keep-pressure-on-Iran/66401647"
+    )
+
+    existing = [{
+        "cluster_id": "old-id",
+        # Enriched topic from a prior LLM pass — *different* from what
+        # normalize_topic_from_title would return for the headline.
+        "topic": "crypto",
+        "headline": "Hegseth says US will keep pressure on Iran",
+        "summary": "",
+        "sources": ["Breaking The News"],
+        "timestamp": "Sat, 30 May 2026 13:00:00 GMT",
+        "last_updated": datetime.now(timezone.utc).isoformat(),
+        "first_seen": "Sat, 30 May 2026 13:00:00 GMT",
+        "articles": [{
+            "title": "Hegseth says US will keep pressure on Iran",
+            "url": url,
+            "source": "Breaking The News",
+            "timestamp": "Sat, 30 May 2026 13:00:00 GMT",
+            "summary": "",
+        }],
+        "entities": ["Pete Hegseth", "Iran"],
+        "sentiment": "negative",
+        "sentimentScore": -0.5,
+        "importance": 0.1,
+    }]
+
+    # The same article arrives again in the next polling cycle.
+    # Its heuristic topic (normalize_topic_from_title) is "other" (no
+    # keyword match), which differs from the stored "crypto" topic.
+    new_article = {
+        "title": "Hegseth says US will keep pressure on Iran",
+        "url": url,
+        "source": "Breaking The News",
+        "timestamp": "Sat, 30 May 2026 13:00:00 GMT",
+        "summary": "",
+        # feed_url is used for per-feed hash tracking
+        "feed_url": "https://breakingthenews.net/news-feed.xml",
+        "importance": 0.11,
+    }
+
+    clustered = dc.dedup_and_cluster_articles(
+        [new_article],
+        existing_clusters=existing,
+        max_age_hours=4,
+    )
+
+    all_clusters = [c for clusters in clustered.values() for c in clusters]
+    # Must produce exactly 1 cluster — the new article merges into the
+    # existing one.  Before the fix this yielded 2 clusters with different
+    # cluster_ids because the topic mismatch prevented matching.
+    assert len(all_clusters) == 1, (
+        f"Expected 1 cluster, got {len(all_clusters)}: "
+        f"{[c['headline'] for c in all_clusters]}"
+    )
+
+    # The surviving cluster must carry the *same* cluster_id regardless of
+    # which topic wins, i.e. cluster_id is now purely article-key based.
+    from news_mcp.dedup.cluster import _stable_cluster_id
+    expected_cid = _stable_cluster_id(
+        "other",
+        [{"title": "Hegseth says US will keep pressure on Iran", "url": url}],
+    )
+    assert all_clusters[0]["cluster_id"] == expected_cid
+
+    # The existing article must still be in the merged cluster.
+    article_urls = [a["url"] for a in all_clusters[0]["articles"]]
+    assert url in article_urls