1 неделя назад · d855ede033
--- a/PROJECT.md
+++ b/PROJECT.md
@@ -3,13 +3,13 @@
 
															 ## Goal
														
 
															 Provide a signal-extraction MCP server that converts RSS into **deduplicated, enriched news clusters** that are easy for agents to use.
														
 
															-## Current architecture (v0.3.1)
														
 
															+## Current architecture (v0.3.2)
														
 
															 - FastMCP SSE server mounted at `/mcp`
														
 
															 - SQLite cache for clusters + entity metadata + feed state + LLM summary caches
														
 
															 - Concurrent RSS fetch (async `asyncio.gather` + `httpx`, bounded semaphore)
														
 
															 - **Multi-signal clustering**: cosine embedding + fuzzy title + token Jaccard + consensus cascade; compares against ALL cluster articles (not just seed)
														
 
															-- **Stable cluster IDs**: `sha1(topic | min_article_key)` — order-independent, consistent across polling cycles
														
 
															-- **Cross-cycle merge**: poller seeds clustering with recent DB clusters (configurable `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h)
														
 
															+- **Stable cluster IDs**: `sha1(min_article_key)` — topic-independent, order-independent, consistent across polling cycles. The topic is excluded from the hash so that the same article always maps to the same cluster_id regardless of heuristic vs LLM-enriched topic classification.
														
 
															+- **Cross-cycle merge**: poller seeds clustering with recent DB clusters (configurable `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h). Existing clusters are re-bucketed by the same heuristic topic function (`normalize_topic_from_title`) that new articles use, ensuring matching works even when the enriched topic drifted.
														
 
															 - **Orphan merge**: post-clustering Union-Find pass merges clusters sharing article keys
														
 
															 - Concurrent Ollama embeddings (pre-computed before clustering loop)
														
 
															 - Concurrent LLM enrichment (entity extraction, topic classification, sentiment) with per-provider semaphore
														
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -1,8 +1,25 @@
 
															 # news-mcp release notes
														
 
															-## v0.3.1 — stable cluster IDs, cross-cycle merge, orphan dedup, multi-article signals
														
 
															+## v0.3.2 — topic-independent cluster IDs, fix for cross-cycle duplicate clusters
														
 
															 ### Highlights
														
 
															+- **Topic-independent cluster IDs**: `_stable_cluster_id` no longer includes the `topic` in the hash. The ID is now `sha1(min_article_key)` instead of `sha1(topic|min_article_key)`. This ensures the same article always maps to the same cluster_id regardless of whether the heuristic classifier or the LLM assigns a different topic. Previously, when the LLM reclassified a cluster's topic (e.g. "macro" → "crypto"), the same article arriving in the next polling cycle would get a different cluster_id, bypass `ON CONFLICT DO UPDATE`, and silently create a duplicate row in the DB.
														
 
															+
														
 
															+- **Cross-cycle merge bucket fix**: when seeding `existing_clusters` from the DB, the cluster's topic is now re-derived via the same heuristic (`normalize_topic_from_title`) that new articles use. Previously, existing clusters were bucketed by their *enriched* topic (from the DB), so a new article with a different heuristic topic would land in a different `by_topic` bucket and never be matched against the existing cluster. This was the primary mechanism producing the 419+ duplicate clusters observed in production.
														
 
															+
														
 
															+### Root cause
														
 
															+Two cooperating bugs allowed the same article to accumulate duplicate DB rows:
														
 
															+
														
 
															+1. **cluster_id included topic** → same article with different topic → different PK → `ON CONFLICT` never fires.
														
 
															+2. **existing clusters bucketed by enriched topic** → new article bucketed by heuristic topic → cross-cycle matching loop never compares them → orphan merge only runs within a single topic bucket → no merge.
														
 
															+
														
 
															+Both fixes together ensure that (a) the cluster_id is deterministic from article keys alone, and (b) cross-cycle matching works regardless of topic drift between heuristic and enriched classifications.
														
 
															+
														
 
															+### Migration notes
														
 
															+- Existing cluster IDs will change format on the next polling cycle. Old rows with the previous ID format become stale (the new code writes with the new ID via `ON CONFLICT`). They will age out via pruning. To clean them immediately, run a one-time dedup pass or wipe and let the next refresh rebuild.
														
 
															+- No database schema changes.
														
 
															+
														
 
															+## v0.3.1 — stable cluster IDs, cross-cycle merge, orphan dedup, multi-article signals
														
 
															 - **Emerging topics rewrite** (`detect_emerging_topics`): complete rewrite with 5 new capabilities:
														
 
															   - **Timeframe parameter** (`"4h"`, `"24h"`, `"3d"`, etc.) — controls lookback window instead of always using `DEFAULT_LOOKBACK_HOURS`
														
 
															   - **Velocity scoring** — splits the window into recent vs prior half, computes `velocity = (recent + 0.5) / (prior + 0.5)`. Entities accelerating now vs before score much higher than steady-state ones
														
--- a/news_mcp/dedup/cluster.py
+++ b/news_mcp/dedup/cluster.py
@@ -168,16 +168,18 @@ def _is_match(
 
															 # ---------------------------------------------------------------------------
														
 
															 def _stable_cluster_id(topic: str, articles: List[Dict[str, Any]]) -> str:
														
 
															-    """Deterministic cluster ID derived from the topic and the sorted set of
														
 
															-    article keys.  Using the minimum key (lexicographic) as the seed ensures
														
 
															-    that no matter which article arrives first, the same set of articles always
														
 
															-    maps to the same cluster_id."""
														
 
															+    """Deterministic cluster ID derived from the sorted set of article keys.
														
 
															+
														
 
															+    The topic is intentionally excluded from the hash: the same article may be
														
 
															+    classified under different topics across cycles (heuristic vs LLM-enriched),
														
 
															+    but it must always map to the same cluster_id so that ON CONFLICT DO UPDATE
														
 
															+    in upsert_clusters correctly merges them instead of creating duplicates."""
														
 
															     keys = sorted(_article_key(a) for a in articles if _article_key(a))
														
 
															     if not keys:
														
 
															         # Degenerate fallback — single article with empty url and title
														
 
															         return hashlib.sha1(topic.encode("utf-8")).hexdigest()
														
 
															     seed = keys[0]
														
 
															-    return hashlib.sha1(f"{topic}|{seed}".encode("utf-8")).hexdigest()
														
 
															+    return hashlib.sha1(seed.encode("utf-8")).hexdigest()
														
 
															 # ---------------------------------------------------------------------------
														
@@ -401,12 +403,17 @@ def dedup_and_cluster_articles(
 
															     by_topic: Dict[str, List[Dict[str, Any]]] = {}
														
 
															-    # Seed with existing clusters (filtered by age window)
														
 
															+    # Seed with existing clusters (filtered by age window).
														
 
															+    # Re-derive the topic via the same heuristic (normalize_topic_from_title)
														
 
															+    # that new articles use, so that existing and new clusters with the same
														
 
															+    # headline land in the same by_topic bucket regardless of what LLM
														
 
															+    # enrichment previously stored on the cluster.
														
 
															     if existing_clusters:
														
 
															         for c in existing_clusters:
														
 
															             if not _cluster_is_within_age_window(c, max_age_hours=max_age_hours):
														
 
															                 continue
														
 
															-            topic = c.get("topic", "other") or "other"
														
 
															+            seed_title = c.get("headline") or ""
														
 
															+            topic = normalize_topic_from_title(seed_title) if seed_title else (c.get("topic", "other") or "other")
														
 
															             by_topic.setdefault(topic, []).append(dict(c))
														
 
															     for a in articles:
														
--- a/test_news_mcp.py
+++ b/test_news_mcp.py
@@ -909,3 +909,83 @@ def test_preseed_merge_into_existing_cluster():
 
															     # Should have exactly 1 cluster (the existing one, now with 2 articles)
														
 
															     assert len(all_clusters) == 1, f"Expected 1 cluster, got {len(all_clusters)}: {[c['headline'] for c in all_clusters]}"
														
 
															     assert len(all_clusters[0]["articles"]) == 2
														
 
															+
														
 
															+
														
 
															+def test_cross_cycle_merge_topic_mismatch():
														
 
															+    """Regression: same article arriving in two cycles must merge even when
														
 
															+    the existing cluster's enriched topic differs from the new article's
														
 
															+    heuristic topic.  Previously the cluster_id included the topic in the
														
 
															+    hash AND existing clusters were bucketed by enriched topic, so a
														
 
															+    topic mismatch silently produced two rows in the DB."""
														
 
															+    from news_mcp.dedup import cluster as dc
														
 
															+
														
 
															+    url = (
														
 
															+        "https://breakingthenews.net/Article/"
														
 
															+        "Hegseth-says-US-will-keep-pressure-on-Iran/66401647"
														
 
															+    )
														
 
															+
														
 
															+    existing = [{
														
 
															+        "cluster_id": "old-id",
														
 
															+        # Enriched topic from a prior LLM pass — *different* from what
														
 
															+        # normalize_topic_from_title would return for the headline.
														
 
															+        "topic": "crypto",
														
 
															+        "headline": "Hegseth says US will keep pressure on Iran",
														
 
															+        "summary": "",
														
 
															+        "sources": ["Breaking The News"],
														
 
															+        "timestamp": "Sat, 30 May 2026 13:00:00 GMT",
														
 
															+        "last_updated": datetime.now(timezone.utc).isoformat(),
														
 
															+        "first_seen": "Sat, 30 May 2026 13:00:00 GMT",
														
 
															+        "articles": [{
														
 
															+            "title": "Hegseth says US will keep pressure on Iran",
														
 
															+            "url": url,
														
 
															+            "source": "Breaking The News",
														
 
															+            "timestamp": "Sat, 30 May 2026 13:00:00 GMT",
														
 
															+            "summary": "",
														
 
															+        }],
														
 
															+        "entities": ["Pete Hegseth", "Iran"],
														
 
															+        "sentiment": "negative",
														
 
															+        "sentimentScore": -0.5,
														
 
															+        "importance": 0.1,
														
 
															+    }]
														
 
															+
														
 
															+    # The same article arrives again in the next polling cycle.
														
 
															+    # Its heuristic topic (normalize_topic_from_title) is "other" (no
														
 
															+    # keyword match), which differs from the stored "crypto" topic.
														
 
															+    new_article = {
														
 
															+        "title": "Hegseth says US will keep pressure on Iran",
														
 
															+        "url": url,
														
 
															+        "source": "Breaking The News",
														
 
															+        "timestamp": "Sat, 30 May 2026 13:00:00 GMT",
														
 
															+        "summary": "",
														
 
															+        # feed_url is used for per-feed hash tracking
														
 
															+        "feed_url": "https://breakingthenews.net/news-feed.xml",
														
 
															+        "importance": 0.11,
														
 
															+    }
														
 
															+
														
 
															+    clustered = dc.dedup_and_cluster_articles(
														
 
															+        [new_article],
														
 
															+        existing_clusters=existing,
														
 
															+        max_age_hours=4,
														
 
															+    )
														
 
															+
														
 
															+    all_clusters = [c for clusters in clustered.values() for c in clusters]
														
 
															+    # Must produce exactly 1 cluster — the new article merges into the
														
 
															+    # existing one.  Before the fix this yielded 2 clusters with different
														
 
															+    # cluster_ids because the topic mismatch prevented matching.
														
 
															+    assert len(all_clusters) == 1, (
														
 
															+        f"Expected 1 cluster, got {len(all_clusters)}: "
														
 
															+        f"{[c['headline'] for c in all_clusters]}"
														
 
															+    )
														
 
															+
														
 
															+    # The surviving cluster must carry the *same* cluster_id regardless of
														
 
															+    # which topic wins, i.e. cluster_id is now purely article-key based.
														
 
															+    from news_mcp.dedup.cluster import _stable_cluster_id
														
 
															+    expected_cid = _stable_cluster_id(
														
 
															+        "other",
														
 
															+        [{"title": "Hegseth says US will keep pressure on Iran", "url": url}],
														
 
															+    )
														
 
															+    assert all_clusters[0]["cluster_id"] == expected_cid
														
 
															+
														
 
															+    # The existing article must still be in the merged cluster.
														
 
															+    article_urls = [a["url"] for a in all_clusters[0]["articles"]]
														
 
															+    assert url in article_urls