瀏覽代碼

docs: update env example,readme,project,release notes for v0.3.1

- .env.example: add NEWS_CLUSTER_MAX_AGE_HOURS, add OPENROUTER_API_KEY,
  replace stale GROQ_* vars with current ENRICH_*/LLM_* names,
  add concurrency control examples, add clustering section header
- README.md: fix stale env var names, add OPENROUTER_API_KEY,
  new Clustering section explaining stable IDs, cross-cycle merge,
  orphan dedup, signal cascade
- PROJECT.md: bump version label to v0.3.1, describe new clustering
  features in current architecture, rename previous section to v0.2.x
- RELEASE_NOTES.md: add v0.3.1 release entry
- OUTLOOK.md: bump version label to v0.3.1, rewrite dedup strategy
  section with current implementation, add v0.3.0/v0.3.1 to completed
Lukas Goldschmidt 1 周之前
父節點
當前提交
cd7b6ade99
共有 5 個文件被更改,包括 72 次插入19 次删除
  1. 14 3
      .env.example
  2. 14 8
      OUTLOOK.md
  3. 7 4
      PROJECT.md
  4. 21 4
      README.md
  5. 16 0
      RELEASE_NOTES.md

+ 14 - 3
.env.example

@@ -9,12 +9,13 @@ NEWS_SUMMARY_MODEL=llama4-16e
 # API keys
 GROQ_API_KEY=
 OPENAI_API_KEY=
+OPENROUTER_API_KEY=
 
 # Extraction behavior
 ENTITY_BLACKLIST=bloomberg
-GROQ_DEBUG=false
-GROQ_ENRICH_OTHER_ONLY=false
-GROQ_MAX_CLUSTERS_PER_REFRESH=20
+LLM_DEBUG=false
+ENRICH_OTHER_TOPICS_ONLY=false
+ENRICHMENT_MAX_PER_REFRESH=0
 
 # Embeddings (optional, Ollama-first when enabled)
 NEWS_EMBEDDINGS_ENABLED=false
@@ -22,6 +23,9 @@ OLLAMA_BASE_URL=http://127.0.0.1:11434
 OLLAMA_EMBEDDING_MODEL=nomic-embed-text
 NEWS_EMBEDDING_SIMILARITY_THRESHOLD=0.885
 
+# Clustering
+NEWS_CLUSTER_MAX_AGE_HOURS=4
+
 # Feeds
 NEWS_FEED_URL=https://breakingthenews.net/news-feed.xml
 NEWS_FEED_URLS=
@@ -39,3 +43,10 @@ NEWS_BACKGROUND_REFRESH_ENABLED=true
 NEWS_BACKGROUND_REFRESH_ON_START=true
 NEWS_PROMPTS_DIR=
 NEWS_ENTITY_ALIASES_FILE=
+
+# Concurrency controls (optional overrides)
+# NEWS_RSS_MAX_CONCURRENCY=10
+# NEWS_OLLAMA_MAX_CONCURRENCY=4
+# NEWS_LLM_CONCURRENCY_OPENROUTER=2
+# NEWS_LLM_CONCURRENCY_OPENAI=5
+# NEWS_LLM_CONCURRENCY_GROQ=8

+ 14 - 8
OUTLOOK.md

@@ -1,7 +1,7 @@
 
 # 📰 News MCP Server — Requirements Spec
 
-> **Current version: v0.3.0** — see [RELEASE_NOTES.md](RELEASE_NOTES.md) for changelog.
+> **Current version: v0.3.1** — see [RELEASE_NOTES.md](RELEASE_NOTES.md) for changelog.
 
 ## 🎯 Goal
 
@@ -362,19 +362,23 @@ get_raw_articles()
 
 # 🧠 5. Deduplication Strategy (critical)
 
-Start simple:
+Clustering is the unit of truth, not individual articles.
 
-### v1:
+**Signal cascade** (cheapest first, short-circuit on match):
+1. Cosine similarity (if embeddings enabled) against cluster centroid
+2. Fuzzy title similarity (SequenceMatcher, configurable threshold, default 0.87)
+3. Token Jaccard over headline+summary (default threshold 0.55)
+4. Consensus: cosine ≥ 0.80 AND (jaccard ≥ 0.30 OR title ≥ 0.55)
 
-* normalize titles (lowercase, strip punctuation)
-* fuzzy match (threshold ~0.8)
+Each new article is compared against **all** articles in a candidate cluster; the best signal across all members is used.
 
-### v2:
+**Stable cluster IDs**: `sha1(topic | min_article_key)` — the same set of articles always maps to the same ID regardless of which article arrived first or which polling cycle created the cluster.
 
-* embeddings / semantic similarity
+**Cross-cycle merge**: the poller loads recent clusters from the DB (controlled by `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h) and seeds them as merge targets before clustering. New articles can merge into clusters from previous polling cycles.
 
-Planned runtime order:
+**Orphan merge**: a post-clustering Union-Find pass merges clusters that share article keys, catching cases where articles about the same event didn't match during the main loop.
 
+Planned runtime order:
 * when `NEWS_EMBEDDINGS_ENABLED=true`, try Ollama embeddings first
 * if Ollama fails, fall back to the existing heuristic cluster path
 * keep candidate pre-filtering cheap before any vector compare
@@ -461,6 +465,8 @@ But only if you:
 * blacklist enforcement maintenance script added
 * related-entities tool added for co-occurrence neighborhoods
 * emerging-topic scoring improved with importance-weighting and co-occurrence
+* concurrent RSS/OLLAMA/LLM pipelines added (v0.3.0)
+* stable cluster IDs, cross-cycle merge, orphan dedup, multi-article signal comparison added (v0.3.1)
 
 ---
 

+ 7 - 4
PROJECT.md

@@ -3,11 +3,14 @@
 ## Goal
 Provide a signal-extraction MCP server that converts RSS into **deduplicated, enriched news clusters** that are easy for agents to use.
 
-## Current architecture (v2)
+## Current architecture (v0.3.1)
 - FastMCP SSE server mounted at `/mcp`
 - SQLite cache for clusters + entity metadata + feed state + LLM summary caches
 - Concurrent RSS fetch (async `asyncio.gather` + `httpx`, bounded semaphore)
-- Composite dedup via fuzzy title + token Jaccard + Ollama embedding cosine
+- **Multi-signal clustering**: cosine embedding + fuzzy title + token Jaccard + consensus cascade; compares against ALL cluster articles (not just seed)
+- **Stable cluster IDs**: `sha1(topic | min_article_key)` — order-independent, consistent across polling cycles
+- **Cross-cycle merge**: poller seeds clustering with recent DB clusters (configurable `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h)
+- **Orphan merge**: post-clustering Union-Find pass merges clusters sharing article keys
 - Concurrent Ollama embeddings (pre-computed before clustering loop)
 - Concurrent LLM enrichment (entity extraction, topic classification, sentiment) with per-provider semaphore
 - Per-cluster retry with exponential backoff (3 retries, 2s/4s/8s) + cross-cycle failure recovery
@@ -15,11 +18,11 @@ Provide a signal-extraction MCP server that converts RSS into **deduplicated, en
 - Dashboard REST API (`/api/v1/*`) for clusters, sentiment series, entity frequencies
 - `get_latest_events()` defaults to all topics (omit `topic` for unfiltered)
 
-## Previous: v1 architecture
+## Previous: v0.2.x architecture
 - FastMCP SSE server mounted at `/mcp`
 - SQLite cache for clusters + Groq summary caches
 - RSS fetch (breakingthenews.net)
-- v1 dedup via fuzzy title similarity
+- v1 dedup via fuzzy title similarity only, seed-article-only comparison
 - optional Ollama embeddings path for clustering (when `NEWS_EMBEDDINGS_ENABLED=true`)
 - configurable embedding similarity threshold (`NEWS_EMBEDDING_SIMILARITY_THRESHOLD`)
 - optional embeddings backfill script for precomputing cluster vectors in SQLite

+ 21 - 4
README.md

@@ -71,7 +71,7 @@ See `news-mcp/.env`.
 Key variables:
 - `NEWS_EXTRACT_PROVIDER`, `NEWS_EXTRACT_MODEL`
 - `NEWS_SUMMARY_PROVIDER`, `NEWS_SUMMARY_MODEL`
-- `GROQ_API_KEY`, `OPENAI_API_KEY`
+- `GROQ_API_KEY`, `OPENAI_API_KEY`, `OPENROUTER_API_KEY`
 - `ENTITY_BLACKLIST` (comma-separated, case-insensitive patterns; wildcards are supported)
 - `NEWS_PROMPTS_DIR` (override prompt directory)
 - `NEWS_ENTITY_ALIASES_FILE` (override entity alias JSON file)
@@ -85,14 +85,31 @@ Key variables:
 - `NEWS_PRUNING_ENABLED` (default true; if false, no rows are physically deleted)
 - `NEWS_RETENTION_DAYS` (physical delete threshold for stored clusters)
 - `NEWS_PRUNE_INTERVAL_HOURS` (how often in-server pruning may run)
-- `GROQ_ENRICH_OTHER_ONLY` (default false; set true for cost control)
-- `NEWS_EMBEDDINGS_ENABLED` (default false; enables Ollama embeddings for clustering when wired in)
+- `ENRICH_OTHER_TOPICS_ONLY` (default false; set true to only LLM-enrich "other" topic clusters)
+- `ENRICHMENT_MAX_PER_REFRESH` (default 0 = no limit; max clusters to LLM-enrich per refresh cycle)
+- `NEWS_LLM_DEBUG` (default false; enable debug logging for LLM calls)
+- `NEWS_EMBEDDINGS_ENABLED` (default false; enables Ollama embeddings for clustering)
 - `OLLAMA_BASE_URL` / `OLLAMA_URL` (default `http://127.0.0.1:11434`)
 - `OLLAMA_EMBEDDING_MODEL` (default `nomic-embed-text`)
-- `NEWS_EMBEDDING_SIMILARITY_THRESHOLD` (default `0.885`; used when embeddings are enabled)
+- `NEWS_EMBEDDING_SIMILARITY_THRESHOLD` (default `0.885`)
+- `NEWS_CLUSTER_MAX_AGE_HOURS` (default `4`; cross-cycle merge window. Set `0` to disable)
 
 When embeddings are enabled, news-mcp tries Ollama first and falls back to the existing heuristic clustering path if Ollama is unavailable.
 
+## Clustering
+
+The clustering pipeline has two modes:
+
+**In-cycle dedup** (every poll): new articles are compared against each other and against recently loaded existing clusters. A match merges into the existing cluster; no match creates a new cluster.
+
+**Cross-cycle merge** (controlled by `NEWS_CLUSTER_MAX_AGE_HOURS`): before clustering, the poller loads recent clusters from the DB and seeds them as merge targets. This means an article that arrives in poll N+1 can merge into a cluster created in poll N, even if the article's title is different enough that it wouldn't match against the cluster's original seed article. Set to `0` to disable.
+
+**Stable cluster IDs**: cluster IDs are derived from the topic and the lexicographically smallest article key in the cluster, not from the first article's title. This means the same set of articles always resolves to the same `cluster_id` regardless of processing order or polling cycle.
+
+**Orphan merge**: a post-clustering pass detects clusters that share article keys (via Union-Find) and merges them. This catches cases where two articles about the same event didn't match during the main loop (e.g. embeddings were temporarily unavailable).
+
+**Signal cascade**: each new article is compared against all articles in a candidate cluster (not just the seed). The matching cascade is: cosine similarity → title similarity → token Jaccard → consensus (cosine + title/jaccard). The first signal that clears its threshold wins.
+
 ## Persistence and migration
 
 The default database path is project-relative:

+ 16 - 0
RELEASE_NOTES.md

@@ -1,5 +1,21 @@
 # news-mcp release notes
 
+## v0.3.1 — stable cluster IDs, cross-cycle merge, orphan dedup, multi-article signals
+
+### Highlights
+- **Multi-article signal comparison**: `_signals()` now compares a new article against ALL articles in a candidate cluster (not just the seed). The best title and jaccard scores across all cluster members are used for matching.
+- **Stable cluster IDs**: `cluster_id = sha1(topic | min_article_key)` instead of `sha1(topic | seed_title)`. The same set of articles always maps to the same ID regardless of processing order. This eliminates duplicate clusters for the same event.
+- **Cross-cycle merge**: the poller loads recent clusters from the DB (controlled by `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h) and seeds them as merge targets before clustering. New articles in poll N+1 can merge into clusters created in poll N.
+- **Orphan merge**: post-clustering Union-Find pass detects and merges clusters that share article keys. Catches cases where articles about the same event didn't match during the main loop (e.g. embeddings temporarily unavailable).
+- **Cascade match via `_is_match()`**: unified signal evaluation — cosine → title → jaccard → consensus. Short-circuits on first passing signal. Configurable `title_threshold` parameter.
+- **Cluster embedding updated on merge**: when a new article merges into an existing cluster, the cluster's embedding is updated to the new article's vector, improving subsequent embedding-based matching.
+- **`NEWS_CLUSTER_MAX_AGE_HOURS`** env var (default `4`): controls the cross-cycle merge window. Set to `0` to disable cross-cycle merge.
+
+### Migration notes
+- No database schema changes.
+- Existing cluster IDs will change format on the next polling cycle (old rows are updated in-place via `ON CONFLICT(cluster_id)` once the new ID is computed). Transient enrichment cache misses may occur for one cycle.
+- Old duplicate clusters (same event, different IDs) will age out via pruning. To clean them immediately, run the article dedup cleanup script.
+
 ## v0.3.0 — concurrent polling, enrichment retry, all-topics default
 
 ### Highlights