|
@@ -71,7 +71,7 @@ See `news-mcp/.env`.
|
|
|
Key variables:
|
|
Key variables:
|
|
|
- `NEWS_EXTRACT_PROVIDER`, `NEWS_EXTRACT_MODEL`
|
|
- `NEWS_EXTRACT_PROVIDER`, `NEWS_EXTRACT_MODEL`
|
|
|
- `NEWS_SUMMARY_PROVIDER`, `NEWS_SUMMARY_MODEL`
|
|
- `NEWS_SUMMARY_PROVIDER`, `NEWS_SUMMARY_MODEL`
|
|
|
-- `GROQ_API_KEY`, `OPENAI_API_KEY`
|
|
|
|
|
|
|
+- `GROQ_API_KEY`, `OPENAI_API_KEY`, `OPENROUTER_API_KEY`
|
|
|
- `ENTITY_BLACKLIST` (comma-separated, case-insensitive patterns; wildcards are supported)
|
|
- `ENTITY_BLACKLIST` (comma-separated, case-insensitive patterns; wildcards are supported)
|
|
|
- `NEWS_PROMPTS_DIR` (override prompt directory)
|
|
- `NEWS_PROMPTS_DIR` (override prompt directory)
|
|
|
- `NEWS_ENTITY_ALIASES_FILE` (override entity alias JSON file)
|
|
- `NEWS_ENTITY_ALIASES_FILE` (override entity alias JSON file)
|
|
@@ -85,14 +85,31 @@ Key variables:
|
|
|
- `NEWS_PRUNING_ENABLED` (default true; if false, no rows are physically deleted)
|
|
- `NEWS_PRUNING_ENABLED` (default true; if false, no rows are physically deleted)
|
|
|
- `NEWS_RETENTION_DAYS` (physical delete threshold for stored clusters)
|
|
- `NEWS_RETENTION_DAYS` (physical delete threshold for stored clusters)
|
|
|
- `NEWS_PRUNE_INTERVAL_HOURS` (how often in-server pruning may run)
|
|
- `NEWS_PRUNE_INTERVAL_HOURS` (how often in-server pruning may run)
|
|
|
-- `GROQ_ENRICH_OTHER_ONLY` (default false; set true for cost control)
|
|
|
|
|
-- `NEWS_EMBEDDINGS_ENABLED` (default false; enables Ollama embeddings for clustering when wired in)
|
|
|
|
|
|
|
+- `ENRICH_OTHER_TOPICS_ONLY` (default false; set true to only LLM-enrich "other" topic clusters)
|
|
|
|
|
+- `ENRICHMENT_MAX_PER_REFRESH` (default 0 = no limit; max clusters to LLM-enrich per refresh cycle)
|
|
|
|
|
+- `NEWS_LLM_DEBUG` (default false; enable debug logging for LLM calls)
|
|
|
|
|
+- `NEWS_EMBEDDINGS_ENABLED` (default false; enables Ollama embeddings for clustering)
|
|
|
- `OLLAMA_BASE_URL` / `OLLAMA_URL` (default `http://127.0.0.1:11434`)
|
|
- `OLLAMA_BASE_URL` / `OLLAMA_URL` (default `http://127.0.0.1:11434`)
|
|
|
- `OLLAMA_EMBEDDING_MODEL` (default `nomic-embed-text`)
|
|
- `OLLAMA_EMBEDDING_MODEL` (default `nomic-embed-text`)
|
|
|
-- `NEWS_EMBEDDING_SIMILARITY_THRESHOLD` (default `0.885`; used when embeddings are enabled)
|
|
|
|
|
|
|
+- `NEWS_EMBEDDING_SIMILARITY_THRESHOLD` (default `0.885`)
|
|
|
|
|
+- `NEWS_CLUSTER_MAX_AGE_HOURS` (default `4`; cross-cycle merge window. Set `0` to disable)
|
|
|
|
|
|
|
|
When embeddings are enabled, news-mcp tries Ollama first and falls back to the existing heuristic clustering path if Ollama is unavailable.
|
|
When embeddings are enabled, news-mcp tries Ollama first and falls back to the existing heuristic clustering path if Ollama is unavailable.
|
|
|
|
|
|
|
|
|
|
+## Clustering
|
|
|
|
|
+
|
|
|
|
|
+The clustering pipeline has two modes:
|
|
|
|
|
+
|
|
|
|
|
+**In-cycle dedup** (every poll): new articles are compared against each other and against recently loaded existing clusters. A match merges into the existing cluster; no match creates a new cluster.
|
|
|
|
|
+
|
|
|
|
|
+**Cross-cycle merge** (controlled by `NEWS_CLUSTER_MAX_AGE_HOURS`): before clustering, the poller loads recent clusters from the DB and seeds them as merge targets. This means an article that arrives in poll N+1 can merge into a cluster created in poll N, even if the article's title is different enough that it wouldn't match against the cluster's original seed article. Set to `0` to disable.
|
|
|
|
|
+
|
|
|
|
|
+**Stable cluster IDs**: cluster IDs are derived from the topic and the lexicographically smallest article key in the cluster, not from the first article's title. This means the same set of articles always resolves to the same `cluster_id` regardless of processing order or polling cycle.
|
|
|
|
|
+
|
|
|
|
|
+**Orphan merge**: a post-clustering pass detects clusters that share article keys (via Union-Find) and merges them. This catches cases where two articles about the same event didn't match during the main loop (e.g. embeddings were temporarily unavailable).
|
|
|
|
|
+
|
|
|
|
|
+**Signal cascade**: each new article is compared against all articles in a candidate cluster (not just the seed). The matching cascade is: cosine similarity → title similarity → token Jaccard → consensus (cosine + title/jaccard). The first signal that clears its threshold wins.
|
|
|
|
|
+
|
|
|
## Persistence and migration
|
|
## Persistence and migration
|
|
|
|
|
|
|
|
The default database path is project-relative:
|
|
The default database path is project-relative:
|