Pārlūkot izejas kodu

docs: update README, PROJECT.md, OUTLOOK.md for v0.5.0

- README: compressed from 389→120 lines, removed duplicate sections,
  added content-change detection, three-layer dedup, clustering pipeline
- PROJECT.md: updated architecture to v0.5.0, added schema for
  seen_articles + site_config, clustering thresholds table, content-change
  detection docs
- OUTLOOK.md: updated to v0.5.0, removed fixed design issues,
  added what's new summary
Lukas Goldschmidt 6 dienas atpakaļ
vecāks
revīzija
afa3876ad1
3 mainītis faili ar 247 papildinājumiem un 447 dzēšanām
  1. 40 36
      OUTLOOK.md
  2. 98 53
      PROJECT.md
  3. 109 358
      README.md

+ 40 - 36
OUTLOOK.md

@@ -1,61 +1,65 @@
 # News MCP Server — Project Vision & Status
 
-> **Current version: v0.4.0** — see PROJECT.md for architecture details.
+> **Current version: v0.5.0**
 
 ## Core Design Principle
 
 Raw news is useless to agents. **Processed news is powerful.**
 
-- ✅ Clusters are the unit of truth, not raw articles
-- ✅ 100 articles → 5–10 clusters, with entities, sentiment, importance
-- ✅ SQL-level filtering by time, entity, keyword — no full-table JSON parsing
+- Clusters are the unit of truth, not raw articles
+- 100 articles → 5–10 clusters, with entities, sentiment, importance
+- SQL-level filtering by time, entity, keyword — no full-table JSON parsing
+- Three-layer dedup: feed hash → article URL → content hash
 
-## Architecture (v0.4.0)
+## What's new in v0.5.0
 
-See PROJECT.md for full schema and architecture. Key points:
-- `payload_ts` generated column for indexed time-range queries
-- `cluster_entities` and `cluster_keywords` junction tables for O(log n) entity/keyword search
-- MCP tools and Dashboard REST API both query the same SQLite DB
-- Docker deployment on thinkcenter-2 (192.168.0.200:8506)
+### Content-change detection
+Articles that update in-place at the same URL (e.g. FT's "More to come..." → real content) are now detected via `content_hash` comparison in `seen_articles`. Changed articles are re-clustered and re-enriched automatically.
+
+### Three-layer dedup
+1. **Feed hash** — skip entire unchanged feeds (O(1))
+2. **seen_articles** — skip already-processed URLs
+3. **Content hash** — detect in-place updates, re-process changed articles
+
+### Clustering improvements
+- Title threshold lowered: 0.87 → 0.75
+- Dual-signal tier: title ≥ 0.55 + jaccard ≥ 0.25 → merge
+- All thresholds configurable via dashboard Config page
+
+### Dashboard Config page
+- All tunable parameters in one place, grouped by category
+- Inline editing, source tracking (env/api/default)
+- Reset to defaults button
+- REST API: GET/POST `/api/v1/config`
+
+### Debug tool
+- `debug_dedup(url, title?)` — MCP tool to inspect dedup decisions, similarity signals
+
+## Architecture
+
+See PROJECT.md for full schema and architecture details.
 
 ## Tool Surface
 
 | Tool | Status | Notes |
 |---|---|---|
 | `get_latest_events` | ✅ | Time-filtered via `payload_ts` SQL index |
-| `get_events_for_entity` | ⚠️ | MCP tool still uses Python-side entity matching (top-N limit). Dashboard uses SQL junction table. Known design flaw. |
+| `get_events_for_entity` | ✅ | SQL junction-table search |
 | `get_event_summary` | ✅ | LLM-written narrative |
-| `detect_emerging_topics` | ✅ | entity/keyword/phrase signal types, velocity scoring |
-| `get_news_sentiment` | ⚠️ | Same Python-side entity matching limitation as `get_events_for_entity` |
+| `detect_emerging_topics` | ✅ | entity/keyword/phrase signal types |
+| `get_news_sentiment` | ✅ | SQL junction-table search |
 | `get_related_recent_entities` | ✅ | Co-occurrence + Google Trends blend |
 | `get_feeds` / `toggle_feed` | ✅ | Feed management |
-| `detect_emerging_topics(around=...)` | ✅ | Scope to entity neighborhood |
+| `debug_dedup` | ✅ | Inspect dedup decisions (new in v0.5.0) |
 
-## Known Design Issues
+## Deployment
 
-### Two Stores (FIXED, May 2026)
-`DashboardStore` was eliminated. All methods moved to `SQLiteClusterStore`. MCP tools now use SQL-level junction-table entity/keyword search via `get_clusters_by_entity_or_keyword()` — no row-limit blind spot.
-
-### MCP Tool Entity Search (FIXED, May 2026)
-`get_events_for_entity` and `get_news_sentiment` now use `SQLiteClusterStore.get_clusters_by_entity_or_keyword()` with proper SQL-level filtering across the full time window via the `cluster_entities` and `cluster_keywords` junction tables.
-
-## Backfill Scripts
-
-After deploying junction table schema changes:
+Docker on thinkcenter-2 (192.168.0.200:8506):
 ```bash
-docker exec -it news-mcp python3 scripts/backfill_junction_tables.py
+cd ~/news-mcp && git pull && docker-compose up -d news-mcp
 ```
 
-For timestamp normalization (already run on live server):
+After schema changes, run backfill:
 ```bash
-docker exec -it news-mcp python3 scripts/normalize_cluster_timestamps.py
+docker exec -it news-mcp python3 scripts/backfill_seen_articles.py
 ```
-
-## Future Directions (v0.5.0+)
-
-### "Emerging entity graph over time"
-- Collapse `detect_emerging_topics()` results into canonical entity nodes
-- Build weighted edges from co-occurrence in recent clusters
-- Infer communities (story neighborhoods)
-- Track graph evolution across refresh windows (node momentum, edge strength changes)
-- Agent tool: `get_emerging_entity_graph(timeframe, limit)`

+ 98 - 53
PROJECT.md

@@ -3,18 +3,23 @@
 ## Goal
 Provide a signal-extraction MCP server that converts RSS into **deduplicated, enriched news clusters** that are easy for agents to use.
 
-## Current architecture (v0.4.0)
+## Current architecture (v0.5.0)
 - FastMCP SSE server mounted at `/mcp`
-- SQLite cache for clusters + entity metadata + feed state + LLM summary caches
-- **payload_ts** — indexed VIRTUAL GENERATED column: `json_extract(payload, '$.timestamp')`. Auto-maintained by SQLite on write. Indexed for O(log n) time-range queries. No write-path code needed.
-- **cluster_entities** junction table — `(cluster_id, entity)` with index on `entity`. Populated in `upsert_clusters()`. SQL-level entity search.
-- **cluster_keywords** junction table — `(cluster_id, keyword)` with index on `keyword`. Same pattern.
+- SQLite cache for clusters + entity metadata + feed state + seen_articles
+- **payload_ts** — indexed VIRTUAL GENERATED column: `json_extract(payload, '$.timestamp')`. Auto-maintained by SQLite on write.
+- **cluster_entities** junction table — `(cluster_id, entity)` with index on `entity`. SQL-level entity search.
+- **cluster_keywords** junction table — `(cluster_id, keyword)` with index on `keyword`. SQL-level keyword search.
+- **seen_articles** junction table — `(article_key, cluster_id, content_hash)`. Per-article dedup with content-change detection.
 - All time-range filters and entity/keyword searches use SQL indexes. No full-table JSON parsing at query time.
 - **Stable cluster IDs**: `sha1(min_article_key)` — topic-independent, order-independent, consistent across polling cycles.
-- **Cross-cycle merge**: poller seeds clustering with recent DB clusters (configurable `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h).
-- **Orphan merge**: post-clustering Union-Find pass merges clusters sharing article keys
-- Concurrent RSS fetch, Ollama embeddings, LLM enrichment with per-provider semaphore
-- Dashboard REST API (`/api/v1/*`) + Keywords panel + entity/keyword drill-down via junction tables
+- **Cross-cycle merge**: poller seeds clustering with recent DB clusters (configurable `NEWS_CLUSTER_MAX_AGE_HOURS`, default 6h).
+- **Orphan merge**: post-clustering Union-Find pass merges clusters sharing article keys.
+- **Three-layer dedup**: feed-level hash (coarse) → seen_articles by URL (fine) → content hash (detects in-place updates).
+- **Dual-signal clustering**: title ≥ 0.75, jaccard ≥ 0.55, or title ≥ 0.55 + jaccard ≥ 0.25 (dual). Embedding cosine ≥ 0.885 when enabled.
+- **Content-change detection**: `seen_articles.content_hash` (SHA-1 of title+summary) detects in-place article updates (e.g. "More to come..." → real content). Changed articles are re-clustered and their `enriched_at` is cleared for re-enrichment.
+- Concurrent RSS fetch, Ollama embeddings, LLM enrichment with per-provider semaphore + rate limiter.
+- Dashboard with Config page for runtime parameter tuning via `site_config` DB table.
+- `article_identity.py` — single source of truth for `article_key()` and `article_content_hash()`.
 
 ## MCP tools
 - `get_latest_events(topic, limit, include_articles)`
@@ -24,24 +29,33 @@ Provide a signal-extraction MCP server that converts RSS into **deduplicated, en
 - `get_news_sentiment(entity, timeframe)`
 - `get_related_recent_entities(subject, timeframe, limit, include_trends)`
 - `get_feeds()` / `toggle_feed(feed_url, enabled)`
+- `debug_dedup(url, title?)` — inspect dedup status, similarity signals, match decisions
 - `get_capabilities()`
 
 ## REST API
-- `GET /` — server info, tools list
-- `GET /health` — uptime, version hash
-- `GET /api/v1/clusters` — paginated, filtered by `payload_ts` SQL index
-- `GET /api/v1/entities` — top entities via junction table GROUP BY
-- `GET /api/v1/keywords` — top keywords via junction table GROUP BY
-- `GET /api/v1/clusters/by-entity?entity=X&hours=Y` — SQL entity search (NEW)
-- `GET /api/v1/clusters/by-keyword?keyword=X&hours=Y` — SQL keyword search (NEW)
-- `GET /api/v1/sentiment-series` — filtered by `payload_ts` SQL index
-- `GET /api/v1/cluster/{cluster_id}` — full detail
-- `GET /api/v1/feeds` / `POST /api/v1/feeds/toggle` — feed management
+| Method | Path | Description |
+|--------|------|-------------|
+| GET | `/api/v1/health` | Extended health: stats, freshness, feed state, pruning, seen_article_count |
+| GET | `/api/v1/clusters` | Paginated clusters. Params: `topic`, `hours`, `limit`, `offset` |
+| GET | `/api/v1/sentiment-series` | Sentiment time-series. Params: `topic`, `hours`, `bucket_hours` |
+| GET | `/api/v1/entities` | Top entities by frequency. Params: `hours`, `limit` |
+| GET | `/api/v1/keywords` | Top keywords by frequency. Params: `hours`, `limit` |
+| GET | `/api/v1/clusters/by-entity` | SQL entity search via junction table |
+| GET | `/api/v1/clusters/by-keyword` | SQL keyword search via junction table |
+| GET | `/api/v1/cluster/{cluster_id}` | Full cluster detail |
+| GET | `/api/v1/feeds` | Feed state list |
+| POST | `/api/v1/feeds/toggle` | Enable/disable a feed |
+| GET | `/api/v1/config` | All site config parameters |
+| POST | `/api/v1/config/update` | Update a config parameter at runtime |
+| POST | `/api/v1/config/reset` | Reset all config to .env/defaults |
 
 ## Refresh & caching
 - Background refresh every `NEWS_REFRESH_INTERVAL_SECONDS` (default 300s)
-- Feed-hash skipping to avoid redundant RSS+LLM work
-- Summary caching for `get_event_summary`
+- **Three-layer dedup**:
+  1. Feed-level content hash — skip entire unchanged feeds (coarse, O(1))
+  2. `seen_articles` by `article_key` (URL) — skip already-processed articles (fine)
+  3. `content_hash` comparison — detect in-place content updates, re-cluster + re-enrich changed articles
+- Enrichment caching via `enriched_at` timestamp in cluster payload
 - Pruning via `NEWS_RETENTION_DAYS`, `NEWS_PRUNE_INTERVAL_HOURS`
 
 ## Schema (clusters table)
@@ -50,56 +64,87 @@ CREATE TABLE clusters (
     cluster_id TEXT PRIMARY KEY,
     topic TEXT NOT NULL,
     payload TEXT NOT NULL,
-    updated_at TEXT NOT NULL,          -- row modification time (set on every upsert)
+    updated_at TEXT NOT NULL,
     summary_payload TEXT,
     summary_updated_at TEXT,
-    payload_ts GENERATED ALWAYS AS     -- indexed event time (auto-maintained)
+    payload_ts GENERATED ALWAYS AS
         (json_extract(payload, '$.timestamp')) VIRTUAL
 );
 CREATE INDEX idx_clusters_payload_ts ON clusters(payload_ts);
 
 CREATE TABLE cluster_entities (
     cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
-    entity     TEXT NOT NULL,          -- lowercased
+    entity     TEXT NOT NULL,
     PRIMARY KEY (cluster_id, entity)
 );
 CREATE INDEX idx_cluster_entities_entity ON cluster_entities(entity);
 
 CREATE TABLE cluster_keywords (
     cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
-    keyword    TEXT NOT NULL,          -- lowercased
+    keyword    TEXT NOT NULL,
     PRIMARY KEY (cluster_id, keyword)
 );
 CREATE INDEX idx_cluster_keywords_keyword ON cluster_keywords(keyword);
-```
-
-## Keyword Utilization (done, May 2026)
-Keywords extracted by the LLM are now first-class search signals:
-- `_cluster_entity_haystack()` includes keywords → `get_events_for_entity()` matches themes
-- Cluster output includes `keywords[]` field
-- `detect_emerging_topics()` scores keywords with velocity/recency/source-diversity formula (`signal_type: "keyword"`)
-- `_collect_local_related()` counts keyword co-occurrence
-- Dashboard Keywords panel with SQL frequency counts via junction table
-- Topic labels (crypto/macro/regulation/ai/other) filtered from keywords at extraction time
-
-## Two-Store Collapse (done, May 2026)
-
-`DashboardStore` has been eliminated. All of its methods were moved into `SQLiteClusterStore` (the single data access layer), and the REST API routes now use the shared `SQLiteClusterStore` instance directly.
-
-All MCP tools (`get_events_for_entity`, `get_news_sentiment`, `get_latest_events` entity mode) now use `SQLiteClusterStore.get_clusters_by_entity_or_keyword()` which searches via junction-table SQL joins — no row-limit blind spot. The `cluster_entities` and `cluster_keywords` junction tables are indexed for O(log n) lookup across any time window.
 
-## Timestamp Pipeline (May 2026)
-1. **Write**: `sanitize_cluster_payload()` normalizes `timestamp`/`first_seen`/`last_updated` to `YYYY-MM-DDTHH:MM:SS+00:00`. If all three missing, falls back to `datetime.now()`.
-2. **Generated column**: `payload_ts` auto-extracts from JSON on write. Indexed.
-3. **Read**: All queries filter by `payload_ts >= ?` in SQL. No JSON parsing for time filtering.
-4. **Backfill**: One-time `scripts/backfill_junction_tables.py` populated junction tables from existing payloads. `payload_ts` was auto-populated by SQLite.
+CREATE TABLE seen_articles (
+    article_key  TEXT PRIMARY KEY,
+    cluster_id   TEXT NOT NULL,
+    first_seen   TEXT NOT NULL,
+    url          TEXT NOT NULL DEFAULT '',
+    content_hash TEXT NOT NULL DEFAULT ''
+);
 
-## Design Flaw: Two Stores (FIXED, May 2026)
+CREATE TABLE site_config (
+    key         TEXT PRIMARY KEY,
+    value       TEXT NOT NULL,
+    type        TEXT NOT NULL DEFAULT 'str',
+    category    TEXT NOT NULL DEFAULT 'general',
+    description TEXT NOT NULL DEFAULT '',
+    source      TEXT NOT NULL DEFAULT 'default'
+);
+```
 
-**What happened:** `DashboardStore` was a thin read-only query layer that wrapped `SQLiteClusterStore`. The MCP tools (`get_events_for_entity`, `get_news_sentiment`, `get_latest_events` entity mode) did Python-side entity matching by fetching top-N clusters via `payload_ts` then filtering in Python. Entities in clusters beyond the limit were silently missed.
+## Clustering thresholds (v0.5.0)
+All configurable via `site_config` DB table (dashboard Config page or REST API):
+
+| Parameter | Default | Signal |
+|-----------|---------|--------|
+| `title_threshold` | 0.75 | Min title similarity (SequenceMatcher) |
+| `jaccard_threshold` | 0.55 | Min Jaccard token overlap |
+| `dual_title_floor` | 0.55 | Dual-signal: min title |
+| `dual_jaccard_floor` | 0.25 | Dual-signal: min jaccard |
+| `embedding_similarity_threshold` | 0.885 | Cosine threshold (embeddings enabled) |
+| `cluster_max_age_hours` | 6 | Cross-cycle merge window |
+
+## Content-change detection (v0.5.0)
+1. On each poll, `filter_already_seen()` computes `content_hash = SHA1(title|summary)` for each article
+2. If `article_key` seen but `content_hash` differs → article is "changed"
+3. Changed articles are re-clustered into their existing cluster (same `article_key` → same cluster)
+4. `enriched_at` is cleared in the cluster payload → next enrichment cycle re-processes it
+5. Empty stored hashes (pre-migration rows) are treated as unchanged — hash is populated on next upsert
+
+## Site config (v0.5.0)
+- `site_config` DB table seeded from `.env` on first startup
+- Dashboard Config page: grouped by category (Clustering, Enrichment, Retention)
+- Runtime updates via REST API (`POST /api/v1/config/update`)
+- Reset to defaults via `POST /api/v1/config/reset`
+- Source tracking: `env` (from .env), `api` (runtime update), `default` (built-in)
+
+## Dashboard (v0.5.0)
+- **Health** — stats, freshness, topic distribution, sentiment overview, feed activity
+- **Feeds** — toggle feeds on/off
+- **Clusters** — filterable table, click for drill-down modal
+- **Sentiment** — time-series chart
+- **Entities** — top entities, bar chart, click for matching clusters
+- **Keywords** — top keywords, bar chart, click for matching clusters
+- **Config** — runtime parameter tuning (new in v0.5.0)
+
+## Backfill scripts
+After deploying schema changes:
+```bash
+docker exec -it news-mcp python3 scripts/backfill_seen_articles.py
+```
 
-**Fix applied:** 
-- `DashboardStore` was deleted. All its methods are now in `SQLiteClusterStore`.
-- All MCP tools use `SQLiteClusterStore.get_clusters_by_entity_or_keyword()` — SQL-level junction-table search with no row-limit blind spot.
-- The combined method uses `LEFT JOIN` on `cluster_entities` and `cluster_keywords` with `WHERE (ce.entity IN (...) OR ck.keyword IN (...))`, which matches both named entities and thematic keywords across any time window.
-- Exact matching (via `IN`) replaced substring matching — more correct, no false positives from partial name matches.
+## Version history
+- **v0.5.0** (2026-06-03): seen_articles table, content-change detection, dual-signal clustering, site_config DB + dashboard Config page, debug_dedup tool, article_identity module
+- **v0.4.0** (2026-05): junction tables, stable cluster IDs, cross-cycle merge, orphan merge, dashboard

+ 109 - 358
README.md

@@ -17,373 +17,124 @@ Docker Compose:
 docker compose up --build
 ```
 
-Default SSE mount (FastMCP):
-- `http://127.0.0.1:8506/mcp/sse`
+Endpoints:
+- MCP: `http://127.0.0.1:8506/mcp/sse`
+- Health: `http://127.0.0.1:8506/health`
+- Dashboard: `http://127.0.0.1:8506/dashboard/`
 
-Health:
-- `http://127.0.0.1:8506/health`
+## What it does
 
-## What this server provides
-- Fetches from one or more configured news feeds (`NEWS_FEED_URL` / `NEWS_FEED_URLS`)
-- Deduplicates articles into clusters (v1 fuzzy title similarity)
-- Enriches clusters with configurable LLM providers/models (topic/entities/sentiment/keywords)
-- Applies a case-insensitive entity blacklist after extraction
-- Caches clusters + LLM fields in SQLite
-- Resolves entities in-process via Google Trends suggestions (no `trends-mcp` hop required for entity resolution)
+- Fetches from configured news feeds (`NEWS_FEED_URLS`)
+- **Three-layer dedup**: feed hash → article URL → content hash (detects in-place updates)
+- Clusters articles via title similarity (≥0.75), Jaccard (≥0.55), dual-signal, or embeddings
+- Enriches clusters with LLM: topic, entities, sentiment, keywords, summary
+- Resolves entities via Google Trends suggestions
+- Dashboard with runtime Config page
 
 ## Tools (MCP)
 
-1) `get_latest_events(topic, limit, include_articles=false)`
-- `topic` is a coarse category: `crypto | macro | regulation | ai | other`
-- when `include_articles=true`, includes `articles[].url` + minimal fields per returned cluster
-
-2) `get_events_for_entity(entity, limit, timeframe="24h", include_articles=false)`
-- substring, case-insensitive match over extracted `entities`
-- uses the requested timeframe as the scan window; `limit` is the cap within that window
-- when `include_articles=true`, includes `articles[].url` + minimal fields per returned cluster
-
-3) `get_event_summary(event_id, include_articles=false)`
-- LLM-written compressed narrative for a given `cluster_id`
-- when `include_articles=true`, includes the underlying `articles` list (with `url`) from the stored cluster
-
-4) `detect_emerging_topics(limit)`
-- derives “emerging” signals from recent cached clusters
-
-5) `get_news_sentiment(entity, timeframe)`
-- aggregates sentiment around an entity from cached enriched clusters
-
-6) `get_related_recent_entities(subject, timeframe, limit, include_trends=true)`
-- merges recent co-occurrence data from cached clusters with Google Trends suggestions and returns
-  related entities (with `mid` when available) plus source/score metadata
-
-7) `get_capabilities()`
-- describes the server’s tool surface, composition recipes, and output conventions for agents
-
-### Entity aliasing
-
-The server keeps a conservative alias map in `config/entity_aliases.json` for obvious shorthands
-like `btc -> Bitcoin`, `eth -> Ethereum`, and `ether -> Ethereum`. Keep this map tight; it is meant
-to reduce false misses, not to rewrite every possible name variant.
-
-## Configuration
-
-See `news-mcp/.env`.
-Key variables:
-- `NEWS_EXTRACT_PROVIDER`, `NEWS_EXTRACT_MODEL`
-- `NEWS_SUMMARY_PROVIDER`, `NEWS_SUMMARY_MODEL`
-- `GROQ_API_KEY`, `OPENAI_API_KEY`, `OPENROUTER_API_KEY`
-- `ENTITY_BLACKLIST` (comma-separated, case-insensitive patterns; wildcards are supported)
-- `NEWS_PROMPTS_DIR` (override prompt directory)
-- `NEWS_ENTITY_ALIASES_FILE` (override entity alias JSON file)
-- `NEWS_FEED_URL` (single feed fallback)
-- `NEWS_FEED_URLS` (comma-separated feed URLs; overrides `NEWS_FEED_URL`)
-- `NEWS_FEED_ITEMS_PER_POLL` (per-feed fetch cap per poll; default 50)
-- `NEWS_REFRESH_INTERVAL_SECONDS` (default 900)
-- `NEWS_BACKGROUND_REFRESH_ON_START` (default true)
-- `NEWS_BACKGROUND_REFRESH_ENABLED` (default true)
-- `NEWS_DEFAULT_LOOKBACK_HOURS` (freshness window for reads; older rows are ignored by queries)
-- `NEWS_PRUNING_ENABLED` (default true; if false, no rows are physically deleted)
-- `NEWS_RETENTION_DAYS` (physical delete threshold for stored clusters)
-- `NEWS_PRUNE_INTERVAL_HOURS` (how often in-server pruning may run)
-- `ENRICH_OTHER_TOPICS_ONLY` (default false; set true to only LLM-enrich "other" topic clusters)
-- `ENRICHMENT_MAX_PER_REFRESH` (default 0 = no limit; max clusters to LLM-enrich per refresh cycle)
-- `NEWS_LLM_DEBUG` (default false; enable debug logging for LLM calls)
-- `NEWS_LLM_CONCURRENCY_<PROVIDER>` (e.g. `NEWS_LLM_CONCURRENCY_GROQ`; max concurrent outbound LLM calls per provider; overrides the built-in defaults: groq=8, openai=5, openrouter=2)
-- `NEWS_LLM_RATE_LIMIT_<PROVIDER>` (e.g. `NEWS_LLM_RATE_LIMIT_GROQ`; max LLM calls per second per provider. Set to `0` to disable rate limiting. Built-in defaults: groq=1.0, openai=5.0, openrouter=2.0)
-- `NEWS_EMBEDDINGS_ENABLED` (default false; enables Ollama embeddings for clustering)
-- `OLLAMA_BASE_URL` / `OLLAMA_URL` (default `http://127.0.0.1:11434`)
-- `OLLAMA_EMBEDDING_MODEL` (default `nomic-embed-text`)
-- `NEWS_EMBEDDING_SIMILARITY_THRESHOLD` (default `0.885`)
-- `NEWS_CLUSTER_MAX_AGE_HOURS` (default `4`; cross-cycle merge window. Set `0` to disable)
-
-When embeddings are enabled, news-mcp tries Ollama first and falls back to the existing heuristic clustering path if Ollama is unavailable.
-
-## Clustering
-
-The clustering pipeline has two modes:
-
-**In-cycle dedup** (every poll): new articles are compared against each other and against recently loaded existing clusters. A match merges into the existing cluster; no match creates a new cluster.
-
-**Cross-cycle merge** (controlled by `NEWS_CLUSTER_MAX_AGE_HOURS`): before clustering, the poller loads recent clusters from the DB and seeds them as merge targets. This means an article that arrives in poll N+1 can merge into a cluster created in poll N, even if the article's title is different enough that it wouldn't match against the cluster's original seed article. Set to `0` to disable.
-
-**Stable cluster IDs**: cluster IDs are derived from the topic and the lexicographically smallest article key in the cluster, not from the first article's title. This means the same set of articles always resolves to the same `cluster_id` regardless of processing order or polling cycle.
-
-**Orphan merge**: a post-clustering pass detects clusters that share article keys (via Union-Find) and merges them. This catches cases where two articles about the same event didn't match during the main loop (e.g. embeddings were temporarily unavailable).
-
-**Signal cascade**: each new article is compared against all articles in a candidate cluster (not just the seed). The matching cascade is: cosine similarity → title similarity → token Jaccard → consensus (cosine + title/jaccard). The first signal that clears its threshold wins.
-
-## Persistence and migration
-
-The default database path is project-relative:
-
-- `NEWS_MCP_DATA_DIR=./data`
-- `NEWS_MCP_DB_PATH=./data/news.sqlite`
-
-That keeps persistence inside the repository tree in both local and Docker runs.
-
-Recommended workflow:
-
-1. Keep **code** in Git.
-2. Keep the **data directory** outside Git but inside the project tree.
-3. Use `rsync` for the initial data transfer to a remote server.
-4. After that, move code with Git and move data only when you actually need a fresh copy.
-
-Example initial transfer:
-
-```bash
-rsync -a ./data/ user@remote:/srv/news-mcp/data/
-```
-
-If you change the location later, override the defaults with:
-
-- `NEWS_MCP_DATA_DIR`
-- `NEWS_MCP_DB_PATH`
-
-## TTL vs pruning
-
-These are intentionally different:
-
-- `NEWS_DEFAULT_LOOKBACK_HOURS` controls **read freshness** only. Older rows remain in SQLite but do not appear in normal "latest" queries.
-- `NEWS_PRUNING_ENABLED` controls whether the server is allowed to **physically delete** old rows.
-- `NEWS_RETENTION_DAYS` controls how old rows may get before they are deleted.
-- `NEWS_PRUNE_INTERVAL_HOURS` controls how often the server checks whether deletion is due.
-
-Pruning is self-contained inside the server:
-- on startup
-- after refresh cycles (prune-if-due)
-
-If `NEWS_PRUNING_ENABLED=false`, no pruning occurs and old rows are retained indefinitely.
-
-## Live extraction smoke test
-
-Run a standardized, fabricated extraction test against the currently configured provider/model:
-
-```bash
-./live_tests.sh
-```
-
-The script reads `./.env`, selects OpenAI or Groq based on the configured keys, and checks that the core expected entities are extracted.
-
-## mcporter examples (all news-mcp calls)
-
-Use your existing config path:
-
-```bash
-CONFIG=/home/lucky/.openclaw/workspace/config/mcporter.json
-```
-
-Inspect server + tools:
-
-```bash
-mcporter --config "$CONFIG" list news --schema
-```
-
-### 1) Latest events
-
-```bash
-mcporter --config "$CONFIG" call news.get_latest_events topic=crypto limit=10
-mcporter --config "$CONFIG" call news.get_latest_events topic=macro limit=5
-```
-
-### 2) Events for an entity
-
-```bash
-mcporter --config "$CONFIG" call news.get_events_for_entity entity=Bitcoin timeframe=24h limit=10
-mcporter --config "$CONFIG" call news.get_events_for_entity entity=ETH timeframe=3d limit=10
-mcporter --config "$CONFIG" call news.get_events_for_entity entity=ETF timeframe=7d limit=10
-```
-
-### 3) Event summary (by cluster_id)
-
-```bash
-# First fetch an event id
-mcporter --config "$CONFIG" call news.get_latest_events topic=crypto limit=1
-
-# Then summarize it
-mcporter --config "$CONFIG" call news.get_event_summary event_id=<cluster_id>
-```
-
-### 4) Emerging topics
-
-```bash
-mcporter --config "$CONFIG" call news.detect_emerging_topics limit=10
-```
-
-### 5) Sentiment for an entity
-
-```bash
-mcporter --config "$CONFIG" call news.get_news_sentiment entity=Bitcoin timeframe=24h
-mcporter --config "$CONFIG" call news.get_news_sentiment entity=Ethereum timeframe=72h
-```
-
-### 6) Related entities (recent neighborhood + trends blending)
-
-```bash
-# Iran: blend local co-occurrence with Google Trends related topics
-mcporter --config "$CONFIG" call news.get_related_recent_entities subject=Iran timeframe=72h limit=12 include_trends=true
-
-# Another seed phrase
-mcporter --config "$CONFIG" call news.get_related_recent_entities subject="iran war" timeframe=72h limit=12 include_trends=true
-```
-
-### 7) Capabilities / composition guidance
-
-```bash
-mcporter --config "$CONFIG" call news.get_capabilities
-```
-
-Use this when you want the server to explain how to chain the tools together, which fields to keep hidden (e.g. `cluster_id`), and how to present sources/timestamps consistently.
-
-## Blacklist enforcement (optional back-clean)
-
-If you change `ENTITY_BLACKLIST`, existing clusters in `news.sqlite` may still
-contain entities/keywords that would now be filtered at extraction time.
-
-For one-off cleanup, run:
-
-```bash
-./.venv/bin/python scripts/enforce_news_blacklist.py --dry-run --limit 200
-./.venv/bin/python scripts/enforce_news_blacklist.py --limit 1000
-```
-
-This enforces `ENTITY_BLACKLIST` inside stored clusters by removing matching
-entries from `payload.entities` and `payload.keywords` and (if needed) setting
-`payload.topic = "other"`.
-
-## Embeddings backfill (optional)
-
-If `NEWS_EMBEDDINGS_ENABLED=true`, you can precompute cluster embeddings for
-older rows before restarting the server:
-
-```bash
-./.venv/bin/python scripts/backfill_news_embeddings.py --dry-run --limit 200
-./.venv/bin/python scripts/backfill_news_embeddings.py --limit 1000
-```
-
-This stores a cluster-level `embedding` and `embedding_model` inside the SQLite
-payload so the Ollama-first clustering path has data ready to use.
-
-## Embedding merge analysis (optional)
-
-To inspect likely cluster merges at different cosine thresholds without writing
-anything back to the DB:
-
-```bash
-./.venv/bin/python scripts/analyze_cluster_embedding_merges.py --thresholds 0.82 0.85 0.88 --limit 200
-```
-
-This prints candidate pairs per threshold so you can decide whether a merge
-script is worth adding next.
-
-## Embedding merge pass (optional, destructive)
-
-After inspecting the analysis output, you can merge clusters above a chosen
-threshold. Start with dry-run:
-
-```bash
-./.venv/bin/python scripts/merge_cluster_embeddings.py --dry-run --threshold 0.90
-```
-
-If the groupings look right, run wet:
-
-```bash
-./.venv/bin/python scripts/merge_cluster_embeddings.py --threshold 0.90
-```
-
-This merges embedding-similar clusters within the same topic and removes the
-absorbed duplicates from SQLite.
-
-## Article dedup cleanup (optional)
-
-Some stored clusters may contain repeated article entries for the same
-underlying article id / URL path. To clean existing rows:
-
-```bash
-./.venv/bin/python scripts/dedup_articles_in_clusters.py --dry-run
-./.venv/bin/python scripts/dedup_articles_in_clusters.py
-```
-
-The live clustering path also deduplicates article entries when new data comes in.
-
-As of the latest hardening, the server/storage write path also self-heals `payload.articles` by deduplicating before persisting (so historical rows can be fixed via the cleanup script, and future writes won’t reintroduce duplicates). 
-```
-
-## Dashboard (new)
-
-A browser-based monitoring dashboard is available at:
-```
-http://127.0.0.1:8506/dashboard/
-```
-
-**Views:**
-- **Health** — cluster/entity counts, freshness indicator, topic distribution (doughnut chart), sentiment overview, feed activity
-- **Clusters** — filterable/sortable table with topic, sentiment, importance, entity chips; search by keyword
-- **Sentiment** — time-series chart (avg sentiment per configurable time bucket) with cluster count overlay
-- **Entities** — top entities by mention frequency, horizontal bar chart, click for detail
-- **Detail** — click any cluster row or search by cluster ID for full drill-down (summary, key facts, articles, keywords, entities)
-
-**Tech:** Pure static HTML/JS with Chart.js for visualizations. Served from the same FastAPI process at `/dashboard`.
-
-**Configuration defaults:** The dashboard's default lookback window follows `NEWS_DEFAULT_LOOKBACK_HOURS` (configured via `.env`, default 144h).
+| Tool | Description |
+|------|-------------|
+| `get_latest_events(topic, limit, include_articles)` | Latest clusters by topic |
+| `get_events_for_entity(entity, limit, timeframe, include_articles)` | Clusters matching entity |
+| `get_event_summary(event_id, include_articles)` | LLM narrative for cluster |
+| `detect_emerging_topics(limit)` | Emerging signals from recent clusters |
+| `get_news_sentiment(entity, timeframe)` | Aggregated sentiment |
+| `get_related_recent_entities(subject, timeframe, limit)` | Co-occurrence + Trends blend |
+| `get_feeds()` / `toggle_feed(url, enabled)` | Feed management |
+| `debug_dedup(url, title?)` | Inspect dedup decisions & similarity signals |
+| `get_capabilities()` | Tool surface documentation |
 
 ## REST API
 
-The following read-only endpoints are available for programmatic access (in addition to the MCP SSE tools):
-
 | Method | Path | Description |
 |--------|------|-------------|
-| GET | `/api/v1/health` | Extended health: cluster/entity counts, freshness, feed state, pruning config |
-| GET | `/api/v1/clusters` | Paginated clusters. Params: `topic`, `hours`, `limit`, `offset` |
-| GET | `/api/v1/sentiment-series` | Sentiment time-series. Params: `topic`, `hours`, `bucket_hours` |
-| GET | `/api/v1/entities` | Top entities by frequency. Params: `hours`, `limit` |
-| GET | `/api/v1/cluster/{cluster_id}` | Full cluster detail with summary, facts, articles |
-
-## Startup behavior
-
-The server uses a lifespan-based startup (FastAPI ≥0.111). Background feed refresh and pruning run as fire-and-forget coroutines, so the HTTP API and dashboard are available immediately — no blocking on the first fetch cycle.
-
-Important env vars controlling background behavior:
-- `NEWS_BACKGROUND_REFRESH_ENABLED=true` — enable/disable background loop
-- `NEWS_BACKGROUND_REFRESH_ON_START=true` — fetch immediately on startup (previously `false`; changed to `true` for faster first data)
-- `NEWS_REFRESH_INTERVAL_SECONDS=900` — polling interval between refresh cycles
-
-## Dashboard (updated)
-
-A browser-based monitoring dashboard is available at:
-```
-http://<your-host>:8506/dashboard/
-```
-
-**Views:**
-| View | What it shows |
-|---|---|
-| Health | Cluster/entity counts, freshness badge, topic doughnut, sentiment overview, feed activity |
-| Clusters | Filterable table — topic, sentiment, importance, entity chips, full-text search |
-| Sentiment | Time-series line chart (avg sentiment per configurable bucket) + cluster count overlay |
-| Entities | Top entities by mention frequency, bar chart, click-through to matching clusters |
-| Detail | Click any cluster row or paste a cluster_id — full drill-down with summary, key facts, articles, keywords, entities |
+| GET | `/api/v1/health` | Stats, freshness, feed state, pruning |
+| GET | `/api/v1/clusters` | Paginated clusters |
+| GET | `/api/v1/sentiment-series` | Sentiment time-series |
+| GET | `/api/v1/entities` | Top entities by frequency |
+| GET | `/api/v1/keywords` | Top keywords by frequency |
+| GET | `/api/v1/clusters/by-entity` | Entity search (SQL) |
+| GET | `/api/v1/clusters/by-keyword` | Keyword search (SQL) |
+| GET | `/api/v1/cluster/{id}` | Full cluster detail |
+| GET | `/api/v1/feeds` | Feed state list |
+| POST | `/api/v1/feeds/toggle` | Enable/disable feed |
+| GET | `/api/v1/config` | All config parameters |
+| POST | `/api/v1/config/update` | Update a parameter |
+| POST | `/api/v1/config/reset` | Reset to defaults |
 
-**Tech stack:** Vanilla HTML/JS + Chart.js, served as static files from the same FastAPI process at `/dashboard`. No extra dependencies.
-
-The dashboard default lookback window follows `NEWS_DEFAULT_LOOKBACK_HOURS` (configured via `.env`, default: 144h).
-
-## REST API
-
-Five read-only JSON endpoints for programmatic access:
-
-| Method | Path | Params | Returns |
-|--------|------|--------|---------|
-| GET | `/api/v1/health` | — | stats, freshness, feed state, pruning config |
-| GET | `/api/v1/clusters` | `topic`, `hours`, `limit`, `offset` | paginated cluster list |
-| GET | `/api/v1/sentiment-series` | `topic`, `hours`, `bucket_hours` | time-series data for Chart.js |
-| GET | `/api/v1/entities` | `hours`, `limit` | top entities by mention count |
-| GET | `/api/v1/cluster/{id}` | — (path param) | full cluster detail |
-
-## Startup
-
-Uses FastAPI `lifespan` — **HTTP API available in <0.3s** regardless of feed/LLM latency. Background refresh + pruning are fire-and-forget coroutines.
+## Configuration
 
-Key `.env` vars:
-- `NEWS_BACKGROUND_REFRESH_ON_START=true` — fetch immediately on boot
-- `NEWS_BACKGROUND_REFRESH_ENABLED=true` — enable/disable the background loop
-- `NEWS_REFRESH_INTERVAL_SECONDS=900` — polling interval
+All parameters are stored in the `site_config` DB table and editable via the dashboard Config page.
+On first startup, seeded from `.env` or built-in defaults.
+
+Key `.env` vars (seeded into site_config):
+
+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `NEWS_FEED_URLS` | — | Comma-separated feed URLs |
+| `NEWS_REFRESH_INTERVAL_SECONDS` | 300 | Polling interval |
+| `NEWS_DEFAULT_LOOKBACK_HOURS` | 24 | Read freshness window |
+| `NEWS_RETENTION_DAYS` | 10 | Prune threshold |
+| `NEWS_PRUNE_INTERVAL_HOURS` | 12 | Prune check interval |
+| `NEWS_CLUSTER_MAX_AGE_HOURS` | 6 | Cross-cycle merge window |
+| `NEWS_EMBEDDINGS_ENABLED` | true | Enable Ollama embeddings |
+| `NEWS_EMBEDDING_SIMILARITY_THRESHOLD` | 0.885 | Cosine threshold |
+| `OLLAMA_BASE_URL` | `http://192.168.0.200:11434` | Ollama API URL |
+| `NEWS_EXTRACT_PROVIDER` / `NEWS_SUMMARY_PROVIDER` | groq | LLM provider |
+| `NEWS_EXTRACT_MODEL` / `NEWS_SUMMARY_MODEL` | llama4-16e | LLM model |
+| `GROQ_API_KEY` / `OPENAI_API_KEY` / `OPENROUTER_API_KEY` | — | API keys |
+| `ENTITY_BLACKLIST` | — | Comma-separated entity patterns |
+| `ENRICHMENT_MAX_PER_REFRESH` | 0 (unlimited) | LLM enrichments per cycle |
+
+Clustering thresholds (also in site_config):
+- `title_threshold`: 0.75
+- `jaccard_threshold`: 0.55
+- `dual_title_floor`: 0.55
+- `dual_jaccard_floor`: 0.25
+
+## Clustering pipeline
+
+1. **Fetch** all feeds concurrently
+2. **Feed hash** — skip unchanged feeds entirely
+3. **Retention filter** — drop articles older than `NEWS_RETENTION_DAYS`
+4. **Seen articles** — `filter_already_seen()` splits into:
+   - `new` → never seen, full processing
+   - `unchanged` → same URL, same content hash → skip
+   - `changed` → same URL, different content hash → re-cluster + re-enrich
+5. **Cluster** — title similarity, Jaccard, embeddings, dual-signal merge
+6. **Enrich** — LLM extraction, summarization, sentiment
+7. **Prune** — delete clusters older than retention window
+
+## Content-change detection
+
+When an article is updated in-place at the same URL (e.g. FT's "More to come..." → real content):
+1. `content_hash = SHA1(title|summary)` is computed
+2. Compared against `seen_articles.content_hash`
+3. If different → article is re-clustered into its existing cluster
+4. `enriched_at` is cleared → next cycle re-enriches with updated content
+
+## Persistence
+
+- SQLite at `./data/news.sqlite` (local) or `/app/data/news.sqlite` (Docker)
+- Schema auto-migrates on startup (ALTER TABLE for new columns)
+- Backfill script for seeding `seen_articles` from existing clusters:
+  ```bash
+  docker exec -it news-mcp python3 scripts/backfill_seen_articles.py
+  ```
+
+## Dashboard
+
+`http://<host>:8506/dashboard/`
+
+- **Health** — stats, charts, feed status
+- **Feeds** — toggle on/off
+- **Clusters** — filterable table, click for drill-down modal
+- **Sentiment** — time-series chart
+- **Entities** — top entities, frequency chart
+- **Keywords** — top keywords, frequency chart
+- **Config** — runtime parameter tuning (new in v0.5.0)
+
+## Version
+
+See `./version-hash.sh` for the current content hash.