Selaa lähdekoodia

docs: cleanup obsolete content, document design flaw

- PROJECT.md: complete rewrite. Removed duplicate Dashboard sections,
  stale v0.2.x architecture block, outdated timestamp contract details.
  Added current schema, query architecture section, and design flaw note
  about the two-store duplication problem.
- AGENTS.md: updated Repo Map with junction table context, replaced
  timestamp contract with payload_ts SQL rule, added Design Flaw note.
- OUTLOOK.md: replaced 544-line requirements wish-list with current
  v0.4.0 status, known issues, and future directions.
- POLLER_UPGRADE_PLAN.md: condensed, marked as not yet implemented.
Lukas Goldschmidt 1 viikko sitten
vanhempi
commit
e3d27d9fd1
4 muutettua tiedostoa jossa 146 lisäystä ja 789 poistoa
  1. 28 10
      AGENTS.md
  2. 43 526
      OUTLOOK.md
  3. 11 14
      POLLER_UPGRADE_PLAN.md
  4. 64 239
      PROJECT.md

+ 28 - 10
AGENTS.md

@@ -29,14 +29,33 @@ This project spans two machines. **Always check which machine you're operating o
 - The local dev copy has its own separate DB — treat it as empty/stale unless explicitly working with it.
 
 ## Repo Map
-- `news_mcp/mcp_server_fastmcp.py`: MCP tool surface, startup refresh, pruning, and HTTP health endpoints.
+- `news_mcp/mcp_server_fastmcp.py`: MCP tool surface, startup refresh, pruning, HTTP health endpoints, REST API.
 - `news_mcp/jobs/poller.py`: feed refresh loop, clustering, enrichment, and cache writes.
-- `news_mcp/storage/sqlite_store.py`: SQLite schema, cluster/entity metadata, feed hashes, and prune state.
-- `news_mcp/dedup/cluster.py`: topic bucketing and the current fuzzy/embedding clustering path.
+- `news_mcp/storage/sqlite_store.py`: SQLite schema (payload_ts, junction tables), upsert with junction population, SQL-level read methods. **Single data access layer for MCP tools.**
+- `news_mcp/dashboard/dashboard_store.py`: Read-only query layer for dashboard REST API. Wraps `SQLiteClusterStore`. Added junction-table entity/keyword search. **NOTE: this store duplicates methods from sqlite_store — see Design Flaw in PROJECT.md.**
+- `news_mcp/dedup/cluster.py`: topic bucketing and fuzzy/embedding clustering.
 - `news_mcp/enrichment/llm_enrich.py`: LLM extraction/summarization and blacklist filtering.
-- `news_mcp/trends_resolution.py` and `news_mcp/related_entities.py`: local Google Trends-based entity resolution and neighborhood lookup.
+- `news_mcp/trends_resolution.py` and `news_mcp/related_entities.py`: entity resolution and neighborhood lookup.
 - `news_mcp/config.py`: env-driven defaults and file paths.
 
+## Query Architecture (READ THIS BEFORE ADDING NEW QUERIES)
+
+**Time filtering:** Always use `payload_ts >= ?` SQL filter. Never parse JSON timestamps in Python for time ranges.
+
+**Entity/keyword search:** Use junction tables:
+- `cluster_entities` for entity search: `JOIN cluster_entities ce ON c.cluster_id = ce.cluster_id WHERE ce.entity = ?`
+- `cluster_keywords` for keyword search: `JOIN cluster_keywords ck ON c.cluster_id = ck.cluster_id WHERE ck.keyword = ?`
+- Do NOT fetch all clusters and filter entities in Python.
+
+**Backfill:** After schema changes, run `scripts/backfill_junction_tables.py` in the Docker container:
+```
+docker exec -it news-mcp python3 scripts/backfill_junction_tables.py
+```
+
+## Design Flaw: Two Stores
+
+`SQLiteClusterStore` and `DashboardStore` are parallel copies. Only `DashboardStore` was updated with junction-table entity search. MCP tools (`get_events_for_entity`, `get_news_sentiment`) still use `SQLiteClusterStore` Python-side entity matching with a row limit (top 200), missing entities in older clusters. See PROJECT.md for full analysis and proposed fix.
+
 ## Docker / Live Server Details
 - `docker-compose.yml` mounts `./:/app` with `working_dir: /app`
 - Data dir and DB path both hardcoded in docker-compose env: `NEWS_MCP_DB_PATH: ./data/news.sqlite`
@@ -52,12 +71,11 @@ This project spans two machines. **Always check which machine you're operating o
 - `include_articles=true` should keep responses compact and only return minimal article fields.
 - Timestamps in cluster payloads are normalized to ISO 8601 UTC (`YYYY-MM-DDTHH:MM:SS+00:00`) at write time in `sanitize_cluster_payload()`.
 
-## Timestamp Contract (READ THIS BEFORE TOUCHING ANY TIMESTAMP CODE)
-- `payload.timestamp`, `payload.first_seen`, `payload.last_updated` are **guaranteed** `YYYY-MM-DDTHH:MM:SS+00:00` for every row written after the normalization migration (backfill script was run on the live server).
-- **Read paths**: use `_read_ts()` from `news_mcp.storage.sqlite_store`, or `datetime.fromisoformat()` directly. That is all that is needed.
-- **Never** add `parsedate_to_datetime` / RFC 2822 fallbacks to a read path. If `_read_ts` returns None on a stored timestamp, the bug is in the write path — fix `sanitize_cluster_payload()`, don't paper over it.
-- `parsedate_to_datetime` is intentionally retained **only** in `sqlite_store._normalize_ts()` (write path) and `dedup/cluster.py` (raw ingest before normalization). Nowhere else.
-- **Never query the dev DB** (`news_mcp/data/news.sqlite` on latitude) to check live data. It is empty/stale. The live DB is on thinkcenter-2 in Docker at `/app/data/news.sqlite`.
+## Timestamp Contract
+- `payload_ts` SQL column (VIRTUAL GENERATED) is the ONLY way to filter by event time. Use `WHERE payload_ts >= ?` in SQL. Never parse JSON timestamps in Python for time ranges.
+- `payload.timestamp` in JSON is guaranteed `YYYY-MM-DDTHH:MM:SS+00:00` at write time (enforced by `sanitize_cluster_payload()`).
+- `updated_at` in the DB = row modification time, NOT event time. Never use for time-range queries.
+- This repo is a **dev machine** copy. The live server is on thinkcenter-2 (192.168.0.200). Never query the dev DB to verify live data.
 
 ## Editing Rules
 - Keep changes aligned with the docs in `README.md`, `PROJECT.md`, and `OUTLOOK.md`.

+ 43 - 526
OUTLOOK.md

@@ -1,544 +1,61 @@
+# News MCP Server — Project Vision & Status
 
-# 📰 News MCP Server — Requirements Spec
+> **Current version: v0.4.0** — see PROJECT.md for architecture details.
 
-> **Current version: v0.3.1** — see [RELEASE_NOTES.md](RELEASE_NOTES.md) for changelog.
+## Core Design Principle
 
-## 🎯 Goal
+Raw news is useless to agents. **Processed news is powerful.**
 
-Provide **structured, deduplicated, topic-aware news signals**
-that an agent can use for reasoning about:
+- ✅ Clusters are the unit of truth, not raw articles
+- ✅ 100 articles → 5–10 clusters, with entities, sentiment, importance
+- ✅ SQL-level filtering by time, entity, keyword — no full-table JSON parsing
 
-* events
-* narratives
-* sentiment shifts
+## Architecture (v0.4.0)
 
-👉 Not a feed reader
-👉 Not a headline dump
-👉 A **signal extraction layer**
+See PROJECT.md for full schema and architecture. Key points:
+- `payload_ts` generated column for indexed time-range queries
+- `cluster_entities` and `cluster_keywords` junction tables for O(log n) entity/keyword search
+- MCP tools and Dashboard REST API both query the same SQLite DB
+- Docker deployment on thinkcenter-2 (192.168.0.200:8506)
 
----
+## Tool Surface
 
-# 🧠 Core Design Principle
+| Tool | Status | Notes |
+|---|---|---|
+| `get_latest_events` | ✅ | Time-filtered via `payload_ts` SQL index |
+| `get_events_for_entity` | ⚠️ | MCP tool still uses Python-side entity matching (top-N limit). Dashboard uses SQL junction table. Known design flaw. |
+| `get_event_summary` | ✅ | LLM-written narrative |
+| `detect_emerging_topics` | ✅ | entity/keyword/phrase signal types, velocity scoring |
+| `get_news_sentiment` | ⚠️ | Same Python-side entity matching limitation as `get_events_for_entity` |
+| `get_related_recent_entities` | ✅ | Co-occurrence + Google Trends blend |
+| `get_feeds` / `toggle_feed` | ✅ | Feed management |
+| `detect_emerging_topics(around=...)` | ✅ | Scope to entity neighborhood |
 
-> Raw news is useless to agents.
-> **Processed news is powerful.**
+## Known Design Issues
 
----
+### Two Stores (see PROJECT.md § "Design Flaw")
+`SQLiteClusterStore` and `DashboardStore` are parallel copies. Only `DashboardStore` was updated with junction-table entity search. MCP tools still use Python-side entity matching with a row limit. Proposed fix: collapse into single data access layer.
 
-# 🏗️ 1. Internal Architecture
+### MCP Tool Entity Search
+`get_events_for_entity` and `get_news_sentiment` fetch top-N clusters by time then filter entities in Python. Entities in clusters beyond the limit are missed. Fix: use junction table `get_clusters_by_entity()`.
 
-## 🧩 Data Sources Layer (`sources/`)
+## Backfill Scripts
 
-Mix of:
-
-* RSS feeds (primary)
-* optional APIs later
-
-Examples:
-
-* Reuters
-* Bloomberg
-* CoinDesk
-
-### Responsibilities:
-
-* fetch articles
-* normalize format
-
----
-
-## 🔄 Ingestion Pipeline
-
-Runs periodically (e.g. every few minutes)
-
-Steps:
-
-1. fetch articles
-2. normalize fields:
-
-   * title
-   * url
-   * source
-   * timestamp
-   * summary (if available)
-
----
-
-## 🧹 Deduplication Layer
-
-### Problem:
-
-Same story appears across many sources.
-
-### Solution:
-
-Cluster articles by similarity:
-
-Methods:
-
-* title similarity (fuzzy match / embeddings)
-* URL canonicalization
-* content similarity (optional later)
-
-### Output:
-
-```json id="cluster"
-{
-  "cluster_id": "...",
-  "headline": "Canonical headline",
-  "articles": [...],
-  "sources": ["Reuters", "Bloomberg"],
-  "first_seen": "...",
-  "last_updated": "..."
-}
-```
-
-👉 This is your **core unit of truth**, not individual articles
-
----
-
-## 🧠 Enrichment Layer
-
-Adds meaning to clusters.
-
-### 1. Entity extraction
-
-* assets (BTC, ETH)
-* companies
-* macro topics (inflation, rates)
-
----
-
-### 2. Topic classification
-
-Examples:
-
-* crypto
-* macro
-* regulation
-* AI
-
----
-
-### 3. Sentiment (lightweight)
-
-* positive / negative / neutral
-* or simple score
-
-👉 Keep this simple in v1 (don’t over-engineer NLP)
-
----
-
-### 4. Importance scoring (VERY useful)
-
-Heuristic:
-
-* number of sources covering it
-* recency
-* source credibility
-* keyword weighting
-
----
-
-## 🗃️ Storage Layer
-
-You need short-term memory:
-
-* clusters (not raw articles)
-* TTL: e.g. 24–72h
-
-Optional:
-
-* in-memory store (start)
-* later: DB
-
-we have a choice of storage possibilites including qdrant, postgresql, couchdb
-
----
-
-# 🧰 2. Agent-Facing Tools (IMPORTANT)
-
-Keep tools **high-level and semantic**
-
----
-
-## 1. `get_latest_events`
-
-> “What is happening right now?”
-
-Input:
-
-```json id="n1"
-{
-  "topic": "crypto",
-  "limit": 5,
-  "include_articles": false
-}
-```
-
-Output:
-
-```json id="n2"
-[
-  {
-    "headline": "...",
-    "summary": "...",
-    "entities": ["BTC"],
-    "sentiment": "positive",
-    "importance": 0.82,
-    "sources": ["Reuters", "CoinDesk"],
-    "timestamp": "...",
-    "articles": [
-      {
-        "title": "...",
-        "url": "...",
-        "source": "Reuters",
-        "timestamp": "..."
-      }
-    ]
-  }
-]
-```
-
----
-
-## 2. `get_events_for_entity`
-
-> “What’s happening with X?”
-
-```json id="n3"
-{
-  "entity": "BTC",
-  "include_articles": false
-}
-```
-
-👉 filters clusters by entity
-
-Optional:
-
-* `include_articles` to include article title/url/source/timestamp in the payload
-
----
-
-## 3. `get_event_summary`
-
-> “Explain this event clearly”
-
-```json id="n4"
-{
-  "event_id": "cluster_id",
-  "include_articles": false
-}
-```
-
-Output:
-
-* merged summary
-* key facts
-* sources
-* optional articles (title/url/source/timestamp)
-
-👉 This is where you compress multiple articles into one clean narrative
-
----
-
-## 4. `get_news_sentiment`
-
-> “What’s the tone around X?”
-
-```json id="n5"
-{
-  "entity": "BTC",
-  "timeframe": "24h"
-}
+After deploying junction table schema changes:
+```bash
+docker exec -it news-mcp python3 scripts/backfill_junction_tables.py
 ```
 
-Output:
-
-```json id="n6"
-{
-  "sentiment": "positive",
-  "score": 0.64,
-  "article_count": 42
-}
+For timestamp normalization (already run on live server):
+```bash
+docker exec -it news-mcp python3 scripts/normalize_cluster_timestamps.py
 ```
 
----
-
-## 5. `detect_emerging_topics` (very valuable)
-
-> “What is gaining attention?”
-
-Output:
-
-```json id="n7"
-[
-  {
-    "topic": "Ethereum ETF",
-    "trend_score": 0.91,
-    "related_entities": ["ETH", "BlackRock", "SEC"],
-    "count": 8,
-    "avg_importance": 0.17
-  }
-]
-```
-
----
-
-## 6. `get_related_entities`
-
-> “What entities tend to appear with X?”
-
-```json id="n8"
-{
-  "subject": "Iran",
-  "timeframe": "24h",
-  "limit": 10
-}
-```
-
-Output:
-
-```json id="n9"
-[
-  {
-    "entity": "United States",
-    "count": 5,
-    "avg_importance": 0.11,
-    "sentiment": "negative",
-    "score": -0.2
-  }
-]
-```
-
-👉 entity-only co-occurrence neighborhood for real-time sense-making
-
----
-
-# ⚠️ 3. What NOT to expose
-
-Avoid:
-
-* raw RSS feeds
-* individual article endpoints
-* unprocessed headlines
-
-❌ Bad:
-
-```id="bad-news"
-get_raw_articles()
-```
-
-👉 This destroys signal quality for agents
-
----
-
-# 🔁 4. Caching & Freshness Strategy
-
-## Key difference from crypto:
-
-* News is **append-only + evolving**
-* Not real-time tick data
-
----
-
-## Strategy:
-
-### Fetch layer:
-
-* poll every few minutes
-
-### Cluster layer:
-
-* update clusters incrementally
-
-### Tool responses:
-
-* no heavy recomputation
-* serve from processed store
-
----
-
-# 🧠 5. Deduplication Strategy (critical)
-
-Clustering is the unit of truth, not individual articles.
-
-**Signal cascade** (cheapest first, short-circuit on match):
-1. Cosine similarity (if embeddings enabled) against cluster centroid
-2. Fuzzy title similarity (SequenceMatcher, configurable threshold, default 0.87)
-3. Token Jaccard over headline+summary (default threshold 0.55)
-4. Consensus: cosine ≥ 0.80 AND (jaccard ≥ 0.30 OR title ≥ 0.55)
-
-Each new article is compared against **all** articles in a candidate cluster; the best signal across all members is used.
-
-**Stable cluster IDs**: `sha1(topic | min_article_key)` — the same set of articles always maps to the same ID regardless of which article arrived first or which polling cycle created the cluster.
-
-**Cross-cycle merge**: the poller loads recent clusters from the DB (controlled by `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h) and seeds them as merge targets before clustering. New articles can merge into clusters from previous polling cycles.
-
-**Orphan merge**: a post-clustering Union-Find pass merges clusters that share article keys, catching cases where articles about the same event didn't match during the main loop.
-
-Planned runtime order:
-* when `NEWS_EMBEDDINGS_ENABLED=true`, try Ollama embeddings first
-* if Ollama fails, fall back to the existing heuristic cluster path
-* keep candidate pre-filtering cheap before any vector compare
-
----
-
-# ⚡ 6. Signal Quality Rules
-
-Your MCP should:
-
-### ✅ Do:
-
-* reduce 100 articles → 5–10 clusters
-* highlight consensus
-* surface importance
-
-### ❌ Don’t:
-
-* overwhelm agent with volume
-* pass conflicting duplicates
-* expose noise
-
----
-
-# 🧩 7. Relationship to Other MCPs
-
-This MCP becomes powerful when combined with:
-
-* crypto MCP → price
-* trends MCP → attention
-
-👉 News MCP provides:
-
-> **causal narratives**
-
----
-
-# 🧭 8. Design Philosophy
-
-Each tool should answer:
-
-> “What is happening, and why should I care?”
-
----
-
-# 🚀 9. Suggested Build Order
-
-1. RSS ingestion
-2. normalization
-3. basic deduplication
-4. clustering
-5. simple summarization
-6. entity tagging
-
-👉 Only then expose tools
-
----
-
-# 🧠 Final takeaway
-
-> Crypto MCP gives you **facts**
-> News MCP gives you **meaning**
-
-But only if you:
-
-* aggressively deduplicate
-* cluster events
-* compress information
-
----
-
-# ✅ Completed since this outlook was written
-
-* v0.1.0 released and tagged
-* provider-agnostic LLM extraction/summarization layer added
-* prompts moved into separate files for easier updates
-* entity blacklist implemented and made case-insensitive
-* wildcard blacklist support added for entities/topics/keywords
-* live extraction smoke test added
-* JSON-backed alias map added for query normalization
-* query normalization added so shorthand like `btc` and `trump` still works
-* docs updated with the new env vars and workflow
-* optional article payloads added to event tools
-* blacklist enforcement maintenance script added
-* related-entities tool added for co-occurrence neighborhoods
-* emerging-topic scoring improved with importance-weighting and co-occurrence
-* concurrent RSS/OLLAMA/LLM pipelines added (v0.3.0)
-* stable cluster IDs, cross-cycle merge, orphan dedup, multi-article signal comparison added (v0.3.1)
-
----
-
-# 🔭 Next high-level steps
-
-## What is left of v0.1.0
-
-The first version is now effectively a usable baseline. The remaining work for v0.1.x is mostly polish:
-
-* stabilize extraction quality across a few more real-world samples
-* expand the alias map only where usage demands it
-* tune emerging-topic noise so repeated source names do not dominate
-* keep sentiment labels aligned with scores as the model improves
-
-## Where v0.2.0 should lead
-
-### Future plan (worth building slowly): “Emerging entity graph over time”
-Right now `detect_emerging_topics()` returns a flat list of emerging *topics/entities*.
-Next-level idea: turn it into an **entity graph** that an agent can reason over.
-
-**Core concept**
-- Collapse/group results into canonical entity nodes (e.g. `iran`, `israel`, `donald_trump`, `strait_of_hormuz`, etc.)
-- Build weighted edges from co-occurrence in recent clusters:
-  - edge weight ~ frequency/co-occurrence strength
-  - node weight ~ trend_score + count (+ optional avg_importance)
-- Infer communities (graph grouping) so related nodes form stable “story neighborhoods”
-
-**Over time (the important part)**
-- Each refresh window produces a snapshot of the graph
-- Store snapshots / deltas to observe:
-  - rising/falling node weights (“momentum”)
-  - strengthening/weaker relations
-  - emerging communities and topic shifts
-
-**Suggested output for an eventual agent tool**
-- `get_emerging_entity_graph(timeframe, limit)` returning:
-  - grouped communities
-  - top nodes + weights
-  - top relations + direction (optional)
-  - summary of “what changed since last snapshot”
-
-This needs extra time to become a real usable MCP tool, so it’s intentionally captured here for later execution.
-
-
-1. **Normalization layer**
-
-   * canonicalize acronyms and entity variants before storage / querying
-   * keep the blacklist as a separate post-processing rule
-
-2. **Wildcard blacklist support**
-
-   * allow patterns for entities / topics / keywords
-   * keep matching case-insensitive
-
-3. **Emerging signal quality**
-
-   * tune what counts as an emerging topic/entity
-   * reduce noise from repeated source names and generic terms
-
-4. **Entity/time tracking and replay (future capability)**
-
-   * track how important entities evolve over time
-   * allow replay of when entities first appeared, how topics shifted, and how sentiment changed
-   * useful later for narrative reconstruction and trend timelines
-
-## Longer-term direction
-
-The endgame is not just “news search”, but a light narrative memory system:
-
-* entity histories over time
-* topic shifts and turning points
-* sentiment arcs
-* replayable timelines for a person, company, or event
+## Future Directions (v0.5.0+)
 
-That should stay in mind while keeping the current implementation simple.
+### "Emerging entity graph over time"
+- Collapse `detect_emerging_topics()` results into canonical entity nodes
+- Build weighted edges from co-occurrence in recent clusters
+- Infer communities (story neighborhoods)
+- Track graph evolution across refresh windows (node momentum, edge strength changes)
+- Agent tool: `get_emerging_entity_graph(timeframe, limit)`

+ 11 - 14
POLLER_UPGRADE_PLAN.md

@@ -1,25 +1,22 @@
 # Poller Upgrade Plan
 
+**Status:** Not yet implemented.
+
 ## Goal
-Remove the poller's direct dependency on `SQLiteClusterStore._conn()` and replace it with a public store method so tests and future store implementations do not need to model a private connection helper.
+Remove the poller's direct dependency on `SQLiteClusterStore._conn()` and replace it with a public store method.
 
 ## Current Problem
-- `news_mcp/jobs/poller.py` clears legacy `feed_state` rows with `with store._conn() as conn:`.
+- `news_mcp/jobs/poller.py:173` clears legacy `feed_state` rows with `with store._conn() as conn:`.
 - That couples the refresh loop to a private SQLite implementation detail.
-- Test doubles now need to expose `_conn()`, which is a sign the contract is too low-level.
+- Test doubles need to expose `_conn()`.
 
 ## Proposed Refactor
-1. Add a public method to `SQLiteClusterStore` for the legacy cleanup step.
+1. Add a public `clear_legacy_feed_state()` method to `SQLiteClusterStore`.
 2. Move the `DELETE FROM feed_state WHERE feed_key LIKE 'newsfeeds:%'` logic into that method.
-3. Update `news_mcp/jobs/poller.py` to call the public method instead of `_conn()`.
-4. Adjust tests to mock the public method, not a private connection handle.
-
-## Verification
-- Re-run the poller-focused tests.
-- Run the repo test script if the change stays small enough to keep coverage cheap.
-- Confirm no other code paths still depend on `store._conn()` outside the store implementation itself.
+3. Update `poller.py` to call the public method.
+4. Adjust tests accordingly.
 
-## Notes
+## Rules
 - Keep the change narrow.
-- Do not alter the feed-hash or clustering behavior in the same patch.
-- Preserve the current legacy-row cleanup behavior exactly; only the access path should change.
+- Do not alter feed-hash or clustering behavior in the same patch.
+- Preserve current legacy-row cleanup behavior exactly.

+ 64 - 239
PROJECT.md

@@ -6,271 +6,96 @@ Provide a signal-extraction MCP server that converts RSS into **deduplicated, en
 ## Current architecture (v0.4.0)
 - FastMCP SSE server mounted at `/mcp`
 - SQLite cache for clusters + entity metadata + feed state + LLM summary caches
-- **payload_ts** — indexed generated column for SQL-level event-time filtering (no JSON parsing at read time)
-- **cluster_entities** and **cluster_keywords** junction tables with indexes for O(log n) entity/keyword search
-- All read paths use SQL-level filtering (no full-table JSON parsing)
-- **Stable cluster IDs**: `sha1(min_article_key)` — topic-independent, order-independent, consistent across polling cycles. The topic is excluded from the hash so that the same article always maps to the same cluster_id regardless of heuristic vs LLM-enriched topic classification.
-- **Cross-cycle merge**: poller seeds clustering with recent DB clusters (configurable `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h). Existing clusters are re-bucketed by the same heuristic topic function (`normalize_topic_from_title`) that new articles use, ensuring matching works even when the enriched topic drifted.
+- **payload_ts** — indexed VIRTUAL GENERATED column: `json_extract(payload, '$.timestamp')`. Auto-maintained by SQLite on write. Indexed for O(log n) time-range queries. No write-path code needed.
+- **cluster_entities** junction table — `(cluster_id, entity)` with index on `entity`. Populated in `upsert_clusters()`. SQL-level entity search.
+- **cluster_keywords** junction table — `(cluster_id, keyword)` with index on `keyword`. Same pattern.
+- All time-range filters and entity/keyword searches use SQL indexes. No full-table JSON parsing at query time.
+- **Stable cluster IDs**: `sha1(min_article_key)` — topic-independent, order-independent, consistent across polling cycles.
+- **Cross-cycle merge**: poller seeds clustering with recent DB clusters (configurable `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h).
 - **Orphan merge**: post-clustering Union-Find pass merges clusters sharing article keys
-- Concurrent Ollama embeddings (pre-computed before clustering loop)
-- Concurrent LLM enrichment (entity extraction, topic classification, sentiment) with per-provider semaphore
-- Per-cluster retry with exponential backoff (3 retries, 2s/4s/8s) + cross-cycle failure recovery
-- All concurrency limits configurable via env vars (`NEWS_RSS_MAX_CONCURRENCY`, `NEWS_OLLAMA_MAX_CONCURRENCY`, `NEWS_LLM_CONCURRENCY_<PROVIDER>`)
-- Dashboard REST API (`/api/v1/*`) for clusters, sentiment series, entity frequencies
-- `get_latest_events()` defaults to all topics (omit `topic` for unfiltered)
+- Concurrent RSS fetch, Ollama embeddings, LLM enrichment with per-provider semaphore
+- Dashboard REST API (`/api/v1/*`) + Keywords panel + entity/keyword drill-down via junction tables
 
-## Previous: v0.2.x architecture
-- FastMCP SSE server mounted at `/mcp`
-- SQLite cache for clusters + Groq summary caches
-- RSS fetch (breakingthenews.net)
-- v1 dedup via fuzzy title similarity only, seed-article-only comparison
-- optional Ollama embeddings path for clustering (when `NEWS_EMBEDDINGS_ENABLED=true`)
-- configurable embedding similarity threshold (`NEWS_EMBEDDING_SIMILARITY_THRESHOLD`)
-- optional embeddings backfill script for precomputing cluster vectors in SQLite
-- optional merge-analysis script for threshold experiments before any DB rewrite
-- optional merge pass for destructive consolidation after threshold review
-- optional article-dedup cleanup for repeated article variants inside a cluster
-- Groq enrichment (topic/entities/sentiment/keywords)
-- Tools expose semantic queries over cached clusters
-
-## MCP tools (current)
+## MCP tools
 - `get_latest_events(topic, limit, include_articles)`
 - `get_events_for_entity(entity, limit, timeframe, include_articles)`
 - `get_event_summary(event_id, include_articles)`
-- `detect_emerging_topics(limit)`
+- `detect_emerging_topics(limit, timeframe, topic, around)` — returns signal_type (entity/keyword/phrase)
 - `get_news_sentiment(entity, timeframe)`
 - `get_related_recent_entities(subject, timeframe, limit, include_trends)`
+- `get_feeds()` / `toggle_feed(feed_url, enabled)`
 - `get_capabilities()`
 
-## Refresh & caching
-
-## Future work (planned): entity graph over time
-Instead of treating `detect_emerging_topics()` as a flat list, we want a higher-level representation:
-
-- Convert emerging topic/entity co-occurrence signals into a **weighted entity graph**
-- Group the graph into **communities** (story neighborhoods)
-- Track **time evolution** across refresh windows:
-  - node “momentum” (trend_score/count changes)
-  - edge strength changes (relation tightening/weakening)
-  - community emergence/disappearance
-
-Eventual agent tool shape (later): `get_emerging_entity_graph(timeframe, limit)`.
+## REST API
+- `GET /` — server info, tools list
+- `GET /health` — uptime, version hash
+- `GET /api/v1/clusters` — paginated, filtered by `payload_ts` SQL index
+- `GET /api/v1/entities` — top entities via junction table GROUP BY
+- `GET /api/v1/keywords` — top keywords via junction table GROUP BY
+- `GET /api/v1/clusters/by-entity?entity=X&hours=Y` — SQL entity search (NEW)
+- `GET /api/v1/clusters/by-keyword?keyword=X&hours=Y` — SQL keyword search (NEW)
+- `GET /api/v1/sentiment-series` — filtered by `payload_ts` SQL index
+- `GET /api/v1/cluster/{cluster_id}` — full detail
+- `GET /api/v1/feeds` / `POST /api/v1/feeds/toggle` — feed management
 
-- Background refresh every `NEWS_REFRESH_INTERVAL_SECONDS` (default 900s)
-- Feed-hash skipping to avoid redundant RSS+Groq work
-- Cluster TTL (`NEWS_CLUSTERS_TTL_HOURS` via `CLUSTERS_TTL_HOURS`)
+## Refresh & caching
+- Background refresh every `NEWS_REFRESH_INTERVAL_SECONDS` (default 300s)
+- Feed-hash skipping to avoid redundant RSS+LLM work
 - Summary caching for `get_event_summary`
+- Pruning via `NEWS_RETENTION_DAYS`, `NEWS_PRUNE_INTERVAL_HOURS`
 
-## Definition of “committable”
-- Tests pass offline (dedup/storage unit tests)
-- Server exposes tool surface with valid schemas
-- Caching prevents repeated Groq calls for unchanged clusters
-- Embeddings remain optional: Ollama is tried first when enabled, otherwise the heuristic path stays active
-- Embeddings backfill script exists for older cluster rows before the server restart
-- Merge-analysis script exists to inspect candidate cluster pairs at multiple thresholds
-- Merge pass exists for destructive consolidation once thresholds look sane
-- Article-dedup cleanup exists for fixing duplicated article records already in SQLite
-- Entity lookup now respects timeframe as the scan window, with limit acting as a cap
-
-## Dashboard & REST API (added May 2026)
-
-### What was added
-- **5 REST endpoints** (`/api/v1/*`) for programmatic access to cluster data, sentiment series, entity frequencies, and health stats
-- **Dashboard SPA** at `/dashboard` — HTMX-based shell with Chart.js visualizations (5 views: health, clusters, sentiment, entities, detail)
-- **Non-blocking startup** — moved from synchronous `@app.on_event("startup")` pruning to `lifespan`-based fire-and-forget background loop; server responds within ~0.3s regardless of feed/LLM latency
-
-### Architecture
-```
-news-mcp/
-├── news_mcp/mcp_server_fastmcp.py   ← MCP tools + REST API + dashboard mount
-├── news_mcp/dashboard/
-│   ├── dashboard_store.py           ← Read-only query layer (no side effects)
-│   ├── index.html                   ← SPA shell with 5 views
-│   ├── style.css                    ← Dark theme, responsive
-│   └── dashboard.js                 ← Client-side rendering + Chart.js
-```
-
-### Key design decisions
-- Dashboard store wraps `SQLiteClusterStore` with thin read-only methods — no enrichment, no writes
-- Single shared store instance (`_shared_store`) avoids repeated DB connections
-- Static SPA files are served by FastAPI's `StaticFiles` mount — no Jinja2/templating dependency
-- Client-side `fetch()` + Chart.js avoids HTMX raw-JSON-in-DOM issues
-- Default lookback matches `NEWS_DEFAULT_LOOKBACK_HOURS` (144h), not a hardcoded 24h
-
-### Known gaps
-- No auth (LAN-only, no login)
-- Entity detail view in dashboard is minimal (click-to-expand from entity list is stub)
-- No alerting/threshold notifications yet (Phase 2: velocity spikes, sentiment divergence)
-
-## Dashboard & REST API (added May 2026)
-
-### What was added
-- **5 REST endpoints** (`/api/v1/*`) — JSON-only, for programmatic access and the dashboard
-- **Dashboard SPA** at `/dashboard` — 5 views (health, clusters, sentiment, entities, detail), Chart.js visualizations, instant client-side rendering
-- **Non-blocking startup** — replaced synchronous `@app.on_event("startup")` with `lifespan`-based fire-and-forget background loop; server responds in <0.3s
-- **Async ingestion lock** — `asyncio.Lock` prevents overlapping refresh cycles
-- **Hardened LLM calls** — OpenRouter retry logic with exponential backoff on 429/5xx, response shape validation
-
-### Architecture additions
-```
-news-mcp/
-├── news_mcp/mcp_server_fastmcp.py   ← MCP + REST API + /dashboard static mount
-├── news_mcp/dashboard/
-│   ├── __init__.py
-│   ├── dashboard_store.py           ← Read-only query layer (no side effects)
-│   ├── index.html                   ← SPA shell, 5 views
-│   ├── style.css                    ← Dark theme, responsive grid
-│   └── dashboard.js                 ← Client render, Chart.js, null-safe DOM access
-```
-
-### Design decisions
-- **Dashboard store** wraps `SQLiteClusterStore` with read-only methods (stats, pagination, series, frequencies, detail). No writes, no enrichment.
-- **Single shared store** (`_shared_store`) — one DB connection pool for the entire process.
-- **Static SPA** served via FastAPI `StaticFiles` — no Jinja2/templating dependency.
-- **Client-side rendering** with `fetch()` + Chart.js — avoids HTMX raw-JSON-in-DOM issues.
-- **Default lookback** follows `NEWS_DEFAULT_LOOKBACK_HOURS` (144h), not hardcoded.
-- **Cluster ordering** — always date-descending (SQL `ORDER BY updated_at DESC` + client-side sort as safety net).
-
-### Known gaps (for future work)
-- No auth (LAN-only assumption)
-- Entity detail view is functional but minimal
-- No alerting/threshold notifications (Phase 2)
-- No server-sent events for real-time dashboard updates
-
-## Keyword Utilization Upgrade (May 2026)
-
-### Problem
-Keywords are extracted by the LLM (`extract_entities.prompt` — "provide short keywords that justify the classification"), stored in the cluster payload, and displayed in the dashboard detail view — but they are not used by any search, scoring, or retrieval path. Thematic signals like "ETF", "rate-cut", "contagion" are invisible to entity search, emerging-topics detection, and related-entity expansion.
-
-### Plan
-
-#### Phase 1 — Search & Retrieval (done)
-- **1a**: Add keywords to `_cluster_entity_haystack()` in `mcp_server_fastmcp.py` so `get_events_for_entity()` and `get_news_sentiment()` match clusters by thematic keywords, not just named entities.
-- **1b**: Add `keywords` field to cluster output dicts in `get_latest_events()` and `get_events_for_entity()` so downstream LLM agents see the full semantic picture.
-
-#### Phase 2 — Emerging Topics (pending)
-- **2a**: Count keywords in `detect_emerging_topics()` with parallel `keyword_counts_recent` / `keyword_counts_prior` accumulators, scored with the same velocity/recency/source-diversity formula as entities.
-- **2b**: Optionally promote high-velocity keywords to "suggested entities" on the dashboard.
-
-#### Phase 3 — Relatedness & Dashboard (pending)
-- **3a**: Add keyword co-occurrence counting in `_collect_local_related()` in `related_entities.py`.
-- **3b**: Add `get_keyword_frequencies()` to `DashboardStore` and a "Keywords" panel on the dashboard.
-
-#### Phase 4 — Prompt Refinement (optional)
-- Split keyword extraction into "theme keywords" (subject matter) and "signal keywords" (what's new/notable) for differential weighting downstream.
-
-## Timestamp Normalization (May 2026)
-
-### Problem
-Cluster payloads stored timestamps as raw RSS strings (RFC 2822 HTTP-date like `"Sat, 30 May 2026 02:00:12 +00:00"`). Every read path needed fragile format-guessing, and SQL time-range queries on `updated_at` (row modification time, not event time) returned wrong data.
-
-### Fix
-- `_normalize_ts()` helper in `sqlite_store.py`: parses ISO 8601, RFC 2822/HTTP-date, epoch seconds → uniform `YYYY-MM-DDTHH:MM:SS+00:00`
-- `sanitize_cluster_payload()` now normalizes `timestamp`, `first_seen`, `last_updated`, and all `article[].timestamp` before writing to DB
-- `merge_cluster_embeddings.py`: same normalization on merged payloads
-- `scripts/normalize_cluster_timestamps.py`: backfill script for existing rows (run on live server with correct `--db` path)
-- `get_sentiment_series()` and `get_entity_frequencies()`: filter by `payload.timestamp` in Python, not `updated_at` in SQL
-
-### Key invariant
-`updated_at` in the DB = row modification time (set to `datetime.now()` on every upsert). For time-range queries, always use `payload.timestamp` parsed from the JSON.
-
-## Timestamp Read-Path Cleanup (May 2026)
-
-### Problem
-After normalization, all read paths still contained defensive RFC 2822 / `parsedate_to_datetime` fallback parsers. This was dead code on the live server (all stored timestamps are ISO 8601 UTC) and risked being re-introduced by future contributors who misread the defensive pattern as necessary.
-
-### Fix
-- Added `_read_ts(ts) -> float | None` to `sqlite_store.py` (module-level, exported). Uses only `datetime.fromisoformat()`. No RFC 2822 fallback. If it fails, the normalization pipeline has a bug — fix that instead.
-- All read-path timestamp parsing in `sqlite_store.py`, `dashboard_store.py`, and `mcp_server_fastmcp.py` now uses `_read_ts` or plain `fromisoformat`.
-- `parsedate_to_datetime` removed from `dashboard_store.py` and `mcp_server_fastmcp.py` imports entirely.
-- `parsedate_to_datetime` is **only** retained in `sqlite_store._normalize_ts()` (the write path) and `dedup/cluster.py` (raw ingest before normalization).
-- Test fixtures updated to use ISO 8601 UTC timestamps.
-
-### Contract (ENFORCE THIS)
-- `payload.timestamp`, `payload.first_seen`, `payload.last_updated` are **always** `YYYY-MM-DDTHH:MM:SS+00:00` for any row written after the normalization migration.
-- Read paths: use `_read_ts()` from `sqlite_store` or `datetime.fromisoformat()` directly. **Never** add `parsedate_to_datetime` to a read path.
-- Write paths: `sanitize_cluster_payload()` in `sqlite_store.py` is the single normalization point. All writes go through `upsert_clusters()` which calls it.
-- This repo is a **dev machine** copy. The live server is on thinkcenter-2 (192.168.0.200). Never query the dev DB to verify live data — the dev DB is stale/empty.
-
-## Junction Tables + Indexed Timestamp (May 2026)
-
-### Problem
-All read paths deserialize every JSON payload to filter by entity/keyword/time. With 6000+ clusters, `get_clusters_page` returns only the 100 newest — clicking an entity that appears 34x shows only 2 clusters because the other 32 are outside the LIMIT. `get_entity_frequencies` counts correctly but the detail view can't find them. Every query does a full table scan with JSON parsing.
-
-### Solution: junction tables + generated timestamp column
-
-**Schema (migrated in `_init_db`, incremental-safe):**
-
+## Schema (clusters table)
 ```sql
--- Indexed event timestamp (SQLite generated column — zero write-path cost)
-ALTER TABLE clusters ADD COLUMN payload_ts
-    GENERATED ALWAYS AS (json_extract(payload, '$.timestamp')) STORED;
-CREATE INDEX IF NOT EXISTS idx_clusters_payload_ts ON clusters(payload_ts);
+CREATE TABLE clusters (
+    cluster_id TEXT PRIMARY KEY,
+    topic TEXT NOT NULL,
+    payload TEXT NOT NULL,
+    updated_at TEXT NOT NULL,          -- row modification time (set on every upsert)
+    summary_payload TEXT,
+    summary_updated_at TEXT,
+    payload_ts GENERATED ALWAYS AS     -- indexed event time (auto-maintained)
+        (json_extract(payload, '$.timestamp')) VIRTUAL
+);
+CREATE INDEX idx_clusters_payload_ts ON clusters(payload_ts);
 
--- Entity junction table for SQL-level entity search
-CREATE TABLE IF NOT EXISTS cluster_entities (
+CREATE TABLE cluster_entities (
     cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
-    entity     TEXT NOT NULL,
+    entity     TEXT NOT NULL,          -- lowercased
     PRIMARY KEY (cluster_id, entity)
 );
-CREATE INDEX IF NOT EXISTS idx_cluster_entities_entity ON cluster_entities(entity);
+CREATE INDEX idx_cluster_entities_entity ON cluster_entities(entity);
 
--- Keyword junction table for SQL-level keyword search
-CREATE TABLE IF NOT EXISTS cluster_keywords (
+CREATE TABLE cluster_keywords (
     cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
-    keyword    TEXT NOT NULL,
+    keyword    TEXT NOT NULL,          -- lowercased
     PRIMARY KEY (cluster_id, keyword)
 );
-CREATE INDEX IF NOT EXISTS idx_cluster_keywords_keyword ON cluster_keywords(keyword);
+CREATE INDEX idx_cluster_keywords_keyword ON cluster_keywords(keyword);
 ```
 
-**Write path (`upsert_clusters`):** Within the existing transaction, after sanitizing the payload and before INSERT/UPDATE:
-1. `DELETE FROM cluster_entities WHERE cluster_id = ?`  (handles re-enrichment)
-2. `DELETE FROM cluster_keywords WHERE cluster_id = ?`
-3. `INSERT OR IGNORE INTO cluster_entities VALUES (?, ?)` for each entity
-4. `INSERT OR IGNORE INTO cluster_keywords VALUES (?, ?)` for each keyword
-5. `payload_ts` is auto-maintained by SQLite's generated column — no code needed
-
-**Read paths — all SQL-level, no JSON parsing at query time:**
-
-- `get_clusters_page`: `WHERE payload_ts >= ? ORDER BY payload_ts DESC LIMIT ? OFFSET ?`
-- `get_entity_frequencies`: `JOIN cluster_entities ... WHERE payload_ts >= ? GROUP BY entity ORDER BY cnt DESC`
-- `get_keyword_frequencies`: `JOIN cluster_keywords ... WHERE payload_ts >= ? GROUP BY keyword ORDER BY cnt DESC`
-- New `get_clusters_by_entity`: `JOIN cluster_entities WHERE payload_ts >= ? AND entity = ?`
-- New `get_clusters_by_keyword`: `JOIN cluster_keywords WHERE payload_ts >= ? AND keyword = ?`
+## Keyword Utilization (done, May 2026)
+Keywords extracted by the LLM are now first-class search signals:
+- `_cluster_entity_haystack()` includes keywords → `get_events_for_entity()` matches themes
+- Cluster output includes `keywords[]` field
+- `detect_emerging_topics()` scores keywords with velocity/recency/source-diversity formula (`signal_type: "keyword"`)
+- `_collect_local_related()` counts keyword co-occurrence
+- Dashboard Keywords panel with SQL frequency counts via junction table
+- Topic labels (crypto/macro/regulation/ai/other) filtered from keywords at extraction time
 
-**Backfill script (`scripts/backfill_junction_tables.py`):**
-- Same pattern as `normalize_cluster_timestamps.py`
-- Accepts `--db` arg, defaults to config DB_PATH
-- Reads all cluster payloads, populates `cluster_entities` and `cluster_keywords`
-- `payload_ts` is auto-populated by SQLite's generated column
-- Idempotent (`INSERT OR IGNORE` + transaction)
-- Reports entity/keyword counts after completion
-- Run once on live server: `docker exec -it <container> python3 scripts/backfill_junction_tables.py`
+## Timestamp Pipeline (May 2026)
+1. **Write**: `sanitize_cluster_payload()` normalizes `timestamp`/`first_seen`/`last_updated` to `YYYY-MM-DDTHH:MM:SS+00:00`. If all three missing, falls back to `datetime.now()`.
+2. **Generated column**: `payload_ts` auto-extracts from JSON on write. Indexed.
+3. **Read**: All queries filter by `payload_ts >= ?` in SQL. No JSON parsing for time filtering.
+4. **Backfill**: One-time `scripts/backfill_junction_tables.py` populated junction tables from existing payloads. `payload_ts` was auto-populated by SQLite.
 
-**REST API changes:**
-- `GET /api/v1/clusters` — now uses SQL `payload_ts` filter, consistent total
-- `GET /api/v1/entities` — SQL `COUNT(*) ... GROUP BY` via junction table
-- `GET /api/v1/keywords` — SQL `COUNT(*) ... GROUP BY` via junction table
-- **New `GET /api/v1/clusters/by-entity?entity=X&hours=Y&limit=Z`** — SQL entity search
-- **New `GET /api/v1/clusters/by-keyword?keyword=X&hours=Y&limit=Z`** — SQL keyword search
+## Design Flaw: Two Stores (KNOWN, fix planned)
 
-**Dashboard JS changes:**
-- `showEntityDetail(label)` — calls `/api/v1/clusters/by-entity` instead of fetching all clusters
-- `showKeywordDetail(label)` — calls `/api/v1/clusters/by-keyword` instead of fetching all clusters
+**Problem:** `SQLiteClusterStore` and `DashboardStore` are parallel copies of the same data access layer. Methods were duplicated when DashboardStore was added, with the same JSON-parsing approach. When junction tables were implemented, only `DashboardStore` was updated. `SQLiteClusterStore` (used by MCP tools) still does full-table JSON parsing for entity/keyword search.
 
-**Files changed:**
-| File | Change |
-|---|---|
-| `news_mcp/storage/sqlite_store.py` | Schema migration (generated column + junction tables), write-path junction population, new SQL-level read methods |
-| `news_mcp/mcp_server_fastmcp.py` | New REST endpoints for entity/keyword cluster search |
-| `news_mcp/dashboard/dashboard_store.py` | `get_entity_frequencies`, `get_keyword_frequencies` use SQL junction table counts |
-| `dashboard/dashboard.js` | `showEntityDetail`, `showKeywordDetail` call new endpoints |
-| `scripts/backfill_junction_tables.py` | New backfill script (same pattern as normalize_cluster_timestamps.py) |
+**Current state:**
+- `DashboardStore` — uses SQL `payload_ts` filter + junction tables ✓
+- `SQLiteClusterStore` — uses SQL `payload_ts` filter for time ✓, but MCP tool entity search (`get_events_for_entity`, `get_news_sentiment`) still fetches top-N clusters by time then filters entities in Python
 
-**Migration safety:**
-- All DDL uses `IF NOT EXISTS` / `ADD COLUMN IF NOT EXISTS` — safe to re-run
-- Backfill script is idempotent (`INSERT OR IGNORE` in transactions)
-- Generated column requires no write-path code changes
-- Old query methods can coexist during transition (removed after verification)
+**Consequence:** `get_events_for_entity("Pete Hegseth", timeframe="72h")` fetches the 200 most recent clusters (via `payload_ts`), then loops in Python checking entities. If the entity appears in 34 clusters but only 15 are in the top 200, 19 are missed.
 
+**Proposed fix:** Collapse both stores into one. `SQLiteClusterStore` should be the single data access layer with proper junction-table methods for entity/keyword search. `DashboardStore` should be a thin wrapper or removed entirely. MCP tools should call `SQLiteClusterStore.get_clusters_by_entity()` using junction tables instead of Python-side filtering.