소스 검색

fix: collapse two-store design flaw — SQL-level entity/keyword search in all MCP tools

The 'Two Stores' design flaw (PROJECT.md § Design Flaw) was that
DashboardStore had proper SQL-level entity/keyword search via junction
tables while MCP tools used Python-side substring matching on a top-N
row fetch, silently missing entities beyond the limit.

Changes:
- sqlite_store.py: Added get_clusters_page(), get_sentiment_series(),
  get_entity_frequencies(), get_keyword_frequencies(),
  get_clusters_by_entity(), get_clusters_by_keyword(), and
  get_clusters_by_entity_or_keyword() — a combined SQL search across
  both cluster_entities and cluster_keywords junction tables with
  WHERE (ce.entity IN (...) OR ck.keyword IN (...)). Extended
  get_dashboard_stats() and get_cluster_detail() to match the richer
  DashboardStore output.
- mcp_server_fastmcp.py: All three MCP tools that did Python-side
  entity matching (get_events_for_entity, get_news_sentiment,
  get_latest_events entity mode) now call
  get_clusters_by_entity_or_keyword() — SQL-level, no row-limit.
  Exact matching via IN replaces substring matching (more correct).
  DashboardStore import removed; REST routes use _shared_store directly.
- dashboard/dashboard_store.py: Deleted. All methods moved to
  SQLiteClusterStore.
- Docs: PROJECT.md, OUTLOOK.md, AGENTS.md updated to mark the fix.

Tests: 32/32 pass. Server starts cleanly. All REST endpoints verified.
Lukas Goldschmidt 1 주 전
부모
커밋
f8677e48b5
7개의 변경된 파일328개의 추가작업 그리고 412개의 파일을 삭제
  1. 2 2
      AGENTS.md
  2. 4 4
      OUTLOOK.md
  3. 13 9
      PROJECT.md
  4. 0 0
      news_mcp/dashboard/__init__.py
  5. 0 357
      news_mcp/dashboard/dashboard_store.py
  6. 15 37
      news_mcp/mcp_server_fastmcp.py
  7. 294 3
      news_mcp/storage/sqlite_store.py

+ 2 - 2
AGENTS.md

@@ -53,9 +53,9 @@ This project spans two machines. **Always check which machine you're operating o
 docker exec -it news-mcp python3 scripts/backfill_junction_tables.py
 ```
 
-## Design Flaw: Two Stores
+## Design Flaw: Two Stores (FIXED May 2026)
 
-`SQLiteClusterStore` and `DashboardStore` are parallel copies. Only `DashboardStore` was updated with junction-table entity search. MCP tools (`get_events_for_entity`, `get_news_sentiment`) still use `SQLiteClusterStore` Python-side entity matching with a row limit (top 200), missing entities in older clusters. See PROJECT.md for full analysis and proposed fix.
+`DashboardStore` was eliminated. `SQLiteClusterStore` is the single data access layer with junction-table entity/keyword search. All MCP tools use the proper SQL methods.
 
 ## Docker / Live Server Details
 - `docker-compose.yml` mounts `./:/app` with `working_dir: /app`

+ 4 - 4
OUTLOOK.md

@@ -33,11 +33,11 @@ See PROJECT.md for full schema and architecture. Key points:
 
 ## Known Design Issues
 
-### Two Stores (see PROJECT.md § "Design Flaw")
-`SQLiteClusterStore` and `DashboardStore` are parallel copies. Only `DashboardStore` was updated with junction-table entity search. MCP tools still use Python-side entity matching with a row limit. Proposed fix: collapse into single data access layer.
+### Two Stores (FIXED, May 2026)
+`DashboardStore` was eliminated. All methods moved to `SQLiteClusterStore`. MCP tools now use SQL-level junction-table entity/keyword search via `get_clusters_by_entity_or_keyword()` — no row-limit blind spot.
 
-### MCP Tool Entity Search
-`get_events_for_entity` and `get_news_sentiment` fetch top-N clusters by time then filter entities in Python. Entities in clusters beyond the limit are missed. Fix: use junction table `get_clusters_by_entity()`.
+### MCP Tool Entity Search (FIXED, May 2026)
+`get_events_for_entity` and `get_news_sentiment` now use `SQLiteClusterStore.get_clusters_by_entity_or_keyword()` with proper SQL-level filtering across the full time window via the `cluster_entities` and `cluster_keywords` junction tables.
 
 ## Backfill Scripts
 

+ 13 - 9
PROJECT.md

@@ -82,20 +82,24 @@ Keywords extracted by the LLM are now first-class search signals:
 - Dashboard Keywords panel with SQL frequency counts via junction table
 - Topic labels (crypto/macro/regulation/ai/other) filtered from keywords at extraction time
 
+## Two-Store Collapse (done, May 2026)
+
+`DashboardStore` has been eliminated. All of its methods were moved into `SQLiteClusterStore` (the single data access layer), and the REST API routes now use the shared `SQLiteClusterStore` instance directly.
+
+All MCP tools (`get_events_for_entity`, `get_news_sentiment`, `get_latest_events` entity mode) now use `SQLiteClusterStore.get_clusters_by_entity_or_keyword()` which searches via junction-table SQL joins — no row-limit blind spot. The `cluster_entities` and `cluster_keywords` junction tables are indexed for O(log n) lookup across any time window.
+
 ## Timestamp Pipeline (May 2026)
 1. **Write**: `sanitize_cluster_payload()` normalizes `timestamp`/`first_seen`/`last_updated` to `YYYY-MM-DDTHH:MM:SS+00:00`. If all three missing, falls back to `datetime.now()`.
 2. **Generated column**: `payload_ts` auto-extracts from JSON on write. Indexed.
 3. **Read**: All queries filter by `payload_ts >= ?` in SQL. No JSON parsing for time filtering.
 4. **Backfill**: One-time `scripts/backfill_junction_tables.py` populated junction tables from existing payloads. `payload_ts` was auto-populated by SQLite.
 
-## Design Flaw: Two Stores (KNOWN, fix planned)
-
-**Problem:** `SQLiteClusterStore` and `DashboardStore` are parallel copies of the same data access layer. Methods were duplicated when DashboardStore was added, with the same JSON-parsing approach. When junction tables were implemented, only `DashboardStore` was updated. `SQLiteClusterStore` (used by MCP tools) still does full-table JSON parsing for entity/keyword search.
-
-**Current state:**
-- `DashboardStore` — uses SQL `payload_ts` filter + junction tables ✓
-- `SQLiteClusterStore` — uses SQL `payload_ts` filter for time ✓, but MCP tool entity search (`get_events_for_entity`, `get_news_sentiment`) still fetches top-N clusters by time then filters entities in Python
+## Design Flaw: Two Stores (FIXED, May 2026)
 
-**Consequence:** `get_events_for_entity("Pete Hegseth", timeframe="72h")` fetches the 200 most recent clusters (via `payload_ts`), then loops in Python checking entities. If the entity appears in 34 clusters but only 15 are in the top 200, 19 are missed.
+**What happened:** `DashboardStore` was a thin read-only query layer that wrapped `SQLiteClusterStore`. The MCP tools (`get_events_for_entity`, `get_news_sentiment`, `get_latest_events` entity mode) did Python-side entity matching by fetching top-N clusters via `payload_ts` then filtering in Python. Entities in clusters beyond the limit were silently missed.
 
-**Proposed fix:** Collapse both stores into one. `SQLiteClusterStore` should be the single data access layer with proper junction-table methods for entity/keyword search. `DashboardStore` should be a thin wrapper or removed entirely. MCP tools should call `SQLiteClusterStore.get_clusters_by_entity()` using junction tables instead of Python-side filtering.
+**Fix applied:** 
+- `DashboardStore` was deleted. All its methods are now in `SQLiteClusterStore`.
+- All MCP tools use `SQLiteClusterStore.get_clusters_by_entity_or_keyword()` — SQL-level junction-table search with no row-limit blind spot.
+- The combined method uses `LEFT JOIN` on `cluster_entities` and `cluster_keywords` with `WHERE (ce.entity IN (...) OR ck.keyword IN (...))`, which matches both named entities and thematic keywords across any time window.
+- Exact matching (via `IN`) replaced substring matching — more correct, no false positives from partial name matches.

+ 0 - 0
news_mcp/dashboard/__init__.py


+ 0 - 357
news_mcp/dashboard/dashboard_store.py

@@ -1,357 +0,0 @@
-from __future__ import annotations
-
-import json
-from datetime import datetime, timedelta, timezone
-from typing import Any
-
-from news_mcp.config import (
-    NEWS_PRUNE_INTERVAL_HOURS,
-    NEWS_PRUNING_ENABLED,
-    NEWS_REFRESH_INTERVAL_SECONDS,
-    NEWS_RETENTION_DAYS,
-    DEFAULT_TOPICS,
-)
-from news_mcp.storage.sqlite_store import SQLiteClusterStore
-
-
-class DashboardStore:
-    """Read-only query layer for the dashboard."""
-
-    def __init__(self, store=None):
-        if store is not None:
-            self._store = store
-        else:
-            from news_mcp.config import DB_PATH
-            self._store = SQLiteClusterStore(DB_PATH)
-
-    # ── Health & Stats ──────────────────────────────────────────────
-
-    def get_dashboard_stats(self) -> dict[str, Any]:
-        with self._store._conn() as conn:
-            total_clusters = conn.execute("SELECT COUNT(*) FROM clusters").fetchone()[0]
-            total_entities = conn.execute("SELECT COUNT(*) FROM entity_metadata").fetchone()[0]
-            cluster_entities = conn.execute(
-                "SELECT COUNT(DISTINCT e.value) "
-                "FROM clusters, json_each(clusters.payload, '$.entities') AS e"
-            ).fetchone()[0]
-            topic_counts = dict(conn.execute(
-                "SELECT topic, COUNT(*) FROM clusters GROUP BY topic"
-            ).fetchall())
-
-        last_refresh = self._store.get_meta("last_refresh_at")
-        last_prune = self._store.get_meta("last_prune_at")
-        
-        # Freshness: did a refresh happen recently? (within 2x the configured interval)
-        fresh = False
-        if last_refresh:
-            try:
-                dt = datetime.fromisoformat(last_refresh.replace("Z", "+00:00"))
-                if dt.tzinfo is None:
-                    dt = dt.replace(tzinfo=timezone.utc)
-                age_hours = (datetime.now(timezone.utc) - dt).total_seconds() / 3600
-                fresh = age_hours < max(1.0, NEWS_REFRESH_INTERVAL_SECONDS / 3600) * 2
-            except Exception:
-                pass
-
-        feeds = {}
-        with self._store._conn() as conn:
-            for row in conn.execute("SELECT feed_key, last_hash, last_item_count, enabled, updated_at FROM feed_state ORDER BY updated_at DESC"):
-                feeds[row[0]] = {"last_hash": row[1], "last_item_count": row[2], "enabled": bool(row[3]), "updated_at": row[4]}
-
-        return {
-            "total_clusters": total_clusters,
-            "total_entities": total_entities,
-            "cluster_entities": cluster_entities,
-            "clusters_by_topic": topic_counts,
-            "last_refresh_at": last_refresh,
-            "last_prune_at": last_prune,
-            "data_fresh": fresh,
-            "feeds": feeds,
-            "feed_count": len(feeds),
-            "pruning": {
-                "enabled": NEWS_PRUNING_ENABLED,
-                "retention_days": NEWS_RETENTION_DAYS,
-                "interval_hours": NEWS_PRUNE_INTERVAL_HOURS,
-                "last_prune_at": last_prune,
-            },
-        }
-
-    # ── Clusters ────────────────────────────────────────────────────
-
-    def get_clusters_page(
-        self,
-        topic: str | None = None,
-        hours: float = 24,
-        limit: int = 20,
-        offset: int = 0,
-    ) -> dict[str, Any]:
-        """Paginated cluster listing filtered by SQL payload_ts index.
-
-        Returns {"clusters": [...], "total": int}.
-        """
-        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
-
-        query = "SELECT payload FROM clusters WHERE payload_ts >= ?"
-        params: list = [cutoff]
-        if topic and topic != "all":
-            query += " AND topic = ?"
-            params.append(topic)
-        # Get total count before pagination
-        total = self._store._conn().execute(
-            f"SELECT COUNT(*) FROM ({query})", params
-        ).fetchone()[0]
-        query += " ORDER BY payload_ts DESC LIMIT ? OFFSET ?"
-        params.extend([limit, offset])
-
-        with self._store._conn() as conn:
-            rows = conn.execute(query, params).fetchall()
-
-        return {
-            "clusters": [
-                {
-                    "cluster_id": c.get("cluster_id", ""),
-                    "headline": c.get("headline", ""),
-                    "topic": c.get("topic", ""),
-                    "sentiment": c.get("sentiment", "neutral"),
-                    "sentimentScore": c.get("sentimentScore"),
-                    "importance": c.get("importance", 0),
-                    "entities": c.get("entities", []),
-                    "sources": c.get("sources", []),
-                    "timestamp": c.get("timestamp", ""),
-                    "keywords": c.get("keywords", []),
-                    "article_count": len(c.get("articles", [])),
-                }
-                for c in [json.loads(r[0]) for r in rows]
-            ],
-            "total": total,
-        }
-
-    def get_cluster_detail(self, cluster_id: str) -> dict[str, Any] | None:
-        with self._store._conn() as conn:
-            cur = conn.execute(
-                "SELECT payload FROM clusters WHERE cluster_id = ?", (cluster_id,)
-            )
-            row = cur.fetchone()
-            if not row:
-                return None
-            c = json.loads(row[0])
-            summary = None
-            if c.get("summary_payload"):
-                try:
-                    summary = json.loads(c["summary_payload"])
-                except Exception:
-                    pass
-            return {
-                "cluster_id": c.get("cluster_id"),
-                "headline": c.get("headline", ""),
-                "summary": c.get("summary", ""),
-                "topic": c.get("topic", ""),
-                "sentiment": c.get("sentiment", "neutral"),
-                "sentimentScore": c.get("sentimentScore"),
-                "importance": c.get("importance", 0),
-                "entities": c.get("entities", []),
-                "entityResolutions": c.get("entityResolutions", []),
-                "keywords": c.get("keywords", []),
-                "sources": c.get("sources", []),
-                "timestamp": c.get("timestamp", ""),
-                "first_seen": c.get("first_seen", ""),
-                "last_updated": c.get("last_updated", ""),
-                "article_count": len(c.get("articles", [])),
-                "articles": c.get("articles", []),
-                "summary_text": summary.get("mergedSummary", "") if summary else "",
-                "key_facts": summary.get("keyFacts", []) if summary else [],
-            }
-
-    # ── Sentiment Series ────────────────────────────────────────────
-
-    def get_sentiment_series(
-            self,
-            topic: str | None = None,
-            hours: float = 24,
-            bucket_hours: float = 1,
-        ) -> list[dict[str, Any]]:
-        """Sentiment score averaged per time bucket.
-
-        Filters by payload_ts SQL index.
-        """
-        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
-        query = "SELECT payload FROM clusters WHERE payload_ts >= ?"
-        params: list = [cutoff]
-        if topic and topic != "all":
-            query += " AND topic = ?"
-            params.append(topic)
-        query += " ORDER BY payload_ts ASC"
-
-        with self._store._conn() as conn:
-            rows = conn.execute(query, params).fetchall()
-
-        buckets: dict[datetime, list[float]] = {}
-        for (payload_text,) in rows:
-            c = json.loads(payload_text)
-            ts_str = c.get("timestamp")
-            score = c.get("sentimentScore")
-            if not ts_str or score is None:
-                continue
-            dt = datetime.fromisoformat(str(ts_str).strip())
-            if dt.tzinfo is None:
-                dt = dt.replace(tzinfo=timezone.utc)
-            dt = dt.astimezone(timezone.utc)
-            bucket_key = dt.replace(minute=0, second=0, microsecond=0)
-            if bucket_hours > 1:
-                bucket_key = bucket_key.replace(
-                    hour=(bucket_key.hour // int(bucket_hours)) * int(bucket_hours)
-                )
-            buckets.setdefault(bucket_key, []).append(float(score))
-
-        return [
-            {
-                "time": bucket_key.isoformat(),
-                "avg_sentiment": round(sum(scores) / len(scores), 3),
-                "count": len(scores),
-                "min": round(min(scores), 3),
-                "max": round(max(scores), 3),
-            }
-            for bucket_key, scores in sorted(buckets.items())
-        ]
-
-    # ── Entity Frequencies ──────────────────────────────────────────
-
-    def get_entity_frequencies(
-        self,
-        hours: float = 24,
-        limit: int = 30,
-    ) -> list[dict[str, Any]]:
-        """Top entities by mention count, using SQL junction table + payload_ts index."""
-        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
-
-        with self._store._conn() as conn:
-            rows = conn.execute(
-                """
-                SELECT ce.entity, COUNT(*) as cnt
-                FROM cluster_entities ce
-                JOIN clusters c ON c.cluster_id = ce.cluster_id
-                WHERE c.payload_ts >= ?
-                GROUP BY ce.entity
-                ORDER BY cnt DESC
-                LIMIT ?
-                """,
-                (cutoff, limit),
-            ).fetchall()
-
-        result: list[dict[str, Any]] = []
-        for label, count in rows:
-            meta = self._store.get_entity_metadata(label)
-            result.append({
-                "label": label,
-                "count": count,
-                "canonical_label": meta["canonical_label"] if meta else label,
-                "mid": meta["mid"] if meta else None,
-            })
-        return result
-
-    # ── Keyword Frequencies ─────────────────────────────────────────
-
-    def get_keyword_frequencies(
-        self,
-        hours: float = 24,
-        limit: int = 30,
-    ) -> list[dict[str, Any]]:
-        """Top keywords by mention count, using SQL junction table + payload_ts index.
-
-        Excludes DEFAULT_TOPICS labels (crypto, macro, regulation, ai, other).
-        """
-        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
-        _topic_labels = {t.lower() for t in DEFAULT_TOPICS}
-
-        with self._store._conn() as conn:
-            rows = conn.execute(
-                """
-                SELECT ck.keyword, COUNT(*) as cnt
-                FROM cluster_keywords ck
-                JOIN clusters c ON c.cluster_id = ck.cluster_id
-                WHERE c.payload_ts >= ?
-                GROUP BY ck.keyword
-                ORDER BY cnt DESC
-                LIMIT ?
-                """,
-                (cutoff, limit),
-            ).fetchall()
-
-        return [
-            {"label": label, "count": count}
-            for label, count in rows
-            if label.lower() not in _topic_labels
-        ]
-
-    # ── Entity/Keyword Cluster Search ────────────────────────────────
-
-    def get_clusters_by_entity(
-        self,
-        entity: str,
-        hours: float = 168,
-        limit: int = 50,
-        offset: int = 0,
-    ) -> dict[str, Any]:
-        """Return clusters matching an entity, SQL-level filter via junction table."""
-        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
-        entity_norm = entity.strip().lower()
-
-        with self._store._conn() as conn:
-            # Total count
-            total = conn.execute(
-                "SELECT COUNT(DISTINCT c.cluster_id) FROM clusters c "
-                "JOIN cluster_entities ce ON c.cluster_id = ce.cluster_id "
-                "WHERE c.payload_ts >= ? AND ce.entity = ?",
-                (cutoff, entity_norm),
-            ).fetchone()[0]
-
-            # Paginated results
-            rows = conn.execute(
-                "SELECT DISTINCT c.payload FROM clusters c "
-                "JOIN cluster_entities ce ON c.cluster_id = ce.cluster_id "
-                "WHERE c.payload_ts >= ? AND ce.entity = ? "
-                "ORDER BY c.payload_ts DESC LIMIT ? OFFSET ?",
-                (cutoff, entity_norm, limit, offset),
-            ).fetchall()
-
-        return {
-            "entity": entity_norm,
-            "clusters": [json.loads(r[0]) for r in rows],
-            "total": total,
-            "hours": hours,
-        }
-
-    def get_clusters_by_keyword(
-        self,
-        keyword: str,
-        hours: float = 168,
-        limit: int = 50,
-        offset: int = 0,
-    ) -> dict[str, Any]:
-        """Return clusters matching a keyword, SQL-level filter via junction table."""
-        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
-        kw_norm = keyword.strip().lower()
-
-        with self._store._conn() as conn:
-            total = conn.execute(
-                "SELECT COUNT(DISTINCT c.cluster_id) FROM clusters c "
-                "JOIN cluster_keywords ck ON c.cluster_id = ck.cluster_id "
-                "WHERE c.payload_ts >= ? AND ck.keyword = ?",
-                (cutoff, kw_norm),
-            ).fetchone()[0]
-
-            rows = conn.execute(
-                "SELECT DISTINCT c.payload FROM clusters c "
-                "JOIN cluster_keywords ck ON c.cluster_id = ck.cluster_id "
-                "WHERE c.payload_ts >= ? AND ck.keyword = ? "
-                "ORDER BY c.payload_ts DESC LIMIT ? OFFSET ?",
-                (cutoff, kw_norm, limit, offset),
-            ).fetchall()
-
-        return {
-            "keyword": kw_norm,
-            "clusters": [json.loads(r[0]) for r in rows],
-            "total": total,
-            "hours": hours,
-        }
-

+ 15 - 37
news_mcp/mcp_server_fastmcp.py

@@ -28,7 +28,6 @@ from news_mcp.config import (
 )
 from news_mcp.jobs.poller import refresh_clusters
 from news_mcp.storage.sqlite_store import SQLiteClusterStore
-from news_mcp.dashboard.dashboard_store import DashboardStore
 from news_mcp.enrichment.llm_enrich import summarize_cluster_llm
 from news_mcp.trends_resolution import resolve_entity_via_trends
 from news_mcp.llm import active_llm_config
@@ -369,16 +368,10 @@ async def get_latest_events(topic: str | None = None, limit: int = 5, include_ar
             clusters = store.get_latest_clusters(topic=topic_norm, ttl_hours=DEFAULT_LOOKBACK_HOURS, limit=limit)
     else:
         # Entity-aware mode: search recent clusters across all topics and match by
-        # raw entity, canonical label, or MID.
-        clusters = store.get_latest_clusters_all_topics(ttl_hours=DEFAULT_LOOKBACK_HOURS, limit=limit * 8)
-        filtered = []
-        for c in clusters:
-            haystack = _cluster_entity_haystack(c)
-            if any(any(term in item for item in haystack) for term in query_terms):
-                filtered.append(c)
-            if len(filtered) >= limit:
-                break
-        clusters = filtered
+        # raw entity, canonical label, or MID using SQL-level junction table search.
+        clusters = store.get_clusters_by_entity_or_keyword(
+            query_terms=query_terms, hours=DEFAULT_LOOKBACK_HOURS, limit=limit
+        )
 
     out = []
     for c in _sort_clusters_by_recency(clusters):
@@ -429,19 +422,8 @@ async def get_events_for_entity(entity: str, limit: int = 10, timeframe: str = "
 
     store = SQLiteClusterStore(DB_PATH)
 
-    def _match_clusters(clusters: list[dict]) -> list[dict]:
-        hits: list[dict] = []
-        for c in _sort_clusters_by_recency(clusters):
-            haystack = _cluster_entity_haystack(c)
-            if any(any(term in item for item in haystack) for term in query_terms):
-                hits.append(c)
-            if len(hits) >= limit:
-                break
-        return hits
-
     hours = _parse_timeframe_to_hours(timeframe)
-    clusters = store.get_latest_clusters_all_topics(ttl_hours=hours, limit=max(200, limit * 10))
-    hits = _match_clusters(clusters)
+    hits = store.get_clusters_by_entity_or_keyword(query_terms=query_terms, hours=hours, limit=limit)
 
     out = []
     for c in hits:
@@ -903,12 +885,8 @@ async def get_news_sentiment(entity: str, timeframe: str = "24h"):
         hours = 24
     hours = max(1, min(int(hours), 168))
 
-    clusters = store.get_latest_clusters_all_topics(ttl_hours=hours, limit=500)
-    matched = []
-    for c in clusters:
-        haystack = _cluster_entity_haystack(c)
-        if any(any(term in item for item in haystack) for term in query_terms):
-            matched.append(c)
+    clusters = store.get_clusters_by_entity_or_keyword(query_terms=query_terms, hours=hours, limit=500)
+    matched = clusters
 
     if not matched:
         return {
@@ -1101,7 +1079,7 @@ def _api_err(exc: Exception, ctx: str) -> JSONResponse:
 def api_health():
     """Extended health + dashboard stats."""
     try:
-        store = DashboardStore(_shared_store)
+        store = _shared_store
         stats = store.get_dashboard_stats()
         stats["version"] = _VERSION_HASH
         return stats
@@ -1117,7 +1095,7 @@ def api_clusters(
 ):
     """Paginated cluster listing."""
     try:
-        store = DashboardStore(_shared_store)
+        store = _shared_store
         result = store.get_clusters_page(topic=topic, hours=hours, limit=limit, offset=offset)
         return {"clusters": result["clusters"], "total": result["total"], "topic": topic or "all", "hours": hours}
     except Exception as e:
@@ -1131,7 +1109,7 @@ def api_sentiment_series(
 ):
     """Sentiment time-series for Chart.js."""
     try:
-        store = DashboardStore(_shared_store)
+        store = _shared_store
         series = store.get_sentiment_series(topic=topic, hours=hours, bucket_hours=bucket_hours)
         return {"series": series, "topic": topic or "all"}
     except Exception as e:
@@ -1144,7 +1122,7 @@ def api_entities(
 ):
     """Top entity frequencies."""
     try:
-        store = DashboardStore(_shared_store)
+        store = _shared_store
         entities = store.get_entity_frequencies(hours=hours, limit=limit)
         return {"entities": entities, "hours": hours}
     except Exception as e:
@@ -1157,7 +1135,7 @@ def api_keywords(
 ):
     """Top keyword frequencies (thematic descriptors, excluding terms already counted as entities)."""
     try:
-        store = DashboardStore(_shared_store)
+        store = _shared_store
         keywords = store.get_keyword_frequencies(hours=hours, limit=limit)
         return {"keywords": keywords, "hours": hours}
     except Exception as e:
@@ -1172,7 +1150,7 @@ def api_clusters_by_entity(
 ):
     """Return clusters matching an entity, filtered by event time via SQL junction table."""
     try:
-        store = DashboardStore(_shared_store)
+        store = _shared_store
         return store.get_clusters_by_entity(
             entity=entity.strip().lower(),
             hours=hours,
@@ -1191,7 +1169,7 @@ def api_clusters_by_keyword(
 ):
     """Return clusters matching a keyword, filtered by event time via SQL junction table."""
     try:
-        store = DashboardStore(_shared_store)
+        store = _shared_store
         return store.get_clusters_by_keyword(
             keyword=keyword.strip().lower(),
             hours=hours,
@@ -1205,7 +1183,7 @@ def api_clusters_by_keyword(
 def api_cluster_detail(cluster_id: str):
     """Full cluster detail for drill-down."""
     try:
-        store = DashboardStore(_shared_store)
+        store = _shared_store
         detail = store.get_cluster_detail(cluster_id)
         if not detail:
             return JSONResponse(status_code=404, content={"error": "Cluster not found", "id": cluster_id})

+ 294 - 3
news_mcp/storage/sqlite_store.py

@@ -12,6 +12,7 @@ from email.utils import parsedate_to_datetime
 from news_mcp.config import (
     NEWS_PRUNE_INTERVAL_HOURS,
     NEWS_PRUNING_ENABLED,
+    NEWS_REFRESH_INTERVAL_SECONDS,
     NEWS_RETENTION_DAYS,
 )
 from news_mcp.entity_normalize import normalize_entities
@@ -671,13 +672,34 @@ class SQLiteClusterStore:
         with self._conn() as conn:
             total_clusters = conn.execute("SELECT COUNT(*) FROM clusters").fetchone()[0]
             total_entities = conn.execute("SELECT COUNT(*) FROM entity_metadata").fetchone()[0]
+            cluster_entities = conn.execute(
+                "SELECT COUNT(DISTINCT e.value) "
+                "FROM clusters, json_each(clusters.payload, '$.entities') AS e"
+            ).fetchone()[0]
             topic_counts = dict(conn.execute(
                 "SELECT topic, COUNT(*) FROM clusters GROUP BY topic"
             ).fetchall())
             last_refresh = self.get_meta("last_refresh_at")
             feeds = {}
-            for row in conn.execute("SELECT feed_key, last_hash, last_item_count, updated_at FROM feed_state"):
-                feeds[row[0]] = {"last_hash": row[1], "last_item_count": row[2], "updated_at": row[3]}
+            for row in conn.execute(
+                "SELECT feed_key, last_hash, last_item_count, enabled, updated_at FROM feed_state ORDER BY updated_at DESC"
+            ):
+                feeds[row[0]] = {
+                    "last_hash": row[1], "last_item_count": row[2],
+                    "enabled": bool(row[3]), "updated_at": row[4],
+                }
+            # Freshness: did a refresh happen recently? (within 2x the configured interval)
+            fresh = False
+            if last_refresh:
+                try:
+                    dt = datetime.fromisoformat(last_refresh.replace("Z", "+00:00"))
+                    if dt.tzinfo is None:
+                        dt = dt.replace(tzinfo=timezone.utc)
+                    age_hours = (datetime.now(timezone.utc) - dt).total_seconds() / 3600
+                    fresh = age_hours < max(1.0, NEWS_REFRESH_INTERVAL_SECONDS / 3600) * 2
+                except Exception:
+                    pass
+
             last_prune = self.get_meta(META_LAST_PRUNE_AT)
             prune_state = self.get_prune_state(
                 pruning_enabled=NEWS_PRUNING_ENABLED,
@@ -687,11 +709,14 @@ class SQLiteClusterStore:
             return {
                 "total_clusters": total_clusters,
                 "total_entities": total_entities,
+                "cluster_entities": cluster_entities,
                 "clusters_by_topic": topic_counts,
                 "last_refresh_at": last_refresh,
                 "last_prune_at": last_prune,
-                "prune_state": prune_state,
+                "data_fresh": fresh,
                 "feeds": feeds,
+                "feed_count": len(feeds),
+                "prune_state": prune_state,
             }
 
     def get_cluster_detail(self, cluster_id: str) -> dict[str, Any] | None:
@@ -730,3 +755,269 @@ class SQLiteClusterStore:
                 "summary_text": summary.get("mergedSummary", "") if summary else "",
                 "key_facts": summary.get("keyFacts", []) if summary else [],
             }
+
+    # ── Paginated Clusters ────────────────────────────────────────────
+
+    def get_clusters_page(
+        self,
+        topic: str | None = None,
+        hours: float = 24,
+        limit: int = 20,
+        offset: int = 0,
+    ) -> dict[str, Any]:
+        """Paginated cluster listing filtered by SQL payload_ts index.
+
+        Returns {"clusters": [...], "total": int}.
+        """
+        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
+        query = "SELECT payload FROM clusters WHERE payload_ts >= ?"
+        params: list = [cutoff]
+        if topic and topic != "all":
+            query += " AND topic = ?"
+            params.append(topic)
+        total = self._conn().execute(
+            f"SELECT COUNT(*) FROM ({query})", params
+        ).fetchone()[0]
+        query += " ORDER BY payload_ts DESC LIMIT ? OFFSET ?"
+        params.extend([limit, offset])
+
+        with self._conn() as conn:
+            rows = conn.execute(query, params).fetchall()
+
+        return {
+            "clusters": [
+                {
+                    "cluster_id": c.get("cluster_id", ""),
+                    "headline": c.get("headline", ""),
+                    "topic": c.get("topic", ""),
+                    "sentiment": c.get("sentiment", "neutral"),
+                    "sentimentScore": c.get("sentimentScore"),
+                    "importance": c.get("importance", 0),
+                    "entities": c.get("entities", []),
+                    "sources": c.get("sources", []),
+                    "timestamp": c.get("timestamp", ""),
+                    "keywords": c.get("keywords", []),
+                    "article_count": len(c.get("articles", [])),
+                }
+                for c in [json.loads(r[0]) for r in rows]
+            ],
+            "total": total,
+        }
+
+    # ── Sentiment Series ──────────────────────────────────────────────
+
+    def get_sentiment_series(
+        self,
+        topic: str | None = None,
+        hours: float = 24,
+        bucket_hours: float = 1,
+    ) -> list[dict[str, Any]]:
+        """Sentiment score averaged per time bucket.  Filters by payload_ts SQL index."""
+        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
+        query = "SELECT payload FROM clusters WHERE payload_ts >= ?"
+        params: list = [cutoff]
+        if topic and topic != "all":
+            query += " AND topic = ?"
+            params.append(topic)
+        query += " ORDER BY payload_ts ASC"
+
+        with self._conn() as conn:
+            rows = conn.execute(query, params).fetchall()
+
+        buckets: dict[datetime, list[float]] = {}
+        for (payload_text,) in rows:
+            c = json.loads(payload_text)
+            ts_str = c.get("timestamp")
+            score = c.get("sentimentScore")
+            if not ts_str or score is None:
+                continue
+            dt = datetime.fromisoformat(str(ts_str).strip())
+            if dt.tzinfo is None:
+                dt = dt.replace(tzinfo=timezone.utc)
+            dt = dt.astimezone(timezone.utc)
+            bucket_key = dt.replace(minute=0, second=0, microsecond=0)
+            if bucket_hours > 1:
+                bucket_key = bucket_key.replace(
+                    hour=(bucket_key.hour // int(bucket_hours)) * int(bucket_hours)
+                )
+            buckets.setdefault(bucket_key, []).append(float(score))
+
+        return [
+            {
+                "time": bucket_key.isoformat(),
+                "avg_sentiment": round(sum(scores) / len(scores), 3),
+                "count": len(scores),
+                "min": round(min(scores), 3),
+                "max": round(max(scores), 3),
+            }
+            for bucket_key, scores in sorted(buckets.items())
+        ]
+
+    # ── Entity / Keyword Frequencies ──────────────────────────────────
+
+    def get_entity_frequencies(
+        self,
+        hours: float = 24,
+        limit: int = 30,
+    ) -> list[dict[str, Any]]:
+        """Top entities by mention count, using SQL junction table + payload_ts index."""
+        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
+        with self._conn() as conn:
+            rows = conn.execute(
+                """\
+                SELECT ce.entity, COUNT(*) as cnt
+                FROM cluster_entities ce
+                JOIN clusters c ON c.cluster_id = ce.cluster_id
+                WHERE c.payload_ts >= ?
+                GROUP BY ce.entity
+                ORDER BY cnt DESC
+                LIMIT ?
+                """,
+                (cutoff, limit),
+            ).fetchall()
+
+        result: list[dict[str, Any]] = []
+        for label, count in rows:
+            meta = self.get_entity_metadata(label)
+            result.append({
+                "label": label,
+                "count": count,
+                "canonical_label": meta["canonical_label"] if meta else label,
+                "mid": meta["mid"] if meta else None,
+            })
+        return result
+
+    def get_keyword_frequencies(
+        self,
+        hours: float = 24,
+        limit: int = 30,
+    ) -> list[dict[str, Any]]:
+        """Top keywords by mention count, using SQL junction table + payload_ts index.
+
+        Excludes DEFAULT_TOPICS labels (crypto, macro, regulation, ai, other).
+        """
+        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
+        _topic_labels = {"crypto", "macro", "regulation", "ai", "other"}
+
+        with self._conn() as conn:
+            rows = conn.execute(
+                """\
+                SELECT ck.keyword, COUNT(*) as cnt
+                FROM cluster_keywords ck
+                JOIN clusters c ON c.cluster_id = ck.cluster_id
+                WHERE c.payload_ts >= ?
+                GROUP BY ck.keyword
+                ORDER BY cnt DESC
+                LIMIT ?
+                """,
+                (cutoff, limit),
+            ).fetchall()
+
+        return [
+            {"label": label, "count": count}
+            for label, count in rows
+            if label.lower() not in _topic_labels
+        ]
+
+    # ── Junction-Table Entity / Keyword Cluster Search ────────────────
+
+    def get_clusters_by_entity(
+        self,
+        entity: str,
+        hours: float = 168,
+        limit: int = 50,
+        offset: int = 0,
+    ) -> dict[str, Any]:
+        """Return clusters matching an entity, SQL-level filter via junction table.
+
+        Returns {"entity": ..., "clusters": [...], "total": int, "hours": float}.
+        """
+        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
+        entity_norm = entity.strip().lower()
+
+        with self._conn() as conn:
+            total = conn.execute(
+                "SELECT COUNT(DISTINCT c.cluster_id) FROM clusters c "
+                "JOIN cluster_entities ce ON c.cluster_id = ce.cluster_id "
+                "WHERE c.payload_ts >= ? AND ce.entity = ?",
+                (cutoff, entity_norm),
+            ).fetchone()[0]
+
+            rows = conn.execute(
+                "SELECT DISTINCT c.payload FROM clusters c "
+                "JOIN cluster_entities ce ON c.cluster_id = ce.cluster_id "
+                "WHERE c.payload_ts >= ? AND ce.entity = ? "
+                "ORDER BY c.payload_ts DESC LIMIT ? OFFSET ?",
+                (cutoff, entity_norm, limit, offset),
+            ).fetchall()
+
+        return {
+            "entity": entity_norm,
+            "clusters": [json.loads(r[0]) for r in rows],
+            "total": total,
+            "hours": hours,
+        }
+
+    def get_clusters_by_keyword(
+        self,
+        keyword: str,
+        hours: float = 168,
+        limit: int = 50,
+        offset: int = 0,
+    ) -> dict[str, Any]:
+        """Return clusters matching a keyword, SQL-level filter via junction table."""
+        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
+        kw_norm = keyword.strip().lower()
+
+        with self._conn() as conn:
+            total = conn.execute(
+                "SELECT COUNT(DISTINCT c.cluster_id) FROM clusters c "
+                "JOIN cluster_keywords ck ON c.cluster_id = ck.cluster_id "
+                "WHERE c.payload_ts >= ? AND ck.keyword = ?",
+                (cutoff, kw_norm),
+            ).fetchone()[0]
+
+            rows = conn.execute(
+                "SELECT DISTINCT c.payload FROM clusters c "
+                "JOIN cluster_keywords ck ON c.cluster_id = ck.cluster_id "
+                "WHERE c.payload_ts >= ? AND ck.keyword = ? "
+                "ORDER BY c.payload_ts DESC LIMIT ? OFFSET ?",
+                (cutoff, kw_norm, limit, offset),
+            ).fetchall()
+
+        return {
+            "keyword": kw_norm,
+            "clusters": [json.loads(r[0]) for r in rows],
+            "total": total,
+            "hours": hours,
+        }
+
+    def get_clusters_by_entity_or_keyword(
+        self,
+        query_terms: set[str],
+        hours: float,
+        limit: int,
+    ) -> list[dict]:
+        """Search clusters by matching ANY query term against entities OR keywords.
+
+        Uses SQL-level junction-table filtering — no row-limit blind spot.
+        Returns clusters sorted by recency.
+        """
+        terms = [q.strip().lower() for q in query_terms if q.strip()]
+        if not terms:
+            return []
+        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
+        placeholders = ",".join("?" for _ in terms)
+
+        with self._conn() as conn:
+            rows = conn.execute(
+                f"SELECT DISTINCT c.payload FROM clusters c "
+                f"LEFT JOIN cluster_entities ce ON c.cluster_id = ce.cluster_id "
+                f"LEFT JOIN cluster_keywords ck ON c.cluster_id = ck.cluster_id "
+                f"WHERE c.payload_ts >= ? "
+                f"  AND (ce.entity IN ({placeholders}) OR ck.keyword IN ({placeholders})) "
+                f"ORDER BY c.payload_ts DESC LIMIT ?",
+                (cutoff, *terms, *terms, limit),
+            ).fetchall()
+
+        return [json.loads(r[0]) for r in rows]