hai 1 semana · f8677e48b5
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -53,9 +53,9 @@ This project spans two machines. **Always check which machine you're operating o
 
															 docker exec -it news-mcp python3 scripts/backfill_junction_tables.py
														
 
															 ```
														
 
															-## Design Flaw: Two Stores
														
 
															+## Design Flaw: Two Stores (FIXED May 2026)
														
 
															-`SQLiteClusterStore` and `DashboardStore` are parallel copies. Only `DashboardStore` was updated with junction-table entity search. MCP tools (`get_events_for_entity`, `get_news_sentiment`) still use `SQLiteClusterStore` Python-side entity matching with a row limit (top 200), missing entities in older clusters. See PROJECT.md for full analysis and proposed fix.
														
 
															+`DashboardStore` was eliminated. `SQLiteClusterStore` is the single data access layer with junction-table entity/keyword search. All MCP tools use the proper SQL methods.
														
 
															 ## Docker / Live Server Details
														
 
															 - `docker-compose.yml` mounts `./:/app` with `working_dir: /app`
														
--- a/OUTLOOK.md
+++ b/OUTLOOK.md
@@ -33,11 +33,11 @@ See PROJECT.md for full schema and architecture. Key points:
 
															 ## Known Design Issues
														
 
															-### Two Stores (see PROJECT.md § "Design Flaw")
														
 
															-`SQLiteClusterStore` and `DashboardStore` are parallel copies. Only `DashboardStore` was updated with junction-table entity search. MCP tools still use Python-side entity matching with a row limit. Proposed fix: collapse into single data access layer.
														
 
															+### Two Stores (FIXED, May 2026)
														
 
															+`DashboardStore` was eliminated. All methods moved to `SQLiteClusterStore`. MCP tools now use SQL-level junction-table entity/keyword search via `get_clusters_by_entity_or_keyword()` — no row-limit blind spot.
														
 
															-### MCP Tool Entity Search
														
 
															-`get_events_for_entity` and `get_news_sentiment` fetch top-N clusters by time then filter entities in Python. Entities in clusters beyond the limit are missed. Fix: use junction table `get_clusters_by_entity()`.
														
 
															+### MCP Tool Entity Search (FIXED, May 2026)
														
 
															+`get_events_for_entity` and `get_news_sentiment` now use `SQLiteClusterStore.get_clusters_by_entity_or_keyword()` with proper SQL-level filtering across the full time window via the `cluster_entities` and `cluster_keywords` junction tables.
														
 
															 ## Backfill Scripts
														
--- a/PROJECT.md
+++ b/PROJECT.md
@@ -82,20 +82,24 @@ Keywords extracted by the LLM are now first-class search signals:
 
															 - Dashboard Keywords panel with SQL frequency counts via junction table
														
 
															 - Topic labels (crypto/macro/regulation/ai/other) filtered from keywords at extraction time
														
 
															+## Two-Store Collapse (done, May 2026)
														
 
															+
														
 
															+`DashboardStore` has been eliminated. All of its methods were moved into `SQLiteClusterStore` (the single data access layer), and the REST API routes now use the shared `SQLiteClusterStore` instance directly.
														
 
															+
														
 
															+All MCP tools (`get_events_for_entity`, `get_news_sentiment`, `get_latest_events` entity mode) now use `SQLiteClusterStore.get_clusters_by_entity_or_keyword()` which searches via junction-table SQL joins — no row-limit blind spot. The `cluster_entities` and `cluster_keywords` junction tables are indexed for O(log n) lookup across any time window.
														
 
															+
														
 
															 ## Timestamp Pipeline (May 2026)
														
 
															 1. **Write**: `sanitize_cluster_payload()` normalizes `timestamp`/`first_seen`/`last_updated` to `YYYY-MM-DDTHH:MM:SS+00:00`. If all three missing, falls back to `datetime.now()`.
														
 
															 2. **Generated column**: `payload_ts` auto-extracts from JSON on write. Indexed.
														
 
															 3. **Read**: All queries filter by `payload_ts >= ?` in SQL. No JSON parsing for time filtering.
														
 
															 4. **Backfill**: One-time `scripts/backfill_junction_tables.py` populated junction tables from existing payloads. `payload_ts` was auto-populated by SQLite.
														
 
															-## Design Flaw: Two Stores (KNOWN, fix planned)
														
 
															-
														
 
															-**Problem:** `SQLiteClusterStore` and `DashboardStore` are parallel copies of the same data access layer. Methods were duplicated when DashboardStore was added, with the same JSON-parsing approach. When junction tables were implemented, only `DashboardStore` was updated. `SQLiteClusterStore` (used by MCP tools) still does full-table JSON parsing for entity/keyword search.
														
 
															-
														
 
															-**Current state:**
														
 
															-- `DashboardStore` — uses SQL `payload_ts` filter + junction tables ✓
														
 
															-- `SQLiteClusterStore` — uses SQL `payload_ts` filter for time ✓, but MCP tool entity search (`get_events_for_entity`, `get_news_sentiment`) still fetches top-N clusters by time then filters entities in Python
														
 
															+## Design Flaw: Two Stores (FIXED, May 2026)
														
 
															-**Consequence:** `get_events_for_entity("Pete Hegseth", timeframe="72h")` fetches the 200 most recent clusters (via `payload_ts`), then loops in Python checking entities. If the entity appears in 34 clusters but only 15 are in the top 200, 19 are missed.
														
 
															+**What happened:** `DashboardStore` was a thin read-only query layer that wrapped `SQLiteClusterStore`. The MCP tools (`get_events_for_entity`, `get_news_sentiment`, `get_latest_events` entity mode) did Python-side entity matching by fetching top-N clusters via `payload_ts` then filtering in Python. Entities in clusters beyond the limit were silently missed.
														
 
															-**Proposed fix:** Collapse both stores into one. `SQLiteClusterStore` should be the single data access layer with proper junction-table methods for entity/keyword search. `DashboardStore` should be a thin wrapper or removed entirely. MCP tools should call `SQLiteClusterStore.get_clusters_by_entity()` using junction tables instead of Python-side filtering.
														
 
															+**Fix applied:** 
														
 
															+- `DashboardStore` was deleted. All its methods are now in `SQLiteClusterStore`.
														
 
															+- All MCP tools use `SQLiteClusterStore.get_clusters_by_entity_or_keyword()` — SQL-level junction-table search with no row-limit blind spot.
														
 
															+- The combined method uses `LEFT JOIN` on `cluster_entities` and `cluster_keywords` with `WHERE (ce.entity IN (...) OR ck.keyword IN (...))`, which matches both named entities and thematic keywords across any time window.
														
 
															+- Exact matching (via `IN`) replaced substring matching — more correct, no false positives from partial name matches.
														
--- a/news_mcp/dashboard/__init__.py
+++ b/news_mcp/dashboard/__init__.py
--- a/news_mcp/dashboard/dashboard_store.py
+++ b/news_mcp/dashboard/dashboard_store.py
@@ -1,357 +0,0 @@
 
															-from __future__ import annotations
														
 
															-
														
 
															-import json
														
 
															-from datetime import datetime, timedelta, timezone
														
 
															-from typing import Any
														
 
															-
														
 
															-from news_mcp.config import (
														
 
															-    NEWS_PRUNE_INTERVAL_HOURS,
														
 
															-    NEWS_PRUNING_ENABLED,
														
 
															-    NEWS_REFRESH_INTERVAL_SECONDS,
														
 
															-    NEWS_RETENTION_DAYS,
														
 
															-    DEFAULT_TOPICS,
														
 
															-)
														
 
															-from news_mcp.storage.sqlite_store import SQLiteClusterStore
														
 
															-
														
 
															-
														
 
															-class DashboardStore:
														
 
															-    """Read-only query layer for the dashboard."""
														
 
															-
														
 
															-    def __init__(self, store=None):
														
 
															-        if store is not None:
														
 
															-            self._store = store
														
 
															-        else:
														
 
															-            from news_mcp.config import DB_PATH
														
 
															-            self._store = SQLiteClusterStore(DB_PATH)
														
 
															-
														
 
															-    # ── Health & Stats ──────────────────────────────────────────────
														
 
															-
														
 
															-    def get_dashboard_stats(self) -> dict[str, Any]:
														
 
															-        with self._store._conn() as conn:
														
 
															-            total_clusters = conn.execute("SELECT COUNT(*) FROM clusters").fetchone()[0]
														
 
															-            total_entities = conn.execute("SELECT COUNT(*) FROM entity_metadata").fetchone()[0]
														
 
															-            cluster_entities = conn.execute(
														
 
															-                "SELECT COUNT(DISTINCT e.value) "
														
 
															-                "FROM clusters, json_each(clusters.payload, '$.entities') AS e"
														
 
															-            ).fetchone()[0]
														
 
															-            topic_counts = dict(conn.execute(
														
 
															-                "SELECT topic, COUNT(*) FROM clusters GROUP BY topic"
														
 
															-            ).fetchall())
														
 
															-
														
 
															-        last_refresh = self._store.get_meta("last_refresh_at")
														
 
															-        last_prune = self._store.get_meta("last_prune_at")
														
 
															-        
														
 
															-        # Freshness: did a refresh happen recently? (within 2x the configured interval)
														
 
															-        fresh = False
														
 
															-        if last_refresh:
														
 
															-            try:
														
 
															-                dt = datetime.fromisoformat(last_refresh.replace("Z", "+00:00"))
														
 
															-                if dt.tzinfo is None:
														
 
															-                    dt = dt.replace(tzinfo=timezone.utc)
														
 
															-                age_hours = (datetime.now(timezone.utc) - dt).total_seconds() / 3600
														
 
															-                fresh = age_hours < max(1.0, NEWS_REFRESH_INTERVAL_SECONDS / 3600) * 2
														
 
															-            except Exception:
														
 
															-                pass
														
 
															-
														
 
															-        feeds = {}
														
 
															-        with self._store._conn() as conn:
														
 
															-            for row in conn.execute("SELECT feed_key, last_hash, last_item_count, enabled, updated_at FROM feed_state ORDER BY updated_at DESC"):
														
 
															-                feeds[row[0]] = {"last_hash": row[1], "last_item_count": row[2], "enabled": bool(row[3]), "updated_at": row[4]}
														
 
															-
														
 
															-        return {
														
 
															-            "total_clusters": total_clusters,
														
 
															-            "total_entities": total_entities,
														
 
															-            "cluster_entities": cluster_entities,
														
 
															-            "clusters_by_topic": topic_counts,
														
 
															-            "last_refresh_at": last_refresh,
														
 
															-            "last_prune_at": last_prune,
														
 
															-            "data_fresh": fresh,
														
 
															-            "feeds": feeds,
														
 
															-            "feed_count": len(feeds),
														
 
															-            "pruning": {
														
 
															-                "enabled": NEWS_PRUNING_ENABLED,
														
 
															-                "retention_days": NEWS_RETENTION_DAYS,
														
 
															-                "interval_hours": NEWS_PRUNE_INTERVAL_HOURS,
														
 
															-                "last_prune_at": last_prune,
														
 
															-            },
														
 
															-        }
														
 
															-
														
 
															-    # ── Clusters ────────────────────────────────────────────────────
														
 
															-
														
 
															-    def get_clusters_page(
														
 
															-        self,
														
 
															-        topic: str | None = None,
														
 
															-        hours: float = 24,
														
 
															-        limit: int = 20,
														
 
															-        offset: int = 0,
														
 
															-    ) -> dict[str, Any]:
														
 
															-        """Paginated cluster listing filtered by SQL payload_ts index.
														
 
															-
														
 
															-        Returns {"clusters": [...], "total": int}.
														
 
															-        """
														
 
															-        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
														
 
															-
														
 
															-        query = "SELECT payload FROM clusters WHERE payload_ts >= ?"
														
 
															-        params: list = [cutoff]
														
 
															-        if topic and topic != "all":
														
 
															-            query += " AND topic = ?"
														
 
															-            params.append(topic)
														
 
															-        # Get total count before pagination
														
 
															-        total = self._store._conn().execute(
														
 
															-            f"SELECT COUNT(*) FROM ({query})", params
														
 
															-        ).fetchone()[0]
														
 
															-        query += " ORDER BY payload_ts DESC LIMIT ? OFFSET ?"
														
 
															-        params.extend([limit, offset])
														
 
															-
														
 
															-        with self._store._conn() as conn:
														
 
															-            rows = conn.execute(query, params).fetchall()
														
 
															-
														
 
															-        return {
														
 
															-            "clusters": [
														
 
															-                {
														
 
															-                    "cluster_id": c.get("cluster_id", ""),
														
 
															-                    "headline": c.get("headline", ""),
														
 
															-                    "topic": c.get("topic", ""),
														
 
															-                    "sentiment": c.get("sentiment", "neutral"),
														
 
															-                    "sentimentScore": c.get("sentimentScore"),
														
 
															-                    "importance": c.get("importance", 0),
														
 
															-                    "entities": c.get("entities", []),
														
 
															-                    "sources": c.get("sources", []),
														
 
															-                    "timestamp": c.get("timestamp", ""),
														
 
															-                    "keywords": c.get("keywords", []),
														
 
															-                    "article_count": len(c.get("articles", [])),
														
 
															-                }
														
 
															-                for c in [json.loads(r[0]) for r in rows]
														
 
															-            ],
														
 
															-            "total": total,
														
 
															-        }
														
 
															-
														
 
															-    def get_cluster_detail(self, cluster_id: str) -> dict[str, Any] | None:
														
 
															-        with self._store._conn() as conn:
														
 
															-            cur = conn.execute(
														
 
															-                "SELECT payload FROM clusters WHERE cluster_id = ?", (cluster_id,)
														
 
															-            )
														
 
															-            row = cur.fetchone()
														
 
															-            if not row:
														
 
															-                return None
														
 
															-            c = json.loads(row[0])
														
 
															-            summary = None
														
 
															-            if c.get("summary_payload"):
														
 
															-                try:
														
 
															-                    summary = json.loads(c["summary_payload"])
														
 
															-                except Exception:
														
 
															-                    pass
														
 
															-            return {
														
 
															-                "cluster_id": c.get("cluster_id"),
														
 
															-                "headline": c.get("headline", ""),
														
 
															-                "summary": c.get("summary", ""),
														
 
															-                "topic": c.get("topic", ""),
														
 
															-                "sentiment": c.get("sentiment", "neutral"),
														
 
															-                "sentimentScore": c.get("sentimentScore"),
														
 
															-                "importance": c.get("importance", 0),
														
 
															-                "entities": c.get("entities", []),
														
 
															-                "entityResolutions": c.get("entityResolutions", []),
														
 
															-                "keywords": c.get("keywords", []),
														
 
															-                "sources": c.get("sources", []),
														
 
															-                "timestamp": c.get("timestamp", ""),
														
 
															-                "first_seen": c.get("first_seen", ""),
														
 
															-                "last_updated": c.get("last_updated", ""),
														
 
															-                "article_count": len(c.get("articles", [])),
														
 
															-                "articles": c.get("articles", []),
														
 
															-                "summary_text": summary.get("mergedSummary", "") if summary else "",
														
 
															-                "key_facts": summary.get("keyFacts", []) if summary else [],
														
 
															-            }
														
 
															-
														
 
															-    # ── Sentiment Series ────────────────────────────────────────────
														
 
															-
														
 
															-    def get_sentiment_series(
														
 
															-            self,
														
 
															-            topic: str | None = None,
														
 
															-            hours: float = 24,
														
 
															-            bucket_hours: float = 1,
														
 
															-        ) -> list[dict[str, Any]]:
														
 
															-        """Sentiment score averaged per time bucket.
														
 
															-
														
 
															-        Filters by payload_ts SQL index.
														
 
															-        """
														
 
															-        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
														
 
															-        query = "SELECT payload FROM clusters WHERE payload_ts >= ?"
														
 
															-        params: list = [cutoff]
														
 
															-        if topic and topic != "all":
														
 
															-            query += " AND topic = ?"
														
 
															-            params.append(topic)
														
 
															-        query += " ORDER BY payload_ts ASC"
														
 
															-
														
 
															-        with self._store._conn() as conn:
														
 
															-            rows = conn.execute(query, params).fetchall()
														
 
															-
														
 
															-        buckets: dict[datetime, list[float]] = {}
														
 
															-        for (payload_text,) in rows:
														
 
															-            c = json.loads(payload_text)
														
 
															-            ts_str = c.get("timestamp")
														
 
															-            score = c.get("sentimentScore")
														
 
															-            if not ts_str or score is None:
														
 
															-                continue
														
 
															-            dt = datetime.fromisoformat(str(ts_str).strip())
														
 
															-            if dt.tzinfo is None:
														
 
															-                dt = dt.replace(tzinfo=timezone.utc)
														
 
															-            dt = dt.astimezone(timezone.utc)
														
 
															-            bucket_key = dt.replace(minute=0, second=0, microsecond=0)
														
 
															-            if bucket_hours > 1:
														
 
															-                bucket_key = bucket_key.replace(
														
 
															-                    hour=(bucket_key.hour // int(bucket_hours)) * int(bucket_hours)
														
 
															-                )
														
 
															-            buckets.setdefault(bucket_key, []).append(float(score))
														
 
															-
														
 
															-        return [
														
 
															-            {
														
 
															-                "time": bucket_key.isoformat(),
														
 
															-                "avg_sentiment": round(sum(scores) / len(scores), 3),
														
 
															-                "count": len(scores),
														
 
															-                "min": round(min(scores), 3),
														
 
															-                "max": round(max(scores), 3),
														
 
															-            }
														
 
															-            for bucket_key, scores in sorted(buckets.items())
														
 
															-        ]
														
 
															-
														
 
															-    # ── Entity Frequencies ──────────────────────────────────────────
														
 
															-
														
 
															-    def get_entity_frequencies(
														
 
															-        self,
														
 
															-        hours: float = 24,
														
 
															-        limit: int = 30,
														
 
															-    ) -> list[dict[str, Any]]:
														
 
															-        """Top entities by mention count, using SQL junction table + payload_ts index."""
														
 
															-        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
														
 
															-
														
 
															-        with self._store._conn() as conn:
														
 
															-            rows = conn.execute(
														
 
															-                """
														
 
															-                SELECT ce.entity, COUNT(*) as cnt
														
 
															-                FROM cluster_entities ce
														
 
															-                JOIN clusters c ON c.cluster_id = ce.cluster_id
														
 
															-                WHERE c.payload_ts >= ?
														
 
															-                GROUP BY ce.entity
														
 
															-                ORDER BY cnt DESC
														
 
															-                LIMIT ?
														
 
															-                """,
														
 
															-                (cutoff, limit),
														
 
															-            ).fetchall()
														
 
															-
														
 
															-        result: list[dict[str, Any]] = []
														
 
															-        for label, count in rows:
														
 
															-            meta = self._store.get_entity_metadata(label)
														
 
															-            result.append({
														
 
															-                "label": label,
														
 
															-                "count": count,
														
 
															-                "canonical_label": meta["canonical_label"] if meta else label,
														
 
															-                "mid": meta["mid"] if meta else None,
														
 
															-            })
														
 
															-        return result
														
 
															-
														
 
															-    # ── Keyword Frequencies ─────────────────────────────────────────
														
 
															-
														
 
															-    def get_keyword_frequencies(
														
 
															-        self,
														
 
															-        hours: float = 24,
														
 
															-        limit: int = 30,
														
 
															-    ) -> list[dict[str, Any]]:
														
 
															-        """Top keywords by mention count, using SQL junction table + payload_ts index.
														
 
															-
														
 
															-        Excludes DEFAULT_TOPICS labels (crypto, macro, regulation, ai, other).
														
 
															-        """
														
 
															-        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
														
 
															-        _topic_labels = {t.lower() for t in DEFAULT_TOPICS}
														
 
															-
														
 
															-        with self._store._conn() as conn:
														
 
															-            rows = conn.execute(
														
 
															-                """
														
 
															-                SELECT ck.keyword, COUNT(*) as cnt
														
 
															-                FROM cluster_keywords ck
														
 
															-                JOIN clusters c ON c.cluster_id = ck.cluster_id
														
 
															-                WHERE c.payload_ts >= ?
														
 
															-                GROUP BY ck.keyword
														
 
															-                ORDER BY cnt DESC
														
 
															-                LIMIT ?
														
 
															-                """,
														
 
															-                (cutoff, limit),
														
 
															-            ).fetchall()
														
 
															-
														
 
															-        return [
														
 
															-            {"label": label, "count": count}
														
 
															-            for label, count in rows
														
 
															-            if label.lower() not in _topic_labels
														
 
															-        ]
														
 
															-
														
 
															-    # ── Entity/Keyword Cluster Search ────────────────────────────────
														
 
															-
														
 
															-    def get_clusters_by_entity(
														
 
															-        self,
														
 
															-        entity: str,
														
 
															-        hours: float = 168,
														
 
															-        limit: int = 50,
														
 
															-        offset: int = 0,
														
 
															-    ) -> dict[str, Any]:
														
 
															-        """Return clusters matching an entity, SQL-level filter via junction table."""
														
 
															-        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
														
 
															-        entity_norm = entity.strip().lower()
														
 
															-
														
 
															-        with self._store._conn() as conn:
														
 
															-            # Total count
														
 
															-            total = conn.execute(
														
 
															-                "SELECT COUNT(DISTINCT c.cluster_id) FROM clusters c "
														
 
															-                "JOIN cluster_entities ce ON c.cluster_id = ce.cluster_id "
														
 
															-                "WHERE c.payload_ts >= ? AND ce.entity = ?",
														
 
															-                (cutoff, entity_norm),
														
 
															-            ).fetchone()[0]
														
 
															-
														
 
															-            # Paginated results
														
 
															-            rows = conn.execute(
														
 
															-                "SELECT DISTINCT c.payload FROM clusters c "
														
 
															-                "JOIN cluster_entities ce ON c.cluster_id = ce.cluster_id "
														
 
															-                "WHERE c.payload_ts >= ? AND ce.entity = ? "
														
 
															-                "ORDER BY c.payload_ts DESC LIMIT ? OFFSET ?",
														
 
															-                (cutoff, entity_norm, limit, offset),
														
 
															-            ).fetchall()
														
 
															-
														
 
															-        return {
														
 
															-            "entity": entity_norm,
														
 
															-            "clusters": [json.loads(r[0]) for r in rows],
														
 
															-            "total": total,
														
 
															-            "hours": hours,
														
 
															-        }
														
 
															-
														
 
															-    def get_clusters_by_keyword(
														
 
															-        self,
														
 
															-        keyword: str,
														
 
															-        hours: float = 168,
														
 
															-        limit: int = 50,
														
 
															-        offset: int = 0,
														
 
															-    ) -> dict[str, Any]:
														
 
															-        """Return clusters matching a keyword, SQL-level filter via junction table."""
														
 
															-        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
														
 
															-        kw_norm = keyword.strip().lower()
														
 
															-
														
 
															-        with self._store._conn() as conn:
														
 
															-            total = conn.execute(
														
 
															-                "SELECT COUNT(DISTINCT c.cluster_id) FROM clusters c "
														
 
															-                "JOIN cluster_keywords ck ON c.cluster_id = ck.cluster_id "
														
 
															-                "WHERE c.payload_ts >= ? AND ck.keyword = ?",
														
 
															-                (cutoff, kw_norm),
														
 
															-            ).fetchone()[0]
														
 
															-
														
 
															-            rows = conn.execute(
														
 
															-                "SELECT DISTINCT c.payload FROM clusters c "
														
 
															-                "JOIN cluster_keywords ck ON c.cluster_id = ck.cluster_id "
														
 
															-                "WHERE c.payload_ts >= ? AND ck.keyword = ? "
														
 
															-                "ORDER BY c.payload_ts DESC LIMIT ? OFFSET ?",
														
 
															-                (cutoff, kw_norm, limit, offset),
														
 
															-            ).fetchall()
														
 
															-
														
 
															-        return {
														
 
															-            "keyword": kw_norm,
														
 
															-            "clusters": [json.loads(r[0]) for r in rows],
														
 
															-            "total": total,
														
 
															-            "hours": hours,
														
 
															-        }
														
 
															-
														
--- a/news_mcp/mcp_server_fastmcp.py
+++ b/news_mcp/mcp_server_fastmcp.py
@@ -28,7 +28,6 @@ from news_mcp.config import (
 
															 )
														
 
															 from news_mcp.jobs.poller import refresh_clusters
														
 
															 from news_mcp.storage.sqlite_store import SQLiteClusterStore
														
 
															-from news_mcp.dashboard.dashboard_store import DashboardStore
														
 
															 from news_mcp.enrichment.llm_enrich import summarize_cluster_llm
														
 
															 from news_mcp.trends_resolution import resolve_entity_via_trends
														
 
															 from news_mcp.llm import active_llm_config
														
@@ -369,16 +368,10 @@ async def get_latest_events(topic: str | None = None, limit: int = 5, include_ar
 
															             clusters = store.get_latest_clusters(topic=topic_norm, ttl_hours=DEFAULT_LOOKBACK_HOURS, limit=limit)
														
 
															     else:
														
 
															         # Entity-aware mode: search recent clusters across all topics and match by
														
 
															-        # raw entity, canonical label, or MID.
														
 
															-        clusters = store.get_latest_clusters_all_topics(ttl_hours=DEFAULT_LOOKBACK_HOURS, limit=limit * 8)
														
 
															-        filtered = []
														
 
															-        for c in clusters:
														
 
															-            haystack = _cluster_entity_haystack(c)
														
 
															-            if any(any(term in item for item in haystack) for term in query_terms):
														
 
															-                filtered.append(c)
														
 
															-            if len(filtered) >= limit:
														
 
															-                break
														
 
															-        clusters = filtered
														
 
															+        # raw entity, canonical label, or MID using SQL-level junction table search.
														
 
															+        clusters = store.get_clusters_by_entity_or_keyword(
														
 
															+            query_terms=query_terms, hours=DEFAULT_LOOKBACK_HOURS, limit=limit
														
 
															+        )
														
 
															     out = []
														
 
															     for c in _sort_clusters_by_recency(clusters):
														
@@ -429,19 +422,8 @@ async def get_events_for_entity(entity: str, limit: int = 10, timeframe: str = "
 
															     store = SQLiteClusterStore(DB_PATH)
														
 
															-    def _match_clusters(clusters: list[dict]) -> list[dict]:
														
 
															-        hits: list[dict] = []
														
 
															-        for c in _sort_clusters_by_recency(clusters):
														
 
															-            haystack = _cluster_entity_haystack(c)
														
 
															-            if any(any(term in item for item in haystack) for term in query_terms):
														
 
															-                hits.append(c)
														
 
															-            if len(hits) >= limit:
														
 
															-                break
														
 
															-        return hits
														
 
															-
														
 
															     hours = _parse_timeframe_to_hours(timeframe)
														
 
															-    clusters = store.get_latest_clusters_all_topics(ttl_hours=hours, limit=max(200, limit * 10))
														
 
															-    hits = _match_clusters(clusters)
														
 
															+    hits = store.get_clusters_by_entity_or_keyword(query_terms=query_terms, hours=hours, limit=limit)
														
 
															     out = []
														
 
															     for c in hits:
														
@@ -903,12 +885,8 @@ async def get_news_sentiment(entity: str, timeframe: str = "24h"):
 
															         hours = 24
														
 
															     hours = max(1, min(int(hours), 168))
														
 
															-    clusters = store.get_latest_clusters_all_topics(ttl_hours=hours, limit=500)
														
 
															-    matched = []
														
 
															-    for c in clusters:
														
 
															-        haystack = _cluster_entity_haystack(c)
														
 
															-        if any(any(term in item for item in haystack) for term in query_terms):
														
 
															-            matched.append(c)
														
 
															+    clusters = store.get_clusters_by_entity_or_keyword(query_terms=query_terms, hours=hours, limit=500)
														
 
															+    matched = clusters
														
 
															     if not matched:
														
 
															         return {
														
@@ -1101,7 +1079,7 @@ def _api_err(exc: Exception, ctx: str) -> JSONResponse:
 
															 def api_health():
														
 
															     """Extended health + dashboard stats."""
														
 
															     try:
														
 
															-        store = DashboardStore(_shared_store)
														
 
															+        store = _shared_store
														
 
															         stats = store.get_dashboard_stats()
														
 
															         stats["version"] = _VERSION_HASH
														
 
															         return stats
														
@@ -1117,7 +1095,7 @@ def api_clusters(
 
															 ):
														
 
															     """Paginated cluster listing."""
														
 
															     try:
														
 
															-        store = DashboardStore(_shared_store)
														
 
															+        store = _shared_store
														
 
															         result = store.get_clusters_page(topic=topic, hours=hours, limit=limit, offset=offset)
														
 
															         return {"clusters": result["clusters"], "total": result["total"], "topic": topic or "all", "hours": hours}
														
 
															     except Exception as e:
														
@@ -1131,7 +1109,7 @@ def api_sentiment_series(
 
															 ):
														
 
															     """Sentiment time-series for Chart.js."""
														
 
															     try:
														
 
															-        store = DashboardStore(_shared_store)
														
 
															+        store = _shared_store
														
 
															         series = store.get_sentiment_series(topic=topic, hours=hours, bucket_hours=bucket_hours)
														
 
															         return {"series": series, "topic": topic or "all"}
														
 
															     except Exception as e:
														
@@ -1144,7 +1122,7 @@ def api_entities(
 
															 ):
														
 
															     """Top entity frequencies."""
														
 
															     try:
														
 
															-        store = DashboardStore(_shared_store)
														
 
															+        store = _shared_store
														
 
															         entities = store.get_entity_frequencies(hours=hours, limit=limit)
														
 
															         return {"entities": entities, "hours": hours}
														
 
															     except Exception as e:
														
@@ -1157,7 +1135,7 @@ def api_keywords(
 
															 ):
														
 
															     """Top keyword frequencies (thematic descriptors, excluding terms already counted as entities)."""
														
 
															     try:
														
 
															-        store = DashboardStore(_shared_store)
														
 
															+        store = _shared_store
														
 
															         keywords = store.get_keyword_frequencies(hours=hours, limit=limit)
														
 
															         return {"keywords": keywords, "hours": hours}
														
 
															     except Exception as e:
														
@@ -1172,7 +1150,7 @@ def api_clusters_by_entity(
 
															 ):
														
 
															     """Return clusters matching an entity, filtered by event time via SQL junction table."""
														
 
															     try:
														
 
															-        store = DashboardStore(_shared_store)
														
 
															+        store = _shared_store
														
 
															         return store.get_clusters_by_entity(
														
 
															             entity=entity.strip().lower(),
														
 
															             hours=hours,
														
@@ -1191,7 +1169,7 @@ def api_clusters_by_keyword(
 
															 ):
														
 
															     """Return clusters matching a keyword, filtered by event time via SQL junction table."""
														
 
															     try:
														
 
															-        store = DashboardStore(_shared_store)
														
 
															+        store = _shared_store
														
 
															         return store.get_clusters_by_keyword(
														
 
															             keyword=keyword.strip().lower(),
														
 
															             hours=hours,
														
@@ -1205,7 +1183,7 @@ def api_clusters_by_keyword(
 
															 def api_cluster_detail(cluster_id: str):
														
 
															     """Full cluster detail for drill-down."""
														
 
															     try:
														
 
															-        store = DashboardStore(_shared_store)
														
 
															+        store = _shared_store
														
 
															         detail = store.get_cluster_detail(cluster_id)
														
 
															         if not detail:
														
 
															             return JSONResponse(status_code=404, content={"error": "Cluster not found", "id": cluster_id})
														
--- a/news_mcp/storage/sqlite_store.py
+++ b/news_mcp/storage/sqlite_store.py
@@ -12,6 +12,7 @@ from email.utils import parsedate_to_datetime
 
															 from news_mcp.config import (
														
 
															     NEWS_PRUNE_INTERVAL_HOURS,
														
 
															     NEWS_PRUNING_ENABLED,
														
 
															+    NEWS_REFRESH_INTERVAL_SECONDS,
														
 
															     NEWS_RETENTION_DAYS,
														
 
															 )
														
 
															 from news_mcp.entity_normalize import normalize_entities
														
@@ -671,13 +672,34 @@ class SQLiteClusterStore:
 
															         with self._conn() as conn:
														
 
															             total_clusters = conn.execute("SELECT COUNT(*) FROM clusters").fetchone()[0]
														
 
															             total_entities = conn.execute("SELECT COUNT(*) FROM entity_metadata").fetchone()[0]
														
 
															+            cluster_entities = conn.execute(
														
 
															+                "SELECT COUNT(DISTINCT e.value) "
														
 
															+                "FROM clusters, json_each(clusters.payload, '$.entities') AS e"
														
 
															+            ).fetchone()[0]
														
 
															             topic_counts = dict(conn.execute(
														
 
															                 "SELECT topic, COUNT(*) FROM clusters GROUP BY topic"
														
 
															             ).fetchall())
														
 
															             last_refresh = self.get_meta("last_refresh_at")
														
 
															             feeds = {}
														
 
															-            for row in conn.execute("SELECT feed_key, last_hash, last_item_count, updated_at FROM feed_state"):
														
 
															-                feeds[row[0]] = {"last_hash": row[1], "last_item_count": row[2], "updated_at": row[3]}
														
 
															+            for row in conn.execute(
														
 
															+                "SELECT feed_key, last_hash, last_item_count, enabled, updated_at FROM feed_state ORDER BY updated_at DESC"
														
 
															+            ):
														
 
															+                feeds[row[0]] = {
														
 
															+                    "last_hash": row[1], "last_item_count": row[2],
														
 
															+                    "enabled": bool(row[3]), "updated_at": row[4],
														
 
															+                }
														
 
															+            # Freshness: did a refresh happen recently? (within 2x the configured interval)
														
 
															+            fresh = False
														
 
															+            if last_refresh:
														
 
															+                try:
														
 
															+                    dt = datetime.fromisoformat(last_refresh.replace("Z", "+00:00"))
														
 
															+                    if dt.tzinfo is None:
														
 
															+                        dt = dt.replace(tzinfo=timezone.utc)
														
 
															+                    age_hours = (datetime.now(timezone.utc) - dt).total_seconds() / 3600
														
 
															+                    fresh = age_hours < max(1.0, NEWS_REFRESH_INTERVAL_SECONDS / 3600) * 2
														
 
															+                except Exception:
														
 
															+                    pass
														
 
															+
														
 
															             last_prune = self.get_meta(META_LAST_PRUNE_AT)
														
 
															             prune_state = self.get_prune_state(
														
 
															                 pruning_enabled=NEWS_PRUNING_ENABLED,
														
@@ -687,11 +709,14 @@ class SQLiteClusterStore:
 
															             return {
														
 
															                 "total_clusters": total_clusters,
														
 
															                 "total_entities": total_entities,
														
 
															+                "cluster_entities": cluster_entities,
														
 
															                 "clusters_by_topic": topic_counts,
														
 
															                 "last_refresh_at": last_refresh,
														
 
															                 "last_prune_at": last_prune,
														
 
															-                "prune_state": prune_state,
														
 
															+                "data_fresh": fresh,
														
 
															                 "feeds": feeds,
														
 
															+                "feed_count": len(feeds),
														
 
															+                "prune_state": prune_state,
														
 
															             }
														
 
															     def get_cluster_detail(self, cluster_id: str) -> dict[str, Any] | None:
														
@@ -730,3 +755,269 @@ class SQLiteClusterStore:
 
															                 "summary_text": summary.get("mergedSummary", "") if summary else "",
														
 
															                 "key_facts": summary.get("keyFacts", []) if summary else [],
														
 
															             }
														
 
															+
														
 
															+    # ── Paginated Clusters ────────────────────────────────────────────
														
 
															+
														
 
															+    def get_clusters_page(
														
 
															+        self,
														
 
															+        topic: str | None = None,
														
 
															+        hours: float = 24,
														
 
															+        limit: int = 20,
														
 
															+        offset: int = 0,
														
 
															+    ) -> dict[str, Any]:
														
 
															+        """Paginated cluster listing filtered by SQL payload_ts index.
														
 
															+
														
 
															+        Returns {"clusters": [...], "total": int}.
														
 
															+        """
														
 
															+        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
														
 
															+        query = "SELECT payload FROM clusters WHERE payload_ts >= ?"
														
 
															+        params: list = [cutoff]
														
 
															+        if topic and topic != "all":
														
 
															+            query += " AND topic = ?"
														
 
															+            params.append(topic)
														
 
															+        total = self._conn().execute(
														
 
															+            f"SELECT COUNT(*) FROM ({query})", params
														
 
															+        ).fetchone()[0]
														
 
															+        query += " ORDER BY payload_ts DESC LIMIT ? OFFSET ?"
														
 
															+        params.extend([limit, offset])
														
 
															+
														
 
															+        with self._conn() as conn:
														
 
															+            rows = conn.execute(query, params).fetchall()
														
 
															+
														
 
															+        return {
														
 
															+            "clusters": [
														
 
															+                {
														
 
															+                    "cluster_id": c.get("cluster_id", ""),
														
 
															+                    "headline": c.get("headline", ""),
														
 
															+                    "topic": c.get("topic", ""),
														
 
															+                    "sentiment": c.get("sentiment", "neutral"),
														
 
															+                    "sentimentScore": c.get("sentimentScore"),
														
 
															+                    "importance": c.get("importance", 0),
														
 
															+                    "entities": c.get("entities", []),
														
 
															+                    "sources": c.get("sources", []),
														
 
															+                    "timestamp": c.get("timestamp", ""),
														
 
															+                    "keywords": c.get("keywords", []),
														
 
															+                    "article_count": len(c.get("articles", [])),
														
 
															+                }
														
 
															+                for c in [json.loads(r[0]) for r in rows]
														
 
															+            ],
														
 
															+            "total": total,
														
 
															+        }
														
 
															+
														
 
															+    # ── Sentiment Series ──────────────────────────────────────────────
														
 
															+
														
 
															+    def get_sentiment_series(
														
 
															+        self,
														
 
															+        topic: str | None = None,
														
 
															+        hours: float = 24,
														
 
															+        bucket_hours: float = 1,
														
 
															+    ) -> list[dict[str, Any]]:
														
 
															+        """Sentiment score averaged per time bucket.  Filters by payload_ts SQL index."""
														
 
															+        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
														
 
															+        query = "SELECT payload FROM clusters WHERE payload_ts >= ?"
														
 
															+        params: list = [cutoff]
														
 
															+        if topic and topic != "all":
														
 
															+            query += " AND topic = ?"
														
 
															+            params.append(topic)
														
 
															+        query += " ORDER BY payload_ts ASC"
														
 
															+
														
 
															+        with self._conn() as conn:
														
 
															+            rows = conn.execute(query, params).fetchall()
														
 
															+
														
 
															+        buckets: dict[datetime, list[float]] = {}
														
 
															+        for (payload_text,) in rows:
														
 
															+            c = json.loads(payload_text)
														
 
															+            ts_str = c.get("timestamp")
														
 
															+            score = c.get("sentimentScore")
														
 
															+            if not ts_str or score is None:
														
 
															+                continue
														
 
															+            dt = datetime.fromisoformat(str(ts_str).strip())
														
 
															+            if dt.tzinfo is None:
														
 
															+                dt = dt.replace(tzinfo=timezone.utc)
														
 
															+            dt = dt.astimezone(timezone.utc)
														
 
															+            bucket_key = dt.replace(minute=0, second=0, microsecond=0)
														
 
															+            if bucket_hours > 1:
														
 
															+                bucket_key = bucket_key.replace(
														
 
															+                    hour=(bucket_key.hour // int(bucket_hours)) * int(bucket_hours)
														
 
															+                )
														
 
															+            buckets.setdefault(bucket_key, []).append(float(score))
														
 
															+
														
 
															+        return [
														
 
															+            {
														
 
															+                "time": bucket_key.isoformat(),
														
 
															+                "avg_sentiment": round(sum(scores) / len(scores), 3),
														
 
															+                "count": len(scores),
														
 
															+                "min": round(min(scores), 3),
														
 
															+                "max": round(max(scores), 3),
														
 
															+            }
														
 
															+            for bucket_key, scores in sorted(buckets.items())
														
 
															+        ]
														
 
															+
														
 
															+    # ── Entity / Keyword Frequencies ──────────────────────────────────
														
 
															+
														
 
															+    def get_entity_frequencies(
														
 
															+        self,
														
 
															+        hours: float = 24,
														
 
															+        limit: int = 30,
														
 
															+    ) -> list[dict[str, Any]]:
														
 
															+        """Top entities by mention count, using SQL junction table + payload_ts index."""
														
 
															+        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
														
 
															+        with self._conn() as conn:
														
 
															+            rows = conn.execute(
														
 
															+                """\
														
 
															+                SELECT ce.entity, COUNT(*) as cnt
														
 
															+                FROM cluster_entities ce
														
 
															+                JOIN clusters c ON c.cluster_id = ce.cluster_id
														
 
															+                WHERE c.payload_ts >= ?
														
 
															+                GROUP BY ce.entity
														
 
															+                ORDER BY cnt DESC
														
 
															+                LIMIT ?
														
 
															+                """,
														
 
															+                (cutoff, limit),
														
 
															+            ).fetchall()
														
 
															+
														
 
															+        result: list[dict[str, Any]] = []
														
 
															+        for label, count in rows:
														
 
															+            meta = self.get_entity_metadata(label)
														
 
															+            result.append({
														
 
															+                "label": label,
														
 
															+                "count": count,
														
 
															+                "canonical_label": meta["canonical_label"] if meta else label,
														
 
															+                "mid": meta["mid"] if meta else None,
														
 
															+            })
														
 
															+        return result
														
 
															+
														
 
															+    def get_keyword_frequencies(
														
 
															+        self,
														
 
															+        hours: float = 24,
														
 
															+        limit: int = 30,
														
 
															+    ) -> list[dict[str, Any]]:
														
 
															+        """Top keywords by mention count, using SQL junction table + payload_ts index.
														
 
															+
														
 
															+        Excludes DEFAULT_TOPICS labels (crypto, macro, regulation, ai, other).
														
 
															+        """
														
 
															+        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
														
 
															+        _topic_labels = {"crypto", "macro", "regulation", "ai", "other"}
														
 
															+
														
 
															+        with self._conn() as conn:
														
 
															+            rows = conn.execute(
														
 
															+                """\
														
 
															+                SELECT ck.keyword, COUNT(*) as cnt
														
 
															+                FROM cluster_keywords ck
														
 
															+                JOIN clusters c ON c.cluster_id = ck.cluster_id
														
 
															+                WHERE c.payload_ts >= ?
														
 
															+                GROUP BY ck.keyword
														
 
															+                ORDER BY cnt DESC
														
 
															+                LIMIT ?
														
 
															+                """,
														
 
															+                (cutoff, limit),
														
 
															+            ).fetchall()
														
 
															+
														
 
															+        return [
														
 
															+            {"label": label, "count": count}
														
 
															+            for label, count in rows
														
 
															+            if label.lower() not in _topic_labels
														
 
															+        ]
														
 
															+
														
 
															+    # ── Junction-Table Entity / Keyword Cluster Search ────────────────
														
 
															+
														
 
															+    def get_clusters_by_entity(
														
 
															+        self,
														
 
															+        entity: str,
														
 
															+        hours: float = 168,
														
 
															+        limit: int = 50,
														
 
															+        offset: int = 0,
														
 
															+    ) -> dict[str, Any]:
														
 
															+        """Return clusters matching an entity, SQL-level filter via junction table.
														
 
															+
														
 
															+        Returns {"entity": ..., "clusters": [...], "total": int, "hours": float}.
														
 
															+        """
														
 
															+        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
														
 
															+        entity_norm = entity.strip().lower()
														
 
															+
														
 
															+        with self._conn() as conn:
														
 
															+            total = conn.execute(
														
 
															+                "SELECT COUNT(DISTINCT c.cluster_id) FROM clusters c "
														
 
															+                "JOIN cluster_entities ce ON c.cluster_id = ce.cluster_id "
														
 
															+                "WHERE c.payload_ts >= ? AND ce.entity = ?",
														
 
															+                (cutoff, entity_norm),
														
 
															+            ).fetchone()[0]
														
 
															+
														
 
															+            rows = conn.execute(
														
 
															+                "SELECT DISTINCT c.payload FROM clusters c "
														
 
															+                "JOIN cluster_entities ce ON c.cluster_id = ce.cluster_id "
														
 
															+                "WHERE c.payload_ts >= ? AND ce.entity = ? "
														
 
															+                "ORDER BY c.payload_ts DESC LIMIT ? OFFSET ?",
														
 
															+                (cutoff, entity_norm, limit, offset),
														
 
															+            ).fetchall()
														
 
															+
														
 
															+        return {
														
 
															+            "entity": entity_norm,
														
 
															+            "clusters": [json.loads(r[0]) for r in rows],
														
 
															+            "total": total,
														
 
															+            "hours": hours,
														
 
															+        }
														
 
															+
														
 
															+    def get_clusters_by_keyword(
														
 
															+        self,
														
 
															+        keyword: str,
														
 
															+        hours: float = 168,
														
 
															+        limit: int = 50,
														
 
															+        offset: int = 0,
														
 
															+    ) -> dict[str, Any]:
														
 
															+        """Return clusters matching a keyword, SQL-level filter via junction table."""
														
 
															+        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
														
 
															+        kw_norm = keyword.strip().lower()
														
 
															+
														
 
															+        with self._conn() as conn:
														
 
															+            total = conn.execute(
														
 
															+                "SELECT COUNT(DISTINCT c.cluster_id) FROM clusters c "
														
 
															+                "JOIN cluster_keywords ck ON c.cluster_id = ck.cluster_id "
														
 
															+                "WHERE c.payload_ts >= ? AND ck.keyword = ?",
														
 
															+                (cutoff, kw_norm),
														
 
															+            ).fetchone()[0]
														
 
															+
														
 
															+            rows = conn.execute(
														
 
															+                "SELECT DISTINCT c.payload FROM clusters c "
														
 
															+                "JOIN cluster_keywords ck ON c.cluster_id = ck.cluster_id "
														
 
															+                "WHERE c.payload_ts >= ? AND ck.keyword = ? "
														
 
															+                "ORDER BY c.payload_ts DESC LIMIT ? OFFSET ?",
														
 
															+                (cutoff, kw_norm, limit, offset),
														
 
															+            ).fetchall()
														
 
															+
														
 
															+        return {
														
 
															+            "keyword": kw_norm,
														
 
															+            "clusters": [json.loads(r[0]) for r in rows],
														
 
															+            "total": total,
														
 
															+            "hours": hours,
														
 
															+        }
														
 
															+
														
 
															+    def get_clusters_by_entity_or_keyword(
														
 
															+        self,
														
 
															+        query_terms: set[str],
														
 
															+        hours: float,
														
 
															+        limit: int,
														
 
															+    ) -> list[dict]:
														
 
															+        """Search clusters by matching ANY query term against entities OR keywords.
														
 
															+
														
 
															+        Uses SQL-level junction-table filtering — no row-limit blind spot.
														
 
															+        Returns clusters sorted by recency.
														
 
															+        """
														
 
															+        terms = [q.strip().lower() for q in query_terms if q.strip()]
														
 
															+        if not terms:
														
 
															+            return []
														
 
															+        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
														
 
															+        placeholders = ",".join("?" for _ in terms)
														
 
															+
														
 
															+        with self._conn() as conn:
														
 
															+            rows = conn.execute(
														
 
															+                f"SELECT DISTINCT c.payload FROM clusters c "
														
 
															+                f"LEFT JOIN cluster_entities ce ON c.cluster_id = ce.cluster_id "
														
 
															+                f"LEFT JOIN cluster_keywords ck ON c.cluster_id = ck.cluster_id "
														
 
															+                f"WHERE c.payload_ts >= ? "
														
 
															+                f"  AND (ce.entity IN ({placeholders}) OR ck.keyword IN ({placeholders})) "
														
 
															+                f"ORDER BY c.payload_ts DESC LIMIT ?",
														
 
															+                (cutoff, *terms, *terms, limit),
														
 
															+            ).fetchall()
														
 
															+
														
 
															+        return [json.loads(r[0]) for r in rows]