瀏覽代碼

refactor: clean read-path timestamp parsing, enforce ISO 8601 contract

All payload.timestamp/first_seen/last_updated values are guaranteed
YYYY-MM-DDTHH:MM:SS+00:00 at write time via sanitize_cluster_payload().
Read paths no longer need RFC 2822 / parsedate_to_datetime fallbacks.

Changes:
- Add _read_ts() to sqlite_store.py: fromisoformat only, no fallback,
  exported for use by dashboard_store and any future read paths
- sqlite_store: get_latest_clusters, get_latest_clusters_all_topics,
  get_clusters_page, get_sentiment_series, get_entity_frequencies all
  use _read_ts — removed 4 local _parse_payload_ts/_parse_ts copies
- dashboard_store: same cleanup across get_clusters_page,
  get_sentiment_series, get_entity_frequencies, get_keyword_frequencies
  — removed 4 local _parse_ts copies; removed parsedate_to_datetime import
- mcp_server_fastmcp: _parse_cluster_timestamp and _cluster_age_hours
  use fromisoformat only; removed parsedate_to_datetime import
- parsedate_to_datetime retained ONLY in sqlite_store._normalize_ts()
  (write path) and dedup/cluster.py (raw ingest)
- Test fixture updated to use ISO 8601 UTC timestamps
- AGENTS.md: Timestamp Contract section added (enforces this rule)
- PROJECT.md: Timestamp Read-Path Cleanup section documents the change
  and the dev-vs-live server distinction
Lukas Goldschmidt 1 周之前
父節點
當前提交
b6567f729d
共有 6 個文件被更改,包括 163 次插入292 次删除
  1. 7 0
      AGENTS.md
  2. 19 0
      PROJECT.md
  3. 48 120
      news_mcp/dashboard/dashboard_store.py
  4. 10 20
      news_mcp/mcp_server_fastmcp.py
  5. 77 150
      news_mcp/storage/sqlite_store.py
  6. 2 2
      test_news_mcp.py

+ 7 - 0
AGENTS.md

@@ -52,6 +52,13 @@ This project spans two machines. **Always check which machine you're operating o
 - `include_articles=true` should keep responses compact and only return minimal article fields.
 - Timestamps in cluster payloads are normalized to ISO 8601 UTC (`YYYY-MM-DDTHH:MM:SS+00:00`) at write time in `sanitize_cluster_payload()`.
 
+## Timestamp Contract (READ THIS BEFORE TOUCHING ANY TIMESTAMP CODE)
+- `payload.timestamp`, `payload.first_seen`, `payload.last_updated` are **guaranteed** `YYYY-MM-DDTHH:MM:SS+00:00` for every row written after the normalization migration (backfill script was run on the live server).
+- **Read paths**: use `_read_ts()` from `news_mcp.storage.sqlite_store`, or `datetime.fromisoformat()` directly. That is all that is needed.
+- **Never** add `parsedate_to_datetime` / RFC 2822 fallbacks to a read path. If `_read_ts` returns None on a stored timestamp, the bug is in the write path — fix `sanitize_cluster_payload()`, don't paper over it.
+- `parsedate_to_datetime` is intentionally retained **only** in `sqlite_store._normalize_ts()` (write path) and `dedup/cluster.py` (raw ingest before normalization). Nowhere else.
+- **Never query the dev DB** (`news_mcp/data/news.sqlite` on latitude) to check live data. It is empty/stale. The live DB is on thinkcenter-2 in Docker at `/app/data/news.sqlite`.
+
 ## Editing Rules
 - Keep changes aligned with the docs in `README.md`, `PROJECT.md`, and `OUTLOOK.md`.
 - Prefer narrow fixes over contract changes unless the user explicitly asks to expand behavior.

+ 19 - 0
PROJECT.md

@@ -172,3 +172,22 @@ Cluster payloads stored timestamps as raw RSS strings (RFC 2822 HTTP-date like `
 
 ### Key invariant
 `updated_at` in the DB = row modification time (set to `datetime.now()` on every upsert). For time-range queries, always use `payload.timestamp` parsed from the JSON.
+
+## Timestamp Read-Path Cleanup (May 2026)
+
+### Problem
+After normalization, all read paths still contained defensive RFC 2822 / `parsedate_to_datetime` fallback parsers. This was dead code on the live server (all stored timestamps are ISO 8601 UTC) and risked being re-introduced by future contributors who misread the defensive pattern as necessary.
+
+### Fix
+- Added `_read_ts(ts) -> float | None` to `sqlite_store.py` (module-level, exported). Uses only `datetime.fromisoformat()`. No RFC 2822 fallback. If it fails, the normalization pipeline has a bug — fix that instead.
+- All read-path timestamp parsing in `sqlite_store.py`, `dashboard_store.py`, and `mcp_server_fastmcp.py` now uses `_read_ts` or plain `fromisoformat`.
+- `parsedate_to_datetime` removed from `dashboard_store.py` and `mcp_server_fastmcp.py` imports entirely.
+- `parsedate_to_datetime` is **only** retained in `sqlite_store._normalize_ts()` (the write path) and `dedup/cluster.py` (raw ingest before normalization).
+- Test fixtures updated to use ISO 8601 UTC timestamps.
+
+### Contract (ENFORCE THIS)
+- `payload.timestamp`, `payload.first_seen`, `payload.last_updated` are **always** `YYYY-MM-DDTHH:MM:SS+00:00` for any row written after the normalization migration.
+- Read paths: use `_read_ts()` from `sqlite_store` or `datetime.fromisoformat()` directly. **Never** add `parsedate_to_datetime` to a read path.
+- Write paths: `sanitize_cluster_payload()` in `sqlite_store.py` is the single normalization point. All writes go through `upsert_clusters()` which calls it.
+- This repo is a **dev machine** copy. The live server is on thinkcenter-2 (192.168.0.200). Never query the dev DB to verify live data — the dev DB is stale/empty.
+

+ 48 - 120
news_mcp/dashboard/dashboard_store.py

@@ -3,7 +3,6 @@ from __future__ import annotations
 import json
 from datetime import datetime, timedelta, timezone
 from typing import Any
-from email.utils import parsedate_to_datetime
 
 from news_mcp.config import (
     NEWS_PRUNE_INTERVAL_HOURS,
@@ -12,7 +11,7 @@ from news_mcp.config import (
     NEWS_RETENTION_DAYS,
     DEFAULT_TOPICS,
 )
-from news_mcp.storage.sqlite_store import SQLiteClusterStore
+from news_mcp.storage.sqlite_store import SQLiteClusterStore, _read_ts
 
 
 class DashboardStore:
@@ -86,24 +85,29 @@ class DashboardStore:
         limit: int = 20,
         offset: int = 0,
     ) -> list[dict[str, Any]]:
-        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
-        now = datetime.now(timezone.utc).isoformat()
-        query = "SELECT payload FROM clusters WHERE updated_at >= ? AND updated_at <= ?"
-        params: list = [cutoff, now]
+        """Paginated cluster listing filtered by payload.timestamp (event time).
+
+        payload.timestamp is guaranteed ISO 8601 UTC — uses _read_ts from
+        sqlite_store. Do NOT filter by updated_at (row mod time).
+        """
+        cutoff_ts = (datetime.now(timezone.utc) - timedelta(hours=hours)).timestamp()
+
+        query = "SELECT payload FROM clusters"
+        params: list = []
         if topic and topic != "all":
-            query += " AND topic = ?"
+            query += " WHERE topic = ?"
             params.append(topic)
-        query += " ORDER BY updated_at DESC LIMIT ? OFFSET ?"
-        params.extend([limit, offset])
 
         with self._store._conn() as conn:
-            cur = conn.execute(query, params)
-            rows = cur.fetchall()
+            rows = conn.execute(query, params).fetchall()
 
-        clusters: list[dict[str, Any]] = []
-        for (payload_text,) in rows:
-            c = json.loads(payload_text)
-            clusters.append({
+        filtered = [json.loads(r[0]) for r in rows]
+        filtered = [c for c in filtered if (_read_ts(c.get("timestamp")) or 0.0) >= cutoff_ts]
+        filtered.sort(key=lambda c: _read_ts(c.get("timestamp")) or 0.0, reverse=True)
+        page = filtered[offset:offset + limit]
+
+        return [
+            {
                 "cluster_id": c.get("cluster_id", ""),
                 "headline": c.get("headline", ""),
                 "topic": c.get("topic", ""),
@@ -115,8 +119,9 @@ class DashboardStore:
                 "timestamp": c.get("timestamp", ""),
                 "keywords": c.get("keywords", []),
                 "article_count": len(c.get("articles", [])),
-            })
-        return clusters
+            }
+            for c in page
+        ]
 
     def get_cluster_detail(self, cluster_id: str) -> dict[str, Any] | None:
         with self._store._conn() as conn:
@@ -164,46 +169,26 @@ class DashboardStore:
         ) -> list[dict[str, Any]]:
         """Sentiment score averaged per time bucket.
 
-        Filters by the cluster's own event timestamp (payload.timestamp),
-        not by updated_at which tracks row modification time.
+        Filters by payload.timestamp (event time, ISO 8601 UTC guaranteed).
         """
-        cutoff = datetime.now(timezone.utc) - timedelta(hours=hours)
-
+        cutoff_ts = (datetime.now(timezone.utc) - timedelta(hours=hours)).timestamp()
         query = "SELECT payload FROM clusters"
         params: list = []
         if topic and topic != "all":
             query += " WHERE topic = ?"
             params.append(topic)
-        query += " ORDER BY updated_at ASC"
 
         with self._store._conn() as conn:
-            cur = conn.execute(query, params)
-            rows = cur.fetchall()
-
-        def _parse_ts(ts: Any) -> datetime | None:
-            if not ts:
-                return None
-            s = str(ts).strip()
-            try:
-                dt = datetime.fromisoformat(s.replace("Z", "+00:00"))
-            except Exception:
-                try:
-                    dt = parsedate_to_datetime(s)
-                except Exception:
-                    return None
-            if dt.tzinfo is None:
-                dt = dt.replace(tzinfo=timezone.utc)
-            return dt.astimezone(timezone.utc)
+            rows = conn.execute(query, params).fetchall()
 
         buckets: dict[datetime, list[float]] = {}
         for (payload_text,) in rows:
             c = json.loads(payload_text)
-            dt = _parse_ts(c.get("timestamp"))
+            ts = _read_ts(c.get("timestamp"))
             score = c.get("sentimentScore")
-            if dt is None or score is None:
-                continue
-            if dt < cutoff:
+            if ts is None or score is None or ts < cutoff_ts:
                 continue
+            dt = datetime.fromtimestamp(ts, tz=timezone.utc)
             bucket_key = dt.replace(minute=0, second=0, microsecond=0)
             if bucket_hours > 1:
                 bucket_key = bucket_key.replace(
@@ -211,17 +196,16 @@ class DashboardStore:
                 )
             buckets.setdefault(bucket_key, []).append(float(score))
 
-        series: list[dict[str, Any]] = []
-        for bucket_key in sorted(buckets):
-            scores = buckets[bucket_key]
-            series.append({
+        return [
+            {
                 "time": bucket_key.isoformat(),
                 "avg_sentiment": round(sum(scores) / len(scores), 3),
                 "count": len(scores),
                 "min": round(min(scores), 3),
                 "max": round(max(scores), 3),
-            })
-        return series
+            }
+            for bucket_key, scores in sorted(buckets.items())
+        ]
 
     # ── Entity Frequencies ──────────────────────────────────────────
 
@@ -230,49 +214,22 @@ class DashboardStore:
         hours: float = 24,
         limit: int = 30,
     ) -> list[dict[str, Any]]:
-        """Top entities by mention count in recent clusters.
+        """Top entities by mention count filtered by payload.timestamp (ISO 8601 UTC guaranteed)."""
+        cutoff_ts = (datetime.now(timezone.utc) - timedelta(hours=hours)).timestamp()
 
-        Filters by the cluster's own event timestamp (payload.timestamp),
-        not by updated_at which tracks row modification time.
-        """
-        cutoff = datetime.now(timezone.utc) - timedelta(hours=hours)
-    
-        query = "SELECT payload FROM clusters"
-        params: list = []
         with self._store._conn() as conn:
-            cur = conn.execute(query, params)
-            rows = cur.fetchall()
-    
-        def _parse_ts(ts):
-            if not ts:
-                return None
-            s = str(ts).strip()
-            try:
-                dt = datetime.fromisoformat(s.replace("Z", "+00:00"))
-            except Exception:
-                try:
-                    from email.utils import parsedate_to_datetime
-                    dt = parsedate_to_datetime(s)
-                except Exception:
-                    return None
-            if dt.tzinfo is None:
-                dt = dt.replace(tzinfo=timezone.utc)
-            return dt.astimezone(timezone.utc)
-    
+            rows = conn.execute("SELECT payload FROM clusters").fetchall()
+
         counter: dict[str, int] = {}
         for (payload_text,) in rows:
             c = json.loads(payload_text)
-            dt = _parse_ts(c.get("timestamp"))
-            if dt is None:
-                continue
-            if dt < cutoff:
+            if (_read_ts(c.get("timestamp")) or 0.0) < cutoff_ts:
                 continue
             for ent in c.get("entities", []):
                 counter[ent] = counter.get(ent, 0) + 1
-    
-        sorted_entities = sorted(counter.items(), key=lambda x: -x[1])[:limit]
+
         result: list[dict[str, Any]] = []
-        for label, count in sorted_entities:
+        for label, count in sorted(counter.items(), key=lambda x: -x[1])[:limit]:
             meta = self._store.get_entity_metadata(label)
             result.append({
                 "label": label,
@@ -289,64 +246,35 @@ class DashboardStore:
         hours: float = 24,
         limit: int = 30,
     ) -> list[dict[str, Any]]:
-        """Top keywords by occurrence count in recent clusters.
+        """Top keywords by occurrence count filtered by payload.timestamp (ISO 8601 UTC guaranteed).
 
-        Mirrors get_entity_frequencies but for LLM-curated thematic keywords.
-        Filters by the cluster's own event timestamp (payload.timestamp).
-        Only includes keywords that are NOT already extracted as entities
-        in the same cluster — the entity signal is higher quality and is
-        already shown in the entity frequencies view.
+        Excludes keywords that are already entities in the same cluster,
+        and excludes DEFAULT_TOPICS labels (crypto, macro, regulation, ai, other).
         """
-        cutoff = datetime.now(timezone.utc) - timedelta(hours=hours)
+        cutoff_ts = (datetime.now(timezone.utc) - timedelta(hours=hours)).timestamp()
+        _topic_labels = {t.lower() for t in DEFAULT_TOPICS}
 
-        query = "SELECT payload FROM clusters"
-        params: list = []
         with self._store._conn() as conn:
-            cur = conn.execute(query, params)
-            rows = cur.fetchall()
-
-        def _parse_ts(ts):
-            if not ts:
-                return None
-            s = str(ts).strip()
-            try:
-                dt = datetime.fromisoformat(s.replace("Z", "+00:00"))
-            except Exception:
-                try:
-                    dt = parsedate_to_datetime(s)
-                except Exception:
-                    return None
-            if dt.tzinfo is None:
-                dt = dt.replace(tzinfo=timezone.utc)
-            return dt.astimezone(timezone.utc)
+            rows = conn.execute("SELECT payload FROM clusters").fetchall()
 
         counter: dict[str, int] = {}
-        _topic_labels = {t.lower() for t in DEFAULT_TOPICS}
         for (payload_text,) in rows:
             c = json.loads(payload_text)
-            dt = _parse_ts(c.get("timestamp"))
-            if dt is None:
-                continue
-            if dt < cutoff:
+            if (_read_ts(c.get("timestamp")) or 0.0) < cutoff_ts:
                 continue
-            # Get entities in this cluster to dedup against keywords
             ents_in_cluster = {str(e).strip().lower() for e in (c.get("entities", []) or []) if str(e).strip()}
             for kw in c.get("keywords", []):
                 kw_str = str(kw).strip()
                 if not kw_str:
                     continue
-                # Skip topic labels (crypto, macro, regulation, ai, other)
-                # that the LLM sometimes returns as keywords.
                 if kw_str.lower() in _topic_labels:
                     continue
-                # Skip keywords that are already entities in this cluster
                 if kw_str.lower() in ents_in_cluster:
                     continue
                 counter[kw_str] = counter.get(kw_str, 0) + 1
 
-        sorted_kws = sorted(counter.items(), key=lambda x: -x[1])[:limit]
         return [
             {"label": label, "count": count}
-            for label, count in sorted_kws
+            for label, count in sorted(counter.items(), key=lambda x: -x[1])[:limit]
         ]
 

+ 10 - 20
news_mcp/mcp_server_fastmcp.py

@@ -9,7 +9,6 @@ import re
 import time
 from collections import Counter
 from datetime import datetime, timezone
-from email.utils import parsedate_to_datetime
 from pathlib import Path
 
 from fastapi import FastAPI, Form
@@ -85,20 +84,18 @@ def _cluster_entity_haystack(cluster: dict) -> list[str]:
 
 
 def _parse_cluster_timestamp(value) -> datetime:
+    """Parse a stored cluster timestamp.
+
+    payload.timestamp is guaranteed ISO 8601 UTC (YYYY-MM-DDTHH:MM:SS+00:00)
+    at write time. Only datetime.fromisoformat is needed — no RFC 2822 fallback.
+    """
     if not value:
         return datetime.min.replace(tzinfo=timezone.utc)
     text = str(value).strip()
     if not text:
         return datetime.min.replace(tzinfo=timezone.utc)
     try:
-        dt = datetime.fromisoformat(text.replace("Z", "+00:00"))
-        if dt.tzinfo is None:
-            dt = dt.replace(tzinfo=timezone.utc)
-        return dt.astimezone(timezone.utc)
-    except Exception:
-        pass
-    try:
-        dt = parsedate_to_datetime(text)
+        dt = datetime.fromisoformat(text)
         if dt.tzinfo is None:
             dt = dt.replace(tzinfo=timezone.utc)
         return dt.astimezone(timezone.utc)
@@ -600,24 +597,17 @@ async def detect_emerging_topics(limit: int = 10, timeframe: str = "24h", topic:
     now = datetime.now(timezone.utc)
 
     def _cluster_age_hours(c: dict) -> float:
-        """Return the cluster's age in hours (approximate, from now)."""
+        """Return the cluster's age in hours. payload.timestamp is ISO 8601 UTC guaranteed."""
         ts = c.get("timestamp") or c.get("last_updated")
         if not ts:
-            return 0.0  # treat un-dated as fresh
+            return 0.0
         try:
-            s = str(ts).replace("Z", "+00:00")
-            dt = datetime.fromisoformat(s)
+            dt = datetime.fromisoformat(str(ts).strip())
             if dt.tzinfo is None:
                 dt = dt.replace(tzinfo=timezone.utc)
             return max(0.0, (now - dt.astimezone(timezone.utc)).total_seconds() / 3600.0)
         except Exception:
-            try:
-                dt = parsedate_to_datetime(str(ts))
-                if dt.tzinfo is None:
-                    dt = dt.replace(tzinfo=timezone.utc)
-                return max(0.0, (now - dt.astimezone(timezone.utc)).total_seconds() / 3600.0)
-            except Exception:
-                return 0.0
+            return 0.0
 
     # Generic entity filter
     _generic_tokens = {"news", "latest", "breaking", "update", "updates", "report", "reports"}

+ 77 - 150
news_mcp/storage/sqlite_store.py

@@ -55,6 +55,27 @@ def _normalize_ts(ts: Any) -> str:
     return text
 
 
+def _read_ts(ts: Any) -> float | None:
+    """Parse a stored, already-normalized ISO 8601 UTC timestamp to a unix float.
+
+    All payload.timestamp / payload.first_seen / payload.last_updated values
+    are guaranteed YYYY-MM-DDTHH:MM:SS+00:00 at write time (enforced by
+    sanitize_cluster_payload → _normalize_ts).  Only datetime.fromisoformat is
+    needed here.  Do NOT add RFC 2822 / parsedate_to_datetime fallbacks — if
+    this function can't parse a stored timestamp it means the normalization
+    pipeline has a bug that should be fixed there, not papered over here.
+    """
+    if not ts:
+        return None
+    try:
+        dt = datetime.fromisoformat(str(ts).strip())
+        if dt.tzinfo is None:
+            dt = dt.replace(tzinfo=timezone.utc)
+        return dt.astimezone(timezone.utc).timestamp()
+    except Exception:
+        return None
+
+
 @dataclass
 class ClusterRow:
     cluster_id: str
@@ -278,35 +299,12 @@ class SQLiteClusterStore:
             return json.loads(row[0])
 
     def get_latest_clusters(self, topic: str, ttl_hours: float, limit: int) -> list[dict]:
-        """Return newest clusters by *their own* timestamp.
+        """Return newest clusters by their own event timestamp (payload.timestamp).
 
-        Filtering/sorting by the DB row's `updated_at` can drift away from the
-        actual event time in `payload.timestamp`.
+        payload.timestamp is guaranteed ISO 8601 UTC — use _read_ts, not raw
+        JSON parsing with RFC 2822 fallbacks.
         """
-
-        cutoff = datetime.now(timezone.utc) - timedelta(hours=float(ttl_hours))
-        cutoff_ts = cutoff.timestamp()
-
-        def _parse_payload_ts(ts: Any) -> float | None:
-            if not ts:
-                return None
-            if isinstance(ts, (int, float)):
-                return float(ts)
-            text = str(ts).strip()
-            try:
-                dt = datetime.fromisoformat(text.replace('Z', '+00:00'))
-                if dt.tzinfo is None:
-                    dt = dt.replace(tzinfo=timezone.utc)
-                return dt.astimezone(timezone.utc).timestamp()
-            except Exception:
-                pass
-            try:
-                dt = parsedate_to_datetime(text)
-                if dt.tzinfo is None:
-                    dt = dt.replace(tzinfo=timezone.utc)
-                return dt.astimezone(timezone.utc).timestamp()
-            except Exception:
-                return None
+        cutoff_ts = (datetime.now(timezone.utc) - timedelta(hours=float(ttl_hours))).timestamp()
 
         with self._conn() as conn:
             cur = conn.execute(
@@ -315,57 +313,24 @@ class SQLiteClusterStore:
             )
             candidates = [json.loads(r[0]) for r in cur.fetchall()]
 
-        filtered: list[dict] = []
-        for c in candidates:
-            ts = _parse_payload_ts(c.get("timestamp"))
-            if ts is None:
-                continue
-            if ts >= cutoff_ts:
-                filtered.append(c)
-
-        filtered.sort(key=lambda c: _parse_payload_ts(c.get("timestamp")) or 0.0, reverse=True)
+        filtered = [c for c in candidates if (_read_ts(c.get("timestamp")) or 0.0) >= cutoff_ts]
+        filtered.sort(key=lambda c: _read_ts(c.get("timestamp")) or 0.0, reverse=True)
         return filtered[: int(limit)]
 
     def get_latest_clusters_all_topics(self, ttl_hours: float, limit: int) -> list[dict]:
-        cutoff = datetime.now(timezone.utc) - timedelta(hours=float(ttl_hours))
-        cutoff_ts = cutoff.timestamp()
+        """Return newest clusters across all topics by event timestamp.
 
-        def _parse_payload_ts(ts: Any) -> float | None:
-            if not ts:
-                return None
-            if isinstance(ts, (int, float)):
-                return float(ts)
-            text = str(ts).strip()
-            try:
-                dt = datetime.fromisoformat(text.replace('Z', '+00:00'))
-                if dt.tzinfo is None:
-                    dt = dt.replace(tzinfo=timezone.utc)
-                return dt.astimezone(timezone.utc).timestamp()
-            except Exception:
-                pass
-            try:
-                dt = parsedate_to_datetime(text)
-                if dt.tzinfo is None:
-                    dt = dt.replace(tzinfo=timezone.utc)
-                return dt.astimezone(timezone.utc).timestamp()
-            except Exception:
-                return None
+        payload.timestamp is guaranteed ISO 8601 UTC — use _read_ts, not raw
+        JSON parsing with RFC 2822 fallbacks.
+        """
+        cutoff_ts = (datetime.now(timezone.utc) - timedelta(hours=float(ttl_hours))).timestamp()
 
         with self._conn() as conn:
-            cur = conn.execute(
-                "SELECT payload FROM clusters ORDER BY updated_at DESC",
-            )
+            cur = conn.execute("SELECT payload FROM clusters ORDER BY updated_at DESC")
             candidates = [json.loads(r[0]) for r in cur.fetchall()]
 
-        filtered: list[dict] = []
-        for c in candidates:
-            ts = _parse_payload_ts(c.get("timestamp"))
-            if ts is None:
-                continue
-            if ts >= cutoff_ts:
-                filtered.append(c)
-
-        filtered.sort(key=lambda c: _parse_payload_ts(c.get("timestamp")) or 0.0, reverse=True)
+        filtered = [c for c in candidates if (_read_ts(c.get("timestamp")) or 0.0) >= cutoff_ts]
+        filtered.sort(key=lambda c: _read_ts(c.get("timestamp")) or 0.0, reverse=True)
         return filtered[: int(limit)]
 
     def get_cluster_by_id(self, cluster_id: str) -> dict | None:
@@ -687,23 +652,29 @@ class SQLiteClusterStore:
         limit: int = 20,
         offset: int = 0,
     ) -> list[dict[str, Any]]:
-        """Paginated cluster listing for the dashboard."""
-        cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()
-        now = datetime.now(timezone.utc).isoformat()
-        query = "SELECT payload FROM clusters WHERE updated_at >= ? AND updated_at <= ?"
-        params: list = [cutoff, now]
+        """Paginated cluster listing filtered by payload.timestamp (event time).
+
+        payload.timestamp is guaranteed ISO 8601 UTC — filtered and sorted
+        using _read_ts, not updated_at (row modification time).
+        """
+        cutoff_ts = (datetime.now(timezone.utc) - timedelta(hours=hours)).timestamp()
+
+        query = "SELECT payload FROM clusters"
+        params: list = []
         if topic and topic != "all":
-            query += " AND topic = ?"
+            query += " WHERE topic = ?"
             params.append(topic)
-        query += " ORDER BY updated_at DESC LIMIT ? OFFSET ?"
-        params.extend([limit, offset])
+
         with self._conn() as conn:
-            cur = conn.execute(query, params)
-            rows = cur.fetchall()
-        clusters: list[dict[str, Any]] = []
-        for (payload_text,) in rows:
-            c = json.loads(payload_text)
-            clusters.append({
+            rows = conn.execute(query, params).fetchall()
+
+        filtered = [json.loads(r[0]) for r in rows]
+        filtered = [c for c in filtered if (_read_ts(c.get("timestamp")) or 0.0) >= cutoff_ts]
+        filtered.sort(key=lambda c: _read_ts(c.get("timestamp")) or 0.0, reverse=True)
+        page = filtered[offset:offset + limit]
+
+        return [
+            {
                 "cluster_id": c.get("cluster_id", ""),
                 "headline": c.get("headline", ""),
                 "topic": c.get("topic", ""),
@@ -715,8 +686,9 @@ class SQLiteClusterStore:
                 "timestamp": c.get("timestamp", ""),
                 "keywords": c.get("keywords", []),
                 "article_count": len(c.get("articles", [])),
-            })
-        return clusters
+            }
+            for c in page
+        ]
 
     def get_sentiment_series(
             self,
@@ -726,45 +698,28 @@ class SQLiteClusterStore:
         ) -> list[dict[str, Any]]:
             """Sentiment score averaged per time bucket.
 
-            Filters by the cluster's own event timestamp (payload.timestamp),
-            not by updated_at which tracks row modification time.
+            Filters by payload.timestamp (event time, ISO 8601 UTC guaranteed).
             """
-            cutoff = datetime.now(timezone.utc) - timedelta(hours=hours)
+            cutoff_ts = (datetime.now(timezone.utc) - timedelta(hours=hours)).timestamp()
             query = "SELECT payload FROM clusters"
             params: list = []
             if topic and topic != "all":
                 query += " WHERE topic = ?"
                 params.append(topic)
-            query += " ORDER BY updated_at ASC"
 
             with self._conn() as conn:
-                cur = conn.execute(query, params)
-                rows = cur.fetchall()
-
-            def _parse_ts(ts: Any) -> datetime | None:
-                if not ts:
-                    return None
-                s = str(ts).strip()
-                try:
-                    dt = datetime.fromisoformat(s.replace("Z", "+00:00"))
-                except Exception:
-                    try:
-                        dt = parsedate_to_datetime(s)
-                    except Exception:
-                        return None
-                if dt.tzinfo is None:
-                    dt = dt.replace(tzinfo=timezone.utc)
-                return dt.astimezone(timezone.utc)
+                rows = conn.execute(query, params).fetchall()
 
             buckets: dict[datetime, list[float]] = {}
             for (payload_text,) in rows:
                 c = json.loads(payload_text)
-                dt = _parse_ts(c.get("timestamp"))
+                ts = _read_ts(c.get("timestamp"))
                 score = c.get("sentimentScore")
-                if dt is None or score is None:
+                if ts is None or score is None:
                     continue
-                if dt < cutoff.replace(tzinfo=timezone.utc):
+                if ts < cutoff_ts:
                     continue
+                dt = datetime.fromtimestamp(ts, tz=timezone.utc)
                 bucket_key = dt.replace(minute=0, second=0, microsecond=0)
                 if bucket_hours > 1:
                     bucket_key = bucket_key.replace(
@@ -772,66 +727,38 @@ class SQLiteClusterStore:
                     )
                 buckets.setdefault(bucket_key, []).append(float(score))
 
-            series: list[dict[str, Any]] = []
-            for bucket_key in sorted(buckets):
-                scores = buckets[bucket_key]
-                series.append({
+            return [
+                {
                     "time": bucket_key.isoformat(),
                     "avg_sentiment": round(sum(scores) / len(scores), 3),
                     "count": len(scores),
                     "min": round(min(scores), 3),
                     "max": round(max(scores), 3),
-                })
-            return series
+                }
+                for bucket_key, scores in sorted(buckets.items())
+            ]
 
     def get_entity_frequencies(
         self,
         hours: float = 24,
         limit: int = 30,
     ) -> list[dict[str, Any]]:
-        """Top entities by mention count in recent clusters.
+        """Top entities by mention count filtered by payload.timestamp (ISO 8601 UTC guaranteed)."""
+        cutoff_ts = (datetime.now(timezone.utc) - timedelta(hours=hours)).timestamp()
 
-        Filters by the cluster's own event timestamp (payload.timestamp),
-        not by updated_at which tracks row modification time.
-        """
-        cutoff = datetime.now(timezone.utc) - timedelta(hours=hours)
-    
-        query = "SELECT payload FROM clusters"
-        params: list = []
         with self._conn() as conn:
-            cur = conn.execute(query, params)
-            rows = cur.fetchall()
-    
-        def _parse_ts(ts):
-            if not ts:
-                return None
-            s = str(ts).strip()
-            try:
-                dt = datetime.fromisoformat(s.replace("Z", "+00:00"))
-            except Exception:
-                try:
-                    from email.utils import parsedate_to_datetime
-                    dt = parsedate_to_datetime(s)
-                except Exception:
-                    return None
-            if dt.tzinfo is None:
-                dt = dt.replace(tzinfo=timezone.utc)
-            return dt.astimezone(timezone.utc)
-    
+            rows = conn.execute("SELECT payload FROM clusters").fetchall()
+
         counter: dict[str, int] = {}
         for (payload_text,) in rows:
             c = json.loads(payload_text)
-            dt = _parse_ts(c.get("timestamp"))
-            if dt is None:
-                continue
-            if dt < cutoff:
+            if (_read_ts(c.get("timestamp")) or 0.0) < cutoff_ts:
                 continue
             for ent in c.get("entities", []):
                 counter[ent] = counter.get(ent, 0) + 1
-    
-        sorted_entities = sorted(counter.items(), key=lambda x: -x[1])[:limit]
+
         result: list[dict[str, Any]] = []
-        for label, count in sorted_entities:
+        for label, count in sorted(counter.items(), key=lambda x: -x[1])[:limit]:
             meta = self.get_entity_metadata(label)
             result.append({
                 "label": label,

+ 2 - 2
test_news_mcp.py

@@ -262,8 +262,8 @@ def test_resolve_entity_falls_back_cleanly_when_provider_unavailable(monkeypatch
 
 def test_sort_clusters_by_recency_prefers_newer_timestamp_over_importance():
     clusters = [
-        {"headline": "older", "timestamp": "Wed, 01 Apr 2026 10:00:00 GMT", "importance": 0.9},
-        {"headline": "newer", "timestamp": "Wed, 01 Apr 2026 11:00:00 GMT", "importance": 0.1},
+        {"headline": "older", "timestamp": "2026-04-01T10:00:00+00:00", "importance": 0.9},
+        {"headline": "newer", "timestamp": "2026-04-01T11:00:00+00:00", "importance": 0.1},
     ]
 
     sorted_clusters = _sort_clusters_by_recency(clusters)