|
|
@@ -6,271 +6,96 @@ Provide a signal-extraction MCP server that converts RSS into **deduplicated, en
|
|
|
## Current architecture (v0.4.0)
|
|
|
- FastMCP SSE server mounted at `/mcp`
|
|
|
- SQLite cache for clusters + entity metadata + feed state + LLM summary caches
|
|
|
-- **payload_ts** — indexed generated column for SQL-level event-time filtering (no JSON parsing at read time)
|
|
|
-- **cluster_entities** and **cluster_keywords** junction tables with indexes for O(log n) entity/keyword search
|
|
|
-- All read paths use SQL-level filtering (no full-table JSON parsing)
|
|
|
-- **Stable cluster IDs**: `sha1(min_article_key)` — topic-independent, order-independent, consistent across polling cycles. The topic is excluded from the hash so that the same article always maps to the same cluster_id regardless of heuristic vs LLM-enriched topic classification.
|
|
|
-- **Cross-cycle merge**: poller seeds clustering with recent DB clusters (configurable `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h). Existing clusters are re-bucketed by the same heuristic topic function (`normalize_topic_from_title`) that new articles use, ensuring matching works even when the enriched topic drifted.
|
|
|
+- **payload_ts** — indexed VIRTUAL GENERATED column: `json_extract(payload, '$.timestamp')`. Auto-maintained by SQLite on write. Indexed for O(log n) time-range queries. No write-path code needed.
|
|
|
+- **cluster_entities** junction table — `(cluster_id, entity)` with index on `entity`. Populated in `upsert_clusters()`. SQL-level entity search.
|
|
|
+- **cluster_keywords** junction table — `(cluster_id, keyword)` with index on `keyword`. Same pattern.
|
|
|
+- All time-range filters and entity/keyword searches use SQL indexes. No full-table JSON parsing at query time.
|
|
|
+- **Stable cluster IDs**: `sha1(min_article_key)` — topic-independent, order-independent, consistent across polling cycles.
|
|
|
+- **Cross-cycle merge**: poller seeds clustering with recent DB clusters (configurable `NEWS_CLUSTER_MAX_AGE_HOURS`, default 4h).
|
|
|
- **Orphan merge**: post-clustering Union-Find pass merges clusters sharing article keys
|
|
|
-- Concurrent Ollama embeddings (pre-computed before clustering loop)
|
|
|
-- Concurrent LLM enrichment (entity extraction, topic classification, sentiment) with per-provider semaphore
|
|
|
-- Per-cluster retry with exponential backoff (3 retries, 2s/4s/8s) + cross-cycle failure recovery
|
|
|
-- All concurrency limits configurable via env vars (`NEWS_RSS_MAX_CONCURRENCY`, `NEWS_OLLAMA_MAX_CONCURRENCY`, `NEWS_LLM_CONCURRENCY_<PROVIDER>`)
|
|
|
-- Dashboard REST API (`/api/v1/*`) for clusters, sentiment series, entity frequencies
|
|
|
-- `get_latest_events()` defaults to all topics (omit `topic` for unfiltered)
|
|
|
+- Concurrent RSS fetch, Ollama embeddings, LLM enrichment with per-provider semaphore
|
|
|
+- Dashboard REST API (`/api/v1/*`) + Keywords panel + entity/keyword drill-down via junction tables
|
|
|
|
|
|
-## Previous: v0.2.x architecture
|
|
|
-- FastMCP SSE server mounted at `/mcp`
|
|
|
-- SQLite cache for clusters + Groq summary caches
|
|
|
-- RSS fetch (breakingthenews.net)
|
|
|
-- v1 dedup via fuzzy title similarity only, seed-article-only comparison
|
|
|
-- optional Ollama embeddings path for clustering (when `NEWS_EMBEDDINGS_ENABLED=true`)
|
|
|
-- configurable embedding similarity threshold (`NEWS_EMBEDDING_SIMILARITY_THRESHOLD`)
|
|
|
-- optional embeddings backfill script for precomputing cluster vectors in SQLite
|
|
|
-- optional merge-analysis script for threshold experiments before any DB rewrite
|
|
|
-- optional merge pass for destructive consolidation after threshold review
|
|
|
-- optional article-dedup cleanup for repeated article variants inside a cluster
|
|
|
-- Groq enrichment (topic/entities/sentiment/keywords)
|
|
|
-- Tools expose semantic queries over cached clusters
|
|
|
-
|
|
|
-## MCP tools (current)
|
|
|
+## MCP tools
|
|
|
- `get_latest_events(topic, limit, include_articles)`
|
|
|
- `get_events_for_entity(entity, limit, timeframe, include_articles)`
|
|
|
- `get_event_summary(event_id, include_articles)`
|
|
|
-- `detect_emerging_topics(limit)`
|
|
|
+- `detect_emerging_topics(limit, timeframe, topic, around)` — returns signal_type (entity/keyword/phrase)
|
|
|
- `get_news_sentiment(entity, timeframe)`
|
|
|
- `get_related_recent_entities(subject, timeframe, limit, include_trends)`
|
|
|
+- `get_feeds()` / `toggle_feed(feed_url, enabled)`
|
|
|
- `get_capabilities()`
|
|
|
|
|
|
-## Refresh & caching
|
|
|
-
|
|
|
-## Future work (planned): entity graph over time
|
|
|
-Instead of treating `detect_emerging_topics()` as a flat list, we want a higher-level representation:
|
|
|
-
|
|
|
-- Convert emerging topic/entity co-occurrence signals into a **weighted entity graph**
|
|
|
-- Group the graph into **communities** (story neighborhoods)
|
|
|
-- Track **time evolution** across refresh windows:
|
|
|
- - node “momentum” (trend_score/count changes)
|
|
|
- - edge strength changes (relation tightening/weakening)
|
|
|
- - community emergence/disappearance
|
|
|
-
|
|
|
-Eventual agent tool shape (later): `get_emerging_entity_graph(timeframe, limit)`.
|
|
|
+## REST API
|
|
|
+- `GET /` — server info, tools list
|
|
|
+- `GET /health` — uptime, version hash
|
|
|
+- `GET /api/v1/clusters` — paginated, filtered by `payload_ts` SQL index
|
|
|
+- `GET /api/v1/entities` — top entities via junction table GROUP BY
|
|
|
+- `GET /api/v1/keywords` — top keywords via junction table GROUP BY
|
|
|
+- `GET /api/v1/clusters/by-entity?entity=X&hours=Y` — SQL entity search (NEW)
|
|
|
+- `GET /api/v1/clusters/by-keyword?keyword=X&hours=Y` — SQL keyword search (NEW)
|
|
|
+- `GET /api/v1/sentiment-series` — filtered by `payload_ts` SQL index
|
|
|
+- `GET /api/v1/cluster/{cluster_id}` — full detail
|
|
|
+- `GET /api/v1/feeds` / `POST /api/v1/feeds/toggle` — feed management
|
|
|
|
|
|
-- Background refresh every `NEWS_REFRESH_INTERVAL_SECONDS` (default 900s)
|
|
|
-- Feed-hash skipping to avoid redundant RSS+Groq work
|
|
|
-- Cluster TTL (`NEWS_CLUSTERS_TTL_HOURS` via `CLUSTERS_TTL_HOURS`)
|
|
|
+## Refresh & caching
|
|
|
+- Background refresh every `NEWS_REFRESH_INTERVAL_SECONDS` (default 300s)
|
|
|
+- Feed-hash skipping to avoid redundant RSS+LLM work
|
|
|
- Summary caching for `get_event_summary`
|
|
|
+- Pruning via `NEWS_RETENTION_DAYS`, `NEWS_PRUNE_INTERVAL_HOURS`
|
|
|
|
|
|
-## Definition of “committable”
|
|
|
-- Tests pass offline (dedup/storage unit tests)
|
|
|
-- Server exposes tool surface with valid schemas
|
|
|
-- Caching prevents repeated Groq calls for unchanged clusters
|
|
|
-- Embeddings remain optional: Ollama is tried first when enabled, otherwise the heuristic path stays active
|
|
|
-- Embeddings backfill script exists for older cluster rows before the server restart
|
|
|
-- Merge-analysis script exists to inspect candidate cluster pairs at multiple thresholds
|
|
|
-- Merge pass exists for destructive consolidation once thresholds look sane
|
|
|
-- Article-dedup cleanup exists for fixing duplicated article records already in SQLite
|
|
|
-- Entity lookup now respects timeframe as the scan window, with limit acting as a cap
|
|
|
-
|
|
|
-## Dashboard & REST API (added May 2026)
|
|
|
-
|
|
|
-### What was added
|
|
|
-- **5 REST endpoints** (`/api/v1/*`) for programmatic access to cluster data, sentiment series, entity frequencies, and health stats
|
|
|
-- **Dashboard SPA** at `/dashboard` — HTMX-based shell with Chart.js visualizations (5 views: health, clusters, sentiment, entities, detail)
|
|
|
-- **Non-blocking startup** — moved from synchronous `@app.on_event("startup")` pruning to `lifespan`-based fire-and-forget background loop; server responds within ~0.3s regardless of feed/LLM latency
|
|
|
-
|
|
|
-### Architecture
|
|
|
-```
|
|
|
-news-mcp/
|
|
|
-├── news_mcp/mcp_server_fastmcp.py ← MCP tools + REST API + dashboard mount
|
|
|
-├── news_mcp/dashboard/
|
|
|
-│ ├── dashboard_store.py ← Read-only query layer (no side effects)
|
|
|
-│ ├── index.html ← SPA shell with 5 views
|
|
|
-│ ├── style.css ← Dark theme, responsive
|
|
|
-│ └── dashboard.js ← Client-side rendering + Chart.js
|
|
|
-```
|
|
|
-
|
|
|
-### Key design decisions
|
|
|
-- Dashboard store wraps `SQLiteClusterStore` with thin read-only methods — no enrichment, no writes
|
|
|
-- Single shared store instance (`_shared_store`) avoids repeated DB connections
|
|
|
-- Static SPA files are served by FastAPI's `StaticFiles` mount — no Jinja2/templating dependency
|
|
|
-- Client-side `fetch()` + Chart.js avoids HTMX raw-JSON-in-DOM issues
|
|
|
-- Default lookback matches `NEWS_DEFAULT_LOOKBACK_HOURS` (144h), not a hardcoded 24h
|
|
|
-
|
|
|
-### Known gaps
|
|
|
-- No auth (LAN-only, no login)
|
|
|
-- Entity detail view in dashboard is minimal (click-to-expand from entity list is stub)
|
|
|
-- No alerting/threshold notifications yet (Phase 2: velocity spikes, sentiment divergence)
|
|
|
-
|
|
|
-## Dashboard & REST API (added May 2026)
|
|
|
-
|
|
|
-### What was added
|
|
|
-- **5 REST endpoints** (`/api/v1/*`) — JSON-only, for programmatic access and the dashboard
|
|
|
-- **Dashboard SPA** at `/dashboard` — 5 views (health, clusters, sentiment, entities, detail), Chart.js visualizations, instant client-side rendering
|
|
|
-- **Non-blocking startup** — replaced synchronous `@app.on_event("startup")` with `lifespan`-based fire-and-forget background loop; server responds in <0.3s
|
|
|
-- **Async ingestion lock** — `asyncio.Lock` prevents overlapping refresh cycles
|
|
|
-- **Hardened LLM calls** — OpenRouter retry logic with exponential backoff on 429/5xx, response shape validation
|
|
|
-
|
|
|
-### Architecture additions
|
|
|
-```
|
|
|
-news-mcp/
|
|
|
-├── news_mcp/mcp_server_fastmcp.py ← MCP + REST API + /dashboard static mount
|
|
|
-├── news_mcp/dashboard/
|
|
|
-│ ├── __init__.py
|
|
|
-│ ├── dashboard_store.py ← Read-only query layer (no side effects)
|
|
|
-│ ├── index.html ← SPA shell, 5 views
|
|
|
-│ ├── style.css ← Dark theme, responsive grid
|
|
|
-│ └── dashboard.js ← Client render, Chart.js, null-safe DOM access
|
|
|
-```
|
|
|
-
|
|
|
-### Design decisions
|
|
|
-- **Dashboard store** wraps `SQLiteClusterStore` with read-only methods (stats, pagination, series, frequencies, detail). No writes, no enrichment.
|
|
|
-- **Single shared store** (`_shared_store`) — one DB connection pool for the entire process.
|
|
|
-- **Static SPA** served via FastAPI `StaticFiles` — no Jinja2/templating dependency.
|
|
|
-- **Client-side rendering** with `fetch()` + Chart.js — avoids HTMX raw-JSON-in-DOM issues.
|
|
|
-- **Default lookback** follows `NEWS_DEFAULT_LOOKBACK_HOURS` (144h), not hardcoded.
|
|
|
-- **Cluster ordering** — always date-descending (SQL `ORDER BY updated_at DESC` + client-side sort as safety net).
|
|
|
-
|
|
|
-### Known gaps (for future work)
|
|
|
-- No auth (LAN-only assumption)
|
|
|
-- Entity detail view is functional but minimal
|
|
|
-- No alerting/threshold notifications (Phase 2)
|
|
|
-- No server-sent events for real-time dashboard updates
|
|
|
-
|
|
|
-## Keyword Utilization Upgrade (May 2026)
|
|
|
-
|
|
|
-### Problem
|
|
|
-Keywords are extracted by the LLM (`extract_entities.prompt` — "provide short keywords that justify the classification"), stored in the cluster payload, and displayed in the dashboard detail view — but they are not used by any search, scoring, or retrieval path. Thematic signals like "ETF", "rate-cut", "contagion" are invisible to entity search, emerging-topics detection, and related-entity expansion.
|
|
|
-
|
|
|
-### Plan
|
|
|
-
|
|
|
-#### Phase 1 — Search & Retrieval (done)
|
|
|
-- **1a**: Add keywords to `_cluster_entity_haystack()` in `mcp_server_fastmcp.py` so `get_events_for_entity()` and `get_news_sentiment()` match clusters by thematic keywords, not just named entities.
|
|
|
-- **1b**: Add `keywords` field to cluster output dicts in `get_latest_events()` and `get_events_for_entity()` so downstream LLM agents see the full semantic picture.
|
|
|
-
|
|
|
-#### Phase 2 — Emerging Topics (pending)
|
|
|
-- **2a**: Count keywords in `detect_emerging_topics()` with parallel `keyword_counts_recent` / `keyword_counts_prior` accumulators, scored with the same velocity/recency/source-diversity formula as entities.
|
|
|
-- **2b**: Optionally promote high-velocity keywords to "suggested entities" on the dashboard.
|
|
|
-
|
|
|
-#### Phase 3 — Relatedness & Dashboard (pending)
|
|
|
-- **3a**: Add keyword co-occurrence counting in `_collect_local_related()` in `related_entities.py`.
|
|
|
-- **3b**: Add `get_keyword_frequencies()` to `DashboardStore` and a "Keywords" panel on the dashboard.
|
|
|
-
|
|
|
-#### Phase 4 — Prompt Refinement (optional)
|
|
|
-- Split keyword extraction into "theme keywords" (subject matter) and "signal keywords" (what's new/notable) for differential weighting downstream.
|
|
|
-
|
|
|
-## Timestamp Normalization (May 2026)
|
|
|
-
|
|
|
-### Problem
|
|
|
-Cluster payloads stored timestamps as raw RSS strings (RFC 2822 HTTP-date like `"Sat, 30 May 2026 02:00:12 +00:00"`). Every read path needed fragile format-guessing, and SQL time-range queries on `updated_at` (row modification time, not event time) returned wrong data.
|
|
|
-
|
|
|
-### Fix
|
|
|
-- `_normalize_ts()` helper in `sqlite_store.py`: parses ISO 8601, RFC 2822/HTTP-date, epoch seconds → uniform `YYYY-MM-DDTHH:MM:SS+00:00`
|
|
|
-- `sanitize_cluster_payload()` now normalizes `timestamp`, `first_seen`, `last_updated`, and all `article[].timestamp` before writing to DB
|
|
|
-- `merge_cluster_embeddings.py`: same normalization on merged payloads
|
|
|
-- `scripts/normalize_cluster_timestamps.py`: backfill script for existing rows (run on live server with correct `--db` path)
|
|
|
-- `get_sentiment_series()` and `get_entity_frequencies()`: filter by `payload.timestamp` in Python, not `updated_at` in SQL
|
|
|
-
|
|
|
-### Key invariant
|
|
|
-`updated_at` in the DB = row modification time (set to `datetime.now()` on every upsert). For time-range queries, always use `payload.timestamp` parsed from the JSON.
|
|
|
-
|
|
|
-## Timestamp Read-Path Cleanup (May 2026)
|
|
|
-
|
|
|
-### Problem
|
|
|
-After normalization, all read paths still contained defensive RFC 2822 / `parsedate_to_datetime` fallback parsers. This was dead code on the live server (all stored timestamps are ISO 8601 UTC) and risked being re-introduced by future contributors who misread the defensive pattern as necessary.
|
|
|
-
|
|
|
-### Fix
|
|
|
-- Added `_read_ts(ts) -> float | None` to `sqlite_store.py` (module-level, exported). Uses only `datetime.fromisoformat()`. No RFC 2822 fallback. If it fails, the normalization pipeline has a bug — fix that instead.
|
|
|
-- All read-path timestamp parsing in `sqlite_store.py`, `dashboard_store.py`, and `mcp_server_fastmcp.py` now uses `_read_ts` or plain `fromisoformat`.
|
|
|
-- `parsedate_to_datetime` removed from `dashboard_store.py` and `mcp_server_fastmcp.py` imports entirely.
|
|
|
-- `parsedate_to_datetime` is **only** retained in `sqlite_store._normalize_ts()` (the write path) and `dedup/cluster.py` (raw ingest before normalization).
|
|
|
-- Test fixtures updated to use ISO 8601 UTC timestamps.
|
|
|
-
|
|
|
-### Contract (ENFORCE THIS)
|
|
|
-- `payload.timestamp`, `payload.first_seen`, `payload.last_updated` are **always** `YYYY-MM-DDTHH:MM:SS+00:00` for any row written after the normalization migration.
|
|
|
-- Read paths: use `_read_ts()` from `sqlite_store` or `datetime.fromisoformat()` directly. **Never** add `parsedate_to_datetime` to a read path.
|
|
|
-- Write paths: `sanitize_cluster_payload()` in `sqlite_store.py` is the single normalization point. All writes go through `upsert_clusters()` which calls it.
|
|
|
-- This repo is a **dev machine** copy. The live server is on thinkcenter-2 (192.168.0.200). Never query the dev DB to verify live data — the dev DB is stale/empty.
|
|
|
-
|
|
|
-## Junction Tables + Indexed Timestamp (May 2026)
|
|
|
-
|
|
|
-### Problem
|
|
|
-All read paths deserialize every JSON payload to filter by entity/keyword/time. With 6000+ clusters, `get_clusters_page` returns only the 100 newest — clicking an entity that appears 34x shows only 2 clusters because the other 32 are outside the LIMIT. `get_entity_frequencies` counts correctly but the detail view can't find them. Every query does a full table scan with JSON parsing.
|
|
|
-
|
|
|
-### Solution: junction tables + generated timestamp column
|
|
|
-
|
|
|
-**Schema (migrated in `_init_db`, incremental-safe):**
|
|
|
-
|
|
|
+## Schema (clusters table)
|
|
|
```sql
|
|
|
--- Indexed event timestamp (SQLite generated column — zero write-path cost)
|
|
|
-ALTER TABLE clusters ADD COLUMN payload_ts
|
|
|
- GENERATED ALWAYS AS (json_extract(payload, '$.timestamp')) STORED;
|
|
|
-CREATE INDEX IF NOT EXISTS idx_clusters_payload_ts ON clusters(payload_ts);
|
|
|
+CREATE TABLE clusters (
|
|
|
+ cluster_id TEXT PRIMARY KEY,
|
|
|
+ topic TEXT NOT NULL,
|
|
|
+ payload TEXT NOT NULL,
|
|
|
+ updated_at TEXT NOT NULL, -- row modification time (set on every upsert)
|
|
|
+ summary_payload TEXT,
|
|
|
+ summary_updated_at TEXT,
|
|
|
+ payload_ts GENERATED ALWAYS AS -- indexed event time (auto-maintained)
|
|
|
+ (json_extract(payload, '$.timestamp')) VIRTUAL
|
|
|
+);
|
|
|
+CREATE INDEX idx_clusters_payload_ts ON clusters(payload_ts);
|
|
|
|
|
|
--- Entity junction table for SQL-level entity search
|
|
|
-CREATE TABLE IF NOT EXISTS cluster_entities (
|
|
|
+CREATE TABLE cluster_entities (
|
|
|
cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
|
|
|
- entity TEXT NOT NULL,
|
|
|
+ entity TEXT NOT NULL, -- lowercased
|
|
|
PRIMARY KEY (cluster_id, entity)
|
|
|
);
|
|
|
-CREATE INDEX IF NOT EXISTS idx_cluster_entities_entity ON cluster_entities(entity);
|
|
|
+CREATE INDEX idx_cluster_entities_entity ON cluster_entities(entity);
|
|
|
|
|
|
--- Keyword junction table for SQL-level keyword search
|
|
|
-CREATE TABLE IF NOT EXISTS cluster_keywords (
|
|
|
+CREATE TABLE cluster_keywords (
|
|
|
cluster_id TEXT NOT NULL REFERENCES clusters(cluster_id) ON DELETE CASCADE,
|
|
|
- keyword TEXT NOT NULL,
|
|
|
+ keyword TEXT NOT NULL, -- lowercased
|
|
|
PRIMARY KEY (cluster_id, keyword)
|
|
|
);
|
|
|
-CREATE INDEX IF NOT EXISTS idx_cluster_keywords_keyword ON cluster_keywords(keyword);
|
|
|
+CREATE INDEX idx_cluster_keywords_keyword ON cluster_keywords(keyword);
|
|
|
```
|
|
|
|
|
|
-**Write path (`upsert_clusters`):** Within the existing transaction, after sanitizing the payload and before INSERT/UPDATE:
|
|
|
-1. `DELETE FROM cluster_entities WHERE cluster_id = ?` (handles re-enrichment)
|
|
|
-2. `DELETE FROM cluster_keywords WHERE cluster_id = ?`
|
|
|
-3. `INSERT OR IGNORE INTO cluster_entities VALUES (?, ?)` for each entity
|
|
|
-4. `INSERT OR IGNORE INTO cluster_keywords VALUES (?, ?)` for each keyword
|
|
|
-5. `payload_ts` is auto-maintained by SQLite's generated column — no code needed
|
|
|
-
|
|
|
-**Read paths — all SQL-level, no JSON parsing at query time:**
|
|
|
-
|
|
|
-- `get_clusters_page`: `WHERE payload_ts >= ? ORDER BY payload_ts DESC LIMIT ? OFFSET ?`
|
|
|
-- `get_entity_frequencies`: `JOIN cluster_entities ... WHERE payload_ts >= ? GROUP BY entity ORDER BY cnt DESC`
|
|
|
-- `get_keyword_frequencies`: `JOIN cluster_keywords ... WHERE payload_ts >= ? GROUP BY keyword ORDER BY cnt DESC`
|
|
|
-- New `get_clusters_by_entity`: `JOIN cluster_entities WHERE payload_ts >= ? AND entity = ?`
|
|
|
-- New `get_clusters_by_keyword`: `JOIN cluster_keywords WHERE payload_ts >= ? AND keyword = ?`
|
|
|
+## Keyword Utilization (done, May 2026)
|
|
|
+Keywords extracted by the LLM are now first-class search signals:
|
|
|
+- `_cluster_entity_haystack()` includes keywords → `get_events_for_entity()` matches themes
|
|
|
+- Cluster output includes `keywords[]` field
|
|
|
+- `detect_emerging_topics()` scores keywords with velocity/recency/source-diversity formula (`signal_type: "keyword"`)
|
|
|
+- `_collect_local_related()` counts keyword co-occurrence
|
|
|
+- Dashboard Keywords panel with SQL frequency counts via junction table
|
|
|
+- Topic labels (crypto/macro/regulation/ai/other) filtered from keywords at extraction time
|
|
|
|
|
|
-**Backfill script (`scripts/backfill_junction_tables.py`):**
|
|
|
-- Same pattern as `normalize_cluster_timestamps.py`
|
|
|
-- Accepts `--db` arg, defaults to config DB_PATH
|
|
|
-- Reads all cluster payloads, populates `cluster_entities` and `cluster_keywords`
|
|
|
-- `payload_ts` is auto-populated by SQLite's generated column
|
|
|
-- Idempotent (`INSERT OR IGNORE` + transaction)
|
|
|
-- Reports entity/keyword counts after completion
|
|
|
-- Run once on live server: `docker exec -it <container> python3 scripts/backfill_junction_tables.py`
|
|
|
+## Timestamp Pipeline (May 2026)
|
|
|
+1. **Write**: `sanitize_cluster_payload()` normalizes `timestamp`/`first_seen`/`last_updated` to `YYYY-MM-DDTHH:MM:SS+00:00`. If all three missing, falls back to `datetime.now()`.
|
|
|
+2. **Generated column**: `payload_ts` auto-extracts from JSON on write. Indexed.
|
|
|
+3. **Read**: All queries filter by `payload_ts >= ?` in SQL. No JSON parsing for time filtering.
|
|
|
+4. **Backfill**: One-time `scripts/backfill_junction_tables.py` populated junction tables from existing payloads. `payload_ts` was auto-populated by SQLite.
|
|
|
|
|
|
-**REST API changes:**
|
|
|
-- `GET /api/v1/clusters` — now uses SQL `payload_ts` filter, consistent total
|
|
|
-- `GET /api/v1/entities` — SQL `COUNT(*) ... GROUP BY` via junction table
|
|
|
-- `GET /api/v1/keywords` — SQL `COUNT(*) ... GROUP BY` via junction table
|
|
|
-- **New `GET /api/v1/clusters/by-entity?entity=X&hours=Y&limit=Z`** — SQL entity search
|
|
|
-- **New `GET /api/v1/clusters/by-keyword?keyword=X&hours=Y&limit=Z`** — SQL keyword search
|
|
|
+## Design Flaw: Two Stores (KNOWN, fix planned)
|
|
|
|
|
|
-**Dashboard JS changes:**
|
|
|
-- `showEntityDetail(label)` — calls `/api/v1/clusters/by-entity` instead of fetching all clusters
|
|
|
-- `showKeywordDetail(label)` — calls `/api/v1/clusters/by-keyword` instead of fetching all clusters
|
|
|
+**Problem:** `SQLiteClusterStore` and `DashboardStore` are parallel copies of the same data access layer. Methods were duplicated when DashboardStore was added, with the same JSON-parsing approach. When junction tables were implemented, only `DashboardStore` was updated. `SQLiteClusterStore` (used by MCP tools) still does full-table JSON parsing for entity/keyword search.
|
|
|
|
|
|
-**Files changed:**
|
|
|
-| File | Change |
|
|
|
-|---|---|
|
|
|
-| `news_mcp/storage/sqlite_store.py` | Schema migration (generated column + junction tables), write-path junction population, new SQL-level read methods |
|
|
|
-| `news_mcp/mcp_server_fastmcp.py` | New REST endpoints for entity/keyword cluster search |
|
|
|
-| `news_mcp/dashboard/dashboard_store.py` | `get_entity_frequencies`, `get_keyword_frequencies` use SQL junction table counts |
|
|
|
-| `dashboard/dashboard.js` | `showEntityDetail`, `showKeywordDetail` call new endpoints |
|
|
|
-| `scripts/backfill_junction_tables.py` | New backfill script (same pattern as normalize_cluster_timestamps.py) |
|
|
|
+**Current state:**
|
|
|
+- `DashboardStore` — uses SQL `payload_ts` filter + junction tables ✓
|
|
|
+- `SQLiteClusterStore` — uses SQL `payload_ts` filter for time ✓, but MCP tool entity search (`get_events_for_entity`, `get_news_sentiment`) still fetches top-N clusters by time then filters entities in Python
|
|
|
|
|
|
-**Migration safety:**
|
|
|
-- All DDL uses `IF NOT EXISTS` / `ADD COLUMN IF NOT EXISTS` — safe to re-run
|
|
|
-- Backfill script is idempotent (`INSERT OR IGNORE` in transactions)
|
|
|
-- Generated column requires no write-path code changes
|
|
|
-- Old query methods can coexist during transition (removed after verification)
|
|
|
+**Consequence:** `get_events_for_entity("Pete Hegseth", timeframe="72h")` fetches the 200 most recent clusters (via `payload_ts`), then loops in Python checking entities. If the entity appears in 34 clusters but only 15 are in the top 200, 19 are missed.
|
|
|
|
|
|
+**Proposed fix:** Collapse both stores into one. `SQLiteClusterStore` should be the single data access layer with proper junction-table methods for entity/keyword search. `DashboardStore` should be a thin wrapper or removed entirely. MCP tools should call `SQLiteClusterStore.get_clusters_by_entity()` using junction tables instead of Python-side filtering.
|