PROJECT.md 6.5 KB

Project: news-mcp

Goal

Provide a signal-extraction MCP server that converts RSS into deduplicated, enriched news clusters that are easy for agents to use.

Current architecture (v1)

  • FastMCP SSE server mounted at /mcp
  • SQLite cache for clusters + Groq summary caches
  • RSS fetch (breakingthenews.net)
  • v1 dedup via fuzzy title similarity
  • optional Ollama embeddings path for clustering (when NEWS_EMBEDDINGS_ENABLED=true)
  • configurable embedding similarity threshold (NEWS_EMBEDDING_SIMILARITY_THRESHOLD)
  • optional embeddings backfill script for precomputing cluster vectors in SQLite
  • optional merge-analysis script for threshold experiments before any DB rewrite
  • optional merge pass for destructive consolidation after threshold review
  • optional article-dedup cleanup for repeated article variants inside a cluster
  • Groq enrichment (topic/entities/sentiment/keywords)
  • Tools expose semantic queries over cached clusters

MCP tools (current)

  • get_latest_events(topic, limit, include_articles)
  • get_events_for_entity(entity, limit, timeframe, include_articles)
  • get_event_summary(event_id, include_articles)
  • detect_emerging_topics(limit)
  • get_news_sentiment(entity, timeframe)
  • get_related_recent_entities(subject, timeframe, limit, include_trends)
  • get_capabilities()

Refresh & caching

Future work (planned): entity graph over time

Instead of treating detect_emerging_topics() as a flat list, we want a higher-level representation:

  • Convert emerging topic/entity co-occurrence signals into a weighted entity graph
  • Group the graph into communities (story neighborhoods)
  • Track time evolution across refresh windows:
    • node “momentum” (trend_score/count changes)
    • edge strength changes (relation tightening/weakening)
    • community emergence/disappearance

Eventual agent tool shape (later): get_emerging_entity_graph(timeframe, limit).

  • Background refresh every NEWS_REFRESH_INTERVAL_SECONDS (default 900s)
  • Feed-hash skipping to avoid redundant RSS+Groq work
  • Cluster TTL (NEWS_CLUSTERS_TTL_HOURS via CLUSTERS_TTL_HOURS)
  • Summary caching for get_event_summary

Definition of “committable”

  • Tests pass offline (dedup/storage unit tests)
  • Server exposes tool surface with valid schemas
  • Caching prevents repeated Groq calls for unchanged clusters
  • Embeddings remain optional: Ollama is tried first when enabled, otherwise the heuristic path stays active
  • Embeddings backfill script exists for older cluster rows before the server restart
  • Merge-analysis script exists to inspect candidate cluster pairs at multiple thresholds
  • Merge pass exists for destructive consolidation once thresholds look sane
  • Article-dedup cleanup exists for fixing duplicated article records already in SQLite
  • Entity lookup now respects timeframe as the scan window, with limit acting as a cap

Dashboard & REST API (added May 2026)

What was added

  • 5 REST endpoints (/api/v1/*) for programmatic access to cluster data, sentiment series, entity frequencies, and health stats
  • Dashboard SPA at /dashboard — HTMX-based shell with Chart.js visualizations (5 views: health, clusters, sentiment, entities, detail)
  • Non-blocking startup — moved from synchronous @app.on_event("startup") pruning to lifespan-based fire-and-forget background loop; server responds within ~0.3s regardless of feed/LLM latency

Architecture

news-mcp/
├── news_mcp/mcp_server_fastmcp.py   ← MCP tools + REST API + dashboard mount
├── news_mcp/dashboard/
│   ├── dashboard_store.py           ← Read-only query layer (no side effects)
│   ├── index.html                   ← SPA shell with 5 views
│   ├── style.css                    ← Dark theme, responsive
│   └── dashboard.js                 ← Client-side rendering + Chart.js

Key design decisions

  • Dashboard store wraps SQLiteClusterStore with thin read-only methods — no enrichment, no writes
  • Single shared store instance (_shared_store) avoids repeated DB connections
  • Static SPA files are served by FastAPI's StaticFiles mount — no Jinja2/templating dependency
  • Client-side fetch() + Chart.js avoids HTMX raw-JSON-in-DOM issues
  • Default lookback matches NEWS_DEFAULT_LOOKBACK_HOURS (144h), not a hardcoded 24h

Known gaps

  • No auth (LAN-only, no login)
  • Entity detail view in dashboard is minimal (click-to-expand from entity list is stub)
  • No alerting/threshold notifications yet (Phase 2: velocity spikes, sentiment divergence)

Dashboard & REST API (added May 2026)

What was added

  • 5 REST endpoints (/api/v1/*) — JSON-only, for programmatic access and the dashboard
  • Dashboard SPA at /dashboard — 5 views (health, clusters, sentiment, entities, detail), Chart.js visualizations, instant client-side rendering
  • Non-blocking startup — replaced synchronous @app.on_event("startup") with lifespan-based fire-and-forget background loop; server responds in <0.3s
  • Async ingestion lockasyncio.Lock prevents overlapping refresh cycles
  • Hardened LLM calls — OpenRouter retry logic with exponential backoff on 429/5xx, response shape validation

Architecture additions

news-mcp/
├── news_mcp/mcp_server_fastmcp.py   ← MCP + REST API + /dashboard static mount
├── news_mcp/dashboard/
│   ├── __init__.py
│   ├── dashboard_store.py           ← Read-only query layer (no side effects)
│   ├── index.html                   ← SPA shell, 5 views
│   ├── style.css                    ← Dark theme, responsive grid
│   └── dashboard.js                 ← Client render, Chart.js, null-safe DOM access

Design decisions

  • Dashboard store wraps SQLiteClusterStore with read-only methods (stats, pagination, series, frequencies, detail). No writes, no enrichment.
  • Single shared store (_shared_store) — one DB connection pool for the entire process.
  • Static SPA served via FastAPI StaticFiles — no Jinja2/templating dependency.
  • Client-side rendering with fetch() + Chart.js — avoids HTMX raw-JSON-in-DOM issues.
  • Default lookback follows NEWS_DEFAULT_LOOKBACK_HOURS (144h), not hardcoded.
  • Cluster ordering — always date-descending (SQL ORDER BY updated_at DESC + client-side sort as safety net).

Known gaps (for future work)

  • No auth (LAN-only assumption)
  • Entity detail view is functional but minimal
  • No alerting/threshold notifications (Phase 2)
  • No server-sent events for real-time dashboard updates