AGENTS.md 8.7 KB

news-mcp

Two-Machine Workflow

This project spans two machines. Always check which machine you're operating on.

Latitude (dev) ThinkCenter-2 (live)
Hostname latitude thinkcenter-2
IP 192.168.0.249 192.168.0.200
Projects dir /home/lucky/.openclaw/workspace/ /home/lucky/
This repo /home/lucky/.openclaw/workspace/news-mcp/ /home/lucky/news-mcp/
DB path data/news.sqlite (host-side: repo-root data/) /app/data/news.sqlite inside Docker container (host bind-mount: data/news.sqlite)
Server URL localhost:8506 http://192.168.0.200:8506
  • The terminal prompt always shows the machine name (e.g. lucky@thinkcenter-2, lucky@latitude).
  • When commands are pasted they include the prompt — read it to know which machine.
  • When the user says "the live server", "thinkcenter-2", or "remote", they mean 192.168.0.200.
  • The live server runs in Docker (docker-compose up -d news-mcp).
  • ssh into live: ssh lucky@192.168.0.200
  • The live DB is at /app/data/news.sqlite in the container, which bind-mounts from data/news.sqlite on the host FS (relative to repo root).
  • Local and Docker now use the same default path./data/news.sqlite — so run.sh and docker-compose up share the same database. The AGENTS.md section below ("Docker/DB path oddity") no longer applies on this machine.
  • Do NOT run maintenance/backfill scripts against the dev DB — it's empty/stale. Either point explicitly to the live DB or tell the user to run it.

Local Environment

  • Source the repo-local .venv first when it exists.
  • Prefer ./tests.sh for offline verification and ./live_tests.sh only for provider-backed smoke checks.
  • Use ./run.sh to start the server locally; it resolves the repo root and prefers the local Uvicorn binary.
  • The local data/ directory contains the same DB the Docker container uses — run.sh and docker-compose up converge on ./data/news.sqlite.

Repo Map

  • news_mcp/mcp_server_fastmcp.py: MCP tool surface, startup refresh, pruning, HTTP health endpoints, REST API.
  • news_mcp/jobs/poller.py: feed refresh loop, clustering, enrichment, and cache writes.
  • news_mcp/storage/sqlite_store.py: SQLite schema (payload_ts, junction tables), upsert with junction population, SQL-level read methods. Single data access layer for MCP tools.
  • news_mcp/dashboard/dashboard_store.py: Read-only query layer for dashboard REST API. Wraps SQLiteClusterStore. Added junction-table entity/keyword search. NOTE: this store duplicates methods from sqlite_store — see Design Flaw in PROJECT.md.
  • news_mcp/dedup/cluster.py: topic bucketing and fuzzy/embedding clustering.
  • news_mcp/enrichment/llm_enrich.py: LLM extraction/summarization and blacklist filtering.
  • news_mcp/trends_resolution.py and news_mcp/related_entities.py: entity resolution and neighborhood lookup.
  • news_mcp/config.py: env-driven defaults and file paths.

Query Architecture (READ THIS BEFORE ADDING NEW QUERIES)

Time filtering: Always use payload_ts >= ? SQL filter. Never parse JSON timestamps in Python for time ranges.

Entity/keyword search: Use junction tables:

  • cluster_entities for entity search: JOIN cluster_entities ce ON c.cluster_id = ce.cluster_id WHERE ce.entity = ?
  • cluster_keywords for keyword search: JOIN cluster_keywords ck ON c.cluster_id = ck.cluster_id WHERE ck.keyword = ?
  • Do NOT fetch all clusters and filter entities in Python.

Backfill: After schema changes, run scripts/backfill_junction_tables.py in the Docker container:

docker exec -it news-mcp python3 scripts/backfill_junction_tables.py

Design Flaw: Two Stores (FIXED May 2026)

DashboardStore was eliminated. SQLiteClusterStore is the single data access layer with junction-table entity/keyword search. All MCP tools use the proper SQL methods.

Docker / Live Server Details

  • docker-compose.yml mounts ./:/app with working_dir: /app
  • Data dir and DB path both hardcoded in docker-compose env: NEWS_MCP_DB_PATH: ./data/news.sqlite
  • Target DB on live server: /app/data/news.sqlite in container → data/news.sqlite on host
  • Backfill script: scripts/normalize_cluster_timestamps.py — always run with explicit --db or set NEWS_MCP_DB_PATH (default now matches ./data/news.sqlite)

Current Contracts

  • Clusters are the unit of truth, not raw articles.
  • NEWS_DEFAULT_LOOKBACK_HOURS controls read freshness only.
  • NEWS_PRUNING_ENABLED, NEWS_RETENTION_DAYS, and NEWS_PRUNE_INTERVAL_HOURS control physical deletion.
  • Entity aliasing is intentionally conservative; keep config/entity_aliases.json tight.
  • include_articles=true should keep responses compact and only return minimal article fields.
  • Timestamps in cluster payloads are normalized to ISO 8601 UTC (YYYY-MM-DDTHH:MM:SS+00:00) at write time in sanitize_cluster_payload().
  • Single data directory: default DATA_DIR is repo-root ./data/ — used by both run.sh and docker-compose up. No env override needed.

Version Hash

Every running server exposes a deterministic content hash via health endpoints. Use it to prove an agent that the live container was restarted with new code.

How it works: At import time the server walks all .py files under news_mcp/, sorts them by path, and computes SHA-256 of their concatenated contents, taking the first 9 hex characters. No git dependency — works identically in Docker and native runs.

Where to find it:

  • GET /health{"status":"ok","uptime":...,"version":"624993d5f"}
  • GET /api/v1/health{...,"version":"624993d5f",...}

Workflow to verify a restart:

When you tell an agent "I restarted the server" and they're unsure, have them curl the live server and compare the hash against ./version-hash.sh:

# Agent: check what the live server currently reports
curl -s http://192.168.0.200:8506/health

# Agent: check what this codebase would produce
bash version-hash.sh

If the two hashes match, the container is running this exact code. If they differ, the container is on a different version. Report both hashes to the user — they can then confirm or investigate.

Shell script: ./version-hash.sh in the repo root computes the same hash the server would use for the current codebase. Run it locally to predict what hash a freshly-built container would report:

```
$ bash version-hash.sh
624993d5f
```
If the script and the live server return different hashes, the live container is running different code than this checkout.

Design rationale: A content hash beats a git commit hash for this purpose because:

  • .git/ is excluded from Docker images — git-based hashes always return "unknown" in containers
  • A content hash changes on any file edit without requiring a git commit first
  • It is perfectly reproducible — same files always produce the same hash, on any machine
  • No build step, no version file to maintain, no CI pipeline dependency

Timestamp Contract

  • payload_ts SQL column (VIRTUAL GENERATED) is the ONLY way to filter by event time. Use WHERE payload_ts >= ? in SQL. Never parse JSON timestamps in Python for time ranges.
  • payload.timestamp in JSON is guaranteed YYYY-MM-DDTHH:MM:SS+00:00 at write time (enforced by sanitize_cluster_payload()).
  • updated_at in the DB = row modification time, NOT event time. Never use for time-range queries.
  • This repo is a dev machine copy. The live server is on thinkcenter-2 (192.168.0.200). Never query the dev DB to verify live data.

Editing Rules

  • Keep changes aligned with the docs in README.md, PROJECT.md, and OUTLOOK.md.
  • Prefer narrow fixes over contract changes unless the user explicitly asks to expand behavior.
  • Do not run destructive maintenance scripts without a dry run first.
  • If a change touches storage or pruning, verify it against a temp DB or isolated test fixture rather than the live database.
  • When writing infrastructure/MCP code that will run on the live server, think about the Docker context (paths, env vars, mount points).

Known Pitfalls

  • cwd=__file__ in subprocess calls fails — it's a file path, not a directory. Use Path(__file__).resolve().parent or str(Path(__file__).parent).
  • The dev DB at data/news.sqlite was scp-copied from the live server — treat it as a snapshot, not the live DB. Never run destructive scripts against it without explicit direction.
  • Indentation in patched Python files is fragile — always verify with py_compile.compile(path, doraise=True) after editing.
  • updated_at in the DB is row modification time (set to datetime.now() on every upsert), NOT event time. Event time lives in payload.timestamp. Always filter by payload.timestamp in Python, never by updated_at in SQL, for time-range queries.