# news-mcp ## Two-Machine Workflow This project spans two machines. **Always check which machine you're operating on.** | | Latitude (dev) | ThinkCenter-2 (live) | |---|---|---| | **Hostname** | latitude | thinkcenter-2 | | **IP** | 192.168.0.249 | 192.168.0.200 | | **Projects dir** | `/home/lucky/.openclaw/workspace/` | `/home/lucky/` | | **This repo** | `/home/lucky/.openclaw/workspace/news-mcp/` | `/home/lucky/news-mcp/` | | **DB path** | `data/news.sqlite` (host-side: repo-root `data/`) | `/app/data/news.sqlite` inside Docker container (host bind-mount: `data/news.sqlite`) | | **Server URL** | localhost:8506 | `http://192.168.0.200:8506` | - The terminal prompt **always shows the machine name** (e.g. `lucky@thinkcenter-2`, `lucky@latitude`). - When commands are pasted they include the prompt — **read it** to know which machine. - When the user says "the live server", "thinkcenter-2", or "remote", they mean 192.168.0.200. - The live server runs in Docker (`docker-compose up -d news-mcp`). - ssh into live: `ssh lucky@192.168.0.200` - The live DB is at `/app/data/news.sqlite` in the container, which bind-mounts from `data/news.sqlite` on the host FS (relative to repo root). - **Local and Docker now use the same default path** — `./data/news.sqlite` — so `run.sh` and `docker-compose up` share the same database. The `AGENTS.md` section below ("Docker/DB path oddity") no longer applies on this machine. - **Do NOT run maintenance/backfill scripts against the dev DB** — it's empty/stale. Either point explicitly to the live DB or tell the user to run it. ## Local Environment - Source the repo-local `.venv` first when it exists. - Prefer `./tests.sh` for offline verification and `./live_tests.sh` only for provider-backed smoke checks. - Use `./run.sh` to start the server locally; it resolves the repo root and prefers the local Uvicorn binary. - The local `data/` directory contains the same DB the Docker container uses — `run.sh` and `docker-compose up` converge on `./data/news.sqlite`. ## Repo Map - `news_mcp/mcp_server_fastmcp.py`: MCP tool surface, startup refresh, pruning, HTTP health endpoints, REST API. - `news_mcp/jobs/poller.py`: feed refresh loop, clustering, enrichment, and cache writes. - `news_mcp/storage/sqlite_store.py`: SQLite schema (payload_ts, junction tables), upsert with junction population, SQL-level read methods. **Single data access layer for MCP tools.** - `news_mcp/dashboard/dashboard_store.py`: Read-only query layer for dashboard REST API. Wraps `SQLiteClusterStore`. Added junction-table entity/keyword search. **NOTE: this store duplicates methods from sqlite_store — see Design Flaw in PROJECT.md.** - `news_mcp/dedup/cluster.py`: topic bucketing and fuzzy/embedding clustering. - `news_mcp/enrichment/llm_enrich.py`: LLM extraction/summarization and blacklist filtering. - `news_mcp/trends_resolution.py` and `news_mcp/related_entities.py`: entity resolution and neighborhood lookup. - `news_mcp/config.py`: env-driven defaults and file paths. ## Query Architecture (READ THIS BEFORE ADDING NEW QUERIES) **Time filtering:** Always use `payload_ts >= ?` SQL filter. Never parse JSON timestamps in Python for time ranges. **Entity/keyword search:** Use junction tables: - `cluster_entities` for entity search: `JOIN cluster_entities ce ON c.cluster_id = ce.cluster_id WHERE ce.entity = ?` - `cluster_keywords` for keyword search: `JOIN cluster_keywords ck ON c.cluster_id = ck.cluster_id WHERE ck.keyword = ?` - Do NOT fetch all clusters and filter entities in Python. **Backfill:** After schema changes, run `scripts/backfill_junction_tables.py` in the Docker container: ``` docker exec -it news-mcp python3 scripts/backfill_junction_tables.py ``` ## Design Flaw: Two Stores `SQLiteClusterStore` and `DashboardStore` are parallel copies. Only `DashboardStore` was updated with junction-table entity search. MCP tools (`get_events_for_entity`, `get_news_sentiment`) still use `SQLiteClusterStore` Python-side entity matching with a row limit (top 200), missing entities in older clusters. See PROJECT.md for full analysis and proposed fix. ## Docker / Live Server Details - `docker-compose.yml` mounts `./:/app` with `working_dir: /app` - Data dir and DB path both hardcoded in docker-compose env: `NEWS_MCP_DB_PATH: ./data/news.sqlite` - Target DB on live server: `/app/data/news.sqlite` in container → `data/news.sqlite` on host - Backfill script: `scripts/normalize_cluster_timestamps.py` — always run with explicit `--db` or set `NEWS_MCP_DB_PATH` (default now matches `./data/news.sqlite`) ## Current Contracts - Clusters are the unit of truth, not raw articles. - `NEWS_DEFAULT_LOOKBACK_HOURS` controls read freshness only. - `NEWS_PRUNING_ENABLED`, `NEWS_RETENTION_DAYS`, and `NEWS_PRUNE_INTERVAL_HOURS` control physical deletion. - Entity aliasing is intentionally conservative; keep `config/entity_aliases.json` tight. - `include_articles=true` should keep responses compact and only return minimal article fields. - Timestamps in cluster payloads are normalized to ISO 8601 UTC (`YYYY-MM-DDTHH:MM:SS+00:00`) at write time in `sanitize_cluster_payload()`. - **Single data directory**: default `DATA_DIR` is repo-root `./data/` — used by both `run.sh` and `docker-compose up`. No env override needed. ## Version Hash Every running server exposes a deterministic content hash via health endpoints. Use it to prove an agent that the live container was restarted with new code. **How it works:** At import time the server walks all `.py` files under `news_mcp/`, sorts them by path, and computes SHA-256 of their concatenated contents, taking the first 9 hex characters. No git dependency — works identically in Docker and native runs. **Where to find it:** - `GET /health` → `{"status":"ok","uptime":...,"version":"624993d5f"}` - `GET /api/v1/health` → `{...,"version":"624993d5f",...}` **Workflow to verify a restart:** When you tell an agent "I restarted the server" and they're unsure, have them curl the live server and compare the hash against `./version-hash.sh`: ``` # Agent: check what the live server currently reports curl -s http://192.168.0.200:8506/health # Agent: check what this codebase would produce bash version-hash.sh ``` If the two hashes match, the container is running this exact code. If they differ, the container is on a different version. Report both hashes to the user — they can then confirm or investigate. **Shell script:** `./version-hash.sh` in the repo root computes the same hash the server would use for the current codebase. Run it locally to predict what hash a freshly-built container would report: ``` $ bash version-hash.sh 624993d5f ``` If the script and the live server return different hashes, the live container is running different code than this checkout. **Design rationale:** A content hash beats a git commit hash for this purpose because: - `.git/` is excluded from Docker images — git-based hashes always return `"unknown"` in containers - A content hash changes on any file edit without requiring a git commit first - It is perfectly reproducible — same files always produce the same hash, on any machine - No build step, no version file to maintain, no CI pipeline dependency ## Timestamp Contract - `payload_ts` SQL column (VIRTUAL GENERATED) is the ONLY way to filter by event time. Use `WHERE payload_ts >= ?` in SQL. Never parse JSON timestamps in Python for time ranges. - `payload.timestamp` in JSON is guaranteed `YYYY-MM-DDTHH:MM:SS+00:00` at write time (enforced by `sanitize_cluster_payload()`). - `updated_at` in the DB = row modification time, NOT event time. Never use for time-range queries. - This repo is a **dev machine** copy. The live server is on thinkcenter-2 (192.168.0.200). Never query the dev DB to verify live data. ## Editing Rules - Keep changes aligned with the docs in `README.md`, `PROJECT.md`, and `OUTLOOK.md`. - Prefer narrow fixes over contract changes unless the user explicitly asks to expand behavior. - Do not run destructive maintenance scripts without a dry run first. - If a change touches storage or pruning, verify it against a temp DB or isolated test fixture rather than the live database. - When writing infrastructure/MCP code that will run on the live server, think about the Docker context (paths, env vars, mount points). ## Known Pitfalls - `cwd=__file__` in subprocess calls fails — it's a file path, not a directory. Use `Path(__file__).resolve().parent` or `str(Path(__file__).parent)`. - The dev DB at `data/news.sqlite` was scp-copied from the live server — treat it as a snapshot, not the live DB. Never run destructive scripts against it without explicit direction. - Indentation in patched Python files is fragile — always verify with `py_compile.compile(path, doraise=True)` after editing. - `updated_at` in the DB is row modification time (set to `datetime.now()` on every upsert), NOT event time. Event time lives in `payload.timestamp`. Always filter by `payload.timestamp` in Python, never by `updated_at` in SQL, for time-range queries.