AGENTS.md 6.1 KB

news-mcp

Two-Machine Workflow

This project spans two machines. Always check which machine you're operating on.

Latitude (dev) ThinkCenter-2 (live)
Hostname latitude thinkcenter-2
IP 192.168.0.249 192.168.0.200
Projects dir /home/lucky/.openclaw/workspace/ /home/lucky/
This repo /home/lucky/.openclaw/workspace/news-mcp/ /home/lucky/news-mcp/
DB path news_mcp/data/news.sqlite (usually empty/stale dev copy) /app/data/news.sqlite inside Docker container (host bind-mount: news_mcp/data/news.sqlite)
Server URL localhost:8506 http://192.168.0.200:8506
  • The terminal prompt always shows the machine name (e.g. lucky@thinkcenter-2, lucky@latitude).
  • When commands are pasted they include the prompt — read it to know which machine.
  • When the user says "the live server", "thinkcenter-2", or "remote", they mean 192.168.0.200.
  • The live server runs in Docker (docker-compose up -d news-mcp).
  • ssh into live: ssh lucky@192.168.0.200
  • The live DB is at /app/data/news.sqlite in the container, which bind-mounts from news_mcp/data/news.sqlite on the host FS (relative to repo root).
  • Do NOT run maintenance/backfill scripts against the dev DB — it's empty/stale. Either point explicitly to the live DB or tell the user to run it.
  • Docker/DB path oddity: docker-compose sets NEWS_MCP_DB_PATH=./data/news.sqlite (relative to working_dir=/app), so the container's DB is at /app/data/news.sqlite. But the host .env does NOT override this, so running the same script on the host resolves DB_PATH to the config default (news_mcp/data/news.sqlite) — a different, usually empty file. The docker-compose env vars only apply inside the container.

Local Environment

  • Source the repo-local .venv first when it exists.
  • Prefer ./tests.sh for offline verification and ./live_tests.sh only for provider-backed smoke checks.
  • Use ./run.sh to start the server locally; it resolves the repo root and prefers the local Uvicorn binary.
  • The local dev copy has its own separate DB — treat it as empty/stale unless explicitly working with it.

Repo Map

  • news_mcp/mcp_server_fastmcp.py: MCP tool surface, startup refresh, pruning, and HTTP health endpoints.
  • news_mcp/jobs/poller.py: feed refresh loop, clustering, enrichment, and cache writes.
  • news_mcp/storage/sqlite_store.py: SQLite schema, cluster/entity metadata, feed hashes, and prune state.
  • news_mcp/dedup/cluster.py: topic bucketing and the current fuzzy/embedding clustering path.
  • news_mcp/enrichment/llm_enrich.py: LLM extraction/summarization and blacklist filtering.
  • news_mcp/trends_resolution.py and news_mcp/related_entities.py: local Google Trends-based entity resolution and neighborhood lookup.
  • news_mcp/config.py: env-driven defaults and file paths.

Docker / Live Server Details

  • docker-compose.yml mounts ./:/app with working_dir: /app
  • Data dir and DB path both hardcoded in docker-compose env: NEWS_MCP_DB_PATH: ./data/news.sqlite
  • Target DB on live server: /app/data/news.sqlite in container → news_mcp/data/news.sqlite on host
  • Backfill script: scripts/normalize_cluster_timestamps.py — always run with explicit --db or set NEWS_MCP_DB_PATH
  • The dev DB at news_mcp/data/news.sqlite is a separate empty file — never confuse it with the live DB

Current Contracts

  • Clusters are the unit of truth, not raw articles.
  • NEWS_DEFAULT_LOOKBACK_HOURS controls read freshness only.
  • NEWS_PRUNING_ENABLED, NEWS_RETENTION_DAYS, and NEWS_PRUNE_INTERVAL_HOURS control physical deletion.
  • Entity aliasing is intentionally conservative; keep config/entity_aliases.json tight.
  • include_articles=true should keep responses compact and only return minimal article fields.
  • Timestamps in cluster payloads are normalized to ISO 8601 UTC (YYYY-MM-DDTHH:MM:SS+00:00) at write time in sanitize_cluster_payload().

Timestamp Contract (READ THIS BEFORE TOUCHING ANY TIMESTAMP CODE)

  • payload.timestamp, payload.first_seen, payload.last_updated are guaranteed YYYY-MM-DDTHH:MM:SS+00:00 for every row written after the normalization migration (backfill script was run on the live server).
  • Read paths: use _read_ts() from news_mcp.storage.sqlite_store, or datetime.fromisoformat() directly. That is all that is needed.
  • Never add parsedate_to_datetime / RFC 2822 fallbacks to a read path. If _read_ts returns None on a stored timestamp, the bug is in the write path — fix sanitize_cluster_payload(), don't paper over it.
  • parsedate_to_datetime is intentionally retained only in sqlite_store._normalize_ts() (write path) and dedup/cluster.py (raw ingest before normalization). Nowhere else.
  • Never query the dev DB (news_mcp/data/news.sqlite on latitude) to check live data. It is empty/stale. The live DB is on thinkcenter-2 in Docker at /app/data/news.sqlite.

Editing Rules

  • Keep changes aligned with the docs in README.md, PROJECT.md, and OUTLOOK.md.
  • Prefer narrow fixes over contract changes unless the user explicitly asks to expand behavior.
  • Do not run destructive maintenance scripts without a dry run first.
  • If a change touches storage or pruning, verify it against a temp DB or isolated test fixture rather than the live database.
  • When writing infrastructure/MCP code that will run on the live server, think about the Docker context (paths, env vars, mount points).

Known Pitfalls

  • cwd=__file__ in subprocess calls fails — it's a file path, not a directory. Use Path(__file__).resolve().parent or str(Path(__file__).parent).
  • DB file at news_mcp/data/news.sqlite in the dev repo is empty (4096 bytes, no tables). The real data lives on the live server in Docker.
  • Indentation in patched Python files is fragile — always verify with py_compile.compile(path, doraise=True) after editing.
  • updated_at in the DB is row modification time (set to datetime.now() on every upsert), NOT event time. Event time lives in payload.timestamp. Always filter by payload.timestamp in Python, never by updated_at in SQL, for time-range queries.