news-mcp

Two-Machine Workflow

This project spans two machines. Always check which machine you're operating on.

	Latitude (dev)	ThinkCenter-2 (live)
Hostname	latitude	thinkcenter-2
IP	192.168.0.249	192.168.0.200
Projects dir	`/home/lucky/.openclaw/workspace/`	`/home/lucky/`
This repo	`/home/lucky/.openclaw/workspace/news-mcp/`	`/home/lucky/news-mcp/`
DB path	`data/news.sqlite` (host-side: repo-root `data/`)	`/app/data/news.sqlite` inside Docker container (host bind-mount: `data/news.sqlite`)
Server URL	localhost:8506	`http://192.168.0.200:8506`

The terminal prompt always shows the machine name (e.g. lucky@thinkcenter-2, lucky@latitude).
When commands are pasted they include the prompt — read it to know which machine.
When the user says "the live server", "thinkcenter-2", or "remote", they mean 192.168.0.200.
The live server runs in Docker (docker-compose up -d news-mcp).
ssh into live: ssh lucky@192.168.0.200
The live DB is at /app/data/news.sqlite in the container, which bind-mounts from data/news.sqlite on the host FS (relative to repo root).
Local and Docker now use the same default path — ./data/news.sqlite — so run.sh and docker-compose up share the same database. The AGENTS.md section below ("Docker/DB path oddity") no longer applies on this machine.
Do NOT run maintenance/backfill scripts against the dev DB — it's empty/stale. Either point explicitly to the live DB or tell the user to run it.

Local Environment

Source the repo-local .venv first when it exists.
Prefer ./tests.sh for offline verification and ./live_tests.sh only for provider-backed smoke checks.
Use ./run.sh to start the server locally; it resolves the repo root and prefers the local Uvicorn binary.
The local data/ directory contains the same DB the Docker container uses — run.sh and docker-compose up converge on ./data/news.sqlite.

Repo Map

news_mcp/mcp_server_fastmcp.py: MCP tool surface, startup refresh, pruning, HTTP health endpoints, REST API.
news_mcp/jobs/poller.py: feed refresh loop, clustering, enrichment, and cache writes.
news_mcp/storage/sqlite_store.py: SQLite schema (payload_ts, junction tables), upsert with junction population, SQL-level read methods. Single data access layer for MCP tools.
news_mcp/dashboard/dashboard_store.py: Read-only query layer for dashboard REST API. Wraps SQLiteClusterStore. Added junction-table entity/keyword search. NOTE: this store duplicates methods from sqlite_store — see Design Flaw in PROJECT.md.
news_mcp/dedup/cluster.py: topic bucketing and fuzzy/embedding clustering.
news_mcp/enrichment/llm_enrich.py: LLM extraction/summarization and blacklist filtering.
news_mcp/trends_resolution.py and news_mcp/related_entities.py: entity resolution and neighborhood lookup.
news_mcp/config.py: env-driven defaults and file paths.

Query Architecture (READ THIS BEFORE ADDING NEW QUERIES)

Time filtering: Always use payload_ts >= ? SQL filter. Never parse JSON timestamps in Python for time ranges.

Entity/keyword search: Use junction tables:

cluster_entities for entity search: JOIN cluster_entities ce ON c.cluster_id = ce.cluster_id WHERE ce.entity = ?
cluster_keywords for keyword search: JOIN cluster_keywords ck ON c.cluster_id = ck.cluster_id WHERE ck.keyword = ?
Do NOT fetch all clusters and filter entities in Python.

Backfill: After schema changes, run scripts/backfill_junction_tables.py in the Docker container:

docker exec -it news-mcp python3 scripts/backfill_junction_tables.py

Design Flaw: Two Stores

SQLiteClusterStore and DashboardStore are parallel copies. Only DashboardStore was updated with junction-table entity search. MCP tools (get_events_for_entity, get_news_sentiment) still use SQLiteClusterStore Python-side entity matching with a row limit (top 200), missing entities in older clusters. See PROJECT.md for full analysis and proposed fix.

Docker / Live Server Details

docker-compose.yml mounts ./:/app with working_dir: /app
Data dir and DB path both hardcoded in docker-compose env: NEWS_MCP_DB_PATH: ./data/news.sqlite
Target DB on live server: /app/data/news.sqlite in container → data/news.sqlite on host
Backfill script: scripts/normalize_cluster_timestamps.py — always run with explicit --db or set NEWS_MCP_DB_PATH (default now matches ./data/news.sqlite)

Current Contracts

Clusters are the unit of truth, not raw articles.
NEWS_DEFAULT_LOOKBACK_HOURS controls read freshness only.
NEWS_PRUNING_ENABLED, NEWS_RETENTION_DAYS, and NEWS_PRUNE_INTERVAL_HOURS control physical deletion.
Entity aliasing is intentionally conservative; keep config/entity_aliases.json tight.
include_articles=true should keep responses compact and only return minimal article fields.
Timestamps in cluster payloads are normalized to ISO 8601 UTC (YYYY-MM-DDTHH:MM:SS+00:00) at write time in sanitize_cluster_payload().
Single data directory: default DATA_DIR is repo-root ./data/ — used by both run.sh and docker-compose up. No env override needed.

Timestamp Contract

payload_ts SQL column (VIRTUAL GENERATED) is the ONLY way to filter by event time. Use WHERE payload_ts >= ? in SQL. Never parse JSON timestamps in Python for time ranges.
payload.timestamp in JSON is guaranteed YYYY-MM-DDTHH:MM:SS+00:00 at write time (enforced by sanitize_cluster_payload()).
updated_at in the DB = row modification time, NOT event time. Never use for time-range queries.
This repo is a dev machine copy. The live server is on thinkcenter-2 (192.168.0.200). Never query the dev DB to verify live data.

Editing Rules

Keep changes aligned with the docs in README.md, PROJECT.md, and OUTLOOK.md.
Prefer narrow fixes over contract changes unless the user explicitly asks to expand behavior.
Do not run destructive maintenance scripts without a dry run first.
If a change touches storage or pruning, verify it against a temp DB or isolated test fixture rather than the live database.
When writing infrastructure/MCP code that will run on the live server, think about the Docker context (paths, env vars, mount points).

Known Pitfalls

cwd=__file__ in subprocess calls fails — it's a file path, not a directory. Use Path(__file__).resolve().parent or str(Path(__file__).parent).
The dev DB at data/news.sqlite was scp-copied from the live server — treat it as a snapshot, not the live DB. Never run destructive scripts against it without explicit direction.
Indentation in patched Python files is fragile — always verify with py_compile.compile(path, doraise=True) after editing.
updated_at in the DB is row modification time (set to datetime.now() on every upsert), NOT event time. Event time lives in payload.timestamp. Always filter by payload.timestamp in Python, never by updated_at in SQL, for time-range queries.

AGENTS.md 7.0 KB Historie Surový