Przeglądaj źródła

Update AGENTS.md and PROJECT.md with two-machine workflow, Docker details, and timestamp normalization notes

Lukas Goldschmidt 1 tydzień temu
rodzic
commit
e5268b71f8
2 zmienionych plików z 52 dodań i 0 usunięć
  1. 37 0
      AGENTS.md
  2. 15 0
      PROJECT.md

+ 37 - 0
AGENTS.md

@@ -1,9 +1,31 @@
 # news-mcp
 
+## Two-Machine Workflow
+
+This project spans two machines. **Always check which machine you're operating on.**
+
+| | Latitude (dev) | ThinkCenter-2 (live) |
+|---|---|---|
+| **Hostname** | latitude | thinkcenter-2 |
+| **IP** | 192.168.0.249 | 192.168.0.200 |
+| **Projects dir** | `/home/lucky/.openclaw/workspace/` | `/home/lucky/` |
+| **This repo** | `/home/lucky/.openclaw/workspace/news-mcp/` | `/home/lucky/news-mcp/` |
+| **DB path** | `news_mcp/data/news.sqlite` (usually empty/stale dev copy) | `/app/data/news.sqlite` inside Docker container (host bind-mount: `news_mcp/data/news.sqlite`) |
+| **Server URL** | localhost:8506 | `http://192.168.0.200:8506` |
+
+- The terminal prompt **always shows the machine name** (e.g. `lucky@thinkcenter-2`, `lucky@latitude`).
+- When commands are pasted they include the prompt — **read it** to know which machine.
+- When the user says "the live server", "thinkcenter-2", or "remote", they mean 192.168.0.200.
+- The live server runs in Docker (`docker-compose up -d news-mcp`).
+- ssh into live: `ssh lucky@192.168.0.200`
+- The live DB is at `/app/data/news.sqlite` in the container, which bind-mounts from `news_mcp/data/news.sqlite` on the host FS (relative to repo root).
+- **Do NOT run maintenance/backfill scripts against the dev DB** — it's empty/stale. Either point explicitly to the live DB or tell the user to run it.
+
 ## Local Environment
 - Source the repo-local `.venv` first when it exists.
 - Prefer `./tests.sh` for offline verification and `./live_tests.sh` only for provider-backed smoke checks.
 - Use `./run.sh` to start the server locally; it resolves the repo root and prefers the local Uvicorn binary.
+- The local dev copy has its own separate DB — treat it as empty/stale unless explicitly working with it.
 
 ## Repo Map
 - `news_mcp/mcp_server_fastmcp.py`: MCP tool surface, startup refresh, pruning, and HTTP health endpoints.
@@ -14,15 +36,30 @@
 - `news_mcp/trends_resolution.py` and `news_mcp/related_entities.py`: local Google Trends-based entity resolution and neighborhood lookup.
 - `news_mcp/config.py`: env-driven defaults and file paths.
 
+## Docker / Live Server Details
+- `docker-compose.yml` mounts `./:/app` with `working_dir: /app`
+- Data dir and DB path both hardcoded in docker-compose env: `NEWS_MCP_DB_PATH: ./data/news.sqlite`
+- Target DB on live server: `/app/data/news.sqlite` in container → `news_mcp/data/news.sqlite` on host
+- Backfill script: `scripts/normalize_cluster_timestamps.py` — always run with explicit `--db` or set `NEWS_MCP_DB_PATH`
+- The dev DB at `news_mcp/data/news.sqlite` is a separate empty file — never confuse it with the live DB
+
 ## Current Contracts
 - Clusters are the unit of truth, not raw articles.
 - `NEWS_DEFAULT_LOOKBACK_HOURS` controls read freshness only.
 - `NEWS_PRUNING_ENABLED`, `NEWS_RETENTION_DAYS`, and `NEWS_PRUNE_INTERVAL_HOURS` control physical deletion.
 - Entity aliasing is intentionally conservative; keep `config/entity_aliases.json` tight.
 - `include_articles=true` should keep responses compact and only return minimal article fields.
+- Timestamps in cluster payloads are normalized to ISO 8601 UTC (`YYYY-MM-DDTHH:MM:SS+00:00`) at write time in `sanitize_cluster_payload()`.
 
 ## Editing Rules
 - Keep changes aligned with the docs in `README.md`, `PROJECT.md`, and `OUTLOOK.md`.
 - Prefer narrow fixes over contract changes unless the user explicitly asks to expand behavior.
 - Do not run destructive maintenance scripts without a dry run first.
 - If a change touches storage or pruning, verify it against a temp DB or isolated test fixture rather than the live database.
+- When writing infrastructure/MCP code that will run on the live server, think about the Docker context (paths, env vars, mount points).
+
+## Known Pitfalls
+- `cwd=__file__` in subprocess calls fails — it's a file path, not a directory. Use `Path(__file__).resolve().parent` or `str(Path(__file__).parent)`.
+- DB file at `news_mcp/data/news.sqlite` in the dev repo is empty (4096 bytes, no tables). The real data lives on the live server in Docker.
+- Indentation in patched Python files is fragile — always verify with `py_compile.compile(path, doraise=True)` after editing.
+- `updated_at` in the DB is row modification time (set to `datetime.now()` on every upsert), NOT event time. Event time lives in `payload.timestamp`. Always filter by `payload.timestamp` in Python, never by `updated_at` in SQL, for time-range queries.

+ 15 - 0
PROJECT.md

@@ -135,3 +135,18 @@ news-mcp/
 - Entity detail view is functional but minimal
 - No alerting/threshold notifications (Phase 2)
 - No server-sent events for real-time dashboard updates
+
+## Timestamp Normalization (May 2026)
+
+### Problem
+Cluster payloads stored timestamps as raw RSS strings (RFC 2822 HTTP-date like `"Sat, 30 May 2026 02:00:12 +00:00"`). Every read path needed fragile format-guessing, and SQL time-range queries on `updated_at` (row modification time, not event time) returned wrong data.
+
+### Fix
+- `_normalize_ts()` helper in `sqlite_store.py`: parses ISO 8601, RFC 2822/HTTP-date, epoch seconds → uniform `YYYY-MM-DDTHH:MM:SS+00:00`
+- `sanitize_cluster_payload()` now normalizes `timestamp`, `first_seen`, `last_updated`, and all `article[].timestamp` before writing to DB
+- `merge_cluster_embeddings.py`: same normalization on merged payloads
+- `scripts/normalize_cluster_timestamps.py`: backfill script for existing rows (run on live server with correct `--db` path)
+- `get_sentiment_series()` and `get_entity_frequencies()`: filter by `payload.timestamp` in Python, not `updated_at` in SQL
+
+### Key invariant
+`updated_at` in the DB = row modification time (set to `datetime.now()` on every upsert). For time-range queries, always use `payload.timestamp` parsed from the JSON.