Kaynağa Gözat

prompt enhanced, evaluation

Lukas Goldschmidt 4 gün önce
ebeveyn
işleme
fbb452194b

+ 52 - 0
README.md

@@ -138,3 +138,55 @@ When an article is updated in-place at the same URL (e.g. FT's "More to come..."
 ## Version
 
 See `./version-hash.sh` for the current content hash.
+
+## Prompt Evaluation (extraction quality)
+
+The extraction prompt (`prompts/extract_entities.prompt`) is tested against a curated set of annotated samples to ensure entity/keyword separation quality, especially for smaller models like `llama-3.1-8b-instant`.
+
+### Running the evaluation
+
+```bash
+# Run against default prompt with 30 annotated samples (all 5 topics)
+python scripts/eval_extraction.py --model llama-3.1-8b-instant --provider groq
+
+# Run with specific prompt file
+python scripts/eval_extraction.py --model llama-3.1-8b-instant --provider groq --prompt-file prompts/extract_entities.prompt
+
+# Run against larger model for comparison
+python scripts/eval_extraction.py --model deepseek/deepseek-v4-flash --provider openrouter
+
+# Verbose per-sample output
+python scripts/eval_extraction.py --model llama-3.1-8b-instant --provider groq --verbose
+
+# Collect new samples from live DB for manual annotation
+python scripts/eval_extraction.py --collect 30 --output new_samples.json
+```
+
+### What it measures
+
+| Metric | Target | Description |
+|--------|--------|-------------|
+| Entity F1 | ≥ 0.65 | Precision/recall of named entities (proper nouns) |
+| Keyword F1 | ≥ 0.40 | Precision/recall of thematic keywords (1-2 word tags) |
+| Leakage | 0.0 | Entities appearing in keywords (should never happen) |
+| Topic Accuracy | ≥ 0.80 | Correct topic classification (crypto/macro/regulation/ai/other) |
+
+### Annotated samples
+
+The 30 golden samples in `data/annotated_samples.json` cover all 5 topics:
+- **regulation** (6): SEC lawsuits, OFAC sanctions, House crypto bills, WAMCO settlement, Cuba sanctions, Iran frozen funds
+- **macro** (7): Fed/ECB decisions, China stimulus, OPEC+ cuts, India forex/trade, jobs report
+- **crypto** (6): Bitcoin ETF flows, memecoins, seller exhaustion, XRP liquidation, Visa stablecoin, Kalshi
+- **ai** (5): Nvidia earnings, Anthropic pause, Microsoft AI, AI bubble debate, Morgan Stanley AI funding
+- **other** (6): Israel/Iran strikes, Trump intel firings, Boeing 737 Max, Putin/Trump, Paris bridge, Ukraine drones
+
+### Current results (llama-3.1-8b-instant via Groq)
+
+```
+Entity F1:     0.665  (P=0.814 R=0.601)
+Keyword F1:    0.468  (P=0.572 R=0.400)
+Leakage (avg): 0.000
+Topic Acc:     0.867
+```
+
+The prompt uses 6 few-shot examples with explicit entity/keyword decision rules and topic classification boundaries (especially the regulation vs other distinction for sanctions enforcement).

+ 82 - 4
prompts/extract_entities.prompt

@@ -34,9 +34,13 @@ For each candidate term in the text:
 === TOPIC CLASSIFICATION ===
 - crypto: Bitcoin, Ethereum, crypto exchanges, DeFi, tokens, mining, ETFs
 - macro: central banks (Fed, ECB, BoE, BoJ), interest rates, inflation, GDP, employment, fiscal/monetary policy, oil, commodities, China economy
-- regulation: SEC, CFTC, lawsuits, enforcement, legislation, compliance, legal rulings, EU AI Act, financial regulation
-- ai: AI models, chips (Nvidia, AMD), LLMs, generative AI, AI companies, AI regulation (but prefer 'regulation' if legal focus)
-- other: geopolitics, war, politics, elections, corporate earnings (non-AI), general business
+- regulation: GOVERNMENT REGULATORY/ENFORCEMENT ACTIONS — SEC, CFTC, OFAC, Treasury, DOJ, FTC, lawsuits, enforcement, legislation, compliance, legal rulings, EU AI Act, financial regulation, SANCTIONS IMPOSITION/ENFORCEMENT
+- ai: AI models, chips (Nvidia, AMD), LLMs, generative AI, AI companies, AI regulation (but prefer 'regulation' if legal/enforcement focus)
+- other: geopolitics, war, politics, elections, corporate earnings (non-AI), general business, diplomacy, UNLESS it's a regulatory/enforcement action by a government body
+
+KEY DISTINCTION: Sanctions by OFAC/Treasury = REGULATION (enforcement action). Diplomatic talks about sanctions = OTHER (diplomacy).
+SEC lawsuit = REGULATION. Crypto exchange hack = OTHER.
+EU AI Act compliance = REGULATION. AI company product launch = AI.
 
 === SENTIMENT RULES ===
 - positive: clearly encouraging, improving, or supportive tone
@@ -44,6 +48,80 @@ For each candidate term in the text:
 - neutral: factual, balanced, or mixed
 - sentimentScore must be a number from -1.0 to 1.0 and should reflect the sentiment label.
 
+=== FEW-SHOT EXAMPLES ===
+
+Example 1 (regulation):
+Headline: "SEC sues Binance over unregistered securities"
+Summary: "The Securities and Exchange Commission filed a lawsuit against Binance, the world's largest crypto exchange, alleging it operated as an unregistered securities exchange and commingled customer funds."
+Output:
+{
+  "topic": "regulation",
+  "entities": ["SEC", "Binance"],
+  "sentiment": "negative",
+  "sentimentScore": -0.7,
+  "keywords": ["securities law", "crypto exchange", "enforcement action"]
+}
+
+Example 2 (macro):
+Headline: "Fed holds rates steady as inflation cools"
+Summary: "The Federal Reserve kept interest rates unchanged at 5.25-5.50%, citing progress on inflation but signaling caution on future cuts."
+Output:
+{
+  "topic": "macro",
+  "entities": ["Federal Reserve"],
+  "sentiment": "neutral",
+  "sentimentScore": 0.0,
+  "keywords": ["interest rates", "inflation", "monetary policy"]
+}
+
+Example 3 (other - geopolitics):
+Headline: "Israel strikes Iranian missile sites in Syria"
+Summary: "Israeli warplanes targeted Iranian missile depots near Damascus overnight, escalating regional tensions."
+Output:
+{
+  "topic": "other",
+  "entities": ["Israel", "Iran", "Syria", "Damascus"],
+  "sentiment": "negative",
+  "sentimentScore": -0.8,
+  "keywords": ["airstrikes", "missile sites", "regional escalation"]
+}
+
+Example 4 (crypto):
+Headline: "Bitcoin ETFs see record inflows as BTC tops $70k"
+Summary: "US spot Bitcoin ETFs attracted $2.3 billion in net inflows this week as Bitcoin surged past $70,000, driven by institutional demand."
+Output:
+{
+  "topic": "crypto",
+  "entities": ["Bitcoin", "BTC"],
+  "sentiment": "positive",
+  "sentimentScore": 0.7,
+  "keywords": ["ETF inflows", "institutional demand", "price surge"]
+}
+
+Example 5 (ai):
+Headline: "Nvidia beats earnings on AI chip demand"
+Summary: "Nvidia reported quarterly revenue of $26 billion, up 262% year-over-year, driven by insatiable demand for its H100 and Blackwell AI chips."
+Output:
+{
+  "topic": "ai",
+  "entities": ["Nvidia", "H100", "Blackwell"],
+  "sentiment": "positive",
+  "sentimentScore": 0.85,
+  "keywords": ["AI chips", "earnings beat", "revenue growth", "chip demand"]
+}
+
+Example 6 (regulation - enforcement/sanctions):
+Headline: "US Treasury imposes new sanctions on Iranian oil network"
+Summary: "The Office of Foreign Assets Control sanctioned a network of companies and vessels transporting Iranian petroleum in violation of US sanctions."
+Output:
+{
+  "topic": "regulation",
+  "entities": ["US Treasury", "OFAC", "Iran"],
+  "sentiment": "negative",
+  "sentimentScore": -0.6,
+  "keywords": ["sanctions enforcement", "oil network", "sanctions evasion"]
+}
+
 Return STRICT JSON with EXACT keys only:
 { topic, entities, sentiment, sentimentScore, keywords }
-where topic is one of [crypto, macro, regulation, ai, other].
+where topic is one of [crypto, macro, regulation, ai, other].

+ 127 - 0
prompts/extract_entities_fewshot.prompt

@@ -0,0 +1,127 @@
+Input cluster JSON:
+{cluster_json}
+
+You MUST extract a news signal from the headline AND summary. Return STRICT JSON only.
+
+Task:
+1) infer the best top-level topic (crypto, macro, regulation, ai, other)
+2) extract concise ENTITIES (proper nouns only)
+3) assign sentiment (positive/negative/neutral) + score (-1.0 to 1.0)
+4) provide short KEYWORDS (thematic tags, 1-2 words, NOT proper nouns)
+
+=== ENTITY RULES (strict) ===
+- ONLY specific named people, places, organizations, titles, products, tickers. 1-5 words.
+- Examples of entities: "Donald Trump", "Federal Reserve", "Bitcoin", "SEC", "ECB", "Iran", "Gaza", "Nvidia", "Apple", "ChatGPT", "Binance", "Jerome Powell", "BTC", "ETH", "Ethereum", "OPEC+", "H100", "Blackwell"
+- Examples of NON-entities (these are THEMES/CONCEPTS → put in KEYWORDS):
+  "inflation", "interest rates", "rates", "euro", "dollar", "oil", "gold", "war", "election", "regulation", "sanctions", "tariffs", "AI", "crypto", "ETF", "monetary policy", "fiscal policy", "trade war", "supply chain", "recession", "growth", "employment", "unemployment", "GDP", "CPI", "PPI", "US", "United States", "EU", "Europe", "China", "eurozone", "oil prices", "stock market", "bond yields"
+- Do NOT include common nouns, abstract concepts, or thematic terms — even if finance/crypto related.
+- Do NOT include adjectives alone ("strict", "new", "record", "major") or generic nouns ("package", "plan", "deal", "bill", "act", "law", "case", "trial", "verdict", "ruling", "decision", "meeting", "summit", "talks").
+
+=== KEYWORD RULES (strict) ===
+- Each keyword MUST be 1-2 words. PREFER 2-word phrases. Avoid single words unless they are established compound concepts (e.g. "inflation" is ok alone, "sanctions" is ok alone).
+- Keywords are THEMATIC TAGS: abstract concepts, policy areas, event types, topics.
+- Good 2-word keywords: "interest rates", "monetary policy", "securities law", "airstrikes", "missile sites", "regional escalation", "trade war", "supply chain", "recession risk", "inflation data", "ETF inflows", "institutional demand", "price surge", "AI chips", "earnings beat", "revenue growth", "chip demand", "rate cut", "eurozone inflation", "deposit rate", "monetary easing", "production cuts", "oil prices", "global supply", "demand concerns", "high-risk systems", "compliance requirements", "criminal conviction", "hush money", "falsifying records", "historic verdict", "guilty verdict", "stimulus package", "infrastructure spending", "property sector"
+- Bad keywords: proper nouns (these go in entities), SINGLE generic words ("unregistered", "securities", "ETFs", "inflows", "strict", "rules", "package", "economy", "oil", "prices", "cuts", "demand", "growth", "beat", "report", "data", "concerns"), verb phrases ("warns Iran", "hikes rates", "cuts rates", "sues Binance"), full headline fragments, anything over 2 words.
+- Return 2-4 keywords. Fewer is better than bad ones.
+
+=== DECISION PROCEDURE ===
+For each candidate term in the text:
+1. Is it a specific named person/place/org/product/ticker? → ENTITY
+2. Is it a theme, topic, policy area, or event type? → KEYWORD
+3. Can you form a meaningful 2-word phrase? → KEYWORD (use the phrase)
+4. Unclear? Default to KEYWORD (safer to miss an entity than pollute entities with themes)
+
+=== TOPIC CLASSIFICATION ===
+- crypto: Bitcoin, Ethereum, crypto exchanges, DeFi, tokens, mining, ETFs
+- macro: central banks (Fed, ECB, BoE, BoJ), interest rates, inflation, GDP, employment, fiscal/monetary policy, oil, commodities, China economy
+- regulation: GOVERNMENT REGULATORY/ENFORCEMENT ACTIONS — SEC, CFTC, OFAC, Treasury, DOJ, FTC, lawsuits, enforcement, legislation, compliance, legal rulings, EU AI Act, financial regulation, SANCTIONS IMPOSITION/ENFORCEMENT
+- ai: AI models, chips (Nvidia, AMD), LLMs, generative AI, AI companies, AI regulation (but prefer 'regulation' if legal/enforcement focus)
+- other: geopolitics, war, politics, elections, corporate earnings (non-AI), general business, diplomacy, UNLESS it's a regulatory/enforcement action by a government body
+
+KEY DISTINCTION: Sanctions by OFAC/Treasury = REGULATION (enforcement action). Diplomatic talks about sanctions = OTHER (diplomacy).
+SEC lawsuit = REGULATION. Crypto exchange hack = OTHER.
+EU AI Act compliance = REGULATION. AI company product launch = AI.
+
+=== SENTIMENT RULES ===
+- positive: clearly encouraging, improving, or supportive tone
+- negative: clearly alarming, worsening, severe, conflict, loss, risk, warning tone
+- neutral: factual, balanced, or mixed
+- sentimentScore must be a number from -1.0 to 1.0 and should reflect the sentiment label.
+
+=== FEW-SHOT EXAMPLES ===
+
+Example 1 (regulation):
+Headline: "SEC sues Binance over unregistered securities"
+Summary: "The Securities and Exchange Commission filed a lawsuit against Binance, the world's largest crypto exchange, alleging it operated as an unregistered securities exchange and commingled customer funds."
+Output:
+{
+  "topic": "regulation",
+  "entities": ["SEC", "Binance"],
+  "sentiment": "negative",
+  "sentimentScore": -0.7,
+  "keywords": ["securities law", "crypto exchange", "enforcement action"]
+}
+
+Example 2 (macro):
+Headline: "Fed holds rates steady as inflation cools"
+Summary: "The Federal Reserve kept interest rates unchanged at 5.25-5.50%, citing progress on inflation but signaling caution on future cuts."
+Output:
+{
+  "topic": "macro",
+  "entities": ["Federal Reserve"],
+  "sentiment": "neutral",
+  "sentimentScore": 0.0,
+  "keywords": ["interest rates", "inflation", "monetary policy"]
+}
+
+Example 3 (other - geopolitics):
+Headline: "Israel strikes Iranian missile sites in Syria"
+Summary: "Israeli warplanes targeted Iranian missile depots near Damascus overnight, escalating regional tensions."
+Output:
+{
+  "topic": "other",
+  "entities": ["Israel", "Iran", "Syria", "Damascus"],
+  "sentiment": "negative",
+  "sentimentScore": -0.8,
+  "keywords": ["airstrikes", "missile sites", "regional escalation"]
+}
+
+Example 4 (crypto):
+Headline: "Bitcoin ETFs see record inflows as BTC tops $70k"
+Summary: "US spot Bitcoin ETFs attracted $2.3 billion in net inflows this week as Bitcoin surged past $70,000, driven by institutional demand."
+Output:
+{
+  "topic": "crypto",
+  "entities": ["Bitcoin", "BTC"],
+  "sentiment": "positive",
+  "sentimentScore": 0.7,
+  "keywords": ["ETF inflows", "institutional demand", "price surge"]
+}
+
+Example 5 (ai):
+Headline: "Nvidia beats earnings on AI chip demand"
+Summary: "Nvidia reported quarterly revenue of $26 billion, up 262% year-over-year, driven by insatiable demand for its H100 and Blackwell AI chips."
+Output:
+{
+  "topic": "ai",
+  "entities": ["Nvidia", "H100", "Blackwell"],
+  "sentiment": "positive",
+  "sentimentScore": 0.85,
+  "keywords": ["AI chips", "earnings beat", "revenue growth", "chip demand"]
+}
+
+Example 6 (regulation - enforcement/sanctions):
+Headline: "US Treasury imposes new sanctions on Iranian oil network"
+Summary: "The Office of Foreign Assets Control sanctioned a network of companies and vessels transporting Iranian petroleum in violation of US sanctions."
+Output:
+{
+  "topic": "regulation",
+  "entities": ["US Treasury", "OFAC", "Iran"],
+  "sentiment": "negative",
+  "sentimentScore": -0.6,
+  "keywords": ["sanctions enforcement", "oil network", "sanctions evasion"]
+}
+
+Return STRICT JSON with EXACT keys only:
+{ topic, entities, sentiment, sentimentScore, keywords }
+where topic is one of [crypto, macro, regulation, ai, other].

+ 387 - 0
scripts/build_annotated_set.py

@@ -0,0 +1,387 @@
+import json
+from pathlib import Path
+
+# 30 diverse clusters selected from live data, manually annotated
+ANNOTATED_SAMPLES = [
+    # === REGULATION (6) ===
+    {
+        "name": "sec_binance_lawsuit",
+        "cluster": {
+            "headline": "SEC sues Binance over unregistered securities",
+            "summary": "The Securities and Exchange Commission filed a lawsuit against Binance, the world's largest crypto exchange, alleging it operated as an unregistered securities exchange and commingled customer funds."
+        },
+        "expected": {
+            "entities": ["SEC", "Binance"],
+            "keywords": ["securities law", "crypto exchange", "enforcement action"],
+            "topic": "regulation"
+        }
+    },
+    {
+        "name": "iran_frozen_funds_talks",
+        "cluster": {
+            "headline": "Khamenei aide says $24B in frozen funds blocking talks",
+            "summary": "A senior aide to Iran's Supreme Leader said $24 billion in frozen Iranian funds are blocking progress in indirect talks with the United States."
+        },
+        "expected": {
+            "entities": ["Iran", "United States", "Khamenei", "Mohsen Rezaei"],
+            "keywords": ["frozen funds", "peace talks", "Iran sanctions"],
+            "topic": "regulation"
+        }
+    },
+    {
+        "name": "us_iran_sanctions",
+        "cluster": {
+            "headline": "US issues new Iran-linked sanctions",
+            "summary": "The US Treasury Department imposed new sanctions targeting Iranian oil and petrochemical networks."
+        },
+        "expected": {
+            "entities": ["United States", "Iran", "Department of the Treasury", "OFAC"],
+            "keywords": ["sanctions", "oil network", "petrochemical"],
+            "topic": "regulation"
+        }
+    },
+    {
+        "name": "house_crypto_bills",
+        "cluster": {
+            "headline": "U.S. House tax committee weighs crypto bills, including relief for small transactions",
+            "summary": "The House Ways and Means Committee discussed legislation providing tax relief for small crypto transactions and clarifying digital asset rules."
+        },
+        "expected": {
+            "entities": ["U.S. House tax committee", "Bitcoin", "Ethereum", "SEC"],
+            "keywords": ["crypto bills", "tax relief", "small transactions"],
+            "topic": "regulation"
+        }
+    },
+    {
+        "name": "wamco_sec_settlement",
+        "cluster": {
+            "headline": "Wamco to Pay $100 Million in SEC Settlement Over Leech Trades",
+            "summary": "Western Asset Management agreed to pay $100 million to settle SEC charges over improper trading by former portfolio manager Ken Leech."
+        },
+        "expected": {
+            "entities": ["Western Asset Management Co.", "Securities and Exchange Commission", "Ken Leech"],
+            "keywords": ["SEC settlement", "trading practices", "portfolio manager"],
+            "topic": "regulation"
+        }
+    },
+    {
+        "name": "us_cuba_sanctions",
+        "cluster": {
+            "headline": "US imposes sanctions on Cuban president, Castro family members",
+            "summary": "The United States sanctioned Cuban President Miguel Diaz-Canel and members of the Castro family over human rights abuses."
+        },
+        "expected": {
+            "entities": ["Cuba", "Cuban president", "Castro family", "Raul Castro", "United States"],
+            "keywords": ["US sanctions", "Cuba tensions", "human rights"],
+            "topic": "regulation"
+        }
+    },
+
+    # === MACRO (7) ===
+    {
+        "name": "fed_rates_inflation",
+        "cluster": {
+            "headline": "Fed holds rates steady as inflation cools",
+            "summary": "The Federal Reserve kept interest rates unchanged at 5.25-5.50%, citing progress on inflation but signaling caution on future cuts."
+        },
+        "expected": {
+            "entities": ["Federal Reserve"],
+            "keywords": ["interest rates", "inflation", "monetary policy"],
+            "topic": "macro"
+        }
+    },
+    {
+        "name": "ecb_rate_cut",
+        "cluster": {
+            "headline": "ECB cuts rates as eurozone inflation falls to 2.4%",
+            "summary": "The European Central Bank lowered its deposit rate by 25 basis points to 3.75%, marking its first cut since 2019 as inflation approaches target."
+        },
+        "expected": {
+            "entities": ["European Central Bank", "ECB"],
+            "keywords": ["rate cut", "eurozone inflation", "deposit rate", "monetary easing"],
+            "topic": "macro"
+        }
+    },
+    {
+        "name": "china_stimulus",
+        "cluster": {
+            "headline": "China unveils stimulus package to boost slowing economy",
+            "summary": "Beijing announced a comprehensive stimulus package including infrastructure spending, tax cuts, and monetary easing to counter slowing growth and property sector weakness."
+        },
+        "expected": {
+            "entities": ["China", "Beijing"],
+            "keywords": ["stimulus package", "infrastructure spending", "monetary easing", "property sector"],
+            "topic": "macro"
+        }
+    },
+    {
+        "name": "oil_opep_cuts",
+        "cluster": {
+            "headline": "Oil jumps after OPEC+ extends production cuts",
+            "summary": "Crude oil prices rose 3% after OPEC+ agreed to extend production cuts through year-end, tightening global supply amid demand concerns."
+        },
+        "expected": {
+            "entities": ["OPEC+"],
+            "keywords": ["production cuts", "oil prices", "global supply", "demand concerns"],
+            "topic": "macro"
+        }
+    },
+    {
+        "name": "india_forex_reserves",
+        "cluster": {
+            "headline": "India's Forex Reserves Hit $682.32 Billion as RBI Tightens Its Economic Grip",
+            "summary": "India's foreign exchange reserves reached a record high as the Reserve Bank of India maintains tight monetary policy."
+        },
+        "expected": {
+            "entities": ["India", "RBI"],
+            "keywords": ["forex reserves", "monetary policy", "economic grip"],
+            "topic": "macro"
+        }
+    },
+    {
+        "name": "india_us_trade_pact",
+        "cluster": {
+            "headline": "India, US May Execute Interim Trade Pact by July, Minister Says",
+            "summary": "Commerce Minister Piyush Goyal said India and the US could finalize an interim trade agreement by July, addressing tariffs and market access."
+        },
+        "expected": {
+            "entities": ["India", "US", "Piyush Goyal"],
+            "keywords": ["trade deal", "tariffs", "market access"],
+            "topic": "macro"
+        }
+    },
+    {
+        "name": "jobs_report_fed_bets",
+        "cluster": {
+            "headline": "Investors boost bets for Fed rate rise after bumper US jobs report",
+            "summary": "Strong US payrolls data led traders to increase wagers on Federal Reserve interest rate hikes."
+        },
+        "expected": {
+            "entities": ["US", "Federal Reserve"],
+            "keywords": ["jobs report", "rate hike", "rate rise"],
+            "topic": "macro"
+        }
+    },
+
+    # === CRYPTO (6) ===
+    {
+        "name": "bitcoin_etf_flows",
+        "cluster": {
+            "headline": "Bitcoin ETFs see record inflows as BTC tops $70k",
+            "summary": "US spot Bitcoin ETFs attracted $2.3 billion in net inflows this week as Bitcoin surged past $70,000, driven by institutional demand."
+        },
+        "expected": {
+            "entities": ["Bitcoin", "BTC"],
+            "keywords": ["ETF inflows", "institutional demand", "price surge"],
+            "topic": "crypto"
+        }
+    },
+    {
+        "name": "memecoins_dive",
+        "cluster": {
+            "headline": "Memecoins dogecoin, shiba inu dive 9% as bitcoin nears $60,000",
+            "summary": "Dogecoin and Shiba Inu led memecoin losses as Bitcoin approached the $60,000 level."
+        },
+        "expected": {
+            "entities": ["dogecoin", "shiba inu", "Bitcoin", "memecoins"],
+            "keywords": ["crypto crash", "memecoins dive", "price drop"],
+            "topic": "crypto"
+        }
+    },
+    {
+        "name": "bitcoin_seller_exhaustion",
+        "cluster": {
+            "headline": "Bitcoin teases 'seller exhaustion' as BTC price downside reaches $60.3K",
+            "summary": "Technical analysts note signs of seller exhaustion in Bitcoin as the cryptocurrency tests support near $60,300."
+        },
+        "expected": {
+            "entities": ["Bitcoin", "BTC"],
+            "keywords": ["seller exhaustion", "price support", "technical analysis"],
+            "topic": "crypto"
+        }
+    },
+    {
+        "name": "xrp_liquidation_selloff",
+        "cluster": {
+            "headline": "XRP falls toward $1.10 as liquidation-driven selloff pushes token to multi-month lows",
+            "summary": "XRP dropped sharply as leveraged positions were liquidated, pushing the token to its lowest level in months."
+        },
+        "expected": {
+            "entities": ["XRP", "Bitcoin", "Ethereum"],
+            "keywords": ["liquidation", "selloff", "price crash"],
+            "topic": "crypto"
+        }
+    },
+    {
+        "name": "visa_stablecoin_test",
+        "cluster": {
+            "headline": "Visa tests private stablecoin settlement with Brale, Canton",
+            "summary": "Visa is piloting private stablecoin settlement using Brale and Canton networks for institutional payments."
+        },
+        "expected": {
+            "entities": ["Visa", "Brale", "Canton"],
+            "keywords": ["stablecoin settlement", "private network", "institutional payments"],
+            "topic": "crypto"
+        }
+    },
+    {
+        "name": "prediction_markets_kalshi",
+        "cluster": {
+            "headline": "Prediction Markets Hit $29.4 Billion in May as Kalshi Leads and Brokers Pile In",
+            "summary": "Prediction market volume surged to record levels with Kalshi leading the growth as traditional brokers enter the space."
+        },
+        "expected": {
+            "entities": ["Kalshi", "Prediction Markets", "Bitcoin", "Ethereum", "SEC"],
+            "keywords": ["prediction markets", "trading volume", "broker adoption"],
+            "topic": "crypto"
+        }
+    },
+
+    # === AI (5) ===
+    {
+        "name": "nvidia_earnings_ai",
+        "cluster": {
+            "headline": "Nvidia beats earnings on AI chip demand",
+            "summary": "Nvidia reported quarterly revenue of $26 billion, up 262% year-over-year, driven by insatiable demand for its H100 and Blackwell AI chips."
+        },
+        "expected": {
+            "entities": ["Nvidia", "H100", "Blackwell"],
+            "keywords": ["AI chips", "earnings beat", "revenue growth", "chip demand"],
+            "topic": "ai"
+        }
+    },
+    {
+        "name": "anthropic_ai_pause",
+        "cluster": {
+            "headline": "Anthropic calls for pause of global AI development",
+            "summary": "AI safety company Anthropic urged a coordinated pause on advanced AI development to establish safety standards."
+        },
+        "expected": {
+            "entities": ["Anthropic", "Claude"],
+            "keywords": ["AI development", "global pause", "AI safety"],
+            "topic": "ai"
+        }
+    },
+    {
+        "name": "microsoft_ai_products",
+        "cluster": {
+            "headline": "Has Microsoft Lost Its Mojo (Again)?",
+            "summary": "Analysts question whether Microsoft's AI product strategy is falling behind competitors despite massive investment."
+        },
+        "expected": {
+            "entities": ["Microsoft", "Scott Hanselman", "Github"],
+            "keywords": ["AI products", "catch-up mode", "competition"],
+            "topic": "ai"
+        }
+    },
+    {
+        "name": "ai_bubble_debate",
+        "cluster": {
+            "headline": "`There Is No AI Bubble,' Says BI's Rob Schiffman",
+            "summary": "Business Insider's Rob Schiffman argues current AI investment levels are justified by real revenue growth, not speculation."
+        },
+        "expected": {
+            "entities": ["Robert Schiffman", "New York", "Business Insider"],
+            "keywords": ["AI bubble", "investment thesis", "revenue growth"],
+            "topic": "ai"
+        }
+    },
+    {
+        "name": "morgan_stanley_ai_funding",
+        "cluster": {
+            "headline": "Morgan Stanley Sees AI-Related Funding Expanding to 15% of All Credit Deals",
+            "summary": "Morgan Stanley reports AI-related financing now represents 15% of credit deals, up from near zero two years ago."
+        },
+        "expected": {
+            "entities": ["Morgan Stanley", "Diameter Capital Partners", "Scott Goodwin"],
+            "keywords": ["AI funding", "credit deals", "financing growth"],
+            "topic": "ai"
+        }
+    },
+
+    # === OTHER (6) ===
+    {
+        "name": "israel_iran_strikes",
+        "cluster": {
+            "headline": "Israel strikes Iranian missile sites in Syria",
+            "summary": "Israeli warplanes targeted Iranian missile depots near Damascus overnight, escalating regional tensions."
+        },
+        "expected": {
+            "entities": ["Israel", "Iran", "Syria", "Damascus"],
+            "keywords": ["airstrikes", "missile sites", "regional escalation"],
+            "topic": "other"
+        }
+    },
+    {
+        "name": "trump_intel_firings",
+        "cluster": {
+            "headline": "Trump orders Pulte to start mass firings at intel agencies",
+            "summary": "President Trump directed Bill Pulte to begin mass firings across US intelligence agencies."
+        },
+        "expected": {
+            "entities": ["Donald Trump", "Bill Pulte", "United States"],
+            "keywords": ["mass firings", "intelligence agencies", "government restructuring"],
+            "topic": "other"
+        }
+    },
+    {
+        "name": "boeing_737_max",
+        "cluster": {
+            "headline": "Boeing to launch 737 Max production in July",
+            "summary": "Boeing plans to restart 737 Max production in July under new CEO Kelly Ortberg."
+        },
+        "expected": {
+            "entities": ["Boeing", "Kelly Ortberg", "Seattle", "Everett", "737 Max"],
+            "keywords": ["aircraft manufacturing", "production restart", "737 Max"],
+            "topic": "other"
+        }
+    },
+    {
+        "name": "putin_trump_peer",
+        "cluster": {
+            "headline": "Putin says he treats Trump as 'peer, with respect'",
+            "summary": "Vladimir Putin described his relationship with Donald Trump as one of mutual respect between peers."
+        },
+        "expected": {
+            "entities": ["Vladimir Putin", "Donald Trump", "Ukraine", "St. Petersburg"],
+            "keywords": ["Ukraine war", "diplomatic relations", "peer respect"],
+            "topic": "other"
+        }
+    },
+    {
+        "name": "paris_bridge_history",
+        "cluster": {
+            "headline": "Why is Paris's oldest bridge called the 'New Bridge'?",
+            "summary": "The history of Paris's Pont Neuf, which despite its name is the city's oldest standing bridge."
+        },
+        "expected": {
+            "entities": ["Paris", "Pont Neuf", "Louis Vuitton", "Tanishk Saha"],
+            "keywords": ["Paris bridge", "bridge history", "artistic installations"],
+            "topic": "other"
+        }
+    },
+    {
+        "name": "ukraine_drone_attack",
+        "cluster": {
+            "headline": "Ukraine under heavy drone attack as Zelensky seeks direct meeting with Putin",
+            "summary": "Russia launched a massive drone barrage on Ukraine as President Zelensky pushes for direct talks with Putin."
+        },
+        "expected": {
+            "entities": ["Ukraine", "Russia", "Volodymyr Zelensky", "Vladimir Putin", "Moscow", "Kyiv"],
+            "keywords": ["drone strikes", "conflict escalation", "peace talks"],
+            "topic": "other"
+        }
+    },
+]
+
+# Write to JSON file
+output_path = Path(__file__).parent.parent / "data" / "annotated_samples.json"
+output_path.write_text(json.dumps(ANNOTATED_SAMPLES, indent=2, ensure_ascii=False))
+print(f"Wrote {len(ANNOTATED_SAMPLES)} annotated samples to {output_path}")
+
+# Print distribution
+from collections import Counter
+topics = Counter(s["expected"]["topic"] for s in ANNOTATED_SAMPLES)
+print("\nTopic distribution:")
+for t, c in sorted(topics.items()):
+    print(f"  {t}: {c}")

+ 43 - 14
scripts/eval_extraction.py

@@ -7,6 +7,7 @@ Usage:
   python scripts/eval_extraction.py --verbose          # Show per-sample details
   python scripts/eval_extraction.py --model llama-3.1-8b-instant  # Test specific model
   python scripts/eval_extraction.py --collect N        # Collect N new samples from live DB
+  python scripts/eval_extraction.py --prompt-file prompts/extract_entities_fewshot.prompt  # Test alternate prompt
 """
 
 from __future__ import annotations
@@ -20,19 +21,30 @@ from typing import Any
 # Add project root to path
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
 
-from news_mcp.llm import build_extraction_prompt, call_llm, load_prompt
+from news_mcp.llm import call_llm, load_prompt
 from news_mcp.config import (
     NEWS_EXTRACT_PROVIDER,
     NEWS_EXTRACT_MODEL,
     NEWS_ENTITY_BLACKLIST,
+    DEFAULT_TOPICS,
 )
 from news_mcp.entity_normalize import normalize_entities
 from news_mcp.enrichment.llm_enrich import _filter_entities
 
 
 # ---------------------------------------------------------------------------
-# Golden samples (curated from real clusters)
+# Load golden samples from JSON file
 # ---------------------------------------------------------------------------
+def load_golden_samples(filepath: str = "data/annotated_samples.json") -> list[dict]:
+    """Load annotated samples from JSON file."""
+    path = Path(__file__).resolve().parent.parent / filepath
+    if not path.exists():
+        # Fallback to built-in samples
+        return GOLDEN_SAMPLES
+    return json.loads(path.read_text(encoding="utf-8"))
+
+
+# Fallback built-in samples (original 10)
 GOLDEN_SAMPLES = [
     {
         "name": "sec_binance_lawsuit",
@@ -239,14 +251,10 @@ def print_sample_result(name: str, pred: dict, gold: dict, scores: dict, verbose
         print(f"  Sentiment: {pred.get('sentiment')} ({pred.get('sentimentScore')})")
 
 
-async def run_extraction(cluster: dict[str, Any], provider: str, model: str) -> dict[str, Any]:
+async def run_extraction(cluster: dict[str, Any], provider: str, model: str, prompt_text: str) -> dict[str, Any]:
     """Run extraction on a single cluster."""
-    prompt = load_prompt("extract_entities.prompt")
-    # Build the full user prompt
     import json as json_lib
-    user_prompt = prompt.replace("{cluster_json}", json_lib.dumps(cluster, ensure_ascii=False))
-
-    from news_mcp.config import GROQ_API_KEY, OPENAI_API_KEY, OPENROUTER_API_KEY
+    user_prompt = prompt_text.replace("{cluster_json}", json_lib.dumps(cluster, ensure_ascii=False))
 
     system_prompt = "You are a news signal extraction engine. Return STRICT JSON only."
 
@@ -254,7 +262,7 @@ async def run_extraction(cluster: dict[str, Any], provider: str, model: str) ->
     return json.loads(content)
 
 
-async def evaluate_samples(samples: list[dict], provider: str, model: str, verbose: bool) -> dict:
+async def evaluate_samples(samples: list[dict], provider: str, model: str, prompt_text: str, verbose: bool) -> dict:
     """Run evaluation on all samples."""
     print(f"\nEvaluating {len(samples)} samples with {provider}/{model}...")
     print("-" * 60)
@@ -270,12 +278,9 @@ async def evaluate_samples(samples: list[dict], provider: str, model: str, verbo
         print(f"[{i}/{len(samples)}] {name}...", end=" ", flush=True)
 
         try:
-            pred = await run_extraction(cluster, provider, model)
+            pred = await run_extraction(cluster, provider, model, prompt_text)
 
             # Apply same post-processing as production pipeline
-            from news_mcp.enrichment.llm_enrich import _filter_entities, normalize_entities
-            from news_mcp.config import DEFAULT_TOPICS, NEWS_ENTITY_BLACKLIST
-
             entities = _filter_entities(normalize_entities(pred.get("entities", [])), blacklist=NEWS_ENTITY_BLACKLIST)
             keywords = _filter_entities(normalize_entities(pred.get("keywords", [])), blacklist=NEWS_ENTITY_BLACKLIST)
 
@@ -321,6 +326,19 @@ async def evaluate_samples(samples: list[dict], provider: str, model: str, verbo
     print(f"  Leakage (avg): {agg.get('leakage', 0):.3f}  (entities in keywords)")
     print(f"  Topic Acc:     {agg.get('topic_acc', 0):.3f}")
 
+    # Per-topic breakdown
+    print("\n  Per-topic breakdown:")
+    by_topic = {}
+    for r in results:
+        if "scores" in r:
+            t = r["gold"]["topic"]
+            if t not in by_topic:
+                by_topic[t] = []
+            by_topic[t].append(r["scores"])
+    for topic, scores_list in sorted(by_topic.items()):
+        topic_agg = aggregate_scores(scores_list)
+        print(f"    {topic:12s}: Ent_F1={topic_agg.get('ent_f1',0):.2f} Kw_F1={topic_agg.get('kw_f1',0):.2f} Topic={topic_agg.get('topic_acc',0):.2f} (n={len(scores_list)})")
+
     return {"aggregate": agg, "per_sample": results}
 
 
@@ -355,14 +373,25 @@ def main():
     parser.add_argument("--verbose", "-v", action="store_true", help="Verbose output")
     parser.add_argument("--collect", type=int, metavar="N", help="Collect N samples from live DB")
     parser.add_argument("--output", default="new_samples.json", help="Output file for collected samples")
+    parser.add_argument("--prompt-file", default="prompts/extract_entities.prompt", help="Prompt file to test")
+    parser.add_argument("--samples-file", default="data/annotated_samples.json", help="Annotated samples JSON")
     args = parser.parse_args()
 
     if args.collect:
         asyncio.run(collect_samples_from_db(args.collect, args.output))
         return
 
+    # Load prompt
+    prompt_name = Path(args.prompt_file).name
+    prompt_text = load_prompt(prompt_name)
+    print(f"Using prompt: {args.prompt_file}")
+
+    # Load samples
+    samples = load_golden_samples(args.samples_file)
+    print(f"Loaded {len(samples)} samples from {args.samples_file}")
+
     # Run evaluation
-    result = asyncio.run(evaluate_samples(GOLDEN_SAMPLES, args.provider, args.model, args.verbose))
+    result = asyncio.run(evaluate_samples(samples, args.provider, args.model, prompt_text, args.verbose))
 
     # Exit code based on quality threshold
     agg = result["aggregate"]