POTE/docs/06_free_testing_data.md
ilia 204cd0e75b Initial commit: POTE Phase 1 complete
- PR1: Project scaffold, DB models, price loader
- PR2: Congressional trade ingestion (House Stock Watcher)
- PR3: Security enrichment + deployment infrastructure
- 37 passing tests, 87%+ coverage
- Docker + Proxmox deployment ready
- Complete documentation
- Works 100% offline with fixtures
2025-12-14 20:45:34 -05:00

7.1 KiB

Free Testing: Data Sources & Sample Data Strategies

Your Question: "How can we test for free?"

Great question! Here are multiple strategies for testing the full pipeline without paid API keys:


Strategy 1: Mock/Fixture Data (Current Approach )

What we already have:

  • tests/conftest.py creates in-memory SQLite DB with sample officials, securities, trades
  • Unit tests use mocked yfinance responses (see test_price_loader.py)
  • Cost: $0
  • Coverage: Models, DB logic, ETL transforms, analytics calculations

Pros: Fast, deterministic, no network, tests edge cases
Cons: Doesn't validate real API behavior or data quality


Strategy 2: Free Public Congressional Trade Data

Option A: House Stock Watcher (Community Project)

  • URL: https://housestockwatcher.com/
  • Format: Web scraping (no official API, but RSS feed available)
  • Data: Real-time congressional trades (House & Senate)
  • License: Public domain (scraped from official disclosures)
  • Cost: $0
  • How to use:
    1. Scrape the RSS feed or JSON data from their GitHub repo
    2. Parse into our trades schema
    3. Use as integration test fixture

Example:

# They have a JSON API endpoint (unofficial but free)
import httpx
resp = httpx.get("https://housestockwatcher.com/api/all_transactions")
trades = resp.json()

Option B: Senate Stock Watcher API

Option C: Official Senate eFD (Electronic Financial Disclosures)

Option D: Quiver Quantitative Free Tier

Integration test example:

# Set QUIVERQUANT_API_KEY in .env for integration tests
@pytest.mark.integration
@pytest.mark.skipif(not os.getenv("QUIVERQUANT_API_KEY"), reason="No API key")
def test_quiver_live_fetch():
    client = QuiverClient(api_key=os.getenv("QUIVERQUANT_API_KEY"))
    trades = client.fetch_recent_trades(limit=10)
    assert len(trades) > 0

Strategy 3: Use Sample/Historical Datasets

Option A: Pre-downloaded CSV Snapshots

  1. Manually download 1-2 weeks of data from House/Senate Stock Watcher
  2. Store in tests/fixtures/sample_trades.csv
  3. Load in integration tests

Example:

import pandas as pd
from pathlib import Path

def test_etl_with_real_data():
    csv_path = Path(__file__).parent / "fixtures" / "sample_trades.csv"
    df = pd.read_csv(csv_path)
    # Run ETL pipeline
    loader = TradeLoader(session)
    loader.ingest_trades(df)
    # Assert trades were stored correctly

Option B: Kaggle Datasets

  • Search for "congressional stock trades" on Kaggle
  • Example: https://www.kaggle.com/datasets (check for recent uploads)
  • Download CSV, store in tests/fixtures/

Combine all strategies:

  1. Unit tests (fast, always run):

    • Use mocked data for models, ETL, analytics
    • pytest tests/ (current setup)
  2. Integration tests (optional, gated by env var):

    @pytest.mark.integration
    @pytest.mark.skipif(not os.getenv("ENABLE_LIVE_TESTS"), reason="Skipping live tests")
    def test_live_quiver_api():
        # Hits real Quiver API (free tier)
        pass
    
  3. Fixture-based tests (real data shape, no network):

    • Store 100 real trades in tests/fixtures/sample_trades.json
    • Test ETL, analytics, edge cases
  4. Manual smoke tests (dev only):

    • python scripts/fetch_sample_prices.py (uses yfinance, free)
    • python scripts/ingest_house_watcher.py (once we build it)

For PR2 (Congress Trade Ingestion):

  1. Build a House Stock Watcher scraper (free, no API key needed)

    • Module: src/pote/ingestion/house_watcher.py
    • Scrape their RSS or JSON endpoint
    • Parse into Trade model
    • Store 100 sample trades in tests/fixtures/
  2. Add integration test marker:

    # pyproject.toml
    [tool.pytest.ini_options]
    markers = [
        "integration: marks tests as integration tests (require DB/network)",
        "slow: marks tests as slow",
        "live: requires external API/network (use --live flag)",
    ]
    
  3. Make PR2 testable without paid APIs:

    # Unit tests (always pass, use mocks)
    pytest tests/ -m "not integration"
    
    # Integration tests (optional, use fixtures or free APIs)
    pytest tests/ -m integration
    
    # Live tests (only if you have API keys)
    QUIVERQUANT_API_KEY=xxx pytest tests/ -m live
    

Cost Comparison

Source Free Tier Paid Tier Best For
yfinance Unlimited N/A Prices (already working )
House Stock Watcher Unlimited scraping N/A Free trades (best option)
Quiver Free 500 calls/mo $30/mo (5k calls) Testing, not production
FMP Free 250 calls/day $15/mo Alternative for trades
Mock data N/A Unit tests

Bottom Line

You can build and test the entire system for $0 by:

  1. Using House/Senate Stock Watcher for real trade data (free, unlimited)
  2. Using yfinance for prices (already working)
  3. Storing fixture snapshots for regression tests
  4. Optionally using Quiver free tier (500 calls/mo) for validation

No paid API required until you want:

  • Production-grade rate limits
  • Historical data beyond 1-2 years
  • Official support/SLAs

Example: Building a Free Trade Scraper (PR2)

# src/pote/ingestion/house_watcher.py
import httpx
from datetime import date

class HouseWatcherClient:
    """Free congressional trade scraper."""
    
    BASE_URL = "https://housestockwatcher.com"
    
    def fetch_recent_trades(self, days: int = 7) -> list[dict]:
        """Scrape recent trades (free, no API key)."""
        resp = httpx.get(f"{self.BASE_URL}/api/all_transactions")
        resp.raise_for_status()
        
        trades = resp.json()
        # Filter to last N days, normalize to our schema
        return [self._normalize(t) for t in trades[:100]]
    
    def _normalize(self, raw: dict) -> dict:
        """Convert HouseWatcher format to our Trade schema."""
        return {
            "official_name": raw["representative"],
            "ticker": raw["ticker"],
            "transaction_date": raw["transaction_date"],
            "filing_date": raw["disclosure_date"],
            "side": "buy" if "Purchase" in raw["type"] else "sell",
            "value_min": raw.get("amount_min"),
            "value_max": raw.get("amount_max"),
            "source": "house_watcher",
        }

Let me know if you want me to implement this scraper now for PR2! 🚀