# Free Testing: Data Sources & Sample Data Strategies ## Your Question: "How can we test for free?" Great question! Here are multiple strategies for testing the full pipeline **without paid API keys**: --- ## Strategy 1: Mock/Fixture Data (Current Approach ✅) **What we already have:** - `tests/conftest.py` creates in-memory SQLite DB with sample officials, securities, trades - Unit tests use mocked `yfinance` responses (see `test_price_loader.py`) - **Cost**: $0 - **Coverage**: Models, DB logic, ETL transforms, analytics calculations **Pros**: Fast, deterministic, no network, tests edge cases **Cons**: Doesn't validate real API behavior or data quality --- ## Strategy 2: Free Public Congressional Trade Data ### Option A: **House Stock Watcher** (Community Project) - **URL**: https://housestockwatcher.com/ - **Format**: Web scraping (no official API, but RSS feed available) - **Data**: Real-time congressional trades (House & Senate) - **License**: Public domain (scraped from official disclosures) - **Cost**: $0 - **How to use**: 1. Scrape the RSS feed or JSON data from their GitHub repo 2. Parse into our `trades` schema 3. Use as integration test fixture **Example**: ```python # They have a JSON API endpoint (unofficial but free) import httpx resp = httpx.get("https://housestockwatcher.com/api/all_transactions") trades = resp.json() ``` ### Option B: **Senate Stock Watcher** API - **URL**: https://senatestockwatcher.com/ - Similar to House Stock Watcher, community-maintained - Free JSON endpoints ### Option C: **Official Senate eFD** (Electronic Financial Disclosures) - **URL**: https://efdsearch.senate.gov/search/ - **Format**: Web forms (no API, requires scraping) - **Cost**: $0, but requires building a scraper - **Data**: Official Senate disclosures (PTRs) ### Option D: **Quiver Quantitative Free Tier** - **URL**: https://www.quiverquant.com/ - **Free tier**: 500 API calls/month (limited but usable for testing) - **Signup**: Email + API key (free) - **Data**: Congress, Senate, House trades + insider trades - **Docs**: https://api.quiverquant.com/docs **Integration test example**: ```python # Set QUIVERQUANT_API_KEY in .env for integration tests @pytest.mark.integration @pytest.mark.skipif(not os.getenv("QUIVERQUANT_API_KEY"), reason="No API key") def test_quiver_live_fetch(): client = QuiverClient(api_key=os.getenv("QUIVERQUANT_API_KEY")) trades = client.fetch_recent_trades(limit=10) assert len(trades) > 0 ``` --- ## Strategy 3: Use Sample/Historical Datasets ### Option A: **Pre-downloaded CSV Snapshots** 1. Manually download 1-2 weeks of data from House/Senate Stock Watcher 2. Store in `tests/fixtures/sample_trades.csv` 3. Load in integration tests **Example**: ```python import pandas as pd from pathlib import Path def test_etl_with_real_data(): csv_path = Path(__file__).parent / "fixtures" / "sample_trades.csv" df = pd.read_csv(csv_path) # Run ETL pipeline loader = TradeLoader(session) loader.ingest_trades(df) # Assert trades were stored correctly ``` ### Option B: **Kaggle Datasets** - Search for "congressional stock trades" on Kaggle - Example: https://www.kaggle.com/datasets (check for recent uploads) - Download CSV, store in `tests/fixtures/` --- ## Strategy 4: Hybrid Testing (Recommended 🌟) **Combine all strategies**: 1. **Unit tests** (fast, always run): - Use mocked data for models, ETL, analytics - `pytest tests/` (current setup) 2. **Integration tests** (optional, gated by env var): ```python @pytest.mark.integration @pytest.mark.skipif(not os.getenv("ENABLE_LIVE_TESTS"), reason="Skipping live tests") def test_live_quiver_api(): # Hits real Quiver API (free tier) pass ``` 3. **Fixture-based tests** (real data shape, no network): - Store 100 real trades in `tests/fixtures/sample_trades.json` - Test ETL, analytics, edge cases 4. **Manual smoke tests** (dev only): - `python scripts/fetch_sample_prices.py` (uses yfinance, free) - `python scripts/ingest_house_watcher.py` (once we build it) --- ## Recommended Next Steps ### For PR2 (Congress Trade Ingestion): 1. **Build a House Stock Watcher scraper** (free, no API key needed) - Module: `src/pote/ingestion/house_watcher.py` - Scrape their RSS or JSON endpoint - Parse into `Trade` model - Store 100 sample trades in `tests/fixtures/` 2. **Add integration test marker**: ```toml # pyproject.toml [tool.pytest.ini_options] markers = [ "integration: marks tests as integration tests (require DB/network)", "slow: marks tests as slow", "live: requires external API/network (use --live flag)", ] ``` 3. **Make PR2 testable without paid APIs**: ```bash # Unit tests (always pass, use mocks) pytest tests/ -m "not integration" # Integration tests (optional, use fixtures or free APIs) pytest tests/ -m integration # Live tests (only if you have API keys) QUIVERQUANT_API_KEY=xxx pytest tests/ -m live ``` --- ## Cost Comparison | Source | Free Tier | Paid Tier | Best For | |--------|-----------|-----------|----------| | **yfinance** | Unlimited | N/A | Prices (already working ✅) | | **House Stock Watcher** | Unlimited scraping | N/A | Free trades (best option) | | **Quiver Free** | 500 calls/mo | $30/mo (5k calls) | Testing, not production | | **FMP Free** | 250 calls/day | $15/mo | Alternative for trades | | **Mock data** | ∞ | N/A | Unit tests | --- ## Bottom Line **You can build and test the entire system for $0** by: 1. Using **House/Senate Stock Watcher** for real trade data (free, unlimited) 2. Using **yfinance** for prices (already working) 3. Storing **fixture snapshots** for regression tests 4. Optionally using **Quiver free tier** (500 calls/mo) for validation **No paid API required until you want:** - Production-grade rate limits - Historical data beyond 1-2 years - Official support/SLAs --- ## Example: Building a Free Trade Scraper (PR2) ```python # src/pote/ingestion/house_watcher.py import httpx from datetime import date class HouseWatcherClient: """Free congressional trade scraper.""" BASE_URL = "https://housestockwatcher.com" def fetch_recent_trades(self, days: int = 7) -> list[dict]: """Scrape recent trades (free, no API key).""" resp = httpx.get(f"{self.BASE_URL}/api/all_transactions") resp.raise_for_status() trades = resp.json() # Filter to last N days, normalize to our schema return [self._normalize(t) for t in trades[:100]] def _normalize(self, raw: dict) -> dict: """Convert HouseWatcher format to our Trade schema.""" return { "official_name": raw["representative"], "ticker": raw["ticker"], "transaction_date": raw["transaction_date"], "filing_date": raw["disclosure_date"], "side": "buy" if "Purchase" in raw["type"] else "sell", "value_min": raw.get("amount_min"), "value_max": raw.get("amount_max"), "source": "house_watcher", } ``` Let me know if you want me to implement this scraper now for PR2! 🚀