POTE/docs/03_data_sources.md
ilia 204cd0e75b Initial commit: POTE Phase 1 complete
- PR1: Project scaffold, DB models, price loader
- PR2: Congressional trade ingestion (House Stock Watcher)
- PR3: Security enrichment + deployment infrastructure
- 37 passing tests, 87%+ coverage
- Docker + Proxmox deployment ready
- Complete documentation
- Works 100% offline with fixtures
2025-12-14 20:45:34 -05:00

54 lines
2.4 KiB
Markdown

# Data sources (public) + limitations
POTE only uses **lawfully available public data**. This project is for **private research** and produces **descriptive analytics** (not investment advice).
## Candidate sources (Phase 1)
### U.S. Congress trading disclosures
- **QuiverQuant (API)**: provides congressional trading data (availability depends on plan/keys).
- **Financial Modeling Prep (FMP)**: provides endpoints related to congressional trading and other market metadata (availability depends on plan/keys).
- **Official disclosure sources** (future): House/Senate disclosure filings where accessible and lawful to process.
POTE will treat source data as “best effort” and store:
- `source` (where it came from)
- `source_trade_id` (if provided)
- `raw` payload snapshot (optional, for traceability)
- `quality_flags` describing parse/coverage issues
### Daily price data
- **yfinance** (Yahoo finance wrapper) for daily OHLCV (research use; subject to availability and terms).
- Alternative provider adapters can be added later (e.g., Stooq, AlphaVantage, Polygon, etc. as configured by the user).
## Known limitations / pitfalls
### Disclosure quality and ambiguity
- **Tickers may be missing or wrong**; some disclosures list company names only or broad funds.
- Transactions may be **value ranges** rather than exact amounts.
- Some entries may reflect **family accounts** or managed accounts depending on disclosure details.
- Duplicate records can occur across sources; deduplication is probabilistic when no unique ID exists.
### Timing and “lag”
- Trades are often disclosed **after** the transaction date. Any analysis must account for:
- transaction date
- filing date
- **disclosure lag** (filing - transaction)
### Survivorship / coverage
- Some data providers may have incomplete histories or change coverage over time.
- Price history may be missing for delisted tickers or corporate actions.
### Interpretation risks
- Correlation is not causation; return outcomes do not imply intent or information access.
- High abnormal returns can occur by chance; small samples are especially noisy.
## Source governance in this repo
- No scraping that violates terms or access controls.
- No bypassing paywalls, authentication, or restrictions.
- When adding a new source, document:
- endpoint/coverage
- required API keys / limits
- normalization mapping to the internal schema
- known quirks