POTE/docs/02_data_model.md
ilia 204cd0e75b Initial commit: POTE Phase 1 complete
- PR1: Project scaffold, DB models, price loader
- PR2: Congressional trade ingestion (House Stock Watcher)
- PR3: Security enrichment + deployment infrastructure
- 37 passing tests, 87%+ coverage
- Docker + Proxmox deployment ready
- Complete documentation
- Works 100% offline with fixtures
2025-12-14 20:45:34 -05:00

103 lines
3.3 KiB
Markdown

# Data model (normalized schema sketch)
This is the Phase 1 target schema. Exact fields may vary slightly by available source data; the goal is to keep raw ingestion **traceable** and analytics **reproducible**.
## Core tables
### `officials`
Represents an individual official (starting with U.S. Congress).
Suggested fields:
- `id` (PK)
- `name` (string)
- `chamber` (enum-like string: House/Senate/Unknown)
- `party` (string, nullable)
- `state` (string, nullable)
- `identifiers` (JSON) — e.g., bioguide ID, source-specific IDs
- `created_at`, `updated_at`
### `securities`
Represents a traded instrument.
Suggested fields:
- `id` (PK)
- `ticker` (string, indexed, nullable) — some disclosures may be missing ticker
- `name` (string, nullable)
- `exchange` (string, nullable)
- `sector` (string, nullable)
- `identifiers` (JSON) — ISIN, CUSIP, etc (when available)
- `created_at`, `updated_at`
### `trades`
One disclosed transaction record.
Suggested fields:
- `id` (PK)
- `official_id` (FK → `officials.id`)
- `security_id` (FK → `securities.id`)
- `source` (string) — e.g., `quiver`, `fmp`, `house_disclosure`
- `source_trade_id` (string, nullable) — unique if provided
- `transaction_date` (date, nullable if unknown)
- `filing_date` (date, nullable)
- `side` (enum-like string: BUY/SELL/EXCHANGE/UNKNOWN)
- `value_range_low` (numeric, nullable)
- `value_range_high` (numeric, nullable)
- `amount` (numeric, nullable) — shares/contracts if available
- `currency` (string, default USD)
- `quality_flags` (JSON) — parse warnings, missing fields, etc
- `raw` (JSON) — optional: raw payload snapshot for traceability
- `created_at`, `updated_at`
Uniqueness strategy (typical):
- unique constraint on (`source`, `source_trade_id`) when `source_trade_id` exists
- otherwise a best-effort dedupe key (official, security, transaction_date, side, value_range_high, filing_date)
### `prices`
Daily OHLCV for a ticker.
Suggested fields:
- `id` (PK) or composite key
- `ticker` (string, indexed)
- `date` (date, indexed)
- `open`, `high`, `low`, `close` (numeric)
- `adj_close` (numeric, nullable)
- `volume` (bigint, nullable)
- `source` (string) — e.g., `yfinance`
- `created_at`, `updated_at`
Unique constraint:
- (`ticker`, `date`, `source`)
## Derived tables
### `metrics_trade`
Per-trade derived analytics (computed after prices are loaded).
Suggested fields:
- `id` (PK)
- `trade_id` (FK → `trades.id`, unique)
- forward returns: `ret_1m`, `ret_3m`, `ret_6m`
- benchmark returns: `bm_ret_1m`, `bm_ret_3m`, `bm_ret_6m`
- abnormal returns: `abret_1m`, `abret_3m`, `abret_6m`
- `calc_version` (string) — allows recomputation while tracking methodology
- `created_at`, `updated_at`
### `metrics_official`
Aggregate metrics per official.
Suggested fields:
- `id` (PK)
- `official_id` (FK → `officials.id`, unique)
- `n_trades`, `n_buys`, `n_sells`
- average/median abnormal returns for buys (by window) + sample sizes
- `cluster_label` (nullable)
- `flags` (JSON) — descriptive risk/ethics flags + supporting metrics
- `calc_version`
- `created_at`, `updated_at`
## Notes on time and lags
- Disclosures often have a filing delay; keep **both** `transaction_date` and `filing_date`.
- When doing “event windows”, prefer windows relative to `transaction_date`, but also compute/record **disclosure lag** as a descriptive attribute.