# Data model (normalized schema sketch) This is the Phase 1 target schema. Exact fields may vary slightly by available source data; the goal is to keep raw ingestion **traceable** and analytics **reproducible**. ## Core tables ### `officials` Represents an individual official (starting with U.S. Congress). Suggested fields: - `id` (PK) - `name` (string) - `chamber` (enum-like string: House/Senate/Unknown) - `party` (string, nullable) - `state` (string, nullable) - `identifiers` (JSON) — e.g., bioguide ID, source-specific IDs - `created_at`, `updated_at` ### `securities` Represents a traded instrument. Suggested fields: - `id` (PK) - `ticker` (string, indexed, nullable) — some disclosures may be missing ticker - `name` (string, nullable) - `exchange` (string, nullable) - `sector` (string, nullable) - `identifiers` (JSON) — ISIN, CUSIP, etc (when available) - `created_at`, `updated_at` ### `trades` One disclosed transaction record. Suggested fields: - `id` (PK) - `official_id` (FK → `officials.id`) - `security_id` (FK → `securities.id`) - `source` (string) — e.g., `quiver`, `fmp`, `house_disclosure` - `source_trade_id` (string, nullable) — unique if provided - `transaction_date` (date, nullable if unknown) - `filing_date` (date, nullable) - `side` (enum-like string: BUY/SELL/EXCHANGE/UNKNOWN) - `value_range_low` (numeric, nullable) - `value_range_high` (numeric, nullable) - `amount` (numeric, nullable) — shares/contracts if available - `currency` (string, default USD) - `quality_flags` (JSON) — parse warnings, missing fields, etc - `raw` (JSON) — optional: raw payload snapshot for traceability - `created_at`, `updated_at` Uniqueness strategy (typical): - unique constraint on (`source`, `source_trade_id`) when `source_trade_id` exists - otherwise a best-effort dedupe key (official, security, transaction_date, side, value_range_high, filing_date) ### `prices` Daily OHLCV for a ticker. Suggested fields: - `id` (PK) or composite key - `ticker` (string, indexed) - `date` (date, indexed) - `open`, `high`, `low`, `close` (numeric) - `adj_close` (numeric, nullable) - `volume` (bigint, nullable) - `source` (string) — e.g., `yfinance` - `created_at`, `updated_at` Unique constraint: - (`ticker`, `date`, `source`) ## Derived tables ### `metrics_trade` Per-trade derived analytics (computed after prices are loaded). Suggested fields: - `id` (PK) - `trade_id` (FK → `trades.id`, unique) - forward returns: `ret_1m`, `ret_3m`, `ret_6m` - benchmark returns: `bm_ret_1m`, `bm_ret_3m`, `bm_ret_6m` - abnormal returns: `abret_1m`, `abret_3m`, `abret_6m` - `calc_version` (string) — allows recomputation while tracking methodology - `created_at`, `updated_at` ### `metrics_official` Aggregate metrics per official. Suggested fields: - `id` (PK) - `official_id` (FK → `officials.id`, unique) - `n_trades`, `n_buys`, `n_sells` - average/median abnormal returns for buys (by window) + sample sizes - `cluster_label` (nullable) - `flags` (JSON) — descriptive risk/ethics flags + supporting metrics - `calc_version` - `created_at`, `updated_at` ## Notes on time and lags - Disclosures often have a filing delay; keep **both** `transaction_date` and `filing_date`. - When doing “event windows”, prefer windows relative to `transaction_date`, but also compute/record **disclosure lag** as a descriptive attribute.