eohi/eohi3/00 - var creation.md

# Variable Creation Scripts Documentation

This document describes the data processing scripts used to create derived variables in the EOHI3 dataset. Each script performs specific transformations and should be run in sequence.

---

## datap 04 - combined vars.r

### Goal
Combine self-perspective and other-perspective variables into single columns. For each row, values exist in either the self-perspective variables OR the other-perspective variables, never both.

### Transformations

#### Past Variables (p5 = past)
Combines `self[VAL/PERS/PREF]_p5_[string]` and `other[VAL/PERS/PREF]_p5_[string]` into `past_[val/pers/pref]_[string]`.

**Source Variables:**
- **Values (VAL)**: `selfVAL_p5_trad`, `otherVAL_p5_trad`, `selfVAL_p5_autonomy`, `otherVAL_p5_autonomy`, `selfVAL_p5_personal`, `otherVAL_p5_personal`, `selfVAL_p5_justice`, `otherVAL_p5_justice`, `selfVAL_p5_close`, `otherVAL_p5_close`, `selfVAL_p5_connect`, `otherVAL_p5_connect`, `selfVAL_p5_dgen`, `otherVAL_p5_dgen`
- **Personality (PERS)**: `selfPERS_p5_open`, `otherPESR_p5_open` (note: typo in source data), `selfPERS_p5_goal`, `otherPERS_p5_goal`, `selfPERS_p5_social`, `otherPERS_p5_social`, `selfPERS_p5_agree`, `otherPERS_p5_agree`, `selfPERS_p5_stress`, `otherPERS_p5_stress`, `selfPERS_p5_dgen`, `otherPERS_p5_dgen`
- **Preferences (PREF)**: `selfPREF_p5_hobbies`, `otherPREF_p5_hobbies`, `selfPREF_p5_music`, `otherPREF_p5_music`, `selfPREF_p5_dress`, `otherPREF_p5_dress`, `selfPREF_p5_exer`, `otherPREF_p5_exer`, `selfPREF_p5_food`, `otherPREF_p5_food`, `selfPREF_p5_friends`, `otherPREF_p5_friends`, `selfPREF_p5_dgen`, `otherPREF_p5_dgen`

**Target Variables:**
- `past_val_trad`, `past_val_autonomy`, `past_val_personal`, `past_val_justice`, `past_val_close`, `past_val_connect`, `past_val_DGEN`
- `past_pers_open`, `past_pers_goal`, `past_pers_social`, `past_pers_agree`, `past_pers_stress`, `past_pers_DGEN`
- `past_pref_hobbies`, `past_pref_music`, `past_pref_dress`, `past_pref_exer`, `past_pref_food`, `past_pref_friends`, `past_pref_DGEN`

#### Future Variables (f5 = future)
Combines `self[VAL/PERS/PREF]_f5_[string]` and `other[VAL/PERS/PREF]_f5_[string]` into `fut_[val/pers/pref]_[string]`.

**Source Variables:**
- **Values (VAL)**: `selfVAL_f5_trad`, `otherVAL_f5_trad`, `selfVAL_f5_autonomy`, `otherVAL_f5_autonomy`, `selfVAL_f5_personal`, `otherVAL_f5_personal`, `selfVAL_f5_justice`, `otherVAL_f5_justice`, `selfVAL_f5_close`, `otherVAL_f5_close`, `selfVAL_f5_connect`, `otherVAL_f5_connect`, `selfVAL_f5_dgen`, `otherVAL_f5_dgen`
- **Personality (PERS)**: `selfPERS_f5_open`, `otherPERS_f5_open`, `selfPERS_f5_goal`, `otherPERS_f5_goal`, `selfPERS_f5_social`, `otherPERS_f5_social`, `selfPERS_f5_agree`, `otherPERS_f5_agree`, `selfPERS_f5_stress`, `otherPERS_f5_stress`, `selfPERS_f5_dgen`, `otherPERS_f5_dgen`
- **Preferences (PREF)**: `selfPREF_f5_hobbies`, `otherPREF_f5_hobbies`, `selfPREF_f5_music`, `otherPREF_f5_music`, `selfPREF_f5_dress`, `otherPREF_f5_dress`, `selfPREF_f5_exer`, `otherPREF_f5_exer`, `selfPREF_f5_food`, `otherPREF_f5_food`, `selfPREF_f5_friends`, `otherPREF_f5_friends`, `selfPREF_f5_dgen`, `otherPREF_f5_dgen`

**Target Variables:**
- `fut_val_trad`, `fut_val_autonomy`, `fut_val_personal`, `fut_val_justice`, `fut_val_close`, `fut_val_connect`, `fut_val_DGEN`
- `fut_pers_open`, `fut_pers_goal`, `fut_pers_social`, `fut_pers_agree`, `fut_pers_stress`, `fut_pers_DGEN`
- `fut_pref_hobbies`, `fut_pref_music`, `fut_pref_dress`, `fut_pref_exer`, `fut_pref_food`, `fut_pref_friends`, `fut_pref_DGEN`

### Logic
- Uses self value if present (not empty/NA), otherwise uses other value
- If both are empty/NA, result is NA
- Assumes mutual exclusivity: each row has values in either self OR other, never both

### Validation Checks
1. **Conflict Check**: Verifies no rows have values in both self and other for the same variable
2. **Coverage Check**: Verifies combined columns have expected number of non-empty values (self_count + other_count = combined_count)
3. **Sample Row Check**: Shows examples of how values were combined

### Output
- Updates existing target columns in `eohi3.csv`
- Creates backup `eohi3_2.csv` before processing

---

## datap 05 - ehi vars.r

### Goal
Calculate EHI (End of History Illusion) variables as the difference between past and future variables. Each EHI variable represents the change from past to future perspective.

### Transformations

**Calculation Formula:** `ehi_[pref/pers/val]_[string] = past_[pref/pers/val]_[string] - fut_[pref/pers/val]_[string]`

#### EHI Variables Created

**EHI Preferences:**
- `ehi_pref_hobbies` = `past_pref_hobbies` - `fut_pref_hobbies`
- `ehi_pref_music` = `past_pref_music` - `fut_pref_music`
- `ehi_pref_dress` = `past_pref_dress` - `fut_pref_dress`
- `ehi_pref_exer` = `past_pref_exer` - `fut_pref_exer`
- `ehi_pref_food` = `past_pref_food` - `fut_pref_food`
- `ehi_pref_friends` = `past_pref_friends` - `fut_pref_friends`
- `ehi_pref_DGEN` = `past_pref_DGEN` - `fut_pref_DGEN`

**EHI Personality:**
- `ehi_pers_open` = `past_pers_open` - `fut_pers_open`
- `ehi_pers_goal` = `past_pers_goal` - `fut_pers_goal`
- `ehi_pers_social` = `past_pers_social` - `fut_pers_social`
- `ehi_pers_agree` = `past_pers_agree` - `fut_pers_agree`
- `ehi_pers_stress` = `past_pers_stress` - `fut_pers_stress`
- `ehi_pers_DGEN` = `past_pers_DGEN` - `fut_pers_DGEN`

**EHI Values:**
- `ehi_val_trad` = `past_val_trad` - `fut_val_trad`
- `ehi_val_autonomy` = `past_val_autonomy` - `fut_val_autonomy`
- `ehi_val_personal` = `past_val_personal` - `fut_val_personal`
- `ehi_val_justice` = `past_val_justice` - `fut_val_justice`
- `ehi_val_close` = `past_val_close` - `fut_val_close`
- `ehi_val_connect` = `past_val_connect` - `fut_val_connect`
- `ehi_val_DGEN` = `past_val_DGEN` - `fut_val_DGEN`

### Logic
- Converts source variables to numeric (handling empty strings and NA)
- Calculates difference: past - future
- Result can be positive (past > future), negative (past < future), or zero (past = future)

### Validation Checks
1. **Variable Existence**: Checks that all target variables exist before processing
2. **Source Variable Check**: Verifies source columns exist
3. **Random Row Validation**: Checks 5 random rows showing source values, target value, expected calculation, and match status

### Output
- Updates existing target columns in `eohi3.csv`
- Creates backup `eohi3_2.csv` before processing

---

## datap 06 - mean vars.r

### Goal
Calculate mean variables for various scales by averaging multiple related variables. Creates both domain-specific means and overall composite means.

### Transformations

#### Domain-Specific Means

**Past Preferences MEAN:**
- **Source Variables**: `past_pref_hobbies`, `past_pref_music`, `past_pref_dress`, `past_pref_exer`, `past_pref_food`, `past_pref_friends` (6 variables)
- **Target Variable**: `past_pref_MEAN`

**Future Preferences MEAN:**
- **Source Variables**: `fut_pref_hobbies`, `fut_pref_music`, `fut_pref_dress`, `fut_pref_exer`, `fut_pref_food`, `fut_pref_friends` (6 variables)
- **Target Variable**: `fut_pref_MEAN`

**Past Personality MEAN:**
- **Source Variables**: `past_pers_open`, `past_pers_goal`, `past_pers_social`, `past_pers_agree`, `past_pers_stress` (5 variables)
- **Target Variable**: `past_pers_MEAN`

**Future Personality MEAN:**
- **Source Variables**: `fut_pers_open`, `fut_pers_goal`, `fut_pers_social`, `fut_pers_agree`, `fut_pers_stress` (5 variables)
- **Target Variable**: `fut_pers_MEAN`

**Past Values MEAN:**
- **Source Variables**: `past_val_trad`, `past_val_autonomy`, `past_val_personal`, `past_val_justice`, `past_val_close`, `past_val_connect` (6 variables)
- **Target Variable**: `past_val_MEAN`

**Future Values MEAN:**
- **Source Variables**: `fut_val_trad`, `fut_val_autonomy`, `fut_val_personal`, `fut_val_justice`, `fut_val_close`, `fut_val_connect` (6 variables)
- **Target Variable**: `fut_val_MEAN`

**EHI Preferences MEAN:**
- **Source Variables**: `ehi_pref_hobbies`, `ehi_pref_music`, `ehi_pref_dress`, `ehi_pref_exer`, `ehi_pref_food`, `ehi_pref_friends` (6 variables)
- **Target Variable**: `ehi_pref_MEAN`

**EHI Personality MEAN:**
- **Source Variables**: `ehi_pers_open`, `ehi_pers_goal`, `ehi_pers_social`, `ehi_pers_agree`, `ehi_pers_stress` (5 variables)
- **Target Variable**: `ehi_pers_MEAN`

**EHI Values MEAN:**
- **Source Variables**: `ehi_val_trad`, `ehi_val_autonomy`, `ehi_val_personal`, `ehi_val_justice`, `ehi_val_close`, `ehi_val_connect` (6 variables)
- **Target Variable**: `ehi_val_MEAN`

#### Composite Means

**EHI Domain-Specific Mean:**
- **Source Variables**: `ehi_pref_MEAN`, `ehi_pers_MEAN`, `ehi_val_MEAN` (3 variables)
- **Target Variable**: `ehiDS_mean`

**EHI Domain-General Mean:**
- **Source Variables**: `ehi_pref_DGEN`, `ehi_pers_DGEN`, `ehi_val_DGEN` (3 variables)
- **Target Variable**: `ehiDGEN_mean`

### Logic
- Converts source variables to numeric (handling empty strings and NA)
- Calculates row means using `rowMeans()` with `na.rm = TRUE` (ignores NA values)
- Each mean represents the average of non-missing values for that row

### Validation Checks
1. **Variable Existence**: Uses `setdiff()` to check source and target variables exist
2. **Random Row Validation**: Checks 5 random rows showing source variable names, source values, target value, expected mean calculation, and match status

### Output
- Updates existing target columns in `eohi3.csv`
- Creates backup `eohi3_2.csv` before processing

---

## datap 07 - scales and recodes.r

### Goal
Recode various variables and calculate scale scores. Includes recoding categorical variables, processing cognitive reflection test (CRT) items, calculating ICAR scores, and recoding demographic variables.

### Transformations

#### 1. Recode other_length2 → other_length
**Source Variable**: `other_length2`
**Target Variable**: `other_length`

**Recoding Rules:**
- Values 5-9 → "5-9"
- Values 10-14 → "10-14"
- Values 15-19 → "15-19"
- Value "20+" → "20+" (handled as special case)
- Empty strings → preserved as empty string (not NA)
- NA → NA

#### 2. Recode other_like2 → other_like
**Source Variable**: `other_like2`
**Target Variable**: `other_like`

**Recoding Rules:**
- "Dislike a great deal" → "-2"
- "Dislike somewhat" → "-1"
- "Neither like nor dislike" → "0"
- "Like somewhat" → "1"
- "Like a great deal" → "2"
- Empty strings → preserved as empty string (not NA)
- NA → NA

#### 3. Calculate aot_total (Actively Open-Minded Thinking)
**Source Variables**: `aot01`, `aot02`, `aot03`, `aot04_r`, `aot05_r`, `aot06_r`, `aot07_r`, `aot08`
**Target Variable**: `aot_total`

**Calculation:**
1. Reverse code `aot04_r`, `aot05_r`, `aot06_r`, `aot07_r` by multiplying by -1
2. Calculate mean of all 8 variables: 4 original (`aot01`, `aot02`, `aot03`, `aot08`) + 4 reversed (`aot04_r`, `aot05_r`, `aot06_r`, `aot07_r`)

#### 4. Process CRT Questions → crt_correct and crt_int
**Source Variables**: `crt01`, `crt02`, `crt03`
**Target Variables**: `crt_correct`, `crt_int`

**CRT01:**
- "5 cents" → `crt_correct` = 1, `crt_int` = 0
- "10 cents" → `crt_correct` = 0, `crt_int` = 1
- Other values → `crt_correct` = 0, `crt_int` = 0

**CRT02:**
- "5 minutes" → `crt_correct` += 1, `crt_int` unchanged
- "100 minutes" → `crt_correct` unchanged, `crt_int` += 1
- Other values → both unchanged

**CRT03:**
- "47 days" → `crt_correct` += 1, `crt_int` unchanged
- "24 days" → `crt_correct` unchanged, `crt_int` += 1
- Other values → both unchanged

**Note**: `crt_correct` and `crt_int` are cumulative across all 3 questions (range: 0-3)

#### 5. Calculate icar_verbal
**Source Variables**: `verbal01`, `verbal02`, `verbal03`, `verbal04`, `verbal05`
**Target Variable**: `icar_verbal`

**Correct Answers:**
- `verbal01` = "5"
- `verbal02` = "8"
- `verbal03` = "It's impossible to tell"
- `verbal04` = "47"
- `verbal05` = "Sunday"

**Calculation**: Proportion correct = (number of correct responses) / 5

#### 6. Calculate icar_matrix
**Source Variables**: `matrix01`, `matrix02`, `matrix03`, `matrix04`, `matrix05`
**Target Variable**: `icar_matrix`

**Correct Answers:**
- `matrix01` = "D"
- `matrix02` = "E"
- `matrix03` = "B"
- `matrix04` = "B"
- `matrix05` = "D"

**Calculation**: Proportion correct = (number of correct responses) / 5

#### 7. Calculate icar_total
**Source Variables**: `verbal01`-`verbal05`, `matrix01`-`matrix05` (10 variables total)
**Target Variable**: `icar_total`

**Calculation**: Proportion correct across all 10 items = (number of correct responses) / 10

#### 8. Recode demo_sex → sex
**Source Variable**: `demo_sex`
**Target Variable**: `sex`

**Recoding Rules:**
- "Male" (case-insensitive) → 0
- "Female" (case-insensitive) → 1
- Other values (e.g., "Prefer not to say") → 2
- Empty/NA → NA

#### 9. Recode demo_edu → education
**Source Variable**: `demo_edu`
**Target Variable**: `education` (ordered factor)

**Recoding Rules:**
- "High School (or equivalent)" or "Trade School" → "HS_TS"
- "College Diploma/Certificate" or "University - Undergraduate" → "C_Ug"
- "University - Graduate (Masters)" or "University - PhD" or "Professional Degree (ex. JD/MD)" → "grad_prof"
- Empty/NA → NA

**Factor Levels**: `HS_TS` < `C_Ug` < `grad_prof` (ordered)

### Validation Checks
Each transformation includes:
1. **Variable Existence Check**: Verifies source and target variables exist
2. **Value Check**: Verifies expected values exist in source variables (warns about unexpected values)
3. **Post-Processing Verification**: Checks 5 random rows showing source values, target values, and calculations

### Output
- Updates existing target columns in `eohi3.csv`
- Creates backup `eohi3_2.csv` before processing

---

## Script Execution Order

These scripts should be run in the following order:

1. **datap 04 - combined vars.r** - Combines self/other variables into past/future variables
2. **datap 05 - ehi vars.r** - Calculates EHI variables from past/future differences
3. **datap 06 - mean vars.r** - Calculates mean variables for scales
4. **datap 07 - scales and recodes.r** - Recodes variables and calculates scale scores

Each script creates a backup (`eohi3_2.csv`) before processing and includes validation checks to ensure transformations are performed correctly.