eohi3 var creation and recode scripts #5

Merged
ira merged 1 commits from workB into master 2026-02-02 10:26:22 -05:00
7 changed files with 2057 additions and 524 deletions

3
.gitignore vendored
View File

@ -1 +1,2 @@
.history/
.history/
eohi3_2.csv

315
eohi3/00 - var creation.md Normal file
View File

@ -0,0 +1,315 @@
# Variable Creation Scripts Documentation
This document describes the data processing scripts used to create derived variables in the EOHI3 dataset. Each script performs specific transformations and should be run in sequence.
---
## datap 04 - combined vars.r
### Goal
Combine self-perspective and other-perspective variables into single columns. For each row, values exist in either the self-perspective variables OR the other-perspective variables, never both.
### Transformations
#### Past Variables (p5 = past)
Combines `self[VAL/PERS/PREF]_p5_[string]` and `other[VAL/PERS/PREF]_p5_[string]` into `past_[val/pers/pref]_[string]`.
**Source Variables:**
- **Values (VAL)**: `selfVAL_p5_trad`, `otherVAL_p5_trad`, `selfVAL_p5_autonomy`, `otherVAL_p5_autonomy`, `selfVAL_p5_personal`, `otherVAL_p5_personal`, `selfVAL_p5_justice`, `otherVAL_p5_justice`, `selfVAL_p5_close`, `otherVAL_p5_close`, `selfVAL_p5_connect`, `otherVAL_p5_connect`, `selfVAL_p5_dgen`, `otherVAL_p5_dgen`
- **Personality (PERS)**: `selfPERS_p5_open`, `otherPESR_p5_open` (note: typo in source data), `selfPERS_p5_goal`, `otherPERS_p5_goal`, `selfPERS_p5_social`, `otherPERS_p5_social`, `selfPERS_p5_agree`, `otherPERS_p5_agree`, `selfPERS_p5_stress`, `otherPERS_p5_stress`, `selfPERS_p5_dgen`, `otherPERS_p5_dgen`
- **Preferences (PREF)**: `selfPREF_p5_hobbies`, `otherPREF_p5_hobbies`, `selfPREF_p5_music`, `otherPREF_p5_music`, `selfPREF_p5_dress`, `otherPREF_p5_dress`, `selfPREF_p5_exer`, `otherPREF_p5_exer`, `selfPREF_p5_food`, `otherPREF_p5_food`, `selfPREF_p5_friends`, `otherPREF_p5_friends`, `selfPREF_p5_dgen`, `otherPREF_p5_dgen`
**Target Variables:**
- `past_val_trad`, `past_val_autonomy`, `past_val_personal`, `past_val_justice`, `past_val_close`, `past_val_connect`, `past_val_DGEN`
- `past_pers_open`, `past_pers_goal`, `past_pers_social`, `past_pers_agree`, `past_pers_stress`, `past_pers_DGEN`
- `past_pref_hobbies`, `past_pref_music`, `past_pref_dress`, `past_pref_exer`, `past_pref_food`, `past_pref_friends`, `past_pref_DGEN`
#### Future Variables (f5 = future)
Combines `self[VAL/PERS/PREF]_f5_[string]` and `other[VAL/PERS/PREF]_f5_[string]` into `fut_[val/pers/pref]_[string]`.
**Source Variables:**
- **Values (VAL)**: `selfVAL_f5_trad`, `otherVAL_f5_trad`, `selfVAL_f5_autonomy`, `otherVAL_f5_autonomy`, `selfVAL_f5_personal`, `otherVAL_f5_personal`, `selfVAL_f5_justice`, `otherVAL_f5_justice`, `selfVAL_f5_close`, `otherVAL_f5_close`, `selfVAL_f5_connect`, `otherVAL_f5_connect`, `selfVAL_f5_dgen`, `otherVAL_f5_dgen`
- **Personality (PERS)**: `selfPERS_f5_open`, `otherPERS_f5_open`, `selfPERS_f5_goal`, `otherPERS_f5_goal`, `selfPERS_f5_social`, `otherPERS_f5_social`, `selfPERS_f5_agree`, `otherPERS_f5_agree`, `selfPERS_f5_stress`, `otherPERS_f5_stress`, `selfPERS_f5_dgen`, `otherPERS_f5_dgen`
- **Preferences (PREF)**: `selfPREF_f5_hobbies`, `otherPREF_f5_hobbies`, `selfPREF_f5_music`, `otherPREF_f5_music`, `selfPREF_f5_dress`, `otherPREF_f5_dress`, `selfPREF_f5_exer`, `otherPREF_f5_exer`, `selfPREF_f5_food`, `otherPREF_f5_food`, `selfPREF_f5_friends`, `otherPREF_f5_friends`, `selfPREF_f5_dgen`, `otherPREF_f5_dgen`
**Target Variables:**
- `fut_val_trad`, `fut_val_autonomy`, `fut_val_personal`, `fut_val_justice`, `fut_val_close`, `fut_val_connect`, `fut_val_DGEN`
- `fut_pers_open`, `fut_pers_goal`, `fut_pers_social`, `fut_pers_agree`, `fut_pers_stress`, `fut_pers_DGEN`
- `fut_pref_hobbies`, `fut_pref_music`, `fut_pref_dress`, `fut_pref_exer`, `fut_pref_food`, `fut_pref_friends`, `fut_pref_DGEN`
### Logic
- Uses self value if present (not empty/NA), otherwise uses other value
- If both are empty/NA, result is NA
- Assumes mutual exclusivity: each row has values in either self OR other, never both
### Validation Checks
1. **Conflict Check**: Verifies no rows have values in both self and other for the same variable
2. **Coverage Check**: Verifies combined columns have expected number of non-empty values (self_count + other_count = combined_count)
3. **Sample Row Check**: Shows examples of how values were combined
### Output
- Updates existing target columns in `eohi3.csv`
- Creates backup `eohi3_2.csv` before processing
---
## datap 05 - ehi vars.r
### Goal
Calculate EHI (End of History Illusion) variables as the difference between past and future variables. Each EHI variable represents the change from past to future perspective.
### Transformations
**Calculation Formula:** `ehi_[pref/pers/val]_[string] = past_[pref/pers/val]_[string] - fut_[pref/pers/val]_[string]`
#### EHI Variables Created
**EHI Preferences:**
- `ehi_pref_hobbies` = `past_pref_hobbies` - `fut_pref_hobbies`
- `ehi_pref_music` = `past_pref_music` - `fut_pref_music`
- `ehi_pref_dress` = `past_pref_dress` - `fut_pref_dress`
- `ehi_pref_exer` = `past_pref_exer` - `fut_pref_exer`
- `ehi_pref_food` = `past_pref_food` - `fut_pref_food`
- `ehi_pref_friends` = `past_pref_friends` - `fut_pref_friends`
- `ehi_pref_DGEN` = `past_pref_DGEN` - `fut_pref_DGEN`
**EHI Personality:**
- `ehi_pers_open` = `past_pers_open` - `fut_pers_open`
- `ehi_pers_goal` = `past_pers_goal` - `fut_pers_goal`
- `ehi_pers_social` = `past_pers_social` - `fut_pers_social`
- `ehi_pers_agree` = `past_pers_agree` - `fut_pers_agree`
- `ehi_pers_stress` = `past_pers_stress` - `fut_pers_stress`
- `ehi_pers_DGEN` = `past_pers_DGEN` - `fut_pers_DGEN`
**EHI Values:**
- `ehi_val_trad` = `past_val_trad` - `fut_val_trad`
- `ehi_val_autonomy` = `past_val_autonomy` - `fut_val_autonomy`
- `ehi_val_personal` = `past_val_personal` - `fut_val_personal`
- `ehi_val_justice` = `past_val_justice` - `fut_val_justice`
- `ehi_val_close` = `past_val_close` - `fut_val_close`
- `ehi_val_connect` = `past_val_connect` - `fut_val_connect`
- `ehi_val_DGEN` = `past_val_DGEN` - `fut_val_DGEN`
### Logic
- Converts source variables to numeric (handling empty strings and NA)
- Calculates difference: past - future
- Result can be positive (past > future), negative (past < future), or zero (past = future)
### Validation Checks
1. **Variable Existence**: Checks that all target variables exist before processing
2. **Source Variable Check**: Verifies source columns exist
3. **Random Row Validation**: Checks 5 random rows showing source values, target value, expected calculation, and match status
### Output
- Updates existing target columns in `eohi3.csv`
- Creates backup `eohi3_2.csv` before processing
---
## datap 06 - mean vars.r
### Goal
Calculate mean variables for various scales by averaging multiple related variables. Creates both domain-specific means and overall composite means.
### Transformations
#### Domain-Specific Means
**Past Preferences MEAN:**
- **Source Variables**: `past_pref_hobbies`, `past_pref_music`, `past_pref_dress`, `past_pref_exer`, `past_pref_food`, `past_pref_friends` (6 variables)
- **Target Variable**: `past_pref_MEAN`
**Future Preferences MEAN:**
- **Source Variables**: `fut_pref_hobbies`, `fut_pref_music`, `fut_pref_dress`, `fut_pref_exer`, `fut_pref_food`, `fut_pref_friends` (6 variables)
- **Target Variable**: `fut_pref_MEAN`
**Past Personality MEAN:**
- **Source Variables**: `past_pers_open`, `past_pers_goal`, `past_pers_social`, `past_pers_agree`, `past_pers_stress` (5 variables)
- **Target Variable**: `past_pers_MEAN`
**Future Personality MEAN:**
- **Source Variables**: `fut_pers_open`, `fut_pers_goal`, `fut_pers_social`, `fut_pers_agree`, `fut_pers_stress` (5 variables)
- **Target Variable**: `fut_pers_MEAN`
**Past Values MEAN:**
- **Source Variables**: `past_val_trad`, `past_val_autonomy`, `past_val_personal`, `past_val_justice`, `past_val_close`, `past_val_connect` (6 variables)
- **Target Variable**: `past_val_MEAN`
**Future Values MEAN:**
- **Source Variables**: `fut_val_trad`, `fut_val_autonomy`, `fut_val_personal`, `fut_val_justice`, `fut_val_close`, `fut_val_connect` (6 variables)
- **Target Variable**: `fut_val_MEAN`
**EHI Preferences MEAN:**
- **Source Variables**: `ehi_pref_hobbies`, `ehi_pref_music`, `ehi_pref_dress`, `ehi_pref_exer`, `ehi_pref_food`, `ehi_pref_friends` (6 variables)
- **Target Variable**: `ehi_pref_MEAN`
**EHI Personality MEAN:**
- **Source Variables**: `ehi_pers_open`, `ehi_pers_goal`, `ehi_pers_social`, `ehi_pers_agree`, `ehi_pers_stress` (5 variables)
- **Target Variable**: `ehi_pers_MEAN`
**EHI Values MEAN:**
- **Source Variables**: `ehi_val_trad`, `ehi_val_autonomy`, `ehi_val_personal`, `ehi_val_justice`, `ehi_val_close`, `ehi_val_connect` (6 variables)
- **Target Variable**: `ehi_val_MEAN`
#### Composite Means
**EHI Domain-Specific Mean:**
- **Source Variables**: `ehi_pref_MEAN`, `ehi_pers_MEAN`, `ehi_val_MEAN` (3 variables)
- **Target Variable**: `ehiDS_mean`
**EHI Domain-General Mean:**
- **Source Variables**: `ehi_pref_DGEN`, `ehi_pers_DGEN`, `ehi_val_DGEN` (3 variables)
- **Target Variable**: `ehiDGEN_mean`
### Logic
- Converts source variables to numeric (handling empty strings and NA)
- Calculates row means using `rowMeans()` with `na.rm = TRUE` (ignores NA values)
- Each mean represents the average of non-missing values for that row
### Validation Checks
1. **Variable Existence**: Uses `setdiff()` to check source and target variables exist
2. **Random Row Validation**: Checks 5 random rows showing source variable names, source values, target value, expected mean calculation, and match status
### Output
- Updates existing target columns in `eohi3.csv`
- Creates backup `eohi3_2.csv` before processing
---
## datap 07 - scales and recodes.r
### Goal
Recode various variables and calculate scale scores. Includes recoding categorical variables, processing cognitive reflection test (CRT) items, calculating ICAR scores, and recoding demographic variables.
### Transformations
#### 1. Recode other_length2 → other_length
**Source Variable**: `other_length2`
**Target Variable**: `other_length`
**Recoding Rules:**
- Values 5-9 → "5-9"
- Values 10-14 → "10-14"
- Values 15-19 → "15-19"
- Value "20+" → "20+" (handled as special case)
- Empty strings → preserved as empty string (not NA)
- NA → NA
#### 2. Recode other_like2 → other_like
**Source Variable**: `other_like2`
**Target Variable**: `other_like`
**Recoding Rules:**
- "Dislike a great deal" → "-2"
- "Dislike somewhat" → "-1"
- "Neither like nor dislike" → "0"
- "Like somewhat" → "1"
- "Like a great deal" → "2"
- Empty strings → preserved as empty string (not NA)
- NA → NA
#### 3. Calculate aot_total (Actively Open-Minded Thinking)
**Source Variables**: `aot01`, `aot02`, `aot03`, `aot04_r`, `aot05_r`, `aot06_r`, `aot07_r`, `aot08`
**Target Variable**: `aot_total`
**Calculation:**
1. Reverse code `aot04_r`, `aot05_r`, `aot06_r`, `aot07_r` by multiplying by -1
2. Calculate mean of all 8 variables: 4 original (`aot01`, `aot02`, `aot03`, `aot08`) + 4 reversed (`aot04_r`, `aot05_r`, `aot06_r`, `aot07_r`)
#### 4. Process CRT Questions → crt_correct and crt_int
**Source Variables**: `crt01`, `crt02`, `crt03`
**Target Variables**: `crt_correct`, `crt_int`
**CRT01:**
- "5 cents" → `crt_correct` = 1, `crt_int` = 0
- "10 cents" → `crt_correct` = 0, `crt_int` = 1
- Other values → `crt_correct` = 0, `crt_int` = 0
**CRT02:**
- "5 minutes" → `crt_correct` += 1, `crt_int` unchanged
- "100 minutes" → `crt_correct` unchanged, `crt_int` += 1
- Other values → both unchanged
**CRT03:**
- "47 days" → `crt_correct` += 1, `crt_int` unchanged
- "24 days" → `crt_correct` unchanged, `crt_int` += 1
- Other values → both unchanged
**Note**: `crt_correct` and `crt_int` are cumulative across all 3 questions (range: 0-3)
#### 5. Calculate icar_verbal
**Source Variables**: `verbal01`, `verbal02`, `verbal03`, `verbal04`, `verbal05`
**Target Variable**: `icar_verbal`
**Correct Answers:**
- `verbal01` = "5"
- `verbal02` = "8"
- `verbal03` = "It's impossible to tell"
- `verbal04` = "47"
- `verbal05` = "Sunday"
**Calculation**: Proportion correct = (number of correct responses) / 5
#### 6. Calculate icar_matrix
**Source Variables**: `matrix01`, `matrix02`, `matrix03`, `matrix04`, `matrix05`
**Target Variable**: `icar_matrix`
**Correct Answers:**
- `matrix01` = "D"
- `matrix02` = "E"
- `matrix03` = "B"
- `matrix04` = "B"
- `matrix05` = "D"
**Calculation**: Proportion correct = (number of correct responses) / 5
#### 7. Calculate icar_total
**Source Variables**: `verbal01`-`verbal05`, `matrix01`-`matrix05` (10 variables total)
**Target Variable**: `icar_total`
**Calculation**: Proportion correct across all 10 items = (number of correct responses) / 10
#### 8. Recode demo_sex → sex
**Source Variable**: `demo_sex`
**Target Variable**: `sex`
**Recoding Rules:**
- "Male" (case-insensitive) → 0
- "Female" (case-insensitive) → 1
- Other values (e.g., "Prefer not to say") → 2
- Empty/NA → NA
#### 9. Recode demo_edu → education
**Source Variable**: `demo_edu`
**Target Variable**: `education` (ordered factor)
**Recoding Rules:**
- "High School (or equivalent)" or "Trade School" → "HS_TS"
- "College Diploma/Certificate" or "University - Undergraduate" → "C_Ug"
- "University - Graduate (Masters)" or "University - PhD" or "Professional Degree (ex. JD/MD)" → "grad_prof"
- Empty/NA → NA
**Factor Levels**: `HS_TS` < `C_Ug` < `grad_prof` (ordered)
### Validation Checks
Each transformation includes:
1. **Variable Existence Check**: Verifies source and target variables exist
2. **Value Check**: Verifies expected values exist in source variables (warns about unexpected values)
3. **Post-Processing Verification**: Checks 5 random rows showing source values, target values, and calculations
### Output
- Updates existing target columns in `eohi3.csv`
- Creates backup `eohi3_2.csv` before processing
---
## Script Execution Order
These scripts should be run in the following order:
1. **datap 04 - combined vars.r** - Combines self/other variables into past/future variables
2. **datap 05 - ehi vars.r** - Calculates EHI variables from past/future differences
3. **datap 06 - mean vars.r** - Calculates mean variables for scales
4. **datap 07 - scales and recodes.r** - Recodes variables and calculates scale scores
Each script creates a backup (`eohi3_2.csv`) before processing and includes validation checks to ensure transformations are performed correctly.

View File

@ -0,0 +1,343 @@
library(dplyr)
setwd("/home/ladmin/Documents/DND/EOHI/eohi3")
# Read the data (with check.names=FALSE to preserve original column names)
# Keep empty cells as empty strings, not NA
# Only convert the literal string "NA" to NA, not empty strings
df <- read.csv("eohi3.csv", stringsAsFactors = FALSE, check.names = FALSE, na.strings = "NA")
# =============================================================================
# 1. CREATE BACKUP
# =============================================================================
#file.copy("eohi3.csv", "eohi3_2.csv", overwrite = TRUE)
# =============================================================================
# 2. DEFINE VARIABLE MAPPINGS
# =============================================================================
# Past variables mapping: [self/other][VAL/PERS/PREF]_p5_[string] -> past_[val/pers/pref]_[string]
past_mappings <- list(
# Values (VAL)
"past_val_trad" = c("selfVAL_p5_trad", "otherVAL_p5_trad"),
"past_val_autonomy" = c("selfVAL_p5_autonomy", "otherVAL_p5_autonomy"),
"past_val_personal" = c("selfVAL_p5_personal", "otherVAL_p5_personal"),
"past_val_justice" = c("selfVAL_p5_justice", "otherVAL_p5_justice"),
"past_val_close" = c("selfVAL_p5_close", "otherVAL_p5_close"),
"past_val_connect" = c("selfVAL_p5_connect", "otherVAL_p5_connect"),
"past_val_DGEN" = c("selfVAL_p5_dgen", "otherVAL_p5_dgen"),
# Personality (PERS)
"past_pers_open" = c("selfPERS_p5_open", "otherPERS_p5_open"),
"past_pers_goal" = c("selfPERS_p5_goal", "otherPERS_p5_goal"),
"past_pers_social" = c("selfPERS_p5_social", "otherPERS_p5_social"),
"past_pers_agree" = c("selfPERS_p5_agree", "otherPERS_p5_agree"),
"past_pers_stress" = c("selfPERS_p5_stress", "otherPERS_p5_stress"),
"past_pers_DGEN" = c("selfPERS_p5_dgen", "otherPERS_p5_dgen"),
# Preferences (PREF)
"past_pref_hobbies" = c("selfPREF_p5_hobbies", "otherPREF_p5_hobbies"),
"past_pref_music" = c("selfPREF_p5_music", "otherPREF_p5_music"),
"past_pref_dress" = c("selfPREF_p5_dress", "otherPREF_p5_dress"),
"past_pref_exer" = c("selfPREF_p5_exer", "otherPREF_p5_exer"),
"past_pref_food" = c("selfPREF_p5_food", "otherPREF_p5_food"),
"past_pref_friends" = c("selfPREF_p5_friends", "otherPREF_p5_friends"),
"past_pref_DGEN" = c("selfPREF_p5_dgen", "otherPREF_p5_dgen")
)
# Future variables mapping: [self/other][VAL/PERS/PREF]_f5_[string] -> fut_[val/pers/pref]_[string]
future_mappings <- list(
# Values (VAL)
"fut_val_trad" = c("selfVAL_f5_trad", "otherVAL_f5_trad"),
"fut_val_autonomy" = c("selfVAL_f5_autonomy", "otherVAL_f5_autonomy"),
"fut_val_personal" = c("selfVAL_f5_personal", "otherVAL_f5_personal"),
"fut_val_justice" = c("selfVAL_f5_justice", "otherVAL_f5_justice"),
"fut_val_close" = c("selfVAL_f5_close", "otherVAL_f5_close"),
"fut_val_connect" = c("selfVAL_f5_connect", "otherVAL_f5_connect"),
"fut_val_DGEN" = c("selfVAL_f5_dgen", "otherVAL_f5_dgen"),
# Personality (PERS)
"fut_pers_open" = c("selfPERS_f5_open", "otherPERS_f5_open"),
"fut_pers_goal" = c("selfPERS_f5_goal", "otherPERS_f5_goal"),
"fut_pers_social" = c("selfPERS_f5_social", "otherPERS_f5_social"),
"fut_pers_agree" = c("selfPERS_f5_agree", "otherPERS_f5_agree"),
"fut_pers_stress" = c("selfPERS_f5_stress", "otherPERS_f5_stress"),
"fut_pers_DGEN" = c("selfPERS_f5_dgen", "otherPERS_f5_dgen"),
# Preferences (PREF)
"fut_pref_hobbies" = c("selfPREF_f5_hobbies", "otherPREF_f5_hobbies"),
"fut_pref_music" = c("selfPREF_f5_music", "otherPREF_f5_music"),
"fut_pref_dress" = c("selfPREF_f5_dress", "otherPREF_f5_dress"),
"fut_pref_exer" = c("selfPREF_f5_exer", "otherPREF_f5_exer"),
"fut_pref_food" = c("selfPREF_f5_food", "otherPREF_f5_food"),
"fut_pref_friends" = c("selfPREF_f5_friends", "otherPREF_f5_friends"),
"fut_pref_DGEN" = c("selfPREF_f5_dgen", "otherPREF_f5_dgen")
)
# =============================================================================
# 3. COMBINE VARIABLES
# =============================================================================
# Function to combine self and other variables
# For each row, values exist in either self OR other, never both
# NOTE: Column existence should be checked before calling this function
combine_vars <- function(df, self_col, other_col) {
# Safety check: if columns don't exist, return appropriate fallback
if (!self_col %in% names(df)) {
stop(paste("ERROR: Column", self_col, "not found. This should have been caught earlier."))
}
if (!other_col %in% names(df)) {
stop(paste("ERROR: Column", other_col, "not found. This should have been caught earlier."))
}
# Combine: use self value if not empty/NA, otherwise use other value
# Handle both NA and empty strings
result <- ifelse(
!is.na(df[[self_col]]) & df[[self_col]] != "",
df[[self_col]],
ifelse(
!is.na(df[[other_col]]) & df[[other_col]] != "",
df[[other_col]],
NA
)
)
return(result)
}
# Apply past mappings
cat("\nCombining past variables...\n")
missing_cols <- list()
for (new_col in names(past_mappings)) {
self_col <- past_mappings[[new_col]][1]
other_col <- past_mappings[[new_col]][2]
# Check if all required columns exist
missing <- c()
if (!new_col %in% names(df)) {
missing <- c(missing, paste("target:", new_col))
}
if (!self_col %in% names(df)) {
missing <- c(missing, paste("self:", self_col))
}
if (!other_col %in% names(df)) {
missing <- c(missing, paste("other:", other_col))
}
if (length(missing) > 0) {
missing_cols[[new_col]] <- missing
warning(paste("Skipping", new_col, "- missing columns:", paste(missing, collapse = ", ")))
next
}
# All columns exist, proceed with combination
df[[new_col]] <- combine_vars(df, self_col, other_col)
cat(paste(" Updated:", new_col, "\n"))
}
# Report any missing columns
if (length(missing_cols) > 0) {
cat("\n⚠ Missing columns detected in PAST variables:\n")
for (var in names(missing_cols)) {
cat(paste(" ", var, ":", paste(missing_cols[[var]], collapse = ", "), "\n"))
}
}
# Apply future mappings
cat("\nCombining future variables...\n")
missing_cols_future <- list()
for (new_col in names(future_mappings)) {
self_col <- future_mappings[[new_col]][1]
other_col <- future_mappings[[new_col]][2]
# Check if all required columns exist
missing <- c()
if (!new_col %in% names(df)) {
missing <- c(missing, paste("target:", new_col))
}
if (!self_col %in% names(df)) {
missing <- c(missing, paste("self:", self_col))
}
if (!other_col %in% names(df)) {
missing <- c(missing, paste("other:", other_col))
}
if (length(missing) > 0) {
missing_cols_future[[new_col]] <- missing
warning(paste("Skipping", new_col, "- missing columns:", paste(missing, collapse = ", ")))
next
}
# All columns exist, proceed with combination
df[[new_col]] <- combine_vars(df, self_col, other_col)
cat(paste(" Updated:", new_col, "\n"))
}
# Report any missing columns
if (length(missing_cols_future) > 0) {
cat("\n⚠ Missing columns detected in FUTURE variables:\n")
for (var in names(missing_cols_future)) {
cat(paste(" ", var, ":", paste(missing_cols_future[[var]], collapse = ", "), "\n"))
}
}
# =============================================================================
# 4. VALIDATION CHECKS
# =============================================================================
cat("\n=== VALIDATION CHECKS ===\n\n")
# Check 1: Ensure no row has values in both self and other for the same variable
check_conflicts <- function(df, mappings) {
conflicts <- data.frame()
for (new_col in names(mappings)) {
self_col <- mappings[[new_col]][1]
other_col <- mappings[[new_col]][2]
if (self_col %in% names(df) && other_col %in% names(df)) {
# Find rows where both self and other have non-empty values
both_filled <- !is.na(df[[self_col]]) & df[[self_col]] != "" &
!is.na(df[[other_col]]) & df[[other_col]] != ""
if (any(both_filled, na.rm = TRUE)) {
conflict_rows <- which(both_filled)
conflicts <- rbind(conflicts, data.frame(
variable = new_col,
self_col = self_col,
other_col = other_col,
n_conflicts = length(conflict_rows),
example_rows = paste(head(conflict_rows, 5), collapse = ", ")
))
}
}
}
return(conflicts)
}
past_conflicts <- check_conflicts(df, past_mappings)
future_conflicts <- check_conflicts(df, future_mappings)
if (nrow(past_conflicts) > 0) {
cat("WARNING: Found conflicts in PAST variables (both self and other have values):\n")
print(past_conflicts)
} else {
cat("✓ No conflicts found in PAST variables\n")
}
if (nrow(future_conflicts) > 0) {
cat("\nWARNING: Found conflicts in FUTURE variables (both self and other have values):\n")
print(future_conflicts)
} else {
cat("✓ No conflicts found in FUTURE variables\n")
}
# Check 2: Verify that combined columns have values where expected
check_coverage <- function(df, mappings) {
coverage <- data.frame()
for (new_col in names(mappings)) {
self_col <- mappings[[new_col]][1]
other_col <- mappings[[new_col]][2]
# Check if columns exist before counting
self_exists <- self_col %in% names(df)
other_exists <- other_col %in% names(df)
target_exists <- new_col %in% names(df)
# Count non-empty values in original columns (only if they exist)
self_count <- if (self_exists) {
sum(!is.na(df[[self_col]]) & df[[self_col]] != "", na.rm = TRUE)
} else {
NA
}
other_count <- if (other_exists) {
sum(!is.na(df[[other_col]]) & df[[other_col]] != "", na.rm = TRUE)
} else {
NA
}
combined_count <- if (target_exists) {
sum(!is.na(df[[new_col]]) & df[[new_col]] != "", na.rm = TRUE)
} else {
NA
}
# Combined should equal sum of self and other (since they don't overlap)
expected_count <- if (!is.na(self_count) && !is.na(other_count)) {
self_count + other_count
} else {
NA
}
match <- if (!is.na(combined_count) && !is.na(expected_count)) {
combined_count == expected_count
} else {
NA
}
coverage <- rbind(coverage, data.frame(
variable = new_col,
self_non_empty = self_count,
other_non_empty = other_count,
combined_non_empty = combined_count,
expected_non_empty = expected_count,
match = match
))
}
return(coverage)
}
past_coverage <- check_coverage(df, past_mappings)
future_coverage <- check_coverage(df, future_mappings)
cat("\n=== COVERAGE CHECK ===\n")
cat("\nPAST variables:\n")
print(past_coverage)
cat("\nFUTURE variables:\n")
print(future_coverage)
# Check if all coverage matches
all_past_match <- all(past_coverage$match, na.rm = TRUE)
all_future_match <- all(future_coverage$match, na.rm = TRUE)
if (all_past_match && all_future_match) {
cat("\n✓ All combined variables have correct coverage\n")
} else {
cat("\n⚠ Some variables may have missing coverage - check the table above\n")
}
# Check 3: Sample check - verify a few rows manually
cat("\n=== SAMPLE ROW CHECK ===\n")
sample_rows <- min(5, nrow(df))
cat(paste("Checking first", sample_rows, "rows:\n\n"))
for (i in 1:sample_rows) {
cat(paste("Row", i, ":\n"))
# Check one past variable
test_var <- "past_val_trad"
self_val <- if (past_mappings[[test_var]][1] %in% names(df)) df[i, past_mappings[[test_var]][1]] else NA
other_val <- if (past_mappings[[test_var]][2] %in% names(df)) df[i, past_mappings[[test_var]][2]] else NA
combined_val <- df[i, test_var]
cat(sprintf(" %s: self=%s, other=%s, combined=%s\n",
test_var,
ifelse(is.na(self_val) || self_val == "", "empty", self_val),
ifelse(is.na(other_val) || other_val == "", "empty", other_val),
ifelse(is.na(combined_val) || combined_val == "", "empty", combined_val)))
}
# =============================================================================
# 5. SAVE UPDATED DATA
# =============================================================================
write.csv(df, "eohi3.csv", row.names = FALSE, na = "")
cat("Updated data saved to: eohi3.csv\n")
cat(paste("Total rows:", nrow(df), "\n"))
cat(paste("Total columns:", ncol(df), "\n"))

187
eohi3/datap 05 - ehi vars.r Normal file
View File

@ -0,0 +1,187 @@
library(dplyr)
setwd("/home/ladmin/Documents/DND/EOHI/eohi3")
# Read the data (with check.names=FALSE to preserve original column names)
# Keep empty cells as empty strings, not NA
# Only convert the literal string "NA" to NA, not empty strings
df <- read.csv("eohi3.csv", stringsAsFactors = FALSE, check.names = FALSE, na.strings = "NA")
# =============================================================================
# 1. CREATE BACKUP
# =============================================================================
file.copy("eohi3.csv", "eohi3_2.csv", overwrite = TRUE)
# =============================================================================
# 2. DEFINE VARIABLE MAPPINGS
# =============================================================================
# Target variables (excluding those ending in _MEAN)
# Each target var = past_var - fut_var
ehi_mappings <- list(
# Preferences (PREF)
"ehi_pref_hobbies" = c("past_pref_hobbies", "fut_pref_hobbies"),
"ehi_pref_music" = c("past_pref_music", "fut_pref_music"),
"ehi_pref_dress" = c("past_pref_dress", "fut_pref_dress"),
"ehi_pref_exer" = c("past_pref_exer", "fut_pref_exer"),
"ehi_pref_food" = c("past_pref_food", "fut_pref_food"),
"ehi_pref_friends" = c("past_pref_friends", "fut_pref_friends"),
"ehi_pref_DGEN" = c("past_pref_DGEN", "fut_pref_DGEN"),
# Personality (PERS)
"ehi_pers_open" = c("past_pers_open", "fut_pers_open"),
"ehi_pers_goal" = c("past_pers_goal", "fut_pers_goal"),
"ehi_pers_social" = c("past_pers_social", "fut_pers_social"),
"ehi_pers_agree" = c("past_pers_agree", "fut_pers_agree"),
"ehi_pers_stress" = c("past_pers_stress", "fut_pers_stress"),
"ehi_pers_DGEN" = c("past_pers_DGEN", "fut_pers_DGEN"),
# Values (VAL)
"ehi_val_trad" = c("past_val_trad", "fut_val_trad"),
"ehi_val_autonomy" = c("past_val_autonomy", "fut_val_autonomy"),
"ehi_val_personal" = c("past_val_personal", "fut_val_personal"),
"ehi_val_justice" = c("past_val_justice", "fut_val_justice"),
"ehi_val_close" = c("past_val_close", "fut_val_close"),
"ehi_val_connect" = c("past_val_connect", "fut_val_connect"),
"ehi_val_DGEN" = c("past_val_DGEN", "fut_val_DGEN")
)
# =============================================================================
# 3. CHECK IF TARGET VARIABLES EXIST
# =============================================================================
missing_targets <- c()
for (target_var in names(ehi_mappings)) {
if (!target_var %in% names(df)) {
missing_targets <- c(missing_targets, target_var)
cat(paste("⚠ Target variable not found:", target_var, "\n"))
}
}
if (length(missing_targets) > 0) {
cat("\nERROR: The following target variables are missing from eohi3.csv:\n")
for (var in missing_targets) {
cat(paste(" -", var, "\n"))
}
stop("Cannot proceed without target variables. Please add them to the CSV file.")
}
# =============================================================================
# 4. CALCULATE EHI VARIABLES (past - future)
# =============================================================================
missing_source_cols <- list()
for (target_var in names(ehi_mappings)) {
past_var <- ehi_mappings[[target_var]][1]
fut_var <- ehi_mappings[[target_var]][2]
# Check if source columns exist
missing <- c()
if (!past_var %in% names(df)) {
missing <- c(missing, past_var)
}
if (!fut_var %in% names(df)) {
missing <- c(missing, fut_var)
}
if (length(missing) > 0) {
missing_source_cols[[target_var]] <- missing
warning(paste("Skipping", target_var, "- missing source columns:", paste(missing, collapse = ", ")))
next
}
# Convert to numeric, handling empty strings and NA
past_vals <- as.numeric(ifelse(df[[past_var]] == "" | is.na(df[[past_var]]), NA, df[[past_var]]))
fut_vals <- as.numeric(ifelse(df[[fut_var]] == "" | is.na(df[[fut_var]]), NA, df[[fut_var]]))
# Calculate difference: past - future
ehi_vals <- past_vals - fut_vals
# Update target column
df[[target_var]] <- ehi_vals
cat(paste(" Calculated:", target_var, "=", past_var, "-", fut_var, "\n"))
}
# Report any missing source columns
if (length(missing_source_cols) > 0) {
for (var in names(missing_source_cols)) {
cat(paste(" ", var, ":", paste(missing_source_cols[[var]], collapse = ", "), "\n"))
}
}
# =============================================================================
# 5. VALIDATION: CHECK 5 RANDOM ROWS
# =============================================================================
cat("\n=== VALIDATION: CHECKING 5 RANDOM ROWS ===\n\n")
# Set seed for reproducibility
set.seed(123)
sample_rows <- sample(1:nrow(df), min(5, nrow(df)))
sample_rows <- sort(sample_rows)
for (i in sample_rows) {
cat(paste("Row", i, ":\n"))
# Check a few representative variables from each category
test_vars <- c(
"ehi_pref_hobbies",
"ehi_pers_open",
"ehi_val_trad"
)
for (target_var in test_vars) {
if (target_var %in% names(ehi_mappings)) {
past_var <- ehi_mappings[[target_var]][1]
fut_var <- ehi_mappings[[target_var]][2]
if (past_var %in% names(df) && fut_var %in% names(df)) {
past_val <- df[i, past_var]
fut_val <- df[i, fut_var]
ehi_val <- df[i, target_var]
# Convert to numeric for calculation check
past_num <- as.numeric(ifelse(past_val == "" | is.na(past_val), NA, past_val))
fut_num <- as.numeric(ifelse(fut_val == "" | is.na(fut_val), NA, fut_val))
ehi_num <- as.numeric(ifelse(is.na(ehi_val), NA, ehi_val))
# Calculate expected value
expected <- if (!is.na(past_num) && !is.na(fut_num)) {
past_num - fut_num
} else {
NA
}
# Check if calculation is correct
match <- if (!is.na(expected) && !is.na(ehi_num)) {
abs(expected - ehi_num) < 0.0001 # Allow for floating point precision
} else {
is.na(expected) && is.na(ehi_num)
}
cat(sprintf(" %s:\n", target_var))
cat(sprintf(" %s = %s\n", past_var, ifelse(is.na(past_val) || past_val == "", "NA/empty", past_val)))
cat(sprintf(" %s = %s\n", fut_var, ifelse(is.na(fut_val) || fut_val == "", "NA/empty", fut_val)))
cat(sprintf(" %s = %s\n", target_var, ifelse(is.na(ehi_val), "NA", ehi_val)))
cat(sprintf(" Expected: %s - %s = %s\n",
ifelse(is.na(past_num), "NA", past_num),
ifelse(is.na(fut_num), "NA", fut_num),
ifelse(is.na(expected), "NA", expected)))
cat(sprintf(" Match: %s\n\n", ifelse(match, "✓", "✗ ERROR")))
}
}
}
}
# =============================================================================
# 6. SAVE UPDATED DATA
# =============================================================================
# COMMENTED OUT: Uncomment when ready to save
# cat("\n=== SAVING DATA ===\n")
write.csv(df, "eohi3.csv", row.names = FALSE, na = "")
# cat("Updated data saved to: eohi3.csv\n")
# cat(paste("Total rows:", nrow(df), "\n"))
# cat(paste("Total columns:", ncol(df), "\n"))

View File

@ -0,0 +1,225 @@
library(dplyr)
setwd("/home/ladmin/Documents/DND/EOHI/eohi3")
# Read the data (with check.names=FALSE to preserve original column names)
# Keep empty cells as empty strings, not NA
# Only convert the literal string "NA" to NA, not empty strings
df <- read.csv("eohi3.csv", stringsAsFactors = FALSE, check.names = FALSE, na.strings = "NA")
# =============================================================================
# 1. CREATE BACKUP
# =============================================================================
file.copy("eohi3.csv", "eohi3_2.csv", overwrite = TRUE)
# =============================================================================
# 2. DEFINE MEAN VARIABLE MAPPINGS
# =============================================================================
mean_mappings <- list(
# Past Preferences MEAN
"past_pref_MEAN" = c("past_pref_hobbies", "past_pref_music", "past_pref_dress",
"past_pref_exer", "past_pref_food", "past_pref_friends"),
# Future Preferences MEAN
"fut_pref_MEAN" = c("fut_pref_hobbies", "fut_pref_music", "fut_pref_dress",
"fut_pref_exer", "fut_pref_food", "fut_pref_friends"),
# Past Personality MEAN
"past_pers_MEAN" = c("past_pers_open", "past_pers_goal", "past_pers_social",
"past_pers_agree", "past_pers_stress"),
# Future Personality MEAN
"fut_pers_MEAN" = c("fut_pers_open", "fut_pers_goal", "fut_pers_social",
"fut_pers_agree", "fut_pers_stress"),
# Past Values MEAN
"past_val_MEAN" = c("past_val_trad", "past_val_autonomy", "past_val_personal",
"past_val_justice", "past_val_close", "past_val_connect"),
# Future Values MEAN
"fut_val_MEAN" = c("fut_val_trad", "fut_val_autonomy", "fut_val_personal",
"fut_val_justice", "fut_val_close", "fut_val_connect"),
# EHI Preferences MEAN
"ehi_pref_MEAN" = c("ehi_pref_hobbies", "ehi_pref_music", "ehi_pref_dress",
"ehi_pref_exer", "ehi_pref_food", "ehi_pref_friends"),
# EHI Personality MEAN
"ehi_pers_MEAN" = c("ehi_pers_open", "ehi_pers_goal", "ehi_pers_social",
"ehi_pers_agree", "ehi_pers_stress"),
# EHI Values MEAN
"ehi_val_MEAN" = c("ehi_val_trad", "ehi_val_autonomy", "ehi_val_personal",
"ehi_val_justice", "ehi_val_close", "ehi_val_connect")
)
# Additional means
additional_means <- list(
"ehiDS_mean" = c("ehi_pref_MEAN", "ehi_pers_MEAN", "ehi_val_MEAN"),
"ehiDGEN_mean" = c("ehi_pref_DGEN", "ehi_pers_DGEN", "ehi_val_DGEN")
)
# =============================================================================
# 3. CHECK IF VARIABLES EXIST
# =============================================================================
# Check source variables for mean_mappings
missing_source_vars <- list()
for (target_var in names(mean_mappings)) {
source_vars <- mean_mappings[[target_var]]
missing <- setdiff(source_vars, names(df))
if (length(missing) > 0) {
missing_source_vars[[target_var]] <- missing
cat(paste("⚠ Missing source variables for", target_var, ":", paste(missing, collapse = ", "), "\n"))
}
}
# Check source variables for additional_means
missing_additional_vars <- list()
for (target_var in names(additional_means)) {
source_vars <- additional_means[[target_var]]
missing <- setdiff(source_vars, names(df))
if (length(missing) > 0) {
missing_additional_vars[[target_var]] <- missing
cat(paste("⚠ Missing source variables for", target_var, ":", paste(missing, collapse = ", "), "\n"))
}
}
# Check if target variables exist
expected_targets <- c(names(mean_mappings), names(additional_means))
actual_targets <- names(df)
missing_targets <- setdiff(expected_targets, actual_targets)
if (length(missing_targets) > 0) {
cat("\nERROR: The following target variables are missing from eohi3.csv:\n")
for (var in missing_targets) {
cat(paste(" -", var, "\n"))
}
stop("Cannot proceed without target variables. Please add them to the CSV file.")
}
# =============================================================================
# 4. CALCULATE MEAN VARIABLES
# =============================================================================
# Function to calculate row means, handling NA and empty strings
calculate_mean <- function(df, source_vars) {
# Extract columns and convert to numeric
cols_data <- df[, source_vars, drop = FALSE]
# Convert to numeric matrix, treating empty strings and "NA" as NA
numeric_matrix <- apply(cols_data, 2, function(x) {
as.numeric(ifelse(x == "" | is.na(x) | x == "NA", NA, x))
})
# Calculate row means, ignoring NA values
rowMeans(numeric_matrix, na.rm = TRUE)
}
# Calculate means for main mappings
for (target_var in names(mean_mappings)) {
source_vars <- mean_mappings[[target_var]]
# Check if all source variables exist
missing <- setdiff(source_vars, names(df))
if (length(missing) > 0) {
warning(paste("Skipping", target_var, "- missing source variables:", paste(missing, collapse = ", ")))
next
}
# Calculate mean
df[[target_var]] <- calculate_mean(df, source_vars)
cat(paste(" Calculated:", target_var, "from", length(source_vars), "variables\n"))
}
# Calculate additional means
for (target_var in names(additional_means)) {
source_vars <- additional_means[[target_var]]
# Check if all source variables exist
missing <- setdiff(source_vars, names(df))
if (length(missing) > 0) {
warning(paste("Skipping", target_var, "- missing source variables:", paste(missing, collapse = ", ")))
next
}
# Calculate mean
df[[target_var]] <- calculate_mean(df, source_vars)
cat(paste(" Calculated:", target_var, "from", length(source_vars), "variables\n"))
}
# =============================================================================
# 5. VALIDATION: CHECK 5 RANDOM ROWS
# =============================================================================
# Set seed for reproducibility
set.seed(123)
sample_rows <- sample(1:nrow(df), min(5, nrow(df)))
sample_rows <- sort(sample_rows)
for (i in sample_rows) {
cat(paste("Row", i, ":\n"))
# Check a few representative mean variables
test_vars <- c(
"past_pref_MEAN",
"ehi_pref_MEAN",
"ehiDS_mean"
)
for (target_var in test_vars) {
# Determine which mapping to use
if (target_var %in% names(mean_mappings)) {
source_vars <- mean_mappings[[target_var]]
} else if (target_var %in% names(additional_means)) {
source_vars <- additional_means[[target_var]]
} else {
next
}
# Check if all source variables exist
if (!all(source_vars %in% names(df))) {
next
}
# Get values
source_vals <- df[i, source_vars]
target_val <- df[i, target_var]
# Convert to numeric for calculation
source_nums <- as.numeric(ifelse(source_vals == "" | is.na(source_vals) | source_vals == "NA", NA, source_vals))
target_num <- as.numeric(ifelse(is.na(target_val), NA, target_val))
# Calculate expected mean (ignoring NA)
expected <- mean(source_nums, na.rm = TRUE)
if (all(is.na(source_nums))) {
expected <- NA
}
# Check if calculation is correct
match <- if (!is.na(expected) && !is.na(target_num)) {
abs(expected - target_num) < 0.0001 # Allow for floating point precision
} else {
is.na(expected) && is.na(target_num)
}
cat(sprintf(" %s:\n", target_var))
cat(sprintf(" Source variables: %s\n", paste(source_vars, collapse = ", ")))
cat(sprintf(" Source values: %s\n", paste(ifelse(is.na(source_vals) | source_vals == "", "NA/empty", source_vals), collapse = ", ")))
cat(sprintf(" %s = %s\n", target_var, ifelse(is.na(target_val), "NA", round(target_val, 4))))
cat(sprintf(" Expected mean: %s\n", ifelse(is.na(expected), "NA", round(expected, 4))))
cat(sprintf(" Match: %s\n\n", ifelse(match, "✓", "✗ ERROR")))
}
}
# =============================================================================
# 6. SAVE UPDATED DATA
# =============================================================================
# COMMENTED OUT: Uncomment when ready to save
write.csv(df, "eohi3.csv", row.names = FALSE, na = "")
# cat("Updated data saved to: eohi3.csv\n")
# cat(paste("Total rows:", nrow(df), "\n"))
# cat(paste("Total columns:", ncol(df), "\n"))

View File

@ -0,0 +1,462 @@
library(dplyr)
setwd("/home/ladmin/Documents/DND/EOHI/eohi3")
# Read the data (with check.names=FALSE to preserve original column names)
# Keep empty cells as empty strings, not NA
# Only convert the literal string "NA" to NA, not empty strings
df <- read.csv("eohi3.csv", stringsAsFactors = FALSE, check.names = FALSE, na.strings = "NA")
# =============================================================================
# 1. CREATE BACKUP
# =============================================================================
file.copy("eohi3.csv", "eohi3_2.csv", overwrite = TRUE)
# =============================================================================
# HELPER FUNCTION: Check variable existence and values
# =============================================================================
check_vars_exist <- function(source_vars, target_vars) {
missing_source <- setdiff(source_vars, names(df))
missing_target <- setdiff(target_vars, names(df))
if (length(missing_source) > 0) {
stop(paste("Missing source variables:", paste(missing_source, collapse = ", ")))
}
if (length(missing_target) > 0) {
stop(paste("Missing target variables:", paste(missing_target, collapse = ", ")))
}
return(TRUE)
}
check_values_exist <- function(var_name, expected_values) {
unique_vals <- unique(df[[var_name]])
unique_vals <- unique_vals[!is.na(unique_vals) & unique_vals != ""]
missing_vals <- setdiff(expected_values, unique_vals)
extra_vals <- setdiff(unique_vals, expected_values)
if (length(missing_vals) > 0) {
cat(paste(" ⚠ Expected values not found in", var_name, ":", paste(missing_vals, collapse = ", "), "\n"))
}
if (length(extra_vals) > 0) {
cat(paste(" ⚠ Unexpected values found in", var_name, ":", paste(extra_vals, collapse = ", "), "\n"))
}
return(list(missing = missing_vals, extra = extra_vals))
}
# =============================================================================
# 2. RECODE other_length2 TO other_length
# =============================================================================
cat("\n=== 1. RECODING other_length2 TO other_length ===\n\n")
# Check variables exist
check_vars_exist("other_length2", "other_length")
# Check values in source
cat("Checking source variable values...\n")
length_vals <- unique(df$other_length2[!is.na(df$other_length2) & df$other_length2 != ""])
cat(paste(" Unique values in other_length2:", paste(length_vals, collapse = ", "), "\n"))
# Recode - handle "20+" as special case first, then convert to numeric for ranges
# Convert to numeric once, suppressing warnings for non-numeric values
num_length <- suppressWarnings(as.numeric(df$other_length2))
df$other_length <- ifelse(
is.na(df$other_length2),
NA,
ifelse(
df$other_length2 == "",
"",
ifelse(
df$other_length2 == "20+",
"20+",
ifelse(
!is.na(num_length) & num_length >= 5 & num_length <= 9,
"5-9",
ifelse(
!is.na(num_length) & num_length >= 10 & num_length <= 14,
"10-14",
ifelse(
!is.na(num_length) & num_length >= 15 & num_length <= 19,
"15-19",
NA
)
)
)
)
)
)
# Verification check
cat("\nVerification (5 random rows):\n")
set.seed(123)
sample_rows <- sample(1:nrow(df), min(5, nrow(df)))
for (i in sample_rows) {
source_val <- df$other_length2[i]
target_val <- df$other_length[i]
cat(sprintf(" Row %d: other_length2 = %s -> other_length = %s\n",
i, ifelse(is.na(source_val), "NA", ifelse(source_val == "", "empty", source_val)),
ifelse(is.na(target_val), "NA", ifelse(target_val == "", "empty", target_val))))
}
# =============================================================================
# 3. RECODE other_like2 TO other_like
# =============================================================================
cat("\n=== 2. RECODING other_like2 TO other_like ===\n\n")
# Check variables exist
check_vars_exist("other_like2", "other_like")
# Check expected values exist
expected_like <- c("Dislike a great deal", "Dislike somewhat", "Neither like nor dislike",
"Like somewhat", "Like a great deal")
check_values_exist("other_like2", expected_like)
# Recode
df$other_like <- ifelse(
is.na(df$other_like2),
NA,
ifelse(
df$other_like2 == "",
"",
ifelse(
df$other_like2 == "Dislike a great deal",
"-2",
ifelse(
df$other_like2 == "Dislike somewhat",
"-1",
ifelse(
df$other_like2 == "Neither like nor dislike",
"0",
ifelse(
df$other_like2 == "Like somewhat",
"1",
ifelse(
df$other_like2 == "Like a great deal",
"2",
NA
)
)
)
)
)
)
)
# Verification check
cat("\nVerification (5 random rows):\n")
set.seed(456)
sample_rows <- sample(1:nrow(df), min(5, nrow(df)))
for (i in sample_rows) {
source_val <- df$other_like2[i]
target_val <- df$other_like[i]
cat(sprintf(" Row %d: other_like2 = %s -> other_like = %s\n",
i, ifelse(is.na(source_val), "NA", ifelse(source_val == "", "empty", source_val)),
ifelse(is.na(target_val), "NA", ifelse(target_val == "", "empty", target_val))))
}
# =============================================================================
# 4. CALCULATE aot_total
# =============================================================================
cat("\n=== 3. CALCULATING aot_total ===\n\n")
# Check variables exist
aot_vars <- c("aot01", "aot02", "aot03", "aot04_r", "aot05_r", "aot06_r", "aot07_r", "aot08")
check_vars_exist(aot_vars, "aot_total")
# Reverse code aot04_r through aot07_r
reverse_vars <- c("aot04_r", "aot05_r", "aot06_r", "aot07_r")
for (var in reverse_vars) {
df[[paste0(var, "_reversed")]] <- as.numeric(ifelse(
df[[var]] == "" | is.na(df[[var]]),
NA,
as.numeric(df[[var]]) * -1
))
}
# Calculate mean of all 8 variables (4 reversed + 4 original)
all_aot_vars <- c("aot01", "aot02", "aot03", "aot04_r_reversed", "aot05_r_reversed",
"aot06_r_reversed", "aot07_r_reversed", "aot08")
# Convert to numeric matrix
aot_matrix <- df[, all_aot_vars]
aot_numeric <- apply(aot_matrix, 2, function(x) {
as.numeric(ifelse(x == "" | is.na(x), NA, x))
})
# Calculate mean
df$aot_total <- rowMeans(aot_numeric, na.rm = TRUE)
# Verification check
cat("\nVerification (5 random rows):\n")
set.seed(789)
sample_rows <- sample(1:nrow(df), min(5, nrow(df)))
for (i in sample_rows) {
aot_vals <- df[i, all_aot_vars]
aot_nums <- as.numeric(ifelse(aot_vals == "" | is.na(aot_vals), NA, aot_vals))
expected_mean <- mean(aot_nums, na.rm = TRUE)
actual_mean <- df$aot_total[i]
cat(sprintf(" Row %d: aot_total = %s (expected: %s)\n",
i, ifelse(is.na(actual_mean), "NA", round(actual_mean, 4)),
ifelse(is.na(expected_mean), "NA", round(expected_mean, 4))))
}
# =============================================================================
# 5. PROCESS CRT QUESTIONS
# =============================================================================
cat("\n=== 4. PROCESSING CRT QUESTIONS ===\n\n")
# Check variables exist
check_vars_exist(c("crt01", "crt02", "crt03"), c("crt_correct", "crt_int"))
# Initialize CRT variables
df$crt_correct <- 0
df$crt_int <- 0
# CRT01: "5 cents" = correct (1,0), "10 cents" = intuitive (0,1), else (0,0)
df$crt_correct <- ifelse(df$crt01 == "5 cents", 1, df$crt_correct)
df$crt_int <- ifelse(df$crt01 == "10 cents", 1, df$crt_int)
# CRT02: "5 minutes" = correct (1,0), "100 minutes" = intuitive (0,1), else (0,0)
df$crt_correct <- ifelse(df$crt02 == "5 minutes", df$crt_correct + 1, df$crt_correct)
df$crt_int <- ifelse(df$crt02 == "100 minutes", df$crt_int + 1, df$crt_int)
# CRT03: "47 days" = correct (1,0), "24 days" = intuitive (0,1), else (0,0)
df$crt_correct <- ifelse(df$crt03 == "47 days", df$crt_correct + 1, df$crt_correct)
df$crt_int <- ifelse(df$crt03 == "24 days", df$crt_int + 1, df$crt_int)
# Check expected values exist
expected_crt01 <- c("5 cents", "10 cents")
expected_crt02 <- c("5 minutes", "100 minutes")
expected_crt03 <- c("47 days", "24 days")
check_values_exist("crt01", expected_crt01)
check_values_exist("crt02", expected_crt02)
check_values_exist("crt03", expected_crt03)
# Verification check
cat("\nVerification (5 random rows):\n")
set.seed(1011)
sample_rows <- sample(1:nrow(df), min(5, nrow(df)))
for (i in sample_rows) {
cat(sprintf(" Row %d:\n", i))
cat(sprintf(" crt01 = %s -> crt_correct = %d, crt_int = %d\n",
ifelse(is.na(df$crt01[i]) || df$crt01[i] == "", "NA/empty", df$crt01[i]),
ifelse(df$crt01[i] == "5 cents", 1, 0),
ifelse(df$crt01[i] == "10 cents", 1, 0)))
cat(sprintf(" crt02 = %s -> crt_correct = %d, crt_int = %d\n",
ifelse(is.na(df$crt02[i]) || df$crt02[i] == "", "NA/empty", df$crt02[i]),
ifelse(df$crt02[i] == "5 minutes", 1, 0),
ifelse(df$crt02[i] == "100 minutes", 1, 0)))
cat(sprintf(" crt03 = %s -> crt_correct = %d, crt_int = %d\n",
ifelse(is.na(df$crt03[i]) || df$crt03[i] == "", "NA/empty", df$crt03[i]),
ifelse(df$crt03[i] == "47 days", 1, 0),
ifelse(df$crt03[i] == "24 days", 1, 0)))
cat(sprintf(" Total: crt_correct = %d, crt_int = %d\n\n",
df$crt_correct[i], df$crt_int[i]))
}
# =============================================================================
# 6. CALCULATE icar_verbal
# =============================================================================
cat("\n=== 5. CALCULATING icar_verbal ===\n\n")
# Check variables exist
verbal_vars <- c("verbal01", "verbal02", "verbal03", "verbal04", "verbal05")
check_vars_exist(verbal_vars, "icar_verbal")
# Correct answers
correct_verbal <- c("5", "8", "It's impossible to tell", "47", "Sunday")
# Calculate proportion correct
verbal_responses <- df[, verbal_vars]
correct_count <- rowSums(
sapply(1:5, function(i) {
verbal_responses[, i] == correct_verbal[i]
}),
na.rm = TRUE
)
df$icar_verbal <- correct_count / 5
# Verification check
cat("\nVerification (5 random rows):\n")
set.seed(1213)
sample_rows <- sample(1:nrow(df), min(5, nrow(df)))
for (i in sample_rows) {
responses <- df[i, verbal_vars]
correct <- sum(sapply(1:5, function(j) responses[j] == correct_verbal[j]), na.rm = TRUE)
prop <- correct / 5
cat(sprintf(" Row %d: Correct = %d/5, icar_verbal = %s\n",
i, correct, round(prop, 4)))
}
# =============================================================================
# 7. CALCULATE icar_matrix
# =============================================================================
cat("\n=== 6. CALCULATING icar_matrix ===\n\n")
# Check variables exist
matrix_vars <- c("matrix01", "matrix02", "matrix03", "matrix04", "matrix05")
check_vars_exist(matrix_vars, "icar_matrix")
# Correct answers
correct_matrix <- c("D", "E", "B", "B", "D")
# Calculate proportion correct
matrix_responses <- df[, matrix_vars]
correct_count <- rowSums(
sapply(1:5, function(i) {
matrix_responses[, i] == correct_matrix[i]
}),
na.rm = TRUE
)
df$icar_matrix <- correct_count / 5
# Verification check
cat("\nVerification (5 random rows):\n")
set.seed(1415)
sample_rows <- sample(1:nrow(df), min(5, nrow(df)))
for (i in sample_rows) {
responses <- df[i, matrix_vars]
correct <- sum(sapply(1:5, function(j) responses[j] == correct_matrix[j]), na.rm = TRUE)
prop <- correct / 5
cat(sprintf(" Row %d: Correct = %d/5, icar_matrix = %s\n",
i, correct, round(prop, 4)))
}
# =============================================================================
# 8. CALCULATE icar_total
# =============================================================================
cat("\n=== 7. CALCULATING icar_total ===\n\n")
# Check variables exist
check_vars_exist(c(verbal_vars, matrix_vars), "icar_total")
# Calculate proportion correct across all 10 items
all_correct <- c(correct_verbal, correct_matrix)
all_responses <- df[, c(verbal_vars, matrix_vars)]
correct_count <- rowSums(
sapply(1:10, function(i) {
all_responses[, i] == all_correct[i]
}),
na.rm = TRUE
)
df$icar_total <- correct_count / 10
# Verification check
cat("\nVerification (5 random rows):\n")
set.seed(1617)
sample_rows <- sample(1:nrow(df), min(5, nrow(df)))
for (i in sample_rows) {
responses <- df[i, c(verbal_vars, matrix_vars)]
correct <- sum(sapply(1:10, function(j) responses[j] == all_correct[j]), na.rm = TRUE)
prop <- correct / 10
cat(sprintf(" Row %d: Correct = %d/10, icar_total = %s\n",
i, correct, round(prop, 4)))
}
# =============================================================================
# 9. RECODE demo_sex TO sex
# =============================================================================
cat("\n=== 8. RECODING demo_sex TO sex ===\n\n")
# Check variables exist
check_vars_exist("demo_sex", "sex")
# Check values
sex_vals <- unique(df$demo_sex[!is.na(df$demo_sex) & df$demo_sex != ""])
cat(paste(" Unique values in demo_sex:", paste(sex_vals, collapse = ", "), "\n"))
# Recode: male = 0, female = 1, else = 2
df$sex <- ifelse(
is.na(df$demo_sex) | df$demo_sex == "",
NA,
ifelse(
tolower(df$demo_sex) == "male",
0,
ifelse(
tolower(df$demo_sex) == "female",
1,
2
)
)
)
# Verification check
cat("\nVerification (5 random rows):\n")
set.seed(1819)
sample_rows <- sample(1:nrow(df), min(5, nrow(df)))
for (i in sample_rows) {
source_val <- df$demo_sex[i]
target_val <- df$sex[i]
cat(sprintf(" Row %d: demo_sex = %s -> sex = %s\n",
i, ifelse(is.na(source_val) || source_val == "", "NA/empty", source_val),
ifelse(is.na(target_val), "NA", target_val)))
}
# =============================================================================
# 10. RECODE demo_edu TO education
# =============================================================================
cat("\n=== 9. RECODING demo_edu TO education ===\n\n")
# Check variables exist
check_vars_exist("demo_edu", "education")
# Check values
edu_vals <- unique(df$demo_edu[!is.na(df$demo_edu) & df$demo_edu != ""])
cat(paste(" Unique values in demo_edu:", paste(edu_vals, collapse = ", "), "\n"))
# Recode
df$education <- ifelse(
is.na(df$demo_edu) | df$demo_edu == "",
NA,
ifelse(
df$demo_edu %in% c("High School (or equivalent)", "Trade School"),
"HS_TS",
ifelse(
df$demo_edu %in% c("College Diploma/Certificate", "University - Undergraduate"),
"C_Ug",
ifelse(
df$demo_edu %in% c("University - Graduate (Masters)", "University - PhD", "Professional Degree (ex. JD/MD)"),
"grad_prof",
NA
)
)
)
)
# Convert to ordered factor
df$education <- factor(df$education,
levels = c("HS_TS", "C_Ug", "grad_prof"),
ordered = TRUE)
# Verification check
cat("\nVerification (5 random rows):\n")
set.seed(2021)
sample_rows <- sample(1:nrow(df), min(5, nrow(df)))
for (i in sample_rows) {
source_val <- df$demo_edu[i]
target_val <- df$education[i]
cat(sprintf(" Row %d: demo_edu = %s -> education = %s\n",
i, ifelse(is.na(source_val) || source_val == "", "NA/empty", source_val),
ifelse(is.na(target_val), "NA", as.character(target_val))))
}
# =============================================================================
# 11. SAVE UPDATED DATA
# =============================================================================
# COMMENTED OUT: Uncomment when ready to save
# write.csv(df, "eohi3.csv", row.names = FALSE, na = "")
# cat("\nUpdated data saved to: eohi3.csv\n")
# cat(paste("Total rows:", nrow(df), "\n"))
# cat(paste("Total columns:", ncol(df), "\n"))

File diff suppressed because one or more lines are too long