hilitehero/README.md

# PDF Highlight Extractor

A Python tool for extracting highlighted text from PDF files with precise text ordering and intelligent hyphenation handling.

## Features

- **4-Color Support**: Extracts Yellow, Pink, Green, and Blue highlights
- **Smart Text Ordering**: Fixes PDF text extraction order issues using multiple methods
- **Hyphenation Merging**: Automatically combines hyphenated words across lines ("lin-" + "guistics" → "linguistics")
- **Precise Boundaries**: Configurable overlap detection to avoid over-extraction
- **Multiple Extraction Methods**: Fallback system for maximum compatibility
- **Cross-page Support**: Handles highlights that span multiple pages
- **Test Mode**: Quick testing with default settings
- **Export Options**: JSON and CSV output formats

## Installation

### Prerequisites
- Python 3.7 or higher
- pip package manager

### Quick Installation

1. **Clone the repository:**
   ```bash
   git clone <repository-url>
   cd HiLiteHero
   ```

2. **Install dependencies:**
   ```bash
   pip install -r requirements.txt
   ```

   Or install manually:
   ```bash
   pip install PyMuPDF colorama
   ```

### Alternative Installation Methods

**Using virtual environment (recommended):**
```bash
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
```

**Using conda:**
```bash
conda create -n hilitehero python=3.9
conda activate hilitehero
pip install -r requirements.txt
```

### Verify Installation
```bash
python main.py --test
```
This should process the default test file and create a JSON output file.

## Quick Start with Makefile

The project includes a comprehensive Makefile for easy development and testing:

### Essential Commands

```bash
# Show all available commands
make help

# Quick test (recommended first run)
make test

# Interactive mode
make run

# Development mode (debug + interactive)
make dev

# Install dependencies
make install

# Clean up generated files
make clean
```

### Common Workflows

**First-time setup:**
```bash
make install    # Install dependencies
make test       # Verify everything works
```

**Development workflow:**
```bash
make dev        # Start development mode
make clean      # Clean up when done
```

**Batch processing:**
```bash
make batch      # Process default file silently
make batch-file FILE=document.pdf  # Process specific file
```

**Code quality:**
```bash
make format     # Format code
make lint       # Check code quality
make check      # Run all checks
```

### Advanced Makefile Usage

**Process specific pages:**
```bash
make run-pages FILE=document.pdf PAGES="1,3-5"
```

**Test different modes:**
```bash
make test-interactive  # Test with interactive review
make test-debug        # Test with debug output
make test-silent       # Test silently
```

**Project management:**
```bash
make status      # Show project status
make docs        # Show documentation
make list-pdfs   # List available PDF files
make list-outputs # Show recent outputs
```


## Dependencies

- PyMuPDF (fitz) - PDF processing and text extraction
- pdfplumber - Additional PDF annotation support
- colorama - Colored terminal output
- pandas - CSV export functionality

## Usage

### Quick Start

**Test Mode (Recommended for first-time users):**
```bash
python main.py --test
```
Uses default test file and automatically saves results to JSON.

**Interactive Mode:**
```bash
python main.py
```
Prompts for PDF file path and provides interactive review options.

**Process Specific PDF:**
```bash
python main.py path/to/your/document.pdf
```

### Command Line Options

| Flag | Description | Example |
|------|-------------|---------|
| `--test`, `-t` | Test mode with default settings | `python main.py -t` |
| `--interactive`, `-i` | Enable interactive review mode | `python main.py -i document.pdf` |
| `--pages`, `-p` | Process specific pages | `python main.py -p "1,3-5" doc.pdf` |
| `--silent`, `-s` | Minimal output, auto-save JSON | `python main.py -s` |
| `--debug`, `-d` | Enable detailed debug output | `python main.py -d document.pdf` |
| `--output-json` | Custom JSON output path | `python main.py --output-json results.json` |

### Usage Examples

**Basic extraction:**
```bash
python main.py document.pdf
```

**Process specific pages with interactive review:**
```bash
python main.py document.pdf -p "1,5-7" -i
```

**Silent mode for batch processing:**
```bash
python main.py document.pdf -s --output-json batch_results.json
```

**Debug mode for troubleshooting:**
```bash
python main.py document.pdf -d
```

**Test with custom output:**
```bash
python main.py -t --output-json test_results.json
```

### Interactive Review Mode

When using `-i` flag, you can:
- **[N]ext** - Move to next highlight
- **[P]rev** - Move to previous highlight
- **[U]p** - Move highlight up in order
- **[M]ove Down** - Move highlight down in order
- **[C]olor** - Change highlight color classification
- **[E]dit** - Edit highlight text
- **[D]elete** - Remove highlight
- **[O]pen Img** - View page image
- **[S]ave&Exit** - Save changes and exit
- **[Q]uit** - Quit without saving

## Output Formats

### Terminal Display
📄 Page 35
🎨 YELLOW
"We end with some specific suggestions for what we can do as linguists"
🎨 PINK (hyphen-merged)
"linguistics itself"

### JSON Export
{
    "highlights": [
        {
            "page": 35,
            "text": "We end with some specific suggestions",
            "color": "yellow",
            "type": "highlight"
        }
    ]
}

### CSV Export
Tabular format with columns: page, text, color, type, category

## Technical Features

### Text Ordering Algorithm
1. **Method A**: PyMuPDF built-in sorting
2. **Method B**: Text block extraction with geometric sorting
3. **Method C**: Enhanced word-level sorting with line detection

### Hyphenation Detection
- Same-page: Detects hyphens within 8-30 pixel line spacing
- Cross-page: Handles hyphenation across page boundaries
- Smart merging: Only merges clear hyphenation patterns

### Precision Control
- **Overlap Threshold**: 40% word overlap required for inclusion
- **Boundary Expansion**: +2 pixel expansion for edge words
- **Line Tolerance**: 5-pixel tolerance for same-line detection

## Troubleshooting

### Common Issues

**Text Order Problems**: The tool uses multiple methods to fix PDF text ordering issues. If text still appears scrambled, the PDF may have complex layout encoding.

**Missing Words**: Lower the overlap threshold or check if highlights are too light/transparent.

**Over-extraction**: The tool is designed to avoid this, but very close text might be included. Check highlight precision in your PDF.

**Installation Issues**:
- Ensure Python 3.7+ is installed
- Try using virtual environment: `make venv-install`
- Check dependencies: `make verify`

**Permission Errors**:
- On Linux/Mac: Ensure PDF files are readable
- On Windows: Run as administrator if needed

### Debug Output
Run with detailed logging to see extraction decisions:
```bash
python main.py --test --debug
# or
make test-debug
```

### Getting Help
```bash
# Show all available commands
make help

# Check project status
make status

# Verify installation
make verify
```

## Contributing

1. Create a feature branch from main
2. Make your changes
3. Test with sample PDFs
4. Submit a pull request

## License

MIT License

## Support

For issues or questions, please open a GitHub issue.

# PDF Highlight Extraction Process - Step by Step

## Phase 1: Initialization and Setup
1. **Script Startup**: Check command line arguments for test mode
2. **Path Resolution**: Determine PDF file path (default or user input)
3. **File Validation**: Verify PDF file exists and is accessible
4. **Object Creation**: Initialize PDFHighlightExtractor with file path

## Phase 2: PDF Analysis and Loading
1. **Document Opening**: Load PDF using PyMuPDF (fitz) library
2. **Page Iteration**: Loop through each page in the document
3. **Annotation Discovery**: Find all annotations on each page
4. **Type Filtering**: Identify highlight-type annotations specifically

## Phase 3: Color Classification
1. **Color Extraction**: Get RGB values from annotation properties
2. **Color Normalization**: Convert to 0-255 range if needed
3. **Color Mapping**: Classify into 4 categories (Yellow, Pink, Green, Blue)
4. **Unknown Filtering**: Skip annotations with unrecognized colors

## Phase 4: Text Extraction (Multi-Method Approach)

### Method A: Built-in Sorting
1. **Rectangle Expansion**: Add 2-pixel buffer around highlight area
2. **PyMuPDF Extraction**: Use page.get_text("text", sort=True)
3. **Text Cleaning**: Remove extra whitespace and normalize
4. **Success Check**: Return if valid text found

### Method B: Text Block Extraction
1. **Block Discovery**: Get text blocks from highlight area
2. **Geometric Sorting**: Sort blocks by Y-position, then X-position
3. **Block Combination**: Join block texts with spaces
4. **Quality Check**: Verify result makes sense

### Method C: Enhanced Word Sorting
1. **Word Collection**: Get all words intersecting highlight area
2. **Overlap Calculation**: Calculate intersection ratio for each word
3. **Threshold Filtering**: Include words with 40%+ overlap
4. **Line Detection**: Group words by Y-position (5-pixel tolerance)
5. **Line Sorting**: Sort lines top-to-bottom
6. **Word Sorting**: Sort words left-to-right within each line
7. **Text Assembly**: Combine words in proper reading order

## Phase 5: Hyphenation Detection and Merging
1. **Pattern Recognition**: Look for highlights ending with '-'
2. **Proximity Check**: Verify next highlight is same color and nearby
3. **Distance Validation**: Check reasonable line spacing (8-30 pixels)
4. **Page Handling**: Support both same-page and cross-page hyphenation
5. **Text Joining**: Remove hyphen and combine words seamlessly

## Phase 6: Data Organization
1. **Highlight Storage**: Create structured data objects for each highlight
2. **Sorting**: Order by page number, then Y-position, then X-position
3. **Merging**: Apply hyphenation merging where detected
4. **Categorization**: Separate annotations from background highlights

## Phase 7: Output Generation

### Terminal Display
1. **Page Grouping**: Organize results by page number
2. **Color Coding**: Apply terminal colors for visual distinction
3. **Status Indicators**: Show merge status (hyphen-merged, cross-page)
4. **Formatting**: Clean, readable text presentation

### File Export (Optional)
1. **JSON Generation**: Structure data with metadata
2. **CSV Creation**: Tabular format for analysis
3. **File Writing**: Save to specified output paths

## Phase 8: Cleanup and Reporting
1. **Resource Cleanup**: Close PDF document properly
2. **Statistics**: Report extraction counts and timing
3. **Status Messages**: Provide user feedback on results
4. **Memory Management**: Clean up temporary objects

## Error Handling Throughout
- **Try-Catch Blocks**: Graceful handling of PDF parsing errors
- **Fallback Methods**: Alternative extraction approaches
- **Validation Checks**: Verify data integrity at each step
- **User Feedback**: Clear error messages and debugging info

## Debug Information
- **Overlap Ratios**: Show word inclusion/exclusion decisions
- **Method Success**: Indicate which extraction method worked
- **Hyphenation Detection**: Log when word merging occurs
- **Performance Timing**: Track processing duration