# PDF Highlight Extractor A Python tool for extracting highlighted text from PDF files with precise text ordering and intelligent hyphenation handling. ## Features - **4-Color Support**: Extracts Yellow, Pink, Green, and Blue highlights - **Smart Text Ordering**: Fixes PDF text extraction order issues using multiple methods - **Hyphenation Merging**: Automatically combines hyphenated words across lines ("lin-" + "guistics" → "linguistics") - **Precise Boundaries**: Configurable overlap detection to avoid over-extraction - **Multiple Extraction Methods**: Fallback system for maximum compatibility - **Cross-page Support**: Handles highlights that span multiple pages - **Test Mode**: Quick testing with default settings - **Export Options**: JSON and CSV output formats ## Installation ### Prerequisites - Python 3.7 or higher - pip package manager ### Quick Installation 1. **Clone the repository:** ```bash git clone cd HiLiteHero ``` 2. **Install dependencies:** ```bash pip install -r requirements.txt ``` Or install manually: ```bash pip install PyMuPDF colorama ``` ### Alternative Installation Methods **Using virtual environment (recommended):** ```bash python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt ``` **Using conda:** ```bash conda create -n hilitehero python=3.9 conda activate hilitehero pip install -r requirements.txt ``` ### Verify Installation ```bash python main.py --test ``` This should process the default test file and create a JSON output file. ## Quick Start with Makefile The project includes a comprehensive Makefile for easy development and testing: ### Essential Commands ```bash # Show all available commands make help # Quick test (recommended first run) make test # Interactive mode make run # Development mode (debug + interactive) make dev # Install dependencies make install # Clean up generated files make clean ``` ### Common Workflows **First-time setup:** ```bash make install # Install dependencies make test # Verify everything works ``` **Development workflow:** ```bash make dev # Start development mode make clean # Clean up when done ``` **Batch processing:** ```bash make batch # Process default file silently make batch-file FILE=document.pdf # Process specific file ``` **Code quality:** ```bash make format # Format code make lint # Check code quality make check # Run all checks ``` ### Advanced Makefile Usage **Process specific pages:** ```bash make run-pages FILE=document.pdf PAGES="1,3-5" ``` **Test different modes:** ```bash make test-interactive # Test with interactive review make test-debug # Test with debug output make test-silent # Test silently ``` **Project management:** ```bash make status # Show project status make docs # Show documentation make list-pdfs # List available PDF files make list-outputs # Show recent outputs ``` ## Dependencies - PyMuPDF (fitz) - PDF processing and text extraction - pdfplumber - Additional PDF annotation support - colorama - Colored terminal output - pandas - CSV export functionality ## Usage ### Quick Start **Test Mode (Recommended for first-time users):** ```bash python main.py --test ``` Uses default test file and automatically saves results to JSON. **Interactive Mode:** ```bash python main.py ``` Prompts for PDF file path and provides interactive review options. **Process Specific PDF:** ```bash python main.py path/to/your/document.pdf ``` ### Command Line Options | Flag | Description | Example | |------|-------------|---------| | `--test`, `-t` | Test mode with default settings | `python main.py -t` | | `--interactive`, `-i` | Enable interactive review mode | `python main.py -i document.pdf` | | `--pages`, `-p` | Process specific pages | `python main.py -p "1,3-5" doc.pdf` | | `--silent`, `-s` | Minimal output, auto-save JSON | `python main.py -s` | | `--debug`, `-d` | Enable detailed debug output | `python main.py -d document.pdf` | | `--output-json` | Custom JSON output path | `python main.py --output-json results.json` | ### Usage Examples **Basic extraction:** ```bash python main.py document.pdf ``` **Process specific pages with interactive review:** ```bash python main.py document.pdf -p "1,5-7" -i ``` **Silent mode for batch processing:** ```bash python main.py document.pdf -s --output-json batch_results.json ``` **Debug mode for troubleshooting:** ```bash python main.py document.pdf -d ``` **Test with custom output:** ```bash python main.py -t --output-json test_results.json ``` ### Interactive Review Mode When using `-i` flag, you can: - **[N]ext** - Move to next highlight - **[P]rev** - Move to previous highlight - **[U]p** - Move highlight up in order - **[M]ove Down** - Move highlight down in order - **[C]olor** - Change highlight color classification - **[E]dit** - Edit highlight text - **[D]elete** - Remove highlight - **[O]pen Img** - View page image - **[S]ave&Exit** - Save changes and exit - **[Q]uit** - Quit without saving ## Output Formats ### Terminal Display 📄 Page 35 🎨 YELLOW "We end with some specific suggestions for what we can do as linguists" 🎨 PINK (hyphen-merged) "linguistics itself" ### JSON Export { "highlights": [ { "page": 35, "text": "We end with some specific suggestions", "color": "yellow", "type": "highlight" } ] } ### CSV Export Tabular format with columns: page, text, color, type, category ## Technical Features ### Text Ordering Algorithm 1. **Method A**: PyMuPDF built-in sorting 2. **Method B**: Text block extraction with geometric sorting 3. **Method C**: Enhanced word-level sorting with line detection ### Hyphenation Detection - Same-page: Detects hyphens within 8-30 pixel line spacing - Cross-page: Handles hyphenation across page boundaries - Smart merging: Only merges clear hyphenation patterns ### Precision Control - **Overlap Threshold**: 40% word overlap required for inclusion - **Boundary Expansion**: +2 pixel expansion for edge words - **Line Tolerance**: 5-pixel tolerance for same-line detection ## Troubleshooting ### Common Issues **Text Order Problems**: The tool uses multiple methods to fix PDF text ordering issues. If text still appears scrambled, the PDF may have complex layout encoding. **Missing Words**: Lower the overlap threshold or check if highlights are too light/transparent. **Over-extraction**: The tool is designed to avoid this, but very close text might be included. Check highlight precision in your PDF. **Installation Issues**: - Ensure Python 3.7+ is installed - Try using virtual environment: `make venv-install` - Check dependencies: `make verify` **Permission Errors**: - On Linux/Mac: Ensure PDF files are readable - On Windows: Run as administrator if needed ### Debug Output Run with detailed logging to see extraction decisions: ```bash python main.py --test --debug # or make test-debug ``` ### Getting Help ```bash # Show all available commands make help # Check project status make status # Verify installation make verify ``` ## Contributing 1. Create a feature branch from main 2. Make your changes 3. Test with sample PDFs 4. Submit a pull request ## License MIT License ## Support For issues or questions, please open a GitHub issue. # PDF Highlight Extraction Process - Step by Step ## Phase 1: Initialization and Setup 1. **Script Startup**: Check command line arguments for test mode 2. **Path Resolution**: Determine PDF file path (default or user input) 3. **File Validation**: Verify PDF file exists and is accessible 4. **Object Creation**: Initialize PDFHighlightExtractor with file path ## Phase 2: PDF Analysis and Loading 1. **Document Opening**: Load PDF using PyMuPDF (fitz) library 2. **Page Iteration**: Loop through each page in the document 3. **Annotation Discovery**: Find all annotations on each page 4. **Type Filtering**: Identify highlight-type annotations specifically ## Phase 3: Color Classification 1. **Color Extraction**: Get RGB values from annotation properties 2. **Color Normalization**: Convert to 0-255 range if needed 3. **Color Mapping**: Classify into 4 categories (Yellow, Pink, Green, Blue) 4. **Unknown Filtering**: Skip annotations with unrecognized colors ## Phase 4: Text Extraction (Multi-Method Approach) ### Method A: Built-in Sorting 1. **Rectangle Expansion**: Add 2-pixel buffer around highlight area 2. **PyMuPDF Extraction**: Use page.get_text("text", sort=True) 3. **Text Cleaning**: Remove extra whitespace and normalize 4. **Success Check**: Return if valid text found ### Method B: Text Block Extraction 1. **Block Discovery**: Get text blocks from highlight area 2. **Geometric Sorting**: Sort blocks by Y-position, then X-position 3. **Block Combination**: Join block texts with spaces 4. **Quality Check**: Verify result makes sense ### Method C: Enhanced Word Sorting 1. **Word Collection**: Get all words intersecting highlight area 2. **Overlap Calculation**: Calculate intersection ratio for each word 3. **Threshold Filtering**: Include words with 40%+ overlap 4. **Line Detection**: Group words by Y-position (5-pixel tolerance) 5. **Line Sorting**: Sort lines top-to-bottom 6. **Word Sorting**: Sort words left-to-right within each line 7. **Text Assembly**: Combine words in proper reading order ## Phase 5: Hyphenation Detection and Merging 1. **Pattern Recognition**: Look for highlights ending with '-' 2. **Proximity Check**: Verify next highlight is same color and nearby 3. **Distance Validation**: Check reasonable line spacing (8-30 pixels) 4. **Page Handling**: Support both same-page and cross-page hyphenation 5. **Text Joining**: Remove hyphen and combine words seamlessly ## Phase 6: Data Organization 1. **Highlight Storage**: Create structured data objects for each highlight 2. **Sorting**: Order by page number, then Y-position, then X-position 3. **Merging**: Apply hyphenation merging where detected 4. **Categorization**: Separate annotations from background highlights ## Phase 7: Output Generation ### Terminal Display 1. **Page Grouping**: Organize results by page number 2. **Color Coding**: Apply terminal colors for visual distinction 3. **Status Indicators**: Show merge status (hyphen-merged, cross-page) 4. **Formatting**: Clean, readable text presentation ### File Export (Optional) 1. **JSON Generation**: Structure data with metadata 2. **CSV Creation**: Tabular format for analysis 3. **File Writing**: Save to specified output paths ## Phase 8: Cleanup and Reporting 1. **Resource Cleanup**: Close PDF document properly 2. **Statistics**: Report extraction counts and timing 3. **Status Messages**: Provide user feedback on results 4. **Memory Management**: Clean up temporary objects ## Error Handling Throughout - **Try-Catch Blocks**: Graceful handling of PDF parsing errors - **Fallback Methods**: Alternative extraction approaches - **Validation Checks**: Verify data integrity at each step - **User Feedback**: Clear error messages and debugging info ## Debug Information - **Overlap Ratios**: Show word inclusion/exclusion decisions - **Method Success**: Indicate which extraction method worked - **Hyphenation Detection**: Log when word merging occurs - **Performance Timing**: Track processing duration