PDF Highlight Extractor
A Python tool for extracting highlighted text from PDF files with precise text ordering and intelligent hyphenation handling.
Features
- 4-Color Support: Extracts Yellow, Pink, Green, and Blue highlights
- Smart Text Ordering: Fixes PDF text extraction order issues using multiple methods
- Hyphenation Merging: Automatically combines hyphenated words across lines ("lin-" + "guistics" → "linguistics")
- Precise Boundaries: Configurable overlap detection to avoid over-extraction
- Multiple Extraction Methods: Fallback system for maximum compatibility
- Cross-page Support: Handles highlights that span multiple pages
- Test Mode: Quick testing with default settings
- Export Options: JSON and CSV output formats
Installation
Prerequisites
- Python 3.7 or higher
- pip package manager
Quick Installation
-
Clone the repository:
git clone <repository-url> cd HiLiteHero -
Install dependencies:
pip install -r requirements.txtOr install manually:
pip install PyMuPDF colorama
Alternative Installation Methods
Using virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
Using conda:
conda create -n hilitehero python=3.9
conda activate hilitehero
pip install -r requirements.txt
Verify Installation
python main.py --test
This should process the default test file and create a JSON output file.
Quick Start with Makefile
The project includes a comprehensive Makefile for easy development and testing:
Essential Commands
# Show all available commands
make help
# Quick test (recommended first run)
make test
# Interactive mode
make run
# Development mode (debug + interactive)
make dev
# Install dependencies
make install
# Clean up generated files
make clean
Common Workflows
First-time setup:
make install # Install dependencies
make test # Verify everything works
Development workflow:
make dev # Start development mode
make clean # Clean up when done
Batch processing:
make batch # Process default file silently
make batch-file FILE=document.pdf # Process specific file
Code quality:
make format # Format code
make lint # Check code quality
make check # Run all checks
Advanced Makefile Usage
Process specific pages:
make run-pages FILE=document.pdf PAGES="1,3-5"
Test different modes:
make test-interactive # Test with interactive review
make test-debug # Test with debug output
make test-silent # Test silently
Project management:
make status # Show project status
make docs # Show documentation
make list-pdfs # List available PDF files
make list-outputs # Show recent outputs
Dependencies
- PyMuPDF (fitz) - PDF processing and text extraction
- pdfplumber - Additional PDF annotation support
- colorama - Colored terminal output
- pandas - CSV export functionality
Usage
Quick Start
Test Mode (Recommended for first-time users):
python main.py --test
Uses default test file and automatically saves results to JSON.
Interactive Mode:
python main.py
Prompts for PDF file path and provides interactive review options.
Process Specific PDF:
python main.py path/to/your/document.pdf
Command Line Options
| Flag | Description | Example |
|---|---|---|
--test, -t |
Test mode with default settings | python main.py -t |
--interactive, -i |
Enable interactive review mode | python main.py -i document.pdf |
--pages, -p |
Process specific pages | python main.py -p "1,3-5" doc.pdf |
--silent, -s |
Minimal output, auto-save JSON | python main.py -s |
--debug, -d |
Enable detailed debug output | python main.py -d document.pdf |
--output-json |
Custom JSON output path | python main.py --output-json results.json |
Usage Examples
Basic extraction:
python main.py document.pdf
Process specific pages with interactive review:
python main.py document.pdf -p "1,5-7" -i
Silent mode for batch processing:
python main.py document.pdf -s --output-json batch_results.json
Debug mode for troubleshooting:
python main.py document.pdf -d
Test with custom output:
python main.py -t --output-json test_results.json
Interactive Review Mode
When using -i flag, you can:
- [N]ext - Move to next highlight
- [P]rev - Move to previous highlight
- [U]p - Move highlight up in order
- [M]ove Down - Move highlight down in order
- [C]olor - Change highlight color classification
- [E]dit - Edit highlight text
- [D]elete - Remove highlight
- [O]pen Img - View page image
- [S]ave&Exit - Save changes and exit
- [Q]uit - Quit without saving
Output Formats
Terminal Display
📄 Page 35 🎨 YELLOW "We end with some specific suggestions for what we can do as linguists" 🎨 PINK (hyphen-merged) "linguistics itself"
JSON Export
{ "highlights": [ { "page": 35, "text": "We end with some specific suggestions", "color": "yellow", "type": "highlight" } ] }
CSV Export
Tabular format with columns: page, text, color, type, category
Technical Features
Text Ordering Algorithm
- Method A: PyMuPDF built-in sorting
- Method B: Text block extraction with geometric sorting
- Method C: Enhanced word-level sorting with line detection
Hyphenation Detection
- Same-page: Detects hyphens within 8-30 pixel line spacing
- Cross-page: Handles hyphenation across page boundaries
- Smart merging: Only merges clear hyphenation patterns
Precision Control
- Overlap Threshold: 40% word overlap required for inclusion
- Boundary Expansion: +2 pixel expansion for edge words
- Line Tolerance: 5-pixel tolerance for same-line detection
Troubleshooting
Common Issues
Text Order Problems: The tool uses multiple methods to fix PDF text ordering issues. If text still appears scrambled, the PDF may have complex layout encoding.
Missing Words: Lower the overlap threshold or check if highlights are too light/transparent.
Over-extraction: The tool is designed to avoid this, but very close text might be included. Check highlight precision in your PDF.
Installation Issues:
- Ensure Python 3.7+ is installed
- Try using virtual environment:
make venv-install - Check dependencies:
make verify
Permission Errors:
- On Linux/Mac: Ensure PDF files are readable
- On Windows: Run as administrator if needed
Debug Output
Run with detailed logging to see extraction decisions:
python main.py --test --debug
# or
make test-debug
Getting Help
# Show all available commands
make help
# Check project status
make status
# Verify installation
make verify
Contributing
- Create a feature branch from main
- Make your changes
- Test with sample PDFs
- Submit a pull request
License
MIT License
Support
For issues or questions, please open a GitHub issue.
PDF Highlight Extraction Process - Step by Step
Phase 1: Initialization and Setup
- Script Startup: Check command line arguments for test mode
- Path Resolution: Determine PDF file path (default or user input)
- File Validation: Verify PDF file exists and is accessible
- Object Creation: Initialize PDFHighlightExtractor with file path
Phase 2: PDF Analysis and Loading
- Document Opening: Load PDF using PyMuPDF (fitz) library
- Page Iteration: Loop through each page in the document
- Annotation Discovery: Find all annotations on each page
- Type Filtering: Identify highlight-type annotations specifically
Phase 3: Color Classification
- Color Extraction: Get RGB values from annotation properties
- Color Normalization: Convert to 0-255 range if needed
- Color Mapping: Classify into 4 categories (Yellow, Pink, Green, Blue)
- Unknown Filtering: Skip annotations with unrecognized colors
Phase 4: Text Extraction (Multi-Method Approach)
Method A: Built-in Sorting
- Rectangle Expansion: Add 2-pixel buffer around highlight area
- PyMuPDF Extraction: Use page.get_text("text", sort=True)
- Text Cleaning: Remove extra whitespace and normalize
- Success Check: Return if valid text found
Method B: Text Block Extraction
- Block Discovery: Get text blocks from highlight area
- Geometric Sorting: Sort blocks by Y-position, then X-position
- Block Combination: Join block texts with spaces
- Quality Check: Verify result makes sense
Method C: Enhanced Word Sorting
- Word Collection: Get all words intersecting highlight area
- Overlap Calculation: Calculate intersection ratio for each word
- Threshold Filtering: Include words with 40%+ overlap
- Line Detection: Group words by Y-position (5-pixel tolerance)
- Line Sorting: Sort lines top-to-bottom
- Word Sorting: Sort words left-to-right within each line
- Text Assembly: Combine words in proper reading order
Phase 5: Hyphenation Detection and Merging
- Pattern Recognition: Look for highlights ending with '-'
- Proximity Check: Verify next highlight is same color and nearby
- Distance Validation: Check reasonable line spacing (8-30 pixels)
- Page Handling: Support both same-page and cross-page hyphenation
- Text Joining: Remove hyphen and combine words seamlessly
Phase 6: Data Organization
- Highlight Storage: Create structured data objects for each highlight
- Sorting: Order by page number, then Y-position, then X-position
- Merging: Apply hyphenation merging where detected
- Categorization: Separate annotations from background highlights
Phase 7: Output Generation
Terminal Display
- Page Grouping: Organize results by page number
- Color Coding: Apply terminal colors for visual distinction
- Status Indicators: Show merge status (hyphen-merged, cross-page)
- Formatting: Clean, readable text presentation
File Export (Optional)
- JSON Generation: Structure data with metadata
- CSV Creation: Tabular format for analysis
- File Writing: Save to specified output paths
Phase 8: Cleanup and Reporting
- Resource Cleanup: Close PDF document properly
- Statistics: Report extraction counts and timing
- Status Messages: Provide user feedback on results
- Memory Management: Clean up temporary objects
Error Handling Throughout
- Try-Catch Blocks: Graceful handling of PDF parsing errors
- Fallback Methods: Alternative extraction approaches
- Validation Checks: Verify data integrity at each step
- User Feedback: Clear error messages and debugging info
Debug Information
- Overlap Ratios: Show word inclusion/exclusion decisions
- Method Success: Indicate which extraction method worked
- Hyphenation Detection: Log when word merging occurs
- Performance Timing: Track processing duration