398 lines
11 KiB
Markdown
398 lines
11 KiB
Markdown
# PDF Highlight Extractor
|
|
|
|
A Python tool for extracting highlighted text from PDF files with precise text ordering and intelligent hyphenation handling.
|
|
|
|
## Features
|
|
|
|
- **4-Color Support**: Extracts Yellow, Pink, Green, and Blue highlights
|
|
- **Smart Text Ordering**: Fixes PDF text extraction order issues using multiple methods
|
|
- **Hyphenation Merging**: Automatically combines hyphenated words across lines ("lin-" + "guistics" → "linguistics")
|
|
- **Precise Boundaries**: Configurable overlap detection to avoid over-extraction
|
|
- **Multiple Extraction Methods**: Fallback system for maximum compatibility
|
|
- **Cross-page Support**: Handles highlights that span multiple pages
|
|
- **Test Mode**: Quick testing with default settings
|
|
- **Export Options**: JSON and CSV output formats
|
|
|
|
## Installation
|
|
|
|
### Prerequisites
|
|
- Python 3.7 or higher
|
|
- pip package manager
|
|
|
|
### Quick Installation
|
|
|
|
1. **Clone the repository:**
|
|
```bash
|
|
git clone <repository-url>
|
|
cd HiLiteHero
|
|
```
|
|
|
|
2. **Install dependencies:**
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
Or install manually:
|
|
```bash
|
|
pip install PyMuPDF colorama
|
|
```
|
|
|
|
### Alternative Installation Methods
|
|
|
|
**Using virtual environment (recommended):**
|
|
```bash
|
|
python -m venv venv
|
|
source venv/bin/activate # On Windows: venv\Scripts\activate
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
**Using conda:**
|
|
```bash
|
|
conda create -n hilitehero python=3.9
|
|
conda activate hilitehero
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
### Verify Installation
|
|
```bash
|
|
python main.py --test
|
|
```
|
|
This should process the default test file and create a JSON output file.
|
|
|
|
## Quick Start with Makefile
|
|
|
|
The project includes a comprehensive Makefile for easy development and testing:
|
|
|
|
### Essential Commands
|
|
|
|
```bash
|
|
# Show all available commands
|
|
make help
|
|
|
|
# Quick test (recommended first run)
|
|
make test
|
|
|
|
# Interactive mode
|
|
make run
|
|
|
|
# Development mode (debug + interactive)
|
|
make dev
|
|
|
|
# Install dependencies
|
|
make install
|
|
|
|
# Clean up generated files
|
|
make clean
|
|
```
|
|
|
|
### Common Workflows
|
|
|
|
**First-time setup:**
|
|
```bash
|
|
make install # Install dependencies
|
|
make test # Verify everything works
|
|
```
|
|
|
|
**Development workflow:**
|
|
```bash
|
|
make dev # Start development mode
|
|
make clean # Clean up when done
|
|
```
|
|
|
|
**Batch processing:**
|
|
```bash
|
|
make batch # Process default file silently
|
|
make batch-file FILE=document.pdf # Process specific file
|
|
```
|
|
|
|
**Code quality:**
|
|
```bash
|
|
make format # Format code
|
|
make lint # Check code quality
|
|
make check # Run all checks
|
|
```
|
|
|
|
### Advanced Makefile Usage
|
|
|
|
**Process specific pages:**
|
|
```bash
|
|
make run-pages FILE=document.pdf PAGES="1,3-5"
|
|
```
|
|
|
|
**Test different modes:**
|
|
```bash
|
|
make test-interactive # Test with interactive review
|
|
make test-debug # Test with debug output
|
|
make test-silent # Test silently
|
|
```
|
|
|
|
**Project management:**
|
|
```bash
|
|
make status # Show project status
|
|
make docs # Show documentation
|
|
make list-pdfs # List available PDF files
|
|
make list-outputs # Show recent outputs
|
|
```
|
|
|
|
|
|
## Dependencies
|
|
|
|
- PyMuPDF (fitz) - PDF processing and text extraction
|
|
- pdfplumber - Additional PDF annotation support
|
|
- colorama - Colored terminal output
|
|
- pandas - CSV export functionality
|
|
|
|
## Usage
|
|
|
|
### Quick Start
|
|
|
|
**Test Mode (Recommended for first-time users):**
|
|
```bash
|
|
python main.py --test
|
|
```
|
|
Uses default test file and automatically saves results to JSON.
|
|
|
|
**Interactive Mode:**
|
|
```bash
|
|
python main.py
|
|
```
|
|
Prompts for PDF file path and provides interactive review options.
|
|
|
|
**Process Specific PDF:**
|
|
```bash
|
|
python main.py path/to/your/document.pdf
|
|
```
|
|
|
|
### Command Line Options
|
|
|
|
| Flag | Description | Example |
|
|
|------|-------------|---------|
|
|
| `--test`, `-t` | Test mode with default settings | `python main.py -t` |
|
|
| `--interactive`, `-i` | Enable interactive review mode | `python main.py -i document.pdf` |
|
|
| `--pages`, `-p` | Process specific pages | `python main.py -p "1,3-5" doc.pdf` |
|
|
| `--silent`, `-s` | Minimal output, auto-save JSON | `python main.py -s` |
|
|
| `--debug`, `-d` | Enable detailed debug output | `python main.py -d document.pdf` |
|
|
| `--output-json` | Custom JSON output path | `python main.py --output-json results.json` |
|
|
|
|
### Usage Examples
|
|
|
|
**Basic extraction:**
|
|
```bash
|
|
python main.py document.pdf
|
|
```
|
|
|
|
**Process specific pages with interactive review:**
|
|
```bash
|
|
python main.py document.pdf -p "1,5-7" -i
|
|
```
|
|
|
|
**Silent mode for batch processing:**
|
|
```bash
|
|
python main.py document.pdf -s --output-json batch_results.json
|
|
```
|
|
|
|
**Debug mode for troubleshooting:**
|
|
```bash
|
|
python main.py document.pdf -d
|
|
```
|
|
|
|
**Test with custom output:**
|
|
```bash
|
|
python main.py -t --output-json test_results.json
|
|
```
|
|
|
|
### Interactive Review Mode
|
|
|
|
When using `-i` flag, you can:
|
|
- **[N]ext** - Move to next highlight
|
|
- **[P]rev** - Move to previous highlight
|
|
- **[U]p** - Move highlight up in order
|
|
- **[M]ove Down** - Move highlight down in order
|
|
- **[C]olor** - Change highlight color classification
|
|
- **[E]dit** - Edit highlight text
|
|
- **[D]elete** - Remove highlight
|
|
- **[O]pen Img** - View page image
|
|
- **[S]ave&Exit** - Save changes and exit
|
|
- **[Q]uit** - Quit without saving
|
|
|
|
## Output Formats
|
|
|
|
### Terminal Display
|
|
📄 Page 35
|
|
🎨 YELLOW
|
|
"We end with some specific suggestions for what we can do as linguists"
|
|
🎨 PINK (hyphen-merged)
|
|
"linguistics itself"
|
|
|
|
### JSON Export
|
|
{
|
|
"highlights": [
|
|
{
|
|
"page": 35,
|
|
"text": "We end with some specific suggestions",
|
|
"color": "yellow",
|
|
"type": "highlight"
|
|
}
|
|
]
|
|
}
|
|
|
|
### CSV Export
|
|
Tabular format with columns: page, text, color, type, category
|
|
|
|
## Technical Features
|
|
|
|
### Text Ordering Algorithm
|
|
1. **Method A**: PyMuPDF built-in sorting
|
|
2. **Method B**: Text block extraction with geometric sorting
|
|
3. **Method C**: Enhanced word-level sorting with line detection
|
|
|
|
### Hyphenation Detection
|
|
- Same-page: Detects hyphens within 8-30 pixel line spacing
|
|
- Cross-page: Handles hyphenation across page boundaries
|
|
- Smart merging: Only merges clear hyphenation patterns
|
|
|
|
### Precision Control
|
|
- **Overlap Threshold**: 40% word overlap required for inclusion
|
|
- **Boundary Expansion**: +2 pixel expansion for edge words
|
|
- **Line Tolerance**: 5-pixel tolerance for same-line detection
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
**Text Order Problems**: The tool uses multiple methods to fix PDF text ordering issues. If text still appears scrambled, the PDF may have complex layout encoding.
|
|
|
|
**Missing Words**: Lower the overlap threshold or check if highlights are too light/transparent.
|
|
|
|
**Over-extraction**: The tool is designed to avoid this, but very close text might be included. Check highlight precision in your PDF.
|
|
|
|
**Installation Issues**:
|
|
- Ensure Python 3.7+ is installed
|
|
- Try using virtual environment: `make venv-install`
|
|
- Check dependencies: `make verify`
|
|
|
|
**Permission Errors**:
|
|
- On Linux/Mac: Ensure PDF files are readable
|
|
- On Windows: Run as administrator if needed
|
|
|
|
### Debug Output
|
|
Run with detailed logging to see extraction decisions:
|
|
```bash
|
|
python main.py --test --debug
|
|
# or
|
|
make test-debug
|
|
```
|
|
|
|
### Getting Help
|
|
```bash
|
|
# Show all available commands
|
|
make help
|
|
|
|
# Check project status
|
|
make status
|
|
|
|
# Verify installation
|
|
make verify
|
|
```
|
|
|
|
## Contributing
|
|
|
|
1. Create a feature branch from main
|
|
2. Make your changes
|
|
3. Test with sample PDFs
|
|
4. Submit a pull request
|
|
|
|
## License
|
|
|
|
MIT License
|
|
|
|
## Support
|
|
|
|
For issues or questions, please open a GitHub issue.
|
|
|
|
# PDF Highlight Extraction Process - Step by Step
|
|
|
|
## Phase 1: Initialization and Setup
|
|
1. **Script Startup**: Check command line arguments for test mode
|
|
2. **Path Resolution**: Determine PDF file path (default or user input)
|
|
3. **File Validation**: Verify PDF file exists and is accessible
|
|
4. **Object Creation**: Initialize PDFHighlightExtractor with file path
|
|
|
|
## Phase 2: PDF Analysis and Loading
|
|
1. **Document Opening**: Load PDF using PyMuPDF (fitz) library
|
|
2. **Page Iteration**: Loop through each page in the document
|
|
3. **Annotation Discovery**: Find all annotations on each page
|
|
4. **Type Filtering**: Identify highlight-type annotations specifically
|
|
|
|
## Phase 3: Color Classification
|
|
1. **Color Extraction**: Get RGB values from annotation properties
|
|
2. **Color Normalization**: Convert to 0-255 range if needed
|
|
3. **Color Mapping**: Classify into 4 categories (Yellow, Pink, Green, Blue)
|
|
4. **Unknown Filtering**: Skip annotations with unrecognized colors
|
|
|
|
## Phase 4: Text Extraction (Multi-Method Approach)
|
|
|
|
### Method A: Built-in Sorting
|
|
1. **Rectangle Expansion**: Add 2-pixel buffer around highlight area
|
|
2. **PyMuPDF Extraction**: Use page.get_text("text", sort=True)
|
|
3. **Text Cleaning**: Remove extra whitespace and normalize
|
|
4. **Success Check**: Return if valid text found
|
|
|
|
### Method B: Text Block Extraction
|
|
1. **Block Discovery**: Get text blocks from highlight area
|
|
2. **Geometric Sorting**: Sort blocks by Y-position, then X-position
|
|
3. **Block Combination**: Join block texts with spaces
|
|
4. **Quality Check**: Verify result makes sense
|
|
|
|
### Method C: Enhanced Word Sorting
|
|
1. **Word Collection**: Get all words intersecting highlight area
|
|
2. **Overlap Calculation**: Calculate intersection ratio for each word
|
|
3. **Threshold Filtering**: Include words with 40%+ overlap
|
|
4. **Line Detection**: Group words by Y-position (5-pixel tolerance)
|
|
5. **Line Sorting**: Sort lines top-to-bottom
|
|
6. **Word Sorting**: Sort words left-to-right within each line
|
|
7. **Text Assembly**: Combine words in proper reading order
|
|
|
|
## Phase 5: Hyphenation Detection and Merging
|
|
1. **Pattern Recognition**: Look for highlights ending with '-'
|
|
2. **Proximity Check**: Verify next highlight is same color and nearby
|
|
3. **Distance Validation**: Check reasonable line spacing (8-30 pixels)
|
|
4. **Page Handling**: Support both same-page and cross-page hyphenation
|
|
5. **Text Joining**: Remove hyphen and combine words seamlessly
|
|
|
|
## Phase 6: Data Organization
|
|
1. **Highlight Storage**: Create structured data objects for each highlight
|
|
2. **Sorting**: Order by page number, then Y-position, then X-position
|
|
3. **Merging**: Apply hyphenation merging where detected
|
|
4. **Categorization**: Separate annotations from background highlights
|
|
|
|
## Phase 7: Output Generation
|
|
|
|
### Terminal Display
|
|
1. **Page Grouping**: Organize results by page number
|
|
2. **Color Coding**: Apply terminal colors for visual distinction
|
|
3. **Status Indicators**: Show merge status (hyphen-merged, cross-page)
|
|
4. **Formatting**: Clean, readable text presentation
|
|
|
|
### File Export (Optional)
|
|
1. **JSON Generation**: Structure data with metadata
|
|
2. **CSV Creation**: Tabular format for analysis
|
|
3. **File Writing**: Save to specified output paths
|
|
|
|
## Phase 8: Cleanup and Reporting
|
|
1. **Resource Cleanup**: Close PDF document properly
|
|
2. **Statistics**: Report extraction counts and timing
|
|
3. **Status Messages**: Provide user feedback on results
|
|
4. **Memory Management**: Clean up temporary objects
|
|
|
|
## Error Handling Throughout
|
|
- **Try-Catch Blocks**: Graceful handling of PDF parsing errors
|
|
- **Fallback Methods**: Alternative extraction approaches
|
|
- **Validation Checks**: Verify data integrity at each step
|
|
- **User Feedback**: Clear error messages and debugging info
|
|
|
|
## Debug Information
|
|
- **Overlap Ratios**: Show word inclusion/exclusion decisions
|
|
- **Method Success**: Indicate which extraction method worked
|
|
- **Hyphenation Detection**: Log when word merging occurs
|
|
- **Performance Timing**: Track processing duration |