2025-07-11 09:32:01 -08:00

PDF Highlight Extractor

A Python tool for extracting highlighted text from PDF files with precise text ordering and intelligent hyphenation handling.

Features

  • 4-Color Support: Extracts Yellow, Pink, Green, and Blue highlights
  • Smart Text Ordering: Fixes PDF text extraction order issues using multiple methods
  • Hyphenation Merging: Automatically combines hyphenated words across lines ("lin-" + "guistics" → "linguistics")
  • Precise Boundaries: Configurable overlap detection to avoid over-extraction
  • Multiple Extraction Methods: Fallback system for maximum compatibility
  • Cross-page Support: Handles highlights that span multiple pages
  • Test Mode: Quick testing with default settings
  • Export Options: JSON and CSV output formats

Installation

Prerequisites

  • Python 3.7 or higher
  • pip package manager

Quick Installation

  1. Clone the repository:

    git clone <repository-url>
    cd HiLiteHero
    
  2. Install dependencies:

    pip install -r requirements.txt
    

    Or install manually:

    pip install PyMuPDF colorama
    

Alternative Installation Methods

Using virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Using conda:

conda create -n hilitehero python=3.9
conda activate hilitehero
pip install -r requirements.txt

Verify Installation

python main.py --test

This should process the default test file and create a JSON output file.

Quick Start with Makefile

The project includes a comprehensive Makefile for easy development and testing:

Essential Commands

# Show all available commands
make help

# Quick test (recommended first run)
make test

# Interactive mode
make run

# Development mode (debug + interactive)
make dev

# Install dependencies
make install

# Clean up generated files
make clean

Common Workflows

First-time setup:

make install    # Install dependencies
make test       # Verify everything works

Development workflow:

make dev        # Start development mode
make clean      # Clean up when done

Batch processing:

make batch      # Process default file silently
make batch-file FILE=document.pdf  # Process specific file

Code quality:

make format     # Format code
make lint       # Check code quality
make check      # Run all checks

Advanced Makefile Usage

Process specific pages:

make run-pages FILE=document.pdf PAGES="1,3-5"

Test different modes:

make test-interactive  # Test with interactive review
make test-debug        # Test with debug output
make test-silent       # Test silently

Project management:

make status      # Show project status
make docs        # Show documentation
make list-pdfs   # List available PDF files
make list-outputs # Show recent outputs

Dependencies

  • PyMuPDF (fitz) - PDF processing and text extraction
  • pdfplumber - Additional PDF annotation support
  • colorama - Colored terminal output
  • pandas - CSV export functionality

Usage

Quick Start

Test Mode (Recommended for first-time users):

python main.py --test

Uses default test file and automatically saves results to JSON.

Interactive Mode:

python main.py

Prompts for PDF file path and provides interactive review options.

Process Specific PDF:

python main.py path/to/your/document.pdf

Command Line Options

Flag Description Example
--test, -t Test mode with default settings python main.py -t
--interactive, -i Enable interactive review mode python main.py -i document.pdf
--pages, -p Process specific pages python main.py -p "1,3-5" doc.pdf
--silent, -s Minimal output, auto-save JSON python main.py -s
--debug, -d Enable detailed debug output python main.py -d document.pdf
--output-json Custom JSON output path python main.py --output-json results.json

Usage Examples

Basic extraction:

python main.py document.pdf

Process specific pages with interactive review:

python main.py document.pdf -p "1,5-7" -i

Silent mode for batch processing:

python main.py document.pdf -s --output-json batch_results.json

Debug mode for troubleshooting:

python main.py document.pdf -d

Test with custom output:

python main.py -t --output-json test_results.json

Interactive Review Mode

When using -i flag, you can:

  • [N]ext - Move to next highlight
  • [P]rev - Move to previous highlight
  • [U]p - Move highlight up in order
  • [M]ove Down - Move highlight down in order
  • [C]olor - Change highlight color classification
  • [E]dit - Edit highlight text
  • [D]elete - Remove highlight
  • [O]pen Img - View page image
  • [S]ave&Exit - Save changes and exit
  • [Q]uit - Quit without saving

Output Formats

Terminal Display

📄 Page 35 🎨 YELLOW "We end with some specific suggestions for what we can do as linguists" 🎨 PINK (hyphen-merged) "linguistics itself"

JSON Export

{ "highlights": [ { "page": 35, "text": "We end with some specific suggestions", "color": "yellow", "type": "highlight" } ] }

CSV Export

Tabular format with columns: page, text, color, type, category

Technical Features

Text Ordering Algorithm

  1. Method A: PyMuPDF built-in sorting
  2. Method B: Text block extraction with geometric sorting
  3. Method C: Enhanced word-level sorting with line detection

Hyphenation Detection

  • Same-page: Detects hyphens within 8-30 pixel line spacing
  • Cross-page: Handles hyphenation across page boundaries
  • Smart merging: Only merges clear hyphenation patterns

Precision Control

  • Overlap Threshold: 40% word overlap required for inclusion
  • Boundary Expansion: +2 pixel expansion for edge words
  • Line Tolerance: 5-pixel tolerance for same-line detection

Troubleshooting

Common Issues

Text Order Problems: The tool uses multiple methods to fix PDF text ordering issues. If text still appears scrambled, the PDF may have complex layout encoding.

Missing Words: Lower the overlap threshold or check if highlights are too light/transparent.

Over-extraction: The tool is designed to avoid this, but very close text might be included. Check highlight precision in your PDF.

Installation Issues:

  • Ensure Python 3.7+ is installed
  • Try using virtual environment: make venv-install
  • Check dependencies: make verify

Permission Errors:

  • On Linux/Mac: Ensure PDF files are readable
  • On Windows: Run as administrator if needed

Debug Output

Run with detailed logging to see extraction decisions:

python main.py --test --debug
# or
make test-debug

Getting Help

# Show all available commands
make help

# Check project status
make status

# Verify installation
make verify

Contributing

  1. Create a feature branch from main
  2. Make your changes
  3. Test with sample PDFs
  4. Submit a pull request

License

MIT License

Support

For issues or questions, please open a GitHub issue.

PDF Highlight Extraction Process - Step by Step

Phase 1: Initialization and Setup

  1. Script Startup: Check command line arguments for test mode
  2. Path Resolution: Determine PDF file path (default or user input)
  3. File Validation: Verify PDF file exists and is accessible
  4. Object Creation: Initialize PDFHighlightExtractor with file path

Phase 2: PDF Analysis and Loading

  1. Document Opening: Load PDF using PyMuPDF (fitz) library
  2. Page Iteration: Loop through each page in the document
  3. Annotation Discovery: Find all annotations on each page
  4. Type Filtering: Identify highlight-type annotations specifically

Phase 3: Color Classification

  1. Color Extraction: Get RGB values from annotation properties
  2. Color Normalization: Convert to 0-255 range if needed
  3. Color Mapping: Classify into 4 categories (Yellow, Pink, Green, Blue)
  4. Unknown Filtering: Skip annotations with unrecognized colors

Phase 4: Text Extraction (Multi-Method Approach)

Method A: Built-in Sorting

  1. Rectangle Expansion: Add 2-pixel buffer around highlight area
  2. PyMuPDF Extraction: Use page.get_text("text", sort=True)
  3. Text Cleaning: Remove extra whitespace and normalize
  4. Success Check: Return if valid text found

Method B: Text Block Extraction

  1. Block Discovery: Get text blocks from highlight area
  2. Geometric Sorting: Sort blocks by Y-position, then X-position
  3. Block Combination: Join block texts with spaces
  4. Quality Check: Verify result makes sense

Method C: Enhanced Word Sorting

  1. Word Collection: Get all words intersecting highlight area
  2. Overlap Calculation: Calculate intersection ratio for each word
  3. Threshold Filtering: Include words with 40%+ overlap
  4. Line Detection: Group words by Y-position (5-pixel tolerance)
  5. Line Sorting: Sort lines top-to-bottom
  6. Word Sorting: Sort words left-to-right within each line
  7. Text Assembly: Combine words in proper reading order

Phase 5: Hyphenation Detection and Merging

  1. Pattern Recognition: Look for highlights ending with '-'
  2. Proximity Check: Verify next highlight is same color and nearby
  3. Distance Validation: Check reasonable line spacing (8-30 pixels)
  4. Page Handling: Support both same-page and cross-page hyphenation
  5. Text Joining: Remove hyphen and combine words seamlessly

Phase 6: Data Organization

  1. Highlight Storage: Create structured data objects for each highlight
  2. Sorting: Order by page number, then Y-position, then X-position
  3. Merging: Apply hyphenation merging where detected
  4. Categorization: Separate annotations from background highlights

Phase 7: Output Generation

Terminal Display

  1. Page Grouping: Organize results by page number
  2. Color Coding: Apply terminal colors for visual distinction
  3. Status Indicators: Show merge status (hyphen-merged, cross-page)
  4. Formatting: Clean, readable text presentation

File Export (Optional)

  1. JSON Generation: Structure data with metadata
  2. CSV Creation: Tabular format for analysis
  3. File Writing: Save to specified output paths

Phase 8: Cleanup and Reporting

  1. Resource Cleanup: Close PDF document properly
  2. Statistics: Report extraction counts and timing
  3. Status Messages: Provide user feedback on results
  4. Memory Management: Clean up temporary objects

Error Handling Throughout

  • Try-Catch Blocks: Graceful handling of PDF parsing errors
  • Fallback Methods: Alternative extraction approaches
  • Validation Checks: Verify data integrity at each step
  • User Feedback: Clear error messages and debugging info

Debug Information

  • Overlap Ratios: Show word inclusion/exclusion decisions
  • Method Success: Indicate which extraction method worked
  • Hyphenation Detection: Log when word merging occurs
  • Performance Timing: Track processing duration
Description
No description provided
Readme 2.9 MiB
Languages
Python 85.3%
Makefile 14.7%