ilia/hilitehero

Fork 0

Go to file

ilia 2265649669 latest

2025-07-11 09:32:01 -08:00

.vscode

latest

2025-07-11 09:32:01 -08:00

test

latest

2025-07-11 09:32:01 -08:00

.gitignore

latest

2025-07-11 09:32:01 -08:00

clear

latest

2025-07-11 09:32:01 -08:00

main.py

latest

2025-07-11 09:32:01 -08:00

README.md

add documentation and update readme.

2025-05-27 11:55:20 -04:00

requirements.txt

latest

2025-07-11 09:32:01 -08:00

README.md

PDF Highlight Extractor

A Python tool for extracting highlighted text from PDF files with precise text ordering and intelligent hyphenation handling.

Features

4-Color Support: Extracts Yellow, Pink, Green, and Blue highlights
Smart Text Ordering: Fixes PDF text extraction order issues using multiple methods
Hyphenation Merging: Automatically combines hyphenated words across lines ("lin-" + "guistics" → "linguistics")
Precise Boundaries: Configurable overlap detection to avoid over-extraction
Multiple Extraction Methods: Fallback system for maximum compatibility
Cross-page Support: Handles highlights that span multiple pages
Test Mode: Quick testing with default settings
Export Options: JSON and CSV output formats

Installation

Clone the repository: git clone cd pdf-highlight-extractor

Install required packages: pip install PyMuPDF pdfplumber colorama pandas

Dependencies

PyMuPDF (fitz) - PDF processing and text extraction
pdfplumber - Additional PDF annotation support
colorama - Colored terminal output
pandas - CSV export functionality

Usage

Quick Test Mode

python highlight_extractor.py --test

Uses default file: /mnt/c/Users/admin/Downloads/test2.pdf and displays results only.

Interactive Mode

python highlight_extractor.py

Prompts for PDF file path and output options.

Command Line Flags

--test, -t, or test - Enable test mode with defaults
No flags - Full interactive mode

Output Formats

Terminal Display

📄 Page 35 🎨 YELLOW "We end with some specific suggestions for what we can do as linguists" 🎨 PINK (hyphen-merged) "linguistics itself"

JSON Export

{ "highlights": [ { "page": 35, "text": "We end with some specific suggestions", "color": "yellow", "type": "highlight" } ] }

CSV Export

Tabular format with columns: page, text, color, type, category

Technical Features

Text Ordering Algorithm

Method A: PyMuPDF built-in sorting
Method B: Text block extraction with geometric sorting
Method C: Enhanced word-level sorting with line detection

Hyphenation Detection

Same-page: Detects hyphens within 8-30 pixel line spacing
Cross-page: Handles hyphenation across page boundaries
Smart merging: Only merges clear hyphenation patterns

Precision Control

Overlap Threshold: 40% word overlap required for inclusion
Boundary Expansion: +2 pixel expansion for edge words
Line Tolerance: 5-pixel tolerance for same-line detection

Troubleshooting

Common Issues

Text Order Problems: The tool uses multiple methods to fix PDF text ordering issues. If text still appears scrambled, the PDF may have complex layout encoding.

Missing Words: Lower the overlap threshold or check if highlights are too light/transparent.

Over-extraction: The tool is designed to avoid this, but very close text might be included. Check highlight precision in your PDF.

Debug Output

Run with detailed logging to see extraction decisions: python highlight_extractor.py --test

Contributing

Create a feature branch from main
Make your changes
Test with sample PDFs
Submit a pull request

License

MIT License

Support

For issues or questions, please open a GitHub issue.

PDF Highlight Extraction Process - Step by Step

Phase 1: Initialization and Setup

Script Startup: Check command line arguments for test mode
Path Resolution: Determine PDF file path (default or user input)
File Validation: Verify PDF file exists and is accessible
Object Creation: Initialize PDFHighlightExtractor with file path

Phase 2: PDF Analysis and Loading

Document Opening: Load PDF using PyMuPDF (fitz) library
Page Iteration: Loop through each page in the document
Annotation Discovery: Find all annotations on each page
Type Filtering: Identify highlight-type annotations specifically

Phase 3: Color Classification

Color Extraction: Get RGB values from annotation properties
Color Normalization: Convert to 0-255 range if needed
Color Mapping: Classify into 4 categories (Yellow, Pink, Green, Blue)
Unknown Filtering: Skip annotations with unrecognized colors

Phase 4: Text Extraction (Multi-Method Approach)

Method A: Built-in Sorting

Rectangle Expansion: Add 2-pixel buffer around highlight area
PyMuPDF Extraction: Use page.get_text("text", sort=True)
Text Cleaning: Remove extra whitespace and normalize
Success Check: Return if valid text found

Method B: Text Block Extraction

Block Discovery: Get text blocks from highlight area
Geometric Sorting: Sort blocks by Y-position, then X-position
Block Combination: Join block texts with spaces
Quality Check: Verify result makes sense

Method C: Enhanced Word Sorting

Word Collection: Get all words intersecting highlight area
Overlap Calculation: Calculate intersection ratio for each word
Threshold Filtering: Include words with 40%+ overlap
Line Detection: Group words by Y-position (5-pixel tolerance)
Line Sorting: Sort lines top-to-bottom
Word Sorting: Sort words left-to-right within each line
Text Assembly: Combine words in proper reading order

Phase 5: Hyphenation Detection and Merging

Pattern Recognition: Look for highlights ending with '-'
Proximity Check: Verify next highlight is same color and nearby
Distance Validation: Check reasonable line spacing (8-30 pixels)
Page Handling: Support both same-page and cross-page hyphenation
Text Joining: Remove hyphen and combine words seamlessly

Phase 6: Data Organization

Highlight Storage: Create structured data objects for each highlight
Sorting: Order by page number, then Y-position, then X-position
Merging: Apply hyphenation merging where detected
Categorization: Separate annotations from background highlights

Phase 7: Output Generation

Terminal Display

Page Grouping: Organize results by page number
Color Coding: Apply terminal colors for visual distinction
Status Indicators: Show merge status (hyphen-merged, cross-page)
Formatting: Clean, readable text presentation

File Export (Optional)

JSON Generation: Structure data with metadata
CSV Creation: Tabular format for analysis
File Writing: Save to specified output paths

Phase 8: Cleanup and Reporting

Resource Cleanup: Close PDF document properly
Statistics: Report extraction counts and timing
Status Messages: Provide user feedback on results
Memory Management: Clean up temporary objects

Error Handling Throughout

Try-Catch Blocks: Graceful handling of PDF parsing errors
Fallback Methods: Alternative extraction approaches
Validation Checks: Verify data integrity at each step
User Feedback: Clear error messages and debugging info

Debug Information

Overlap Ratios: Show word inclusion/exclusion decisions
Method Success: Indicate which extraction method worked
Hyphenation Detection: Log when word merging occurs
Performance Timing: Track processing duration