add documentation and update readme.

fix extraction issues. Add test param.
2025-05-27 11:55:20 -04:00 · 2025-05-27 11:55:20 -04:00 · 70e3b66c95
commit 70e3b66c95
parent cf79da27b4
2 changed files with 912 additions and 540 deletions
--- a/README.md
+++ b/README.md
@ -1 +1,204 @@
 # PDF Highlight Extractor
+
+A Python tool for extracting highlighted text from PDF files with precise text ordering and intelligent hyphenation handling.
+
+## Features
+
+- **4-Color Support**: Extracts Yellow, Pink, Green, and Blue highlights
+- **Smart Text Ordering**: Fixes PDF text extraction order issues using multiple methods
+- **Hyphenation Merging**: Automatically combines hyphenated words across lines ("lin-" + "guistics" → "linguistics")
+- **Precise Boundaries**: Configurable overlap detection to avoid over-extraction
+- **Multiple Extraction Methods**: Fallback system for maximum compatibility
+- **Cross-page Support**: Handles highlights that span multiple pages
+- **Test Mode**: Quick testing with default settings
+- **Export Options**: JSON and CSV output formats
+
+## Installation
+
+Clone the repository:
+git clone <repository-url>
+cd pdf-highlight-extractor
+
+Install required packages:
+pip install PyMuPDF pdfplumber colorama pandas
+
+
+## Dependencies
+
+- PyMuPDF (fitz) - PDF processing and text extraction
+- pdfplumber - Additional PDF annotation support
+- colorama - Colored terminal output
+- pandas - CSV export functionality
+
+## Usage
+
+### Quick Test Mode
+python highlight_extractor.py --test
+
+Uses default file: `/mnt/c/Users/admin/Downloads/test2.pdf` and displays results only.
+
+### Interactive Mode
+python highlight_extractor.py
+
+Prompts for PDF file path and output options.
+
+### Command Line Flags
+- `--test`, `-t`, or `test` - Enable test mode with defaults
+- No flags - Full interactive mode
+
+## Output Formats
+
+### Terminal Display
+📄 Page 35
+🎨 YELLOW
+"We end with some specific suggestions for what we can do as linguists"
+🎨 PINK (hyphen-merged)
+"linguistics itself"
+
+### JSON Export
+{
+    "highlights": [
+        {
+            "page": 35,
+            "text": "We end with some specific suggestions",
+            "color": "yellow",
+            "type": "highlight"
+        }
+    ]
+}
+
+### CSV Export
+Tabular format with columns: page, text, color, type, category
+
+## Technical Features
+
+### Text Ordering Algorithm
+1. **Method A**: PyMuPDF built-in sorting
+2. **Method B**: Text block extraction with geometric sorting
+3. **Method C**: Enhanced word-level sorting with line detection
+
+### Hyphenation Detection
+- Same-page: Detects hyphens within 8-30 pixel line spacing
+- Cross-page: Handles hyphenation across page boundaries
+- Smart merging: Only merges clear hyphenation patterns
+
+### Precision Control
+- **Overlap Threshold**: 40% word overlap required for inclusion
+- **Boundary Expansion**: +2 pixel expansion for edge words
+- **Line Tolerance**: 5-pixel tolerance for same-line detection
+
+## Troubleshooting
+
+### Common Issues
+
+**Text Order Problems**: The tool uses multiple methods to fix PDF text ordering issues. If text still appears scrambled, the PDF may have complex layout encoding.
+
+**Missing Words**: Lower the overlap threshold or check if highlights are too light/transparent.
+
+**Over-extraction**: The tool is designed to avoid this, but very close text might be included. Check highlight precision in your PDF.
+
+### Debug Output
+Run with detailed logging to see extraction decisions:
+python highlight_extractor.py --test
+
+## Contributing
+
+1. Create a feature branch from main
+2. Make your changes
+3. Test with sample PDFs
+4. Submit a pull request
+
+## License
+
+MIT License
+
+## Support
+
+For issues or questions, please open a GitHub issue.
+
+# PDF Highlight Extraction Process - Step by Step
+
+## Phase 1: Initialization and Setup
+1. **Script Startup**: Check command line arguments for test mode
+2. **Path Resolution**: Determine PDF file path (default or user input)
+3. **File Validation**: Verify PDF file exists and is accessible
+4. **Object Creation**: Initialize PDFHighlightExtractor with file path
+
+## Phase 2: PDF Analysis and Loading
+1. **Document Opening**: Load PDF using PyMuPDF (fitz) library
+2. **Page Iteration**: Loop through each page in the document
+3. **Annotation Discovery**: Find all annotations on each page
+4. **Type Filtering**: Identify highlight-type annotations specifically
+
+## Phase 3: Color Classification
+1. **Color Extraction**: Get RGB values from annotation properties
+2. **Color Normalization**: Convert to 0-255 range if needed
+3. **Color Mapping**: Classify into 4 categories (Yellow, Pink, Green, Blue)
+4. **Unknown Filtering**: Skip annotations with unrecognized colors
+
+## Phase 4: Text Extraction (Multi-Method Approach)
+
+### Method A: Built-in Sorting
+1. **Rectangle Expansion**: Add 2-pixel buffer around highlight area
+2. **PyMuPDF Extraction**: Use page.get_text("text", sort=True)
+3. **Text Cleaning**: Remove extra whitespace and normalize
+4. **Success Check**: Return if valid text found
+
+### Method B: Text Block Extraction  
+1. **Block Discovery**: Get text blocks from highlight area
+2. **Geometric Sorting**: Sort blocks by Y-position, then X-position
+3. **Block Combination**: Join block texts with spaces
+4. **Quality Check**: Verify result makes sense
+
+### Method C: Enhanced Word Sorting
+1. **Word Collection**: Get all words intersecting highlight area
+2. **Overlap Calculation**: Calculate intersection ratio for each word
+3. **Threshold Filtering**: Include words with 40%+ overlap
+4. **Line Detection**: Group words by Y-position (5-pixel tolerance)
+5. **Line Sorting**: Sort lines top-to-bottom
+6. **Word Sorting**: Sort words left-to-right within each line
+7. **Text Assembly**: Combine words in proper reading order
+
+## Phase 5: Hyphenation Detection and Merging
+1. **Pattern Recognition**: Look for highlights ending with '-'
+2. **Proximity Check**: Verify next highlight is same color and nearby
+3. **Distance Validation**: Check reasonable line spacing (8-30 pixels)
+4. **Page Handling**: Support both same-page and cross-page hyphenation
+5. **Text Joining**: Remove hyphen and combine words seamlessly
+
+## Phase 6: Data Organization
+1. **Highlight Storage**: Create structured data objects for each highlight
+2. **Sorting**: Order by page number, then Y-position, then X-position
+3. **Merging**: Apply hyphenation merging where detected
+4. **Categorization**: Separate annotations from background highlights
+
+## Phase 7: Output Generation
+
+### Terminal Display
+1. **Page Grouping**: Organize results by page number
+2. **Color Coding**: Apply terminal colors for visual distinction
+3. **Status Indicators**: Show merge status (hyphen-merged, cross-page)
+4. **Formatting**: Clean, readable text presentation
+
+### File Export (Optional)
+1. **JSON Generation**: Structure data with metadata
+2. **CSV Creation**: Tabular format for analysis
+3. **File Writing**: Save to specified output paths
+
+## Phase 8: Cleanup and Reporting
+1. **Resource Cleanup**: Close PDF document properly
+2. **Statistics**: Report extraction counts and timing
+3. **Status Messages**: Provide user feedback on results
+4. **Memory Management**: Clean up temporary objects
+
+## Error Handling Throughout
+- **Try-Catch Blocks**: Graceful handling of PDF parsing errors
+- **Fallback Methods**: Alternative extraction approaches
+- **Validation Checks**: Verify data integrity at each step
+- **User Feedback**: Clear error messages and debugging info
+
+## Debug Information
+- **Overlap Ratios**: Show word inclusion/exclusion decisions
+- **Method Success**: Indicate which extraction method worked
+- **Hyphenation Detection**: Log when word merging occurs
+- **Performance Timing**: Track processing duration
--- a/main.py
+++ b/main.py