tanyar09 673f84d388 Add Indeed parsing strategy and enhance job search parser

- Introduced a new Indeed parsing strategy to support job extraction from Indeed, including advanced filtering options.
- Updated job search parser to include Indeed in the site strategies, allowing for combined searches with other job sites.
- Enhanced README documentation with detailed usage instructions for the Indeed parser, including examples for keyword and location filtering.
- Improved logging for Indeed parsing to provide insights into job extraction processes and potential CAPTCHA handling.

2025-12-18 14:01:06 -05:00

18 KiB

Raw Permalink Blame History

Job Search Parser - Job Market Intelligence

Specialized parser for job market intelligence, tracking job postings, market trends, and competitive analysis. Focuses on tech roles and industry insights.

🎯 Purpose

The Job Search Parser is designed to:

Track Job Market Trends: Monitor demand for specific roles and skills
Competitive Intelligence: Analyze salary ranges and requirements
Industry Insights: Track hiring patterns across different sectors
Skill Gap Analysis: Identify in-demand technologies and frameworks
Market Demand Forecasting: Predict job market trends

🚀 Features

Core Functionality

Multi-Source Aggregation: Collect job data from multiple platforms
Role-Specific Tracking: Focus on tech roles and emerging positions
Skill Analysis: Extract and categorize required skills
Salary Intelligence: Track compensation ranges and trends
Company Intelligence: Monitor hiring companies and patterns

Advanced Features

Market Trend Analysis: Identify growing and declining job categories
Geographic Distribution: Track job distribution by location
Experience Level Analysis: Entry, mid, senior level tracking
Remote Work Trends: Monitor remote/hybrid work patterns
Technology Stack Tracking: Framework and tool popularity

🌐 Supported Job Sites

✅ Implemented Parsers

SkipTheDrive Parser

Remote job board specializing in work-from-home positions.

Features:

Keyword-based job search with relevance sorting
Job type filtering (full-time, part-time, contract)
Multi-page result parsing with pagination
Featured/sponsored job identification
AI-powered job relevance analysis
Automatic duplicate detection

Usage:

# Parse SkipTheDrive for QA automation jobs
node index.js --sites=skipthedrive --keywords="automation qa,qa engineer"

# Filter by job type
JOB_TYPES="full time,contract" node index.js --sites=skipthedrive

# Run demo with limited results
node index.js --sites=skipthedrive --demo

LinkedIn Jobs Parser

Professional network job postings with comprehensive job data.

Features:

LinkedIn authentication support
Keyword-based job search
Location filtering (both LinkedIn location and post-extraction filter)
Multi-page result parsing with pagination
Job type and experience level extraction
Automatic duplicate detection
Infinite scroll handling

Requirements:

LinkedIn credentials (username and password) must be set in .env file:

LINKEDIN_USERNAME=******@gmail.com
LINKEDIN_PASSWORD=***
LINKEDIN_JOB_LOCATION=Canada  # Optional: LinkedIn location filter

Usage:

# Search LinkedIn jobs
node index.js --sites=linkedin --keywords="software engineer,developer"

# Search with location filter
node index.js --sites=linkedin --keywords="co-op" --location="Ontario"

# Search with date filter (jobs posted after specific date)
node index.js --sites=linkedin --keywords="co-op" --min-date="2025-12-01"

# Combine filters
node index.js --sites=linkedin --keywords="co-op" --location="Ontario" --min-date="2025-12-01"

# Combine multiple sites
node index.js --sites=linkedin,skipthedrive,indeed --keywords="intern,co-op"

# Use AND logic - jobs must match ALL keywords (e.g., "co-op" AND "summer 2026")
node index.js --sites=linkedin --keywords="co-op,summer 2026" --and

# Use grouped AND/OR logic - (co-op OR intern) AND (summer 2026)
# Use | (pipe) for OR within groups, , (comma) to separate AND groups
node index.js --sites=linkedin --keywords="co-op|intern,summer 2026" --and

# Multiple AND groups - (co-op OR intern) AND (summer 2026) AND (remote)
node index.js --sites=linkedin --keywords="co-op|intern,summer 2026,remote" --and

Date Filter Notes:

The date filter uses LinkedIn's f_TPR parameter to filter at the LinkedIn level before parsing
Format: YYYY-MM-DD (e.g., 2025-12-01)
LinkedIn supports relative timeframes up to ~30 days
For dates older than 30 days, LinkedIn may limit results to the maximum supported timeframe

Indeed Parser

Comprehensive job aggregator with extensive job listings.

Features:

Keyword-based job search
Location filtering (both Indeed location and post-extraction filter)
Multi-page result parsing with pagination
Salary information extraction
Date filtering (jobs posted within last 30 days)
Automatic duplicate detection
Job type and experience level support

Usage:

# Search Indeed jobs
node index.js --sites=indeed --keywords="software engineer,developer"

# Search with location filter
node index.js --sites=indeed --keywords="co-op" --location="Ontario"

# Search with date filter (jobs posted after specific date)
node index.js --sites=indeed --keywords="co-op" --min-date="2025-12-01"

# Combine filters
node index.js --sites=indeed --keywords="co-op" --location="Ontario" --min-date="2025-12-01"

# Combine multiple sites
node index.js --sites=indeed,linkedin --keywords="intern,co-op"

# Use AND logic - jobs must match ALL keywords
node index.js --sites=indeed --keywords="co-op,summer 2026" --and

# Use grouped AND/OR logic - (co-op OR intern) AND (summer 2026)
node index.js --sites=indeed --keywords="co-op|intern,summer 2026" --and

Date Filter Notes:

The date filter converts to Indeed's fromage parameter (days ago)
Format: YYYY-MM-DD (e.g., 2025-12-01)
Indeed supports up to 30 days for date filtering
For dates older than 30 days, Indeed limits results to the maximum supported timeframe

CAPTCHA/Verification Handling:

Indeed may show CAPTCHA or human verification pages when detecting automated access
If you encounter CAPTCHA errors, try:
1. Run in non-headless mode: Set HEADLESS=false in .env file (you can manually solve CAPTCHA)
2. Wait a few minutes between runs to avoid rate limiting
3. Use a different IP address or VPN if available
4. Reduce the number of pages or keywords per run
The parser will automatically detect and report CAPTCHA pages with helpful error messages

🚧 Planned Parsers

Glassdoor: Jobs with company reviews and salary data
Monster: Traditional job board
SimplyHired: Job aggregator with salary estimates
AngelList: Startup and tech jobs
Remote.co: Dedicated remote work jobs
FlexJobs: Flexible and remote positions

📦 Installation

# Install dependencies
npm install

# Run tests
npm test

# Run demo
node demo.js

🔧 Configuration

Environment Variables

Create a .env file in the parser directory:

# Job Search Configuration
SEARCH_KEYWORDS=software engineer,developer,programmer
# For grouped AND/OR logic, use pipe (|) for OR within groups and comma (,) for AND groups:
# SEARCH_KEYWORDS=co-op|intern,summer 2026,remote  # (co-op OR intern) AND (summer 2026) AND (remote)
USE_AND_LOGIC=false  # Set to "true" to enable AND logic (required for grouped keywords)
LOCATION_FILTER=Ontario,Canada
MAX_PAGES=5

# LinkedIn Configuration (required for LinkedIn jobs)
LINKEDIN_USERNAME=your_email@example.com
LINKEDIN_PASSWORD=your_password
LINKEDIN_JOB_LOCATION=Canada  # Optional: LinkedIn location search

# Date Filter (LinkedIn only - filters at LinkedIn level before parsing)
MIN_DATE=2025-12-01  # Format: YYYY-MM-DD (jobs posted after this date)

# Analysis Configuration
ENABLE_AI_ANALYSIS=false
HEADLESS=true

# Output Configuration
OUTPUT_FORMAT=json  # Options: "json", "csv", or "both"

Keyword Examples in .env:

# Simple OR logic (default) - matches ANY keyword
SEARCH_KEYWORDS=co-op,intern
USE_AND_LOGIC=false

# Simple AND logic - matches ALL keywords
SEARCH_KEYWORDS=co-op,summer 2026
USE_AND_LOGIC=true

# Grouped AND/OR logic - (co-op OR intern) AND (summer 2026) AND (remote)
SEARCH_KEYWORDS=co-op|intern,summer 2026,remote
USE_AND_LOGIC=true

Command Line Options

# Basic usage
node index.js

# Select sites to parse
node index.js --sites=linkedin,skipthedrive,indeed

# Search keywords
node index.js --keywords="software engineer,developer"

# Location filter
node index.js --location="Ontario"

# Max pages to parse
node index.js --max-pages=10

# Exclude rejected results
node index.js --no-rejected

# Output format (json, csv, or both)
node index.js --output=csv
node index.js --output=both

# Date filter (LinkedIn only - filters at LinkedIn level)
node index.js --sites=linkedin --min-date="2025-12-01"

# Use AND logic for keywords (all keywords must match)
node index.js --sites=linkedin --keywords="co-op,summer 2026" --and

# Use grouped AND/OR logic: (co-op OR intern) AND (summer 2026)
# Use | (pipe) for OR within groups, , (comma) to separate AND groups
node index.js --sites=linkedin --keywords="co-op|intern,summer 2026" --and

# Multiple AND groups: (co-op OR intern) AND (summer 2026) AND (remote)
node index.js --sites=linkedin --keywords="co-op|intern,summer 2026,remote" --and

Available Options:

--sites="site1,site2": Job sites to parse (linkedin, skipthedrive, indeed)
--keywords="keyword1,keyword2": Search keywords
- Use | (pipe) to separate OR keywords within a group: "co-op|intern" means "co-op" OR "intern"
- Use , (comma) to separate AND groups when using --and: "co-op|intern,summer 2026" means (co-op OR intern) AND (summer 2026)
--location="LOCATION": Location filter
--max-pages=NUMBER: Maximum pages to parse (0 or "all" for unlimited)
--min-date="YYYY-MM-DD": Minimum posted date filter (LinkedIn only - filters at LinkedIn level before parsing)
--no-rejected or --exclude-rejected: Exclude rejected results from output
--output=FORMAT or --format=FORMAT: Output format - "json", "csv", or "both" (default: "json")
--and or --all-keywords: Use AND logic for keywords (all keywords must match). Default is OR logic (any keyword matches)
- When combined with | (pipe) in keywords, enables grouped AND/OR logic

📊 Keywords

Role-Specific Keywords

Place keyword CSV files in the keywords/ directory:

job-search-parser/
├── keywords/
│   ├── job-search-keywords.csv     # General job search terms
│   ├── tech-roles.csv              # Technology roles
│   ├── data-roles.csv              # Data science roles
│   ├── management-roles.csv        # Management positions
│   └── emerging-roles.csv          # Emerging job categories
└── index.js

Tech Roles Keywords

keyword
software engineer
frontend developer
backend developer
full stack developer
data scientist
machine learning engineer
devops engineer
site reliability engineer
cloud architect
security engineer
mobile developer
iOS developer
Android developer
react developer
vue developer
angular developer
node.js developer
python developer
java developer
golang developer
rust developer
data engineer
analytics engineer

Data Science Keywords

keyword
data scientist
machine learning engineer
data analyst
business analyst
data engineer
analytics engineer
ML engineer
AI engineer
statistician
quantitative analyst
research scientist
data architect
BI developer
ETL developer

📈 Usage Examples

Basic Job Search

# Standard job market analysis
node index.js

# Specific tech roles
node index.js --roles="software engineer,data scientist"

# Geographic focus
node index.js --locations="Toronto,Vancouver,Calgary"

Advanced Analysis

# Senior level positions
node index.js --experience="senior" --salary-min=100000

# Remote work opportunities
node index.js --remote="remote" --roles="frontend developer"

# Trend analysis
node index.js --trends --skills --output=results/trends.json

Market Intelligence

# Salary analysis
node index.js --salary-min=80000 --salary-max=150000

# Skill gap analysis
node index.js --skills --roles="machine learning engineer"

# Competitive intelligence
node index.js --companies="Google,Microsoft,Amazon"

📊 Output Format

JSON Structure

{
  "metadata": {
    "timestamp": "2024-01-15T10:30:00Z",
    "search_parameters": {
      "roles": ["software engineer", "data scientist"],
      "locations": ["Toronto", "Vancouver"],
      "experience_levels": ["mid", "senior"],
      "remote_preference": ["remote", "hybrid"]
    },
    "total_jobs_found": 1250,
    "analysis_duration_seconds": 45
  },
  "market_overview": {
    "total_jobs": 1250,
    "average_salary": 95000,
    "salary_range": {
      "min": 65000,
      "max": 180000,
      "median": 92000
    },
    "remote_distribution": {
      "remote": 45,
      "hybrid": 35,
      "onsite": 20
    },
    "experience_distribution": {
      "entry": 15,
      "mid": 45,
      "senior": 40
    }
  },
  "trends": {
    "growing_skills": [
      { "skill": "React", "growth_rate": 25 },
      { "skill": "Python", "growth_rate": 18 },
      { "skill": "AWS", "growth_rate": 22 }
    ],
    "declining_skills": [
      { "skill": "jQuery", "growth_rate": -12 },
      { "skill": "PHP", "growth_rate": -8 }
    ],
    "emerging_roles": ["AI Engineer", "DevSecOps Engineer", "Data Engineer"]
  },
  "jobs": [
    {
      "id": "job_1",
      "title": "Senior Software Engineer",
      "company": "TechCorp",
      "location": "Toronto, Ontario",
      "remote_type": "hybrid",
      "salary": {
        "min": 100000,
        "max": 140000,
        "currency": "CAD"
      },
      "required_skills": ["React", "Node.js", "TypeScript", "AWS"],
      "preferred_skills": ["GraphQL", "Docker", "Kubernetes"],
      "experience_level": "senior",
      "job_url": "https://example.com/job/1",
      "posted_date": "2024-01-10T09:00:00Z",
      "scraped_at": "2024-01-15T10:30:00Z"
    }
  ],
  "analysis": {
    "skill_demand": {
      "React": { "count": 45, "avg_salary": 98000 },
      "Python": { "count": 38, "avg_salary": 102000 },
      "AWS": { "count": 32, "avg_salary": 105000 }
    },
    "company_insights": {
      "top_hirers": [
        { "company": "TechCorp", "jobs": 25 },
        { "company": "StartupXYZ", "jobs": 18 }
      ],
      "salary_leaders": [
        { "company": "BigTech", "avg_salary": 120000 },
        { "company": "FinTech", "avg_salary": 115000 }
      ]
    }
  }
}

CSV Output

The parser can generate CSV files for easy spreadsheet analysis. Use --output=csv or OUTPUT_FORMAT=csv to export results as CSV.

CSV Columns:

jobId: Unique job identifier
title: Job title
company: Company name
location: Job location
jobUrl: Link to job posting
postedDate: Date job was posted
description: Job description
jobType: Type of job (full-time, part-time, contract, etc.)
experienceLevel: Required experience level
keyword: Search keyword that matched
extractedAt: Timestamp when job was extracted
source: Source site (e.g., "linkedin-jobs", "skipthedrive")
aiRelevant: AI analysis relevance (Yes/No)
aiConfidence: AI confidence score (0-1)
aiReasoning: AI reasoning for relevance
aiContext: AI analysis context
aiModel: AI model used for analysis
aiAnalyzedAt: Timestamp of AI analysis

Example CSV Output:

jobId,title,company,location,jobUrl,postedDate,description,jobType,experienceLevel,keyword,extractedAt,source,aiRelevant,aiConfidence,aiReasoning,aiContext,aiModel,aiAnalyzedAt
4344137241,Web Applications Co-op/Intern,Nokia,Kanata ON (Hybrid),https://www.linkedin.com/jobs/view/4344137241,,"Web Applications Co-op/Intern",,co-op,2025-12-17T04:50:05.600Z,linkedin-jobs,Yes,0.8,"The post mentions a co-op/intern position",co-op and internship opportunities for First year Math students,mistral,2025-12-17T04:58:33.479Z

Usage:

# Export as CSV only
node index.js --output=csv

# Export both JSON and CSV
node index.js --output=both

# Using environment variable
OUTPUT_FORMAT=csv node index.js

🔒 Security & Best Practices

Data Privacy

Respect job site terms of service
Implement appropriate rate limiting
Store data securely and responsibly
Anonymize sensitive information

Rate Limiting

Implement delays between requests
Respect API rate limits
Use multiple data sources
Monitor for blocking/detection

Legal Compliance

Educational and research purposes only
Respect website terms of service
Implement data retention policies
Monitor for legal changes

🧪 Testing

Run Tests

# All tests
npm test

# Specific test suites
npm test -- --testNamePattern="JobSearch"
npm test -- --testNamePattern="Analysis"
npm test -- --testNamePattern="Trends"

Test Coverage

npm run test:coverage

🚀 Performance Optimization

Recommended Settings

Fast Analysis

node index.js --roles="software engineer" --locations="Toronto"

Comprehensive Analysis

node index.js --trends --skills --experience="all"

Focused Intelligence

node index.js --salary-min=80000 --remote="remote" --trends

Performance Tips

Use specific role filters to reduce data volume
Implement caching for repeated searches
Use parallel processing for multiple sources
Optimize data storage and retrieval

🔧 Troubleshooting

Common Issues

Rate Limiting

# Reduce request frequency
export REQUEST_DELAY=2000
node index.js

Data Source Issues

# Use specific sources
node index.js --sources="linkedin,indeed"

# Check source availability
node index.js --test-sources

Output Issues

# Check output directory
mkdir -p results
node index.js --output=results/analysis.json

# Verify file permissions
chmod 755 results/

📈 Monitoring & Analytics

Key Metrics

Job Volume: Total jobs found per search
Salary Trends: Average and median salary changes
Skill Demand: Most requested skills
Remote Adoption: Remote work trend analysis
Market Velocity: Job posting frequency

Dashboard Integration

Real-time market monitoring
Trend visualization
Salary benchmarking
Skill gap analysis
Competitive intelligence

🤝 Contributing

Development Setup

Fork the repository
Create feature branch
Add tests for new functionality
Ensure all tests pass
Submit pull request

Code Standards

Follow existing code style
Add JSDoc comments
Maintain test coverage
Update documentation

📄 License

This parser is part of the LinkedOut platform and follows the same licensing terms.

Note: This tool is designed for educational and research purposes. Always respect website terms of service and implement appropriate rate limiting and ethical usage practices.

18 KiB Raw Permalink Blame History

Job Search Parser - Job Market Intelligence

🎯 Purpose

🚀 Features

Core Functionality

Advanced Features

🌐 Supported Job Sites

✅ Implemented Parsers

SkipTheDrive Parser

LinkedIn Jobs Parser

Indeed Parser

🚧 Planned Parsers

📦 Installation

🔧 Configuration

Environment Variables

Command Line Options

📊 Keywords

Role-Specific Keywords

Tech Roles Keywords

Data Science Keywords

📈 Usage Examples

Basic Job Search

Advanced Analysis

Market Intelligence

📊 Output Format

JSON Structure

CSV Output

🔒 Security & Best Practices

Data Privacy

Rate Limiting

Legal Compliance

🧪 Testing

Run Tests

Test Coverage

🚀 Performance Optimization

Recommended Settings

Fast Analysis

Comprehensive Analysis

Focused Intelligence

Performance Tips

🔧 Troubleshooting

Common Issues

Rate Limiting

Data Source Issues

Output Issues

📈 Monitoring & Analytics

Key Metrics

Dashboard Integration

🤝 Contributing

Development Setup

Code Standards

📄 License

18 KiB

Raw Permalink Blame History