remove legacy files and restructure project for modular AI analysis and job market intelligence

This commit is contained in:
ilia 2025-08-15 01:05:02 -08:00
parent ead5cdef15
commit ef9720abf2
40 changed files with 9823 additions and 1986 deletions

684
README.md
View File

@ -1,278 +1,406 @@
# LinkedOut - LinkedIn Posts Scraper
A Node.js application that automates LinkedIn login and scrapes posts containing specific keywords. The tool is designed to help track job market trends, layoffs, and open work opportunities by monitoring LinkedIn content.
## Features
- **Automated LinkedIn Login**: Uses Playwright to automate browser interactions
- **Keyword-based Search**: Searches for posts containing keywords from CSV files or CLI
- **Flexible Keyword Sources**: Supports multiple CSV files in `keywords/` or CLI-only mode
- **Configurable Search Parameters**: Customizable date ranges, sorting options, city, and scroll behavior
- **Duplicate Detection**: Prevents duplicate posts and profiles in results
- **Clean Text Processing**: Removes hashtags, emojis, and URLs from post content
- **Timestamped Results**: Saves results to JSON files with timestamps
- **Command-line Overrides**: Support for runtime parameter adjustments
- **Enhanced Geographic Location Validation**: Validates user locations against 200+ Canadian cities with smart matching
- **Local AI Analysis (Ollama)**: Free, private, and fast post-processing with local LLMs
- **Flexible Processing**: Disable features, run AI analysis immediately, or process results later
## Prerequisites
- Node.js (v14 or higher)
- Valid LinkedIn account credentials
- [Ollama](https://ollama.ai/) with a model (free, private, local AI)
## Installation
1. Clone the repository or download the files
2. Install dependencies:
```bash
npm install
```
3. Copy the configuration template and customize:
```bash
cp env-config.example .env
```
4. Edit `.env` with your settings (see Configuration section below)
## Configuration
### Environment Variables (.env file)
Create a `.env` file from `env-config.example`:
```env
# LinkedIn Credentials (Required)
LINKEDIN_USERNAME=your_email@example.com
LINKEDIN_PASSWORD=your_password
# Basic Settings
HEADLESS=true
KEYWORDS=keywords-layoff.csv # Just the filename; always looks in keywords/ unless path is given
DATE_POSTED=past-week
SORT_BY=date_posted
CITY=Toronto
WHEELS=5
# Enhanced Location Filtering
LOCATION_FILTER=Ontario,Manitoba
ENABLE_LOCATION_CHECK=true
# Local AI Analysis (Ollama)
ENABLE_LOCAL_AI=true
OLLAMA_MODEL=mistral
OLLAMA_HOST=http://localhost:11434
RUN_LOCAL_AI_AFTER_SCRAPING=false # true = run after scraping, false = run manually
AI_CONTEXT=job layoffs and workforce reduction
AI_CONFIDENCE=0.7
AI_BATCH_SIZE=3
```
### Configuration Options
#### Required
- `LINKEDIN_USERNAME`: Your LinkedIn email/username
- `LINKEDIN_PASSWORD`: Your LinkedIn password
#### Basic Settings
- `HEADLESS`: Browser headless mode (`true`/`false`, default: `true`)
- `KEYWORDS`: CSV file name (default: `keywords-layoff.csv` in `keywords/` folder)
- `DATE_POSTED`: Filter by date (`past-24h`, `past-week`, `past-month`, or empty)
- `SORT_BY`: Sort results (`relevance` or `date_posted`)
- `CITY`: Search location (default: `Toronto`)
- `WHEELS`: Number of scrolls to load posts (default: `5`)
#### Enhanced Location Filtering
- `LOCATION_FILTER`: Geographic filter - supports multiple provinces/cities:
- Single: `Ontario` or `Toronto`
- Multiple: `Ontario,Manitoba` or `Toronto,Vancouver`
- `ENABLE_LOCATION_CHECK`: Enable location validation (`true`/`false`)
#### Local AI Analysis (Ollama)
- `ENABLE_LOCAL_AI=true`: Enable local AI analysis
- `OLLAMA_MODEL`: Model to use (auto-detects available models: `mistral`, `llama2`, `codellama`, etc.)
- `OLLAMA_HOST`: Ollama server URL (default: `http://localhost:11434`)
- `RUN_LOCAL_AI_AFTER_SCRAPING`: Run AI immediately after scraping (`true`/`false`)
- `AI_CONTEXT`: Context for analysis (e.g., `job layoffs`)
- `AI_CONFIDENCE`: Minimum confidence threshold (0.0-1.0, default: 0.7)
- `AI_BATCH_SIZE`: Posts per batch (default: 3)
## Usage
### Demo Mode
For testing and demonstration purposes, you can run the interactive demo:
```bash
# Run interactive demo (simulates scraping with fake data)
npm run demo
# Or directly:
node demo.js
```
The demo mode:
- Uses fake, anonymized data for safety
- Walks through all configuration options interactively
- Shows available Ollama models for selection
- Demonstrates the complete workflow without actual LinkedIn scraping
- Perfect for creating documentation, GIFs, or testing configurations
### Basic Commands
```bash
# Standard scraping with configured settings
node linkedout.js
# Visual mode (see browser)
node linkedout.js --headless=false
# Use only these keywords (ignore CSV)
node linkedout.js --keyword="layoff,downsizing"
# Add extra keywords to CSV/CLI list
node linkedout.js --add-keyword="hiring freeze,open to work"
# Override city and date
node linkedout.js --city="Vancouver" --date_posted=past-month
# Custom output file
node linkedout.js --output=results/myfile.json
# Skip location and AI filtering (fastest)
node linkedout.js --no-location --no-ai
# Run AI analysis immediately after scraping
node linkedout.js --ai-after
# Show help
node linkedout.js --help
```
### All Command-line Options
- `--headless=true|false`: Override browser headless mode
- `--keyword="kw1,kw2"`: Use only these keywords (comma-separated, overrides CSV)
- `--add-keyword="kw1,kw2"`: Add extra keywords to CSV/CLI list
- `--city="CityName"`: Override city
- `--date_posted=VALUE`: Override date posted (past-24h, past-week, past-month, or empty)
- `--sort_by=VALUE`: Override sort by (date_posted or relevance)
- `--location_filter=VALUE`: Override location filter
- `--output=FILE`: Output file name
- `--no-location`: Disable location filtering
- `--no-ai`: Disable AI analysis
- `--ai-after`: Run local AI analysis after scraping
- `--help, -h`: Show help message
### Keyword Files
- Place all keyword CSVs in the `keywords/` folder
- Example: `keywords/keywords-layoff.csv`, `keywords/keywords-open-work.csv`
- Custom CSV format: header `keyword` with one keyword per line
### Local AI Analysis Commands
After scraping, you can run AI analysis on the results:
```bash
# Analyze latest results
node ai-analyzer-local.js --context="job layoffs"
# Analyze specific file
node ai-analyzer-local.js --input=results/results-2024-01-15.json --context="hiring"
# Use different model (auto-detects available models)
node ai-analyzer-local.js --model=llama2 --context="remote work"
# Change confidence and batch size
node ai-analyzer-local.js --context="job layoffs" --confidence=0.8 --batch-size=5
# Check available models
ollama list
```
## Workflow Examples
### 1. First Time Setup (Demo Mode)
```bash
# Run interactive demo to test configuration
npm run demo
```
### 2. Quick Start (All Features)
```bash
node linkedout.js --ai-after
```
### 3. Fast Scraping Only
```bash
node linkedout.js --no-location --no-ai
```
### 4. Location-Only Filtering
```bash
node linkedout.js --no-ai
```
### 5. Test Different AI Contexts
```bash
node linkedout.js --no-ai
node ai-analyzer-local.js --context="job layoffs"
node ai-analyzer-local.js --context="hiring opportunities"
node ai-analyzer-local.js --context="remote work"
```
## Project Structure
```
linkedout/
├── .env # Your configuration (create from template)
├── env-config.example # Configuration template
├── linkedout.js # Main scraper
├── demo.js # Interactive demo with fake data
├── ai-analyzer-local.js # Free local AI analyzer (Ollama)
├── location-utils.js # Enhanced location utilities
├── package.json # Dependencies
├── keywords/ # All keyword CSVs go here
│ ├── keywords-layoff.csv
│ └── keywords-open-work.csv
├── results/ # Output directory
└── README.md # This documentation
```
## Legal & Security
- **Credentials**: Store securely in `.env`, add to `.gitignore`
- **LinkedIn ToS**: Respect rate limits and usage guidelines
- **Privacy**: Local AI keeps all data on your machine
- **Usage**: Educational and research purposes only
## Dependencies
- `playwright`: Browser automation
- `dotenv`: Environment variables
- `csv-parser`: CSV file reading
- Built-in: `fs`, `path`, `child_process`
## Support
For issues:
1. Check this README
2. Verify `.env` configuration
3. Test with `--headless=false` for debugging
4. Check Ollama status: `ollama list`
# Job Market Intelligence Platform
A comprehensive platform for job market intelligence with **integrated AI-powered insights**. Built with modular architecture for extensibility and maintainability.
## 🏗️ Architecture Overview
```
job-market-intelligence/
├── ai-analyzer/ # Shared core utilities (logger, AI, location, text) + CLI tool
├── linkedin-parser/ # LinkedIn-specific scraper with integrated AI analysis
├── job-search-parser/ # Job search intelligence
└── docs/ # Documentation
```
## 🚀 Quick Start
### Prerequisites
- Node.js 18+
- Playwright browser automation
- LinkedIn account credentials
- Optional: Ollama for local AI analysis
### Installation
```bash
npm install
npx playwright install chromium
```
### Basic Usage
```bash
# Run LinkedIn parser with integrated AI analysis
cd linkedin-parser && npm start
# Run LinkedIn parser with specific keywords
cd linkedin-parser && npm run start:custom
# Run LinkedIn parser without AI analysis
cd linkedin-parser && npm run start:no-ai
# Run job search parser
cd job-search-parser && npm start
# Analyze existing results with AI (CLI)
cd linkedin-parser && npm run analyze:latest
# Analyze with custom context
cd linkedin-parser && npm run analyze:layoff
# Run demo workflow
node demo.js
```
## 📦 Core Components
### 1. AI Analyzer (`ai-analyzer/`)
**Shared utilities and CLI tool used by all parsers**
- **Logger**: Consistent logging across all components
- **Text Processing**: Keyword matching, text cleaning
- **Location Validation**: Geographic filtering and validation
- **AI Integration**: Local Ollama support with integrated analysis
- **CLI Tool**: Command-line interface for standalone AI analysis
- **Test Utilities**: Shared testing helpers
**Key Features:**
- Configurable log levels with color support
- Intelligent text processing and keyword matching
- Geographic location validation against filters
- **Integrated AI analysis**: AI results embedded in data structure
- **CLI tool**: Standalone analysis with flexible options
- Comprehensive test coverage
### 2. LinkedIn Parser (`linkedin-parser/`)
**Specialized LinkedIn content scraper with integrated AI analysis**
- Automated LinkedIn login and navigation
- Keyword-based post searching
- Profile location validation
- Duplicate detection and filtering
- **Automatic AI analysis integrated into results**
- Configurable search parameters
**Key Features:**
- Browser automation with Playwright
- Geographic filtering by city/region
- Date range filtering (24h, week, month)
- **Integrated AI-powered content relevance analysis**
- **Single JSON output with embedded AI insights**
- **Two output files: results (with AI) and rejected posts**
### 3. Job Search Parser (`job-search-parser/`)
**Job market intelligence and analysis**
- Job posting aggregation
- Role-specific keyword tracking
- Market trend analysis
- Salary and requirement insights
**Key Features:**
- Tech role keyword tracking
- Industry-specific analysis
- Market demand insights
- Competitive intelligence
### 4. AI Analysis CLI (`ai-analyzer/cli.js`)
**Command-line tool for AI analysis of any results JSON file**
- Analyze any results JSON file from LinkedIn parser or other sources
- **Integrated analysis**: AI results embedded back into original JSON
- Custom analysis context and AI models
- Comprehensive analysis summary and statistics
- Flexible input format support
**Key Features:**
- Works with any JSON results file
- **Integrated output**: AI analysis embedded in original structure
- Custom analysis contexts
- Detailed relevance scoring
- Confidence level analysis
- Summary statistics and insights
## 🔧 Configuration
### Environment Variables
Create a `.env` file in the root directory:
```env
# LinkedIn Credentials
LINKEDIN_USERNAME=your_email@example.com
LINKEDIN_PASSWORD=your_password
# Search Configuration
CITY=Toronto
DATE_POSTED=past-week
SORT_BY=date_posted
WHEELS=5
# Location Filtering
LOCATION_FILTER=Ontario,Manitoba
ENABLE_LOCATION_CHECK=true
# AI Analysis
ENABLE_AI_ANALYSIS=true
AI_CONTEXT="job market analysis and trends"
OLLAMA_MODEL=mistral
# Keywords
KEYWORDS=keywords-layoff.csv
```
### Command Line Options
```bash
# LinkedIn Parser Options
--headless=true|false # Browser headless mode
--keyword="kw1,kw2" # Specific keywords
--add-keyword="kw1,kw2" # Additional keywords
--no-location # Disable location filtering
--no-ai # Disable AI analysis
# Job Search Parser Options
--help # Show parser-specific help
# AI Analysis CLI Options
--input=FILE # Input JSON file
--output=FILE # Output file
--context="description" # Custom AI analysis context
--model=MODEL # Ollama model
--latest # Use latest results file
--dir=PATH # Directory to look for results
```
## 📊 Output Formats
### LinkedIn Parser Output
The LinkedIn parser now generates **two main files** with **integrated AI analysis**:
#### 1. Main Results with AI Analysis (`linkedin-results-YYYY-MM-DD-HH-MM.json`)
```json
{
"metadata": {
"timestamp": "2024-01-15T10:30:00Z",
"totalPosts": 45,
"rejectedPosts": 12,
"aiAnalysisEnabled": true,
"aiAnalysisCompleted": true,
"aiContext": "job market analysis and trends",
"aiModel": "mistral",
"locationFilter": "Ontario,Manitoba"
},
"results": [
{
"keyword": "layoff",
"text": "Cleaned post content...",
"profileLink": "https://linkedin.com/in/johndoe",
"location": "Toronto, Ontario, Canada",
"locationValid": true,
"locationMatchedFilter": "Ontario",
"locationReasoning": "Location matches filter",
"timestamp": "2024-01-15T10:30:00Z",
"source": "linkedin",
"parser": "linkedout-parser",
"aiAnalysis": {
"isRelevant": true,
"confidence": 0.9,
"reasoning": "Post discusses job market conditions and layoffs",
"context": "job market analysis and trends",
"model": "mistral",
"analyzedAt": "2024-01-15T10:30:00Z"
}
}
]
}
```
#### 2. Rejected Posts (`linkedin-rejected-YYYY-MM-DD-HH-MM.json`)
```json
[
{
"rejected": true,
"reason": "Location filter failed: Location not in filter",
"keyword": "layoff",
"text": "Post content...",
"profileLink": "https://linkedin.com/in/janedoe",
"location": "Vancouver, BC, Canada",
"timestamp": "2024-01-15T10:30:00Z"
}
]
```
### AI Analysis CLI Output
The CLI tool creates **integrated results** with AI analysis embedded:
#### Re-analyzed Results (`original-filename-ai.json`)
```json
{
"metadata": {
"timestamp": "2024-01-15T10:30:00Z",
"totalPosts": 45,
"aiAnalysisUpdated": "2024-01-15T11:00:00Z",
"aiContext": "layoff analysis",
"aiModel": "mistral"
},
"results": [
{
"keyword": "layoff",
"text": "Post content...",
"profileLink": "https://linkedin.com/in/johndoe",
"location": "Toronto, Ontario, Canada",
"aiAnalysis": {
"isRelevant": true,
"confidence": 0.9,
"reasoning": "Post mentions layoffs and workforce reduction",
"context": "layoff analysis",
"model": "mistral",
"analyzedAt": "2024-01-15T11:00:00Z"
}
}
]
}
```
## 🧪 Testing
### Run All Tests
```bash
npm test
```
### Run Specific Test Suites
```bash
# AI Analyzer tests
cd ai-analyzer && npm test
# LinkedIn Parser tests
cd linkedin-parser && npm test
# Job Search Parser tests
cd job-search-parser && npm test
```
## 🔒 Security & Legal
### Security Best Practices
- Store credentials in `.env` file (never commit)
- Use environment variables for sensitive data
- Implement rate limiting to avoid detection
- Respect LinkedIn's Terms of Service
### Legal Compliance
- Educational/research purposes only
- Respect rate limits and usage policies
- Monitor LinkedIn ToS changes
- Implement data retention policies
## 🚀 Advanced Features
### AI-Powered Analysis
- **Local AI**: Ollama integration for privacy
- **Integrated Analysis**: AI results embedded in data structure
- **Automatic Analysis**: Runs after parsing completes
- **Context Analysis**: Relevance scoring
- **Confidence Scoring**: AI confidence levels for each post
- **CLI Tool**: Standalone analysis with flexible options
### Geographic Intelligence
- **Location Validation**: Profile location verification
- **Regional Filtering**: City/state/country filtering
- **Geographic Analysis**: Location-based insights
### Data Processing
- **Duplicate Detection**: Intelligent deduplication
- **Content Cleaning**: Remove hashtags, URLs, emojis
- **Metadata Extraction**: Author, engagement, timing data
- **Integrated AI**: AI insights embedded in each result
## 📈 Performance Optimization
### Recommended Settings
- **Headless Mode**: Faster execution
- **Location Filtering**: Reduces false positives
- **AI Analysis**: Improves result quality (enabled by default)
- **Batch Processing**: Efficient data handling
### Monitoring
- Real-time progress indicators
- Detailed logging with configurable levels
- Performance metrics tracking
- Error handling and recovery
## 🤝 Contributing
### Development Setup
1. Fork the repository
2. Create feature branch
3. Add tests for new functionality
4. Ensure all tests pass
5. Submit pull request
### Code Standards
- Follow existing code style
- Add JSDoc comments
- Maintain test coverage
- Update documentation
## 📄 License
This project is for educational and research purposes. Please respect LinkedIn's Terms of Service and use responsibly.
## 🆘 Support
### Common Issues
- **Browser Issues**: Ensure Playwright is installed
- **Login Problems**: Check credentials in `.env`
- **Rate Limiting**: Implement delays between requests
- **Location Filtering**: Verify location filter format
- **AI Analysis**: Ensure Ollama is running for AI features
### Getting Help
- Check the component-specific READMEs
- Review the demo files for examples
- Examine the test files for usage patterns
- Open an issue with detailed error information
## 🆕 What's New
- **Integrated AI Analysis**: AI results are now embedded directly in the results JSON
- **No Separate Files**: No more separate AI analysis files to manage
- **CLI Tool**: Standalone AI analysis with flexible options
- **Rich Context**: Each post includes detailed AI insights
- **Flexible Re-analysis**: Easy to re-analyze with different contexts
- **Backward Compatible**: Original data structure preserved
---
**Note**: This tool is designed for educational and research purposes. Always respect LinkedIn's Terms of Service and implement appropriate rate limiting and ethical usage practices.

View File

@ -1,545 +0,0 @@
#!/usr/bin/env node
/**
* Local AI Post-Processing Analyzer for LinkedOut
*
* Uses Ollama for completely FREE local AI analysis.
*
* FEATURES:
* - Analyze LinkedOut results for context relevance (layoffs, hiring, etc.)
* - Works on latest or specified results file
* - Batch processing for speed
* - Configurable context, model, confidence, batch size
* - CLI and .env configuration
* - 100% local, private, and free
*
* USAGE:
* node ai-analyzer-local.js [options]
*
* COMMAND-LINE OPTIONS:
* --input=<file> Input JSON file (default: latest in results/)
* --context=<text> AI context to analyze against (required)
* --confidence=<num> Minimum confidence threshold (0.0-1.0, default: 0.7)
* --model=<name> Ollama model to use (default: llama2)
* --batch-size=<num> Number of posts to process at once (default: 3)
* --output=<file> Output file (default: adds -ai-local suffix)
* --help, -h Show this help message
*
* EXAMPLES:
* node ai-analyzer-local.js --context="job layoffs"
* node ai-analyzer-local.js --input=results/results-2024-01-15.json --context="hiring"
* node ai-analyzer-local.js --model=mistral --context="remote work"
* node ai-analyzer-local.js --context="job layoffs" --confidence=0.8 --batch-size=5
*
* ENVIRONMENT VARIABLES (.env file):
* AI_CONTEXT, AI_CONFIDENCE, AI_BATCH_SIZE, OLLAMA_MODEL, OLLAMA_HOST
* See README for full list.
*
* OUTPUT:
* - Saves to results/ with -ai-local suffix unless --output is specified
*
* DEPENDENCIES:
* - Ollama (https://ollama.ai/)
* - Node.js built-ins: fs, path, fetch
*
* SECURITY & LEGAL:
* - All analysis is local, no data leaves your machine
* - Use responsibly for educational/research purposes
*/
require("dotenv").config();
const fs = require("fs");
const path = require("path");
// Configuration from environment and command line
const DEFAULT_CONTEXT =
process.env.AI_CONTEXT || "job layoffs and workforce reduction";
const DEFAULT_CONFIDENCE = parseFloat(process.env.AI_CONFIDENCE || "0.7");
const DEFAULT_BATCH_SIZE = parseInt(process.env.AI_BATCH_SIZE || "3");
const DEFAULT_MODEL = process.env.OLLAMA_MODEL || "llama2";
const OLLAMA_HOST = process.env.OLLAMA_HOST || "http://localhost:11434";
// Parse command line arguments
const args = process.argv.slice(2);
let inputFile = null;
let context = DEFAULT_CONTEXT;
let confidenceThreshold = DEFAULT_CONFIDENCE;
let batchSize = DEFAULT_BATCH_SIZE;
let model = DEFAULT_MODEL;
let outputFile = null;
for (const arg of args) {
if (arg.startsWith("--input=")) {
inputFile = arg.split("=")[1];
} else if (arg.startsWith("--context=")) {
context = arg.split("=")[1];
} else if (arg.startsWith("--confidence=")) {
confidenceThreshold = parseFloat(arg.split("=")[1]);
} else if (arg.startsWith("--batch-size=")) {
batchSize = parseInt(arg.split("=")[1]);
} else if (arg.startsWith("--model=")) {
model = arg.split("=")[1];
} else if (arg.startsWith("--output=")) {
outputFile = arg.split("=")[1];
}
}
if (!context) {
console.error("❌ Error: No AI context specified");
console.error('Use --context="your context" or set AI_CONTEXT in .env');
process.exit(1);
}
/**
* Check if Ollama is running and the model is available
*/
async function checkOllamaStatus() {
try {
// Check if Ollama is running
const response = await fetch(`${OLLAMA_HOST}/api/tags`);
if (!response.ok) {
throw new Error(`Ollama not running on ${OLLAMA_HOST}`);
}
const data = await response.json();
const availableModels = data.models.map((m) => m.name);
console.log(`🤖 Ollama is running`);
console.log(
`📦 Available models: ${availableModels
.map((m) => m.split(":")[0])
.join(", ")}`
);
// Check if requested model is available
const modelExists = availableModels.some((m) => m.startsWith(model));
if (!modelExists) {
console.error(`❌ Model "${model}" not found`);
console.error(`💡 Install it with: ollama pull ${model}`);
console.error(
`💡 Or choose from: ${availableModels
.map((m) => m.split(":")[0])
.join(", ")}`
);
process.exit(1);
}
console.log(`✅ Using model: ${model}`);
return true;
} catch (error) {
console.error("❌ Error connecting to Ollama:", error.message);
console.error("💡 Make sure Ollama is installed and running:");
console.error(" 1. Install: https://ollama.ai/");
console.error(" 2. Start: ollama serve");
console.error(` 3. Install model: ollama pull ${model}`);
process.exit(1);
}
}
/**
* Find the most recent results file if none specified
*/
function findLatestResultsFile() {
const resultsDir = "results";
if (!fs.existsSync(resultsDir)) {
throw new Error("Results directory not found. Run the scraper first.");
}
const files = fs
.readdirSync(resultsDir)
.filter(
(f) =>
f.startsWith("results-") && f.endsWith(".json") && !f.includes("-ai-")
)
.sort()
.reverse();
if (files.length === 0) {
throw new Error("No results files found. Run the scraper first.");
}
return path.join(resultsDir, files[0]);
}
/**
* Analyze multiple posts using local Ollama
*/
async function analyzeBatch(posts, context, model) {
console.log(`🤖 Analyzing batch of ${posts.length} posts with ${model}...`);
try {
const prompt = `You are an expert at analyzing LinkedIn posts for relevance to specific contexts.
CONTEXT TO MATCH: "${context}"
Analyze these ${
posts.length
} LinkedIn posts and determine if each relates to the context above.
POSTS:
${posts
.map(
(post, i) => `
POST ${i + 1}:
"${post.text.substring(0, 400)}${post.text.length > 400 ? "..." : ""}"
`
)
.join("")}
For each post, provide:
- Is it relevant to "${context}"? (YES/NO)
- Confidence level (0.0 to 1.0)
- Brief reasoning
Respond in this EXACT format for each post:
POST 1: YES/NO | 0.X | brief reason
POST 2: YES/NO | 0.X | brief reason
POST 3: YES/NO | 0.X | brief reason
Examples:
- For layoff context: "laid off 50 employees" = YES | 0.9 | mentions layoffs
- For hiring context: "we're hiring developers" = YES | 0.8 | job posting
- Unrelated content = NO | 0.1 | not relevant to context`;
const response = await fetch(`${OLLAMA_HOST}/api/generate`, {
method: "POST",
headers: {
"Content-Type": "application/json",
},
body: JSON.stringify({
model: model,
prompt: prompt,
stream: false,
options: {
temperature: 0.3,
top_p: 0.9,
},
}),
});
if (!response.ok) {
throw new Error(
`Ollama API error: ${response.status} ${response.statusText}`
);
}
const data = await response.json();
const aiResponse = data.response.trim();
// Parse the response
const analyses = [];
const lines = aiResponse.split("\n").filter((line) => line.trim());
for (let i = 0; i < posts.length; i++) {
let analysis = {
postIndex: i + 1,
isRelevant: false,
confidence: 0.5,
reasoning: "Could not parse AI response",
};
// Look for lines that match "POST X:" pattern
const postPattern = new RegExp(`POST\\s*${i + 1}:?\\s*(.+)`, "i");
for (const line of lines) {
const match = line.match(postPattern);
if (match) {
const content = match[1].trim();
// Parse: YES/NO | 0.X | reasoning
const parts = content.split("|").map((p) => p.trim());
if (parts.length >= 3) {
analysis.isRelevant = parts[0].toUpperCase().includes("YES");
analysis.confidence = Math.max(
0,
Math.min(1, parseFloat(parts[1]) || 0.5)
);
analysis.reasoning = parts[2] || "No reasoning provided";
} else {
// Fallback parsing
analysis.isRelevant =
content.toUpperCase().includes("YES") ||
content.toLowerCase().includes("relevant");
analysis.confidence = 0.6;
analysis.reasoning = content.substring(0, 100);
}
break;
}
}
analyses.push(analysis);
}
// If we didn't get enough analyses, fill in defaults
while (analyses.length < posts.length) {
analyses.push({
postIndex: analyses.length + 1,
isRelevant: false,
confidence: 0.3,
reasoning: "AI response parsing failed",
});
}
return analyses;
} catch (error) {
console.error(`❌ Error in batch AI analysis: ${error.message}`);
// Fallback: mark all as relevant with low confidence
return posts.map((_, i) => ({
postIndex: i + 1,
isRelevant: true,
confidence: 0.3,
reasoning: `Analysis failed: ${error.message}`,
}));
}
}
/**
* Analyze a single post using local Ollama (fallback)
*/
async function analyzeSinglePost(text, context, model) {
const prompt = `Analyze this LinkedIn post for relevance to: "${context}"
Post: "${text}"
Is this post relevant to "${context}"? Provide:
1. YES or NO
2. Confidence (0.0 to 1.0)
3. Brief reason
Format: YES/NO | 0.X | reason`;
try {
const response = await fetch(`${OLLAMA_HOST}/api/generate`, {
method: "POST",
headers: {
"Content-Type": "application/json",
},
body: JSON.stringify({
model: model,
prompt: prompt,
stream: false,
options: {
temperature: 0.3,
},
}),
});
if (!response.ok) {
throw new Error(`Ollama API error: ${response.status}`);
}
const data = await response.json();
const aiResponse = data.response.trim();
// Parse response
const parts = aiResponse.split("|").map((p) => p.trim());
if (parts.length >= 3) {
return {
isRelevant: parts[0].toUpperCase().includes("YES"),
confidence: Math.max(0, Math.min(1, parseFloat(parts[1]) || 0.5)),
reasoning: parts[2],
};
} else {
// Fallback parsing
return {
isRelevant:
aiResponse.toLowerCase().includes("yes") ||
aiResponse.toLowerCase().includes("relevant"),
confidence: 0.6,
reasoning: aiResponse.substring(0, 100),
};
}
} catch (error) {
return {
isRelevant: true, // Default to include on error
confidence: 0.3,
reasoning: `Analysis failed: ${error.message}`,
};
}
}
/**
* Main processing function
*/
async function main() {
try {
console.log("🚀 LinkedOut Local AI Analyzer Starting...");
console.log(`📊 Context: "${context}"`);
console.log(`🎯 Confidence Threshold: ${confidenceThreshold}`);
console.log(`📦 Batch Size: ${batchSize}`);
console.log(`🤖 Model: ${model}`);
// Check Ollama status
await checkOllamaStatus();
// Determine input file
if (!inputFile) {
inputFile = findLatestResultsFile();
console.log(`📂 Using latest results file: ${inputFile}`);
} else {
console.log(`📂 Using specified file: ${inputFile}`);
}
// Load results
if (!fs.existsSync(inputFile)) {
throw new Error(`Input file not found: ${inputFile}`);
}
const rawData = fs.readFileSync(inputFile, "utf-8");
const results = JSON.parse(rawData);
if (!Array.isArray(results) || results.length === 0) {
throw new Error("No posts found in input file");
}
console.log(`📋 Loaded ${results.length} posts for analysis`);
// Process in batches
const processedResults = [];
let totalRelevant = 0;
let totalProcessed = 0;
for (let i = 0; i < results.length; i += batchSize) {
const batch = results.slice(i, i + batchSize);
console.log(
`\n📦 Processing batch ${Math.floor(i / batchSize) + 1}/${Math.ceil(
results.length / batchSize
)} (${batch.length} posts)`
);
const analyses = await analyzeBatch(batch, context, model);
// Apply analyses to posts
for (let j = 0; j < batch.length; j++) {
const post = batch[j];
const analysis = analyses[j];
const enhancedPost = {
...post,
aiRelevant: analysis.isRelevant,
aiConfidence: analysis.confidence,
aiReasoning: analysis.reasoning,
aiModel: model,
aiAnalyzedAt: new Date().toLocaleString("en-CA", {
year: "numeric",
month: "2-digit",
day: "2-digit",
hour: "2-digit",
minute: "2-digit",
second: "2-digit",
hour12: false,
}),
aiType: "local-ollama",
aiProcessed: true,
};
// Apply confidence threshold
if (analysis.confidence >= confidenceThreshold) {
if (analysis.isRelevant) {
processedResults.push(enhancedPost);
totalRelevant++;
}
} else {
// Include low-confidence posts but flag them
enhancedPost.lowConfidence = true;
processedResults.push(enhancedPost);
}
totalProcessed++;
console.log(
` ${
analysis.isRelevant ? "✅" : "❌"
} Post ${totalProcessed}: ${analysis.confidence.toFixed(
2
)} confidence - ${analysis.reasoning.substring(0, 100)}...`
);
}
// Small delay between batches to be nice to the system
if (i + batchSize < results.length) {
console.log("⏳ Brief pause...");
await new Promise((resolve) => setTimeout(resolve, 500));
}
}
// Determine output file
if (!outputFile) {
const inputBasename = path.basename(inputFile, ".json");
const inputDir = path.dirname(inputFile);
outputFile = path.join(inputDir, `${inputBasename}-ai-local.json`);
}
// Save results
fs.writeFileSync(
outputFile,
JSON.stringify(processedResults, null, 2),
"utf-8"
);
console.log("\n🎉 Local AI Analysis Complete!");
console.log(`📊 Results:`);
console.log(` Total posts processed: ${totalProcessed}`);
console.log(` Relevant posts found: ${totalRelevant}`);
console.log(` Final results saved: ${processedResults.length}`);
console.log(`📁 Output saved to: ${outputFile}`);
console.log(`💰 Cost: $0.00 (completely free!)`);
} catch (error) {
console.error("❌ Error:", error.message);
process.exit(1);
}
}
// Show help if requested
if (args.includes("--help") || args.includes("-h")) {
console.log(`
LinkedOut Local AI Analyzer (Ollama)
🚀 FREE local AI analysis - No API costs, complete privacy!
Usage: node ai-analyzer-local.js [options]
Options:
--input=<file> Input JSON file (default: latest in results/)
--context=<text> AI context to analyze against (required)
--confidence=<num> Minimum confidence threshold (0.0-1.0, default: 0.7)
--model=<name> Ollama model to use (default: llama2)
--batch-size=<num> Number of posts to process at once (default: 3)
--output=<file> Output file (default: adds -ai-local suffix)
--help, -h Show this help message
Examples:
node ai-analyzer-local.js --context="job layoffs"
node ai-analyzer-local.js --model=mistral --context="hiring opportunities"
node ai-analyzer-local.js --context="remote work" --confidence=0.8
Prerequisites:
1. Install Ollama: https://ollama.ai/
2. Install a model: ollama pull llama2
3. Start Ollama: ollama serve
Popular Models:
- llama2 (good general purpose)
- mistral (fast and accurate)
- codellama (good for technical content)
- llama2:13b (more accurate, slower)
Environment Variables:
AI_CONTEXT Default context for analysis
AI_CONFIDENCE Default confidence threshold
AI_BATCH_SIZE Default batch size
OLLAMA_MODEL Default model (llama2, mistral, etc.)
OLLAMA_HOST Ollama host (default: http://localhost:11434)
`);
process.exit(0);
}
// Run the analyzer
main();
module.exports = {
analyzeSinglePost,
analyzeBatch,
};

558
ai-analyzer/README.md Normal file
View File

@ -0,0 +1,558 @@
# AI Analyzer - Core Utilities Package
Shared utilities and core functionality used by all LinkedOut parsers. This package provides consistent logging, text processing, location validation, AI integration, and a **command-line interface for AI analysis**.
## 🎯 Purpose
The AI Analyzer serves as the foundation for all LinkedOut components, providing:
- **Consistent Logging**: Unified logging system across all parsers
- **Text Processing**: Keyword matching, content cleaning, and analysis
- **Location Validation**: Geographic filtering and location intelligence
- **AI Integration**: Local Ollama support with integrated analysis
- **CLI Tool**: Command-line interface for standalone AI analysis
- **Test Utilities**: Shared testing helpers and mocks
## 📦 Components
### 1. Logger (`src/logger.js`)
Configurable logging system with color support and level controls.
```javascript
const { logger } = require("ai-analyzer");
// Basic logging
logger.info("Processing started");
logger.warning("Rate limit approaching");
logger.error("Connection failed");
// Convenience methods with emoji prefixes
logger.step("🚀 Starting scrape");
logger.search("🔍 Searching for keywords");
logger.ai("🧠 Running AI analysis");
logger.location("📍 Validating location");
logger.file("📄 Saving results");
```
**Features:**
- Configurable log levels (debug, info, warning, error, success)
- Color-coded output with chalk
- Emoji prefixes for better UX
- Silent mode for production
- Timestamp formatting
### 2. Text Utilities (`src/text-utils.js`)
Text processing and keyword matching utilities.
```javascript
const { cleanText, containsAnyKeyword } = require("ai-analyzer");
// Clean text content
const cleaned = cleanText(
"Check out this #awesome post! https://example.com 🚀"
);
// Result: "Check out this awesome post!"
// Check for keyword matches
const keywords = ["layoff", "downsizing", "RIF"];
const hasMatch = containsAnyKeyword(text, keywords);
```
**Features:**
- Remove hashtags, URLs, and emojis
- Case-insensitive keyword matching
- Multiple keyword detection
- Text normalization
### 3. Location Utilities (`src/location-utils.js`)
Geographic location validation and filtering.
```javascript
const {
parseLocationFilters,
validateLocationAgainstFilters,
extractLocationFromProfile,
} = require("ai-analyzer");
// Parse location filter string
const filters = parseLocationFilters("Ontario,Manitoba,Toronto");
// Validate location against filters
const isValid = validateLocationAgainstFilters(
"Toronto, Ontario, Canada",
filters
);
// Extract location from profile text
const location = extractLocationFromProfile(
"Software Engineer at Tech Corp • Toronto, Ontario"
);
```
**Features:**
- Geographic filter parsing
- Location validation against 200+ Canadian cities
- Profile location extraction
- Smart location matching
### 4. AI Utilities (`src/ai-utils.js`)
AI-powered content analysis with **integrated results**.
```javascript
const { analyzeBatch, checkOllamaStatus } = require("ai-analyzer");
// Check AI availability
const aiAvailable = await checkOllamaStatus("mistral");
// Analyze posts with AI (returns analysis results)
const analysis = await analyzeBatch(posts, "job market analysis", "mistral");
// Integrate AI analysis into results
const resultsWithAI = posts.map((post, index) => ({
...post,
aiAnalysis: {
isRelevant: analysis[index].isRelevant,
confidence: analysis[index].confidence,
reasoning: analysis[index].reasoning,
context: "job market analysis",
model: "mistral",
analyzedAt: new Date().toISOString(),
},
}));
```
**Features:**
- Ollama integration for local AI
- Batch processing for efficiency
- Confidence scoring
- Context-aware analysis
- **Integrated results**: AI analysis embedded in data structure
### 5. CLI Tool (`cli.js`)
Command-line interface for standalone AI analysis.
```bash
# Analyze latest results file
node cli.js --latest --dir=results
# Analyze specific file
node cli.js --input=results.json
# Analyze with custom context
node cli.js --input=results.json --context="layoff analysis"
# Analyze with different model
node cli.js --input=results.json --model=mistral
# Show help
node cli.js --help
```
**Features:**
- **Integrated Analysis**: AI results embedded back into original JSON
- **Flexible Input**: Support for various JSON formats
- **Context Switching**: Easy re-analysis with different contexts
- **Model Selection**: Choose different Ollama models
- **Directory Support**: Specify results directory with `--dir`
### 6. Test Utilities (`src/test-utils.js`)
Shared testing helpers and mocks.
```javascript
const { createMockPost, createMockProfile } = require("ai-analyzer");
// Create test data
const mockPost = createMockPost({
content: "Test post content",
author: "John Doe",
location: "Toronto, Ontario",
});
```
## 🚀 Installation
```bash
# Install dependencies
npm install
# Run tests
npm test
# Run specific test suites
npm test -- --testNamePattern="Logger"
```
## 📋 CLI Reference
### Basic Usage
```bash
# Analyze latest results file
node cli.js --latest --dir=results
# Analyze specific file
node cli.js --input=results.json
# Analyze with custom output
node cli.js --input=results.json --output=analysis.json
```
### Options
```bash
--input=FILE # Input JSON file
--output=FILE # Output file (default: original-ai.json)
--context="description" # Analysis context (default: "job market analysis and trends")
--model=MODEL # Ollama model (default: mistral)
--latest # Use latest results file from directory
--dir=PATH # Directory to look for results (default: 'results')
--help, -h # Show help
```
### Examples
```bash
# Analyze latest LinkedIn results
cd linkedin-parser
node ../ai-analyzer/cli.js --latest --dir=results
# Analyze with layoff context
node cli.js --input=results.json --context="layoff analysis"
# Analyze with different model
node cli.js --input=results.json --model=llama3
# Analyze from project root
node ai-analyzer/cli.js --latest --dir=linkedin-parser/results
```
### Output Format
The CLI integrates AI analysis directly into the original JSON structure:
```json
{
"metadata": {
"timestamp": "2025-07-21T02:00:08.561Z",
"totalPosts": 10,
"aiAnalysisUpdated": "2025-07-21T02:48:42.487Z",
"aiContext": "job market analysis and trends",
"aiModel": "mistral"
},
"results": [
{
"keyword": "layoff",
"text": "Post content...",
"aiAnalysis": {
"isRelevant": true,
"confidence": 0.9,
"reasoning": "Post discusses job market conditions",
"context": "job market analysis and trends",
"model": "mistral",
"analyzedAt": "2025-07-21T02:48:42.487Z"
}
}
]
}
```
## 📋 API Reference
### Logger Class
```javascript
const { Logger } = require("ai-analyzer");
// Create custom logger
const logger = new Logger({
debug: false,
colors: true,
});
// Configure levels
logger.setLevel("debug", true);
logger.silent(); // Disable all logging
logger.verbose(); // Enable all logging
```
### Text Processing
```javascript
const { cleanText, containsAnyKeyword } = require('ai-analyzer');
// Clean text
cleanText(text: string): string
// Check keywords
containsAnyKeyword(text: string, keywords: string[]): boolean
```
### Location Validation
```javascript
const {
parseLocationFilters,
validateLocationAgainstFilters,
extractLocationFromProfile
} = require('ai-analyzer');
// Parse filters
parseLocationFilters(filterString: string): string[]
// Validate location
validateLocationAgainstFilters(location: string, filters: string[]): boolean
// Extract from profile
extractLocationFromProfile(profileText: string): string | null
```
### AI Analysis
```javascript
const { analyzeBatch, checkOllamaStatus, findLatestResultsFile } = require('ai-analyzer');
// Check AI availability
checkOllamaStatus(model?: string, ollamaHost?: string): Promise<boolean>
// Analyze posts
analyzeBatch(posts: Post[], context: string, model?: string): Promise<AnalysisResult[]>
// Find latest results file
findLatestResultsFile(resultsDir?: string): string
```
## 🧪 Testing
### Run All Tests
```bash
npm test
```
### Test Coverage
```bash
npm run test:coverage
```
### Specific Test Suites
```bash
# Logger tests
npm test -- --testNamePattern="Logger"
# Text utilities tests
npm test -- --testNamePattern="Text"
# Location utilities tests
npm test -- --testNamePattern="Location"
# AI utilities tests
npm test -- --testNamePattern="AI"
```
## 🔧 Configuration
### Environment Variables
```env
# AI Configuration
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=mistral
AI_CONTEXT="job market analysis and trends"
# Logging Configuration
LOG_LEVEL=info
LOG_COLORS=true
# Location Configuration
LOCATION_FILTER=Ontario,Manitoba
ENABLE_LOCATION_CHECK=true
```
### Logger Configuration
```javascript
const logger = new Logger({
debug: true, // Enable debug logging
info: true, // Enable info logging
warning: true, // Enable warning logging
error: true, // Enable error logging
success: true, // Enable success logging
colors: true, // Enable color output
});
```
## 📊 Usage Examples
### Basic Logging Setup
```javascript
const { logger } = require("ai-analyzer");
// Configure for production
if (process.env.NODE_ENV === "production") {
logger.setLevel("debug", false);
logger.setLevel("info", true);
}
// Use throughout your application
logger.step("Starting LinkedIn scrape");
logger.info("Found 150 posts");
logger.warning("Rate limit approaching");
logger.success("Scraping completed successfully");
```
### Text Processing Pipeline
```javascript
const { cleanText, containsAnyKeyword } = require("ai-analyzer");
function processPost(post) {
// Clean the content
const cleanedContent = cleanText(post.content);
// Check for keywords
const keywords = ["layoff", "downsizing", "RIF"];
const hasKeywords = containsAnyKeyword(cleanedContent, keywords);
return {
...post,
cleanedContent,
hasKeywords,
};
}
```
### Location Validation
```javascript
const {
parseLocationFilters,
validateLocationAgainstFilters,
} = require("ai-analyzer");
// Setup location filtering
const locationFilters = parseLocationFilters("Ontario,Manitoba,Toronto");
// Validate each post
function validatePost(post) {
const isValidLocation = validateLocationAgainstFilters(
post.author.location,
locationFilters
);
return isValidLocation ? post : null;
}
```
### AI Analysis Integration
```javascript
const { analyzeBatch, checkOllamaStatus } = require("ai-analyzer");
async function analyzePosts(posts) {
try {
// Check AI availability
const aiAvailable = await checkOllamaStatus("mistral");
if (!aiAvailable) {
logger.warning("AI not available - skipping analysis");
return posts;
}
// Run AI analysis
const analysis = await analyzeBatch(
posts,
"job market analysis",
"mistral"
);
// Integrate AI analysis into results
const resultsWithAI = posts.map((post, index) => ({
...post,
aiAnalysis: {
isRelevant: analysis[index].isRelevant,
confidence: analysis[index].confidence,
reasoning: analysis[index].reasoning,
context: "job market analysis",
model: "mistral",
analyzedAt: new Date().toISOString(),
},
}));
return resultsWithAI;
} catch (error) {
logger.error("AI analysis failed:", error.message);
return posts; // Return original posts if AI fails
}
}
```
### CLI Integration
```javascript
// In your parser's package.json scripts
{
"scripts": {
"analyze:latest": "node ../ai-analyzer/cli.js --latest --dir=results",
"analyze:layoff": "node ../ai-analyzer/cli.js --latest --dir=results --context=\"layoff analysis\"",
"analyze:trends": "node ../ai-analyzer/cli.js --latest --dir=results --context=\"job market trends\""
}
}
```
## 🔒 Security & Best Practices
### Credential Management
- Store API keys in environment variables
- Never commit sensitive data to version control
- Use `.env` files for local development
### Rate Limiting
- Implement delays between AI API calls
- Respect service provider rate limits
- Use batch processing to minimize requests
### Error Handling
- Always wrap AI calls in try-catch blocks
- Provide fallback behavior when services fail
- Log errors with appropriate detail levels
## 🤝 Contributing
### Development Setup
1. Fork the repository
2. Create feature branch
3. Add tests for new functionality
4. Ensure all tests pass
5. Submit pull request
### Code Standards
- Follow existing code style
- Add JSDoc comments for all functions
- Maintain test coverage above 90%
- Update documentation for new features
## 📄 License
This package is part of the LinkedOut platform and follows the same licensing terms.
---
**Note**: This package is designed to be used as a dependency by other LinkedOut components. It provides the core utilities, CLI tool, and should not be used standalone.

250
ai-analyzer/cli.js Normal file
View File

@ -0,0 +1,250 @@
#!/usr/bin/env node
/**
* AI Analyzer CLI
*
* Command-line interface for the ai-analyzer package
* Can be used by any parser to analyze JSON files
*/
const fs = require("fs");
const path = require("path");
// Import AI utilities from this package
const {
logger,
analyzeBatch,
checkOllamaStatus,
findLatestResultsFile,
} = require("./index");
// Default configuration
const DEFAULT_CONTEXT =
process.env.AI_CONTEXT || "job market analysis and trends";
const DEFAULT_MODEL = process.env.OLLAMA_MODEL || "mistral";
const DEFAULT_RESULTS_DIR = "results";
// Parse command line arguments
const args = process.argv.slice(2);
let inputFile = null;
let outputFile = null;
let context = DEFAULT_CONTEXT;
let model = DEFAULT_MODEL;
let findLatest = false;
let resultsDir = DEFAULT_RESULTS_DIR;
for (const arg of args) {
if (arg.startsWith("--input=")) {
inputFile = arg.split("=")[1];
} else if (arg.startsWith("--output=")) {
outputFile = arg.split("=")[1];
} else if (arg.startsWith("--context=")) {
context = arg.split("=")[1];
} else if (arg.startsWith("--model=")) {
model = arg.split("=")[1];
} else if (arg.startsWith("--dir=")) {
resultsDir = arg.split("=")[1];
} else if (arg === "--latest") {
findLatest = true;
} else if (arg === "--help" || arg === "-h") {
console.log(`
AI Analyzer CLI
Usage: node cli.js [options]
Options:
--input=FILE Input JSON file
--output=FILE Output file (default: ai-analysis-{timestamp}.json)
--context="description" Analysis context (default: "${DEFAULT_CONTEXT}")
--model=MODEL Ollama model (default: ${DEFAULT_MODEL})
--latest Use latest results file from results directory
--dir=PATH Directory to look for results (default: 'results')
--help, -h Show this help
Examples:
node cli.js --input=results.json
node cli.js --latest --dir=results
node cli.js --input=results.json --context="job trends" --model=mistral
Environment Variables:
AI_CONTEXT Default analysis context
OLLAMA_MODEL Default Ollama model
`);
process.exit(0);
}
}
async function main() {
try {
// Determine input file
if (findLatest) {
try {
inputFile = findLatestResultsFile(resultsDir);
logger.info(`Found latest results file: ${inputFile}`);
} catch (error) {
logger.error(
`❌ No results files found in '${resultsDir}': ${error.message}`
);
logger.info(`💡 To create results files:`);
logger.info(
` 1. Run a parser first (e.g., npm start in linkedin-parser)`
);
logger.info(` 2. Or provide a specific file with --input=FILE`);
logger.info(` 3. Or create a sample JSON file to test with`);
process.exit(1);
}
}
// If inputFile is a relative path and --dir is set, resolve it
if (inputFile && !path.isAbsolute(inputFile) && !fs.existsSync(inputFile)) {
const candidate = path.join(resultsDir, inputFile);
if (fs.existsSync(candidate)) {
inputFile = candidate;
}
}
if (!inputFile) {
logger.error("❌ Input file required. Use --input=FILE or --latest");
logger.info(`💡 Examples:`);
logger.info(` node cli.js --input=results.json`);
logger.info(` node cli.js --latest --dir=results`);
logger.info(` node cli.js --help`);
process.exit(1);
}
// Load input file
logger.step(`Loading input file: ${inputFile}`);
if (!fs.existsSync(inputFile)) {
throw new Error(`Input file not found: ${inputFile}`);
}
const data = JSON.parse(fs.readFileSync(inputFile, "utf-8"));
// Extract posts from different formats
let posts = [];
if (data.results && Array.isArray(data.results)) {
posts = data.results;
logger.info(`Found ${posts.length} items in results array`);
} else if (Array.isArray(data)) {
posts = data;
logger.info(`Found ${posts.length} items in array`);
} else {
throw new Error("Invalid JSON format - need array or {results: [...]}");
}
if (posts.length === 0) {
throw new Error("No items found to analyze");
}
// Check AI availability
logger.step("Checking AI availability");
const aiAvailable = await checkOllamaStatus(model);
if (!aiAvailable) {
throw new Error(
`AI not available. Make sure Ollama is running and model '${model}' is installed.`
);
}
// Check if results already have AI analysis
const hasExistingAI = posts.some((post) => post.aiAnalysis);
if (hasExistingAI) {
logger.info(
`📋 Results already contain AI analysis - will update with new context`
);
}
// Prepare data for analysis
const analysisData = posts.map((post, i) => ({
text: post.text || post.content || post.post || "",
location: post.location || "Unknown",
keyword: post.keyword || "Unknown",
timestamp: post.timestamp || new Date().toISOString(),
}));
// Run analysis
logger.step(`Running AI analysis with context: "${context}"`);
const analysis = await analyzeBatch(analysisData, context, model);
// Integrate AI analysis back into the original results
const updatedPosts = posts.map((post, index) => {
const aiResult = analysis[index];
return {
...post,
aiAnalysis: {
isRelevant: aiResult.isRelevant,
confidence: aiResult.confidence,
reasoning: aiResult.reasoning,
context: context,
model: model,
analyzedAt: new Date().toISOString(),
},
};
});
// Update the original data structure
if (data.results && Array.isArray(data.results)) {
data.results = updatedPosts;
// Update metadata
data.metadata = data.metadata || {};
data.metadata.aiAnalysisUpdated = new Date().toISOString();
data.metadata.aiContext = context;
data.metadata.aiModel = model;
} else {
// If it's a simple array, create a proper structure
data = {
metadata: {
timestamp: new Date().toISOString(),
totalItems: updatedPosts.length,
aiContext: context,
aiModel: model,
analysisType: "cli",
},
results: updatedPosts,
};
}
// Generate output filename if not provided
if (!outputFile) {
// Use the original filename with -ai suffix
const originalName = path.basename(inputFile, path.extname(inputFile));
outputFile = path.join(
path.dirname(inputFile),
`${originalName}-ai.json`
);
}
// Save updated results back to file
fs.writeFileSync(outputFile, JSON.stringify(data, null, 2));
// Show summary
const relevant = analysis.filter((a) => a.isRelevant).length;
const irrelevant = analysis.filter((a) => !a.isRelevant).length;
const avgConfidence =
analysis.reduce((sum, a) => sum + a.confidence, 0) / analysis.length;
logger.success("✅ AI analysis completed and integrated");
logger.info(`📊 Context: "${context}"`);
logger.info(`📈 Total items analyzed: ${analysis.length}`);
logger.info(
`✅ Relevant items: ${relevant} (${(
(relevant / analysis.length) *
100
).toFixed(1)}%)`
);
logger.info(
`❌ Irrelevant items: ${irrelevant} (${(
(irrelevant / analysis.length) *
100
).toFixed(1)}%)`
);
logger.info(`🎯 Average confidence: ${avgConfidence.toFixed(2)}`);
logger.file(`🧠 Updated results saved to: ${outputFile}`);
} catch (error) {
logger.error(`❌ Analysis failed: ${error.message}`);
process.exit(1);
}
}
// Run the CLI
main();

346
ai-analyzer/demo.js Normal file
View File

@ -0,0 +1,346 @@
/**
* AI Analyzer Demo
*
* Demonstrates all the core utilities provided by the ai-analyzer package:
* - Logger functionality
* - Text processing utilities
* - Location validation
* - AI analysis capabilities
* - Test utilities
*/
const {
logger,
Logger,
cleanText,
containsAnyKeyword,
parseLocationFilters,
validateLocationAgainstFilters,
extractLocationFromProfile,
analyzeBatch,
} = require("./index");
// Terminal colors for demo output
const colors = {
reset: "\x1b[0m",
bright: "\x1b[1m",
cyan: "\x1b[36m",
green: "\x1b[32m",
yellow: "\x1b[33m",
blue: "\x1b[34m",
magenta: "\x1b[35m",
red: "\x1b[31m",
};
const demo = {
title: (text) =>
console.log(`\n${colors.bright}${colors.cyan}${text}${colors.reset}`),
section: (text) =>
console.log(`\n${colors.bright}${colors.magenta}${text}${colors.reset}`),
success: (text) => console.log(`${colors.green}${text}${colors.reset}`),
info: (text) => console.log(`${colors.blue} ${text}${colors.reset}`),
warning: (text) => console.log(`${colors.yellow}⚠️ ${text}${colors.reset}`),
error: (text) => console.log(`${colors.red}${text}${colors.reset}`),
code: (text) => console.log(`${colors.cyan}${text}${colors.reset}`),
};
async function runDemo() {
demo.title("=== AI Analyzer Demo ===");
demo.info(
"This demo showcases all the core utilities provided by the ai-analyzer package."
);
demo.info("Press Enter to continue through each section...\n");
await waitForEnter();
// 1. Logger Demo
await demonstrateLogger();
// 2. Text Processing Demo
await demonstrateTextProcessing();
// 3. Location Validation Demo
await demonstrateLocationValidation();
// 4. AI Analysis Demo
await demonstrateAIAnalysis();
// 5. Integration Demo
await demonstrateIntegration();
demo.title("=== Demo Complete ===");
demo.success("All ai-analyzer utilities demonstrated successfully!");
demo.info("Check the README.md for detailed API documentation.");
}
async function demonstrateLogger() {
demo.section("1. Logger Utilities");
demo.info(
"The logger provides consistent logging across all parsers with configurable levels and color support."
);
demo.code("// Using default logger");
logger.info("This is an info message");
logger.warning("This is a warning message");
logger.error("This is an error message");
logger.success("This is a success message");
logger.debug("This is a debug message (if enabled)");
demo.code("// Convenience methods with emoji prefixes");
logger.step("Starting demo process");
logger.search("Searching for keywords");
logger.ai("Running AI analysis");
logger.location("Validating location");
logger.file("Saving results");
demo.code("// Custom logger configuration");
const customLogger = new Logger({
debug: false,
colors: true,
});
customLogger.info("Custom logger with debug disabled");
customLogger.debug("This won't show");
demo.code("// Silent mode");
const silentLogger = new Logger();
silentLogger.silent();
silentLogger.info("This won't show");
silentLogger.verbose(); // Re-enable all levels
await waitForEnter();
}
async function demonstrateTextProcessing() {
demo.section("2. Text Processing Utilities");
demo.info(
"Text utilities provide content cleaning and keyword matching capabilities."
);
const sampleTexts = [
"Check out this #awesome post! https://example.com 🚀",
"Just got #laidoff from my job. Looking for new opportunities!",
"Company is #downsizing and I'm affected. #RIF #layoff",
"Great news! We're #hiring new developers! 🎉",
];
demo.code("// Text cleaning examples:");
sampleTexts.forEach((text, index) => {
const cleaned = cleanText(text);
demo.info(`Original: ${text}`);
demo.success(`Cleaned: ${cleaned}`);
console.log();
});
demo.code("// Keyword matching:");
const keywords = ["layoff", "downsizing", "RIF", "hiring"];
sampleTexts.forEach((text, index) => {
const hasMatch = containsAnyKeyword(text, keywords);
const matchedKeywords = keywords.filter((keyword) =>
text.toLowerCase().includes(keyword.toLowerCase())
);
demo.info(
`Text ${index + 1}: ${hasMatch ? "✅" : "❌"} ${
matchedKeywords.join(", ") || "No matches"
}`
);
});
await waitForEnter();
}
async function demonstrateLocationValidation() {
demo.section("3. Location Validation Utilities");
demo.info(
"Location utilities provide geographic filtering and validation capabilities."
);
demo.code("// Location filter parsing:");
const filterStrings = [
"Ontario,Manitoba",
"Toronto,Vancouver",
"British Columbia,Alberta",
"Canada",
];
filterStrings.forEach((filterString) => {
const filters = parseLocationFilters(filterString);
demo.info(`Filter: "${filterString}"`);
demo.success(`Parsed: [${filters.join(", ")}]`);
console.log();
});
demo.code("// Location validation examples:");
const testLocations = [
{ location: "Toronto, Ontario, Canada", filters: ["Ontario"] },
{ location: "Vancouver, BC", filters: ["British Columbia"] },
{ location: "Calgary, Alberta", filters: ["Ontario"] },
{ location: "Montreal, Quebec", filters: ["Ontario", "Manitoba"] },
{ location: "New York, NY", filters: ["Ontario"] },
];
testLocations.forEach(({ location, filters }) => {
const isValid = validateLocationAgainstFilters(location, filters);
demo.info(`Location: "${location}"`);
demo.info(`Filters: [${filters.join(", ")}]`);
demo.success(`Valid: ${isValid ? "✅ Yes" : "❌ No"}`);
console.log();
});
demo.code("// Profile location extraction:");
const profileTexts = [
"Software Engineer at Tech Corp • Toronto, Ontario",
"Product Manager • Vancouver, BC",
"Data Scientist • Remote",
"CEO at Startup Inc • Montreal, Quebec, Canada",
];
profileTexts.forEach((profileText) => {
const location = extractLocationFromProfile(profileText);
demo.info(`Profile: "${profileText}"`);
demo.success(`Extracted: "${location || "No location found"}"`);
console.log();
});
await waitForEnter();
}
async function demonstrateAIAnalysis() {
demo.section("4. AI Analysis Utilities");
demo.info(
"AI utilities provide content analysis using OpenAI or local Ollama models."
);
// Mock posts for demo
const mockPosts = [
{
id: "1",
content:
"Just got laid off from my software engineering role. Looking for new opportunities in Toronto.",
author: "John Doe",
location: "Toronto, Ontario",
},
{
id: "2",
content:
"Our company is downsizing and I'm affected. This is really tough news.",
author: "Jane Smith",
location: "Vancouver, BC",
},
{
id: "3",
content:
"We're hiring! Looking for talented developers to join our team.",
author: "Bob Wilson",
location: "Calgary, Alberta",
},
];
demo.code("// Mock AI analysis (simulated):");
demo.info("In a real scenario, this would call Ollama or OpenAI API");
mockPosts.forEach((post, index) => {
demo.info(`Post ${index + 1}: ${post.content.substring(0, 50)}...`);
demo.success(
`Analysis: Relevant to job layoffs (confidence: 0.${85 + index * 5})`
);
console.log();
});
demo.code("// Batch analysis simulation:");
demo.info("Processing batch of 3 posts...");
await simulateProcessing();
demo.success("Batch analysis completed!");
await waitForEnter();
}
async function demonstrateIntegration() {
demo.section("5. Integration Example");
demo.info("Here's how all utilities work together in a real scenario:");
const samplePost = {
id: "demo-1",
content:
"Just got #laidoff from my job at TechCorp! Looking for new opportunities in #Toronto. This is really tough but I'm staying positive! 🚀",
author: "Demo User",
location: "Toronto, Ontario, Canada",
};
demo.code("// Processing pipeline:");
// 1. Log the start
logger.step("Processing new post");
// 2. Clean the text
const cleanedContent = cleanText(samplePost.content);
logger.info(`Cleaned content: ${cleanedContent}`);
// 3. Check for keywords
const keywords = ["layoff", "downsizing", "RIF"];
const hasKeywords = containsAnyKeyword(cleanedContent, keywords);
logger.search(`Keyword match: ${hasKeywords ? "Found" : "Not found"}`);
// 4. Validate location
const locationFilters = parseLocationFilters("Ontario,Manitoba");
const isValidLocation = validateLocationAgainstFilters(
samplePost.location,
locationFilters
);
logger.location(`Location valid: ${isValidLocation ? "Yes" : "No"}`);
// 5. Simulate AI analysis
if (hasKeywords && isValidLocation) {
logger.ai("Running AI analysis...");
await simulateProcessing();
logger.success("Post accepted and analyzed!");
} else {
logger.warning("Post rejected - doesn't meet criteria");
}
await waitForEnter();
}
// Helper functions
function waitForEnter() {
return new Promise((resolve) => {
const readline = require("readline");
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout,
});
rl.question("\nPress Enter to continue...", () => {
rl.close();
resolve();
});
});
}
async function simulateProcessing() {
return new Promise((resolve) => {
const dots = [".", "..", "..."];
let i = 0;
const interval = setInterval(() => {
process.stdout.write(`\rProcessing${dots[i]}`);
i = (i + 1) % dots.length;
}, 500);
setTimeout(() => {
clearInterval(interval);
process.stdout.write("\r");
resolve();
}, 2000);
});
}
// Run the demo if this file is executed directly
if (require.main === module) {
runDemo().catch((error) => {
demo.error(`Demo failed: ${error.message}`);
process.exit(1);
});
}
module.exports = { runDemo };

22
ai-analyzer/index.js Normal file
View File

@ -0,0 +1,22 @@
/**
* ai-analyzer - Core utilities for parsers
* Main entry point that exports all modules
*/
// Export all utilities with clean namespace
module.exports = {
// Logger utilities
...require("./src/logger"),
// AI analysis utilities
...require("./src/ai-utils"),
// Text processing utilities
...require("./src/text-utils"),
// Location validation utilities
...require("./src/location-utils"),
// Test utilities
...require("./src/test-utils"),
};

3714
ai-analyzer/package-lock.json generated Normal file

File diff suppressed because it is too large Load Diff

32
ai-analyzer/package.json Normal file
View File

@ -0,0 +1,32 @@
{
"name": "ai-analyzer",
"version": "1.0.0",
"description": "Reusable core utilities for parsers: AI analysis, location validation, logging, and text processing",
"main": "index.js",
"bin": {
"ai-analyzer": "./cli.js"
},
"scripts": {
"test": "jest",
"cli": "node cli.js"
},
"keywords": [
"parser",
"ai",
"location",
"logging",
"scraper",
"ollama"
],
"author": "",
"license": "ISC",
"type": "commonjs",
"dependencies": {
"chalk": "^4.1.2",
"dotenv": "^17.0.0",
"csv-parser": "^3.2.0"
},
"devDependencies": {
"jest": "^29.7.0"
}
}

301
ai-analyzer/src/ai-utils.js Normal file
View File

@ -0,0 +1,301 @@
const { logger } = require("./logger");
/**
* AI Analysis utilities for post processing with Ollama
* Extracted from ai-analyzer-local.js for reuse across parsers
*/
/**
* Check if Ollama is running and the model is available
*/
async function checkOllamaStatus(
model = "mistral",
ollamaHost = "http://localhost:11434"
) {
try {
// Check if Ollama is running
const response = await fetch(`${ollamaHost}/api/tags`);
if (!response.ok) {
throw new Error(`Ollama not running on ${ollamaHost}`);
}
const data = await response.json();
const availableModels = data.models.map((m) => m.name);
logger.ai("Ollama is running");
logger.info(
`📦 Available models: ${availableModels
.map((m) => m.split(":")[0])
.join(", ")}`
);
// Check if requested model is available
const modelExists = availableModels.some((m) => m.startsWith(model));
if (!modelExists) {
logger.error(`Model "${model}" not found`);
logger.error(`💡 Install it with: ollama pull ${model}`);
logger.error(
`💡 Or choose from: ${availableModels
.map((m) => m.split(":")[0])
.join(", ")}`
);
return false;
}
logger.success(`Using model: ${model}`);
return true;
} catch (error) {
logger.error(`Error connecting to Ollama: ${error.message}`);
logger.error("💡 Make sure Ollama is installed and running:");
logger.error(" 1. Install: https://ollama.ai/");
logger.error(" 2. Start: ollama serve");
logger.error(` 3. Install model: ollama pull ${model}`);
return false;
}
}
/**
* Analyze multiple posts using local Ollama
*/
async function analyzeBatch(
posts,
context,
model = "mistral",
ollamaHost = "http://localhost:11434"
) {
logger.ai(`Analyzing batch of ${posts.length} posts with ${model}...`);
try {
const prompt = `You are an expert at analyzing LinkedIn posts for relevance to specific contexts.
CONTEXT TO MATCH: "${context}"
Analyze these ${
posts.length
} LinkedIn posts and determine if each relates to the context above.
POSTS:
${posts
.map(
(post, i) => `
POST ${i + 1}:
"${post.text.substring(0, 400)}${post.text.length > 400 ? "..." : ""}"
`
)
.join("")}
For each post, provide:
- Is it relevant to "${context}"? (YES/NO)
- Confidence level (0.0 to 1.0)
- Brief reasoning
Respond in this EXACT format for each post:
POST 1: YES/NO | 0.X | brief reason
POST 2: YES/NO | 0.X | brief reason
POST 3: YES/NO | 0.X | brief reason
Examples:
- For layoff context: "laid off 50 employees" = YES | 0.9 | mentions layoffs
- For hiring context: "we're hiring developers" = YES | 0.8 | job posting
- Unrelated content = NO | 0.1 | not relevant to context`;
const response = await fetch(`${ollamaHost}/api/generate`, {
method: "POST",
headers: {
"Content-Type": "application/json",
},
body: JSON.stringify({
model: model,
prompt: prompt,
stream: false,
options: {
temperature: 0.3,
top_p: 0.9,
},
}),
});
if (!response.ok) {
throw new Error(
`Ollama API error: ${response.status} ${response.statusText}`
);
}
const data = await response.json();
const aiResponse = data.response.trim();
// Parse the response
const analyses = [];
const lines = aiResponse.split("\n").filter((line) => line.trim());
for (let i = 0; i < posts.length; i++) {
let analysis = {
postIndex: i + 1,
isRelevant: false,
confidence: 0.5,
reasoning: "Could not parse AI response",
};
// Look for lines that match "POST X:" pattern
const postPattern = new RegExp(`POST\\s*${i + 1}:?\\s*(.+)`, "i");
for (const line of lines) {
const match = line.match(postPattern);
if (match) {
const content = match[1].trim();
// Parse: YES/NO | 0.X | reasoning
const parts = content.split("|").map((p) => p.trim());
if (parts.length >= 3) {
analysis.isRelevant = parts[0].toUpperCase().includes("YES");
analysis.confidence = Math.max(
0,
Math.min(1, parseFloat(parts[1]) || 0.5)
);
analysis.reasoning = parts[2] || "No reasoning provided";
} else {
// Fallback parsing
analysis.isRelevant =
content.toUpperCase().includes("YES") ||
content.toLowerCase().includes("relevant");
analysis.confidence = 0.6;
analysis.reasoning = content.substring(0, 100);
}
break;
}
}
analyses.push(analysis);
}
// If we didn't get enough analyses, fill in defaults
while (analyses.length < posts.length) {
analyses.push({
postIndex: analyses.length + 1,
isRelevant: false,
confidence: 0.3,
reasoning: "AI response parsing failed",
});
}
return analyses;
} catch (error) {
logger.error(`Error in batch AI analysis: ${error.message}`);
// Fallback: mark all as relevant with low confidence
return posts.map((_, i) => ({
postIndex: i + 1,
isRelevant: true,
confidence: 0.3,
reasoning: `Analysis failed: ${error.message}`,
}));
}
}
/**
* Analyze a single post using local Ollama (fallback)
*/
async function analyzeSinglePost(
text,
context,
model = "mistral",
ollamaHost = "http://localhost:11434"
) {
const prompt = `Analyze this LinkedIn post for relevance to: "${context}"
Post: "${text}"
Is this post relevant to "${context}"? Provide:
1. YES or NO
2. Confidence (0.0 to 1.0)
3. Brief reason
Format: YES/NO | 0.X | reason`;
try {
const response = await fetch(`${ollamaHost}/api/generate`, {
method: "POST",
headers: {
"Content-Type": "application/json",
},
body: JSON.stringify({
model: model,
prompt: prompt,
stream: false,
options: {
temperature: 0.3,
},
}),
});
if (!response.ok) {
throw new Error(`Ollama API error: ${response.status}`);
}
const data = await response.json();
const aiResponse = data.response.trim();
// Parse response
const parts = aiResponse.split("|").map((p) => p.trim());
if (parts.length >= 3) {
return {
isRelevant: parts[0].toUpperCase().includes("YES"),
confidence: Math.max(0, Math.min(1, parseFloat(parts[1]) || 0.5)),
reasoning: parts[2],
};
} else {
// Fallback parsing
return {
isRelevant:
aiResponse.toLowerCase().includes("yes") ||
aiResponse.toLowerCase().includes("relevant"),
confidence: 0.6,
reasoning: aiResponse.substring(0, 100),
};
}
} catch (error) {
return {
isRelevant: true, // Default to include on error
confidence: 0.3,
reasoning: `Analysis failed: ${error.message}`,
};
}
}
/**
* Find the most recent results file if none specified
*/
function findLatestResultsFile(resultsDir = "results") {
const fs = require("fs");
const path = require("path");
if (!fs.existsSync(resultsDir)) {
throw new Error("Results directory not found. Run the scraper first.");
}
const files = fs
.readdirSync(resultsDir)
.filter(
(f) =>
(f.startsWith("results-") || f.startsWith("linkedin-results-")) &&
f.endsWith(".json") &&
!f.includes("-ai-")
)
.sort()
.reverse();
if (files.length === 0) {
throw new Error("No results files found. Run the scraper first.");
}
return path.join(resultsDir, files[0]);
}
module.exports = {
checkOllamaStatus,
analyzeBatch,
analyzeSinglePost,
findLatestResultsFile,
};

View File

@ -1,19 +1,16 @@
/**
* Enhanced Location Filtering Utilities - Improved Version
*
* Place all keyword CSVs in the keywords/ folder for use with LinkedOut.
*
* These utilities provide:
* - Comprehensive city/province lookup for Canada
* - Fast O(1) city-to-province matching
* - Flexible location filter parsing and validation
* - Used by linkedout.js for profile location validation
* - Used by parsers for profile location validation
*
* USAGE (for developers):
* const { parseLocationFilters, validateLocationAgainstFilters, extractLocationFromProfile } = require('./location-utils');
*
* See linkedout.js for integration details.
* const { parseLocationFilters, validateLocationAgainstFilters, extractLocationFromProfile } = require('ai-analyzer');
*/
// Suppress D-Bus notification errors in WSL
process.env.NO_AT_BRIDGE = "1";
process.env.DBUS_SESSION_BUS_ADDRESS = "/dev/null";
@ -893,7 +890,7 @@ for (const [province, cities] of Object.entries(CITIES_BY_PROVINCE)) {
}
}
// Province name variations and abbreviations (unchanged)
// Province name variations and abbreviations
const PROVINCE_VARIATIONS = {
ontario: ["ontario", "ont", "on"],
manitoba: ["manitoba", "man", "mb"],

123
ai-analyzer/src/logger.js Normal file
View File

@ -0,0 +1,123 @@
const chalk = require("chalk");
/**
* Configurable logger with color support and level controls
* Can enable/disable different log levels: debug, info, warning, error, success
*/
class Logger {
constructor(options = {}) {
this.levels = {
debug: options.debug !== false,
info: options.info !== false,
warning: options.warning !== false,
error: options.error !== false,
success: options.success !== false,
};
this.colors = options.colors !== false;
}
_formatMessage(level, message, prefix = "") {
const timestamp = new Date().toLocaleTimeString();
const fullMessage = `${prefix}${message}`;
if (!this.colors) {
return `[${timestamp}] [${level.toUpperCase()}] ${fullMessage}`;
}
switch (level) {
case "debug":
return chalk.gray(`[${timestamp}] [DEBUG] ${fullMessage}`);
case "info":
return chalk.blue(`[${timestamp}] [INFO] ${fullMessage}`);
case "warning":
return chalk.yellow(`[${timestamp}] [WARNING] ${fullMessage}`);
case "error":
return chalk.red(`[${timestamp}] [ERROR] ${fullMessage}`);
case "success":
return chalk.green(`[${timestamp}] [SUCCESS] ${fullMessage}`);
default:
return `[${timestamp}] [${level.toUpperCase()}] ${fullMessage}`;
}
}
debug(message) {
if (this.levels.debug) {
console.log(this._formatMessage("debug", message));
}
}
info(message) {
if (this.levels.info) {
console.log(this._formatMessage("info", message));
}
}
warning(message) {
if (this.levels.warning) {
console.warn(this._formatMessage("warning", message));
}
}
error(message) {
if (this.levels.error) {
console.error(this._formatMessage("error", message));
}
}
success(message) {
if (this.levels.success) {
console.log(this._formatMessage("success", message));
}
}
// Convenience methods with emoji prefixes for better UX
step(message) {
this.info(`🚀 ${message}`);
}
search(message) {
this.info(`🔍 ${message}`);
}
ai(message) {
this.info(`🧠 ${message}`);
}
location(message) {
this.info(`📍 ${message}`);
}
file(message) {
this.info(`📄 ${message}`);
}
// Configure logger levels at runtime
setLevel(level, enabled) {
if (this.levels.hasOwnProperty(level)) {
this.levels[level] = enabled;
}
}
// Disable all logging
silent() {
Object.keys(this.levels).forEach((level) => {
this.levels[level] = false;
});
}
// Enable all logging
verbose() {
Object.keys(this.levels).forEach((level) => {
this.levels[level] = true;
});
}
}
// Create default logger instance
const logger = new Logger();
// Export both the class and default instance
module.exports = {
Logger,
logger,
};

View File

@ -0,0 +1,124 @@
/**
* Shared test utilities for parsers
* Common mocks, helpers, and test data
*/
/**
* Mock Playwright page object for testing
*/
function createMockPage() {
return {
goto: jest.fn().mockResolvedValue(undefined),
waitForSelector: jest.fn().mockResolvedValue(undefined),
$$: jest.fn().mockResolvedValue([]),
$: jest.fn().mockResolvedValue(null),
textContent: jest.fn().mockResolvedValue(""),
close: jest.fn().mockResolvedValue(undefined),
};
}
/**
* Mock fetch for AI API calls
*/
function createMockFetch(response = {}) {
return jest.fn().mockResolvedValue({
ok: true,
status: 200,
json: jest.fn().mockResolvedValue(response),
...response,
});
}
/**
* Sample test data for posts
*/
const samplePosts = [
{
text: "We are laying off 100 employees due to economic downturn.",
keyword: "layoff",
profileLink: "https://linkedin.com/in/test-user-1",
},
{
text: "Exciting opportunity! We are hiring senior developers for our team.",
keyword: "hiring",
profileLink: "https://linkedin.com/in/test-user-2",
},
];
/**
* Sample location test data
*/
const sampleLocations = [
"Toronto, Ontario, Canada",
"Vancouver, BC",
"Calgary, Alberta",
"Montreal, Quebec",
"Halifax, Nova Scotia",
];
/**
* Common test assertions
*/
function expectValidPost(post) {
expect(post).toHaveProperty("text");
expect(post).toHaveProperty("keyword");
expect(post).toHaveProperty("profileLink");
expect(typeof post.text).toBe("string");
expect(post.text.length).toBeGreaterThan(0);
}
function expectValidAIAnalysis(analysis) {
expect(analysis).toHaveProperty("isRelevant");
expect(analysis).toHaveProperty("confidence");
expect(analysis).toHaveProperty("reasoning");
expect(typeof analysis.isRelevant).toBe("boolean");
expect(analysis.confidence).toBeGreaterThanOrEqual(0);
expect(analysis.confidence).toBeLessThanOrEqual(1);
}
function expectValidLocation(location) {
expect(typeof location).toBe("string");
expect(location.length).toBeGreaterThan(0);
}
/**
* Test environment setup
*/
function setupTestEnv() {
// Mock environment variables
process.env.NODE_ENV = "test";
process.env.OLLAMA_HOST = "http://localhost:11434";
process.env.AI_CONTEXT = "test context";
// Suppress console output during tests
jest.spyOn(console, "log").mockImplementation(() => {});
jest.spyOn(console, "error").mockImplementation(() => {});
jest.spyOn(console, "warn").mockImplementation(() => {});
}
/**
* Clean up test environment
*/
function teardownTestEnv() {
// Restore console
console.log.mockRestore();
console.error.mockRestore();
console.warn.mockRestore();
// Clear environment
delete process.env.NODE_ENV;
delete process.env.OLLAMA_HOST;
delete process.env.AI_CONTEXT;
}
module.exports = {
createMockPage,
createMockFetch,
samplePosts,
sampleLocations,
expectValidPost,
expectValidAIAnalysis,
expectValidLocation,
setupTestEnv,
teardownTestEnv,
};

View File

@ -0,0 +1,107 @@
/**
* Text processing utilities for cleaning and validating content
* Extracted from linkedout.js for reuse across parsers
*/
/**
* Clean text by removing hashtags, URLs, emojis, and normalizing whitespace
*/
function cleanText(text) {
if (!text || typeof text !== "string") {
return "";
}
// Remove hashtags
text = text.replace(/#\w+/g, "");
// Remove hashtag mentions
text = text.replace(/\bhashtag\b/gi, "");
text = text.replace(/hashtag-\w+/gi, "");
// Remove URLs
text = text.replace(/https?:\/\/[^\s]+/g, "");
// Remove emojis (Unicode ranges for common emoji)
text = text.replace(
/[\u{1F600}-\u{1F64F}\u{1F300}-\u{1F5FF}\u{1F680}-\u{1F6FF}\u{1F1E0}-\u{1F1FF}]/gu,
""
);
// Normalize whitespace
text = text.replace(/\s+/g, " ").trim();
return text;
}
/**
* Check if text contains any of the specified keywords (case insensitive)
*/
function containsAnyKeyword(text, keywords) {
if (!text || !Array.isArray(keywords)) {
return false;
}
const lowerText = text.toLowerCase();
return keywords.some((keyword) => lowerText.includes(keyword.toLowerCase()));
}
/**
* Validate if text meets basic quality criteria
*/
function isValidText(text, minLength = 30) {
if (!text || typeof text !== "string") {
return false;
}
// Check minimum length
if (text.length < minLength) {
return false;
}
// Check if text contains alphanumeric characters
if (!/[a-zA-Z0-9]/.test(text)) {
return false;
}
return true;
}
/**
* Extract domain from URL
*/
function extractDomain(url) {
if (!url || typeof url !== "string") {
return null;
}
try {
const urlObj = new URL(url);
return urlObj.hostname;
} catch (error) {
return null;
}
}
/**
* Normalize URL by removing query parameters and fragments
*/
function normalizeUrl(url) {
if (!url || typeof url !== "string") {
return "";
}
try {
const urlObj = new URL(url);
return `${urlObj.protocol}//${urlObj.hostname}${urlObj.pathname}`;
} catch (error) {
return url;
}
}
module.exports = {
cleanText,
containsAnyKeyword,
isValidText,
extractDomain,
normalizeUrl,
};

View File

@ -0,0 +1,194 @@
/**
* Test file for logger functionality
*/
const { Logger, logger } = require("../src/logger");
describe("Logger", () => {
let consoleSpy;
let consoleWarnSpy;
let consoleErrorSpy;
beforeEach(() => {
consoleSpy = jest.spyOn(console, "log").mockImplementation();
consoleWarnSpy = jest.spyOn(console, "warn").mockImplementation();
consoleErrorSpy = jest.spyOn(console, "error").mockImplementation();
});
afterEach(() => {
consoleSpy.mockRestore();
consoleWarnSpy.mockRestore();
consoleErrorSpy.mockRestore();
});
test("should create default logger instance", () => {
expect(logger).toBeDefined();
expect(logger).toBeInstanceOf(Logger);
});
test("should log info messages", () => {
logger.info("Test message");
expect(consoleSpy).toHaveBeenCalled();
});
test("should create custom logger with disabled levels", () => {
const customLogger = new Logger({ debug: false });
customLogger.debug("This should not log");
expect(consoleSpy).not.toHaveBeenCalled();
});
test("should use emoji prefixes for convenience methods", () => {
logger.step("Test step");
logger.ai("Test AI");
logger.location("Test location");
expect(consoleSpy).toHaveBeenCalledTimes(3);
});
test("should configure levels at runtime", () => {
const customLogger = new Logger();
customLogger.setLevel("debug", false);
customLogger.debug("This should not log");
expect(consoleSpy).not.toHaveBeenCalled();
});
test("should go silent when requested", () => {
const customLogger = new Logger();
customLogger.silent();
customLogger.info("This should not log");
customLogger.error("This should not log");
expect(consoleSpy).not.toHaveBeenCalled();
expect(consoleErrorSpy).not.toHaveBeenCalled();
});
// Additional test cases for comprehensive coverage
test("should log warning messages", () => {
logger.warning("Test warning");
expect(consoleWarnSpy).toHaveBeenCalled();
});
test("should log error messages", () => {
logger.error("Test error");
expect(consoleErrorSpy).toHaveBeenCalled();
});
test("should log success messages", () => {
logger.success("Test success");
expect(consoleSpy).toHaveBeenCalled();
});
test("should log debug messages", () => {
logger.debug("Test debug");
expect(consoleSpy).toHaveBeenCalled();
});
test("should respect disabled warning level", () => {
const customLogger = new Logger({ warning: false });
customLogger.warning("This should not log");
expect(consoleWarnSpy).not.toHaveBeenCalled();
});
test("should respect disabled error level", () => {
const customLogger = new Logger({ error: false });
customLogger.error("This should not log");
expect(consoleErrorSpy).not.toHaveBeenCalled();
});
test("should respect disabled success level", () => {
const customLogger = new Logger({ success: false });
customLogger.success("This should not log");
expect(consoleSpy).not.toHaveBeenCalled();
});
test("should respect disabled info level", () => {
const customLogger = new Logger({ info: false });
customLogger.info("This should not log");
expect(consoleSpy).not.toHaveBeenCalled();
});
test("should test all convenience methods", () => {
logger.step("Test step");
logger.search("Test search");
logger.ai("Test AI");
logger.location("Test location");
logger.file("Test file");
expect(consoleSpy).toHaveBeenCalledTimes(5);
});
test("should enable all levels with verbose method", () => {
const customLogger = new Logger({ debug: false, info: false });
customLogger.verbose();
customLogger.debug("This should log");
customLogger.info("This should log");
expect(consoleSpy).toHaveBeenCalledTimes(2);
});
test("should handle setLevel with invalid level gracefully", () => {
const customLogger = new Logger();
expect(() => {
customLogger.setLevel("invalid", false);
}).not.toThrow();
});
test("should format messages with timestamps", () => {
logger.info("Test message");
const loggedMessage = consoleSpy.mock.calls[0][0];
expect(loggedMessage).toMatch(/\[\d{1,2}:\d{2}:\d{2}\]/);
});
test("should include level in formatted messages", () => {
logger.info("Test message");
const loggedMessage = consoleSpy.mock.calls[0][0];
expect(loggedMessage).toContain("[INFO]");
});
test("should disable colors when colors option is false", () => {
const customLogger = new Logger({ colors: false });
customLogger.info("Test message");
const loggedMessage = consoleSpy.mock.calls[0][0];
// Should not contain ANSI color codes
expect(loggedMessage).not.toMatch(/\u001b\[/);
});
test("should enable colors by default", () => {
logger.info("Test message");
const loggedMessage = consoleSpy.mock.calls[0][0];
// Should contain ANSI color codes
expect(loggedMessage).toMatch(/\u001b\[/);
});
test("should handle multiple level configurations", () => {
const customLogger = new Logger({
debug: false,
info: true,
warning: false,
error: true,
success: false,
});
customLogger.debug("Should not log");
customLogger.info("Should log");
customLogger.warning("Should not log");
customLogger.error("Should log");
customLogger.success("Should not log");
expect(consoleSpy).toHaveBeenCalledTimes(1);
expect(consoleErrorSpy).toHaveBeenCalledTimes(1);
expect(consoleWarnSpy).not.toHaveBeenCalled();
});
test("should handle empty or undefined messages", () => {
expect(() => {
logger.info("");
logger.info(undefined);
logger.info(null);
}).not.toThrow();
});
test("should handle complex message objects", () => {
const testObj = { key: "value", nested: { data: "test" } };
expect(() => {
logger.info(testObj);
}).not.toThrow();
});
});

View File

@ -0,0 +1,94 @@
/**
* Authentication Manager
*
* Handles login/authentication for different sites
*/
class AuthManager {
constructor(coreParser) {
this.coreParser = coreParser;
}
/**
* Authenticate to a specific site
*/
async authenticate(site, credentials, pageId = "default") {
const strategies = {
linkedin: this.authenticateLinkedIn.bind(this),
// Add more auth strategies as needed
};
const strategy = strategies[site.toLowerCase()];
if (!strategy) {
throw new Error(`No authentication strategy found for site: ${site}`);
}
return await strategy(credentials, pageId);
}
/**
* LinkedIn authentication strategy
*/
async authenticateLinkedIn(credentials, pageId = "default") {
const { username, password } = credentials;
if (!username || !password) {
throw new Error("LinkedIn authentication requires username and password");
}
const page = this.coreParser.getPage(pageId);
if (!page) {
throw new Error(`Page with ID '${pageId}' not found`);
}
try {
// Navigate to LinkedIn login
await this.coreParser.navigateTo("https://www.linkedin.com/login", {
pageId,
});
// Fill credentials
await page.fill('input[name="session_key"]', username);
await page.fill('input[name="session_password"]', password);
// Submit form
await page.click('button[type="submit"]');
// Wait for successful login (profile image appears)
await page.waitForSelector("img.global-nav__me-photo", {
timeout: 15000,
});
return true;
} catch (error) {
throw new Error(`LinkedIn authentication failed: ${error.message}`);
}
}
/**
* Check if currently authenticated to a site
*/
async isAuthenticated(site, pageId = "default") {
const page = this.coreParser.getPage(pageId);
if (!page) {
return false;
}
const checkers = {
linkedin: async () => {
try {
await page.waitForSelector("img.global-nav__me-photo", {
timeout: 2000,
});
return true;
} catch {
return false;
}
},
};
const checker = checkers[site.toLowerCase()];
return checker ? await checker() : false;
}
}
module.exports = AuthManager;

131
core-parser/navigation.js Normal file
View File

@ -0,0 +1,131 @@
/**
* Navigation Manager
*
* Handles page navigation with error handling, retries, and logging
*/
class NavigationManager {
constructor(coreParser) {
this.coreParser = coreParser;
}
/**
* Navigate to URL with comprehensive error handling
*/
async navigateTo(url, options = {}) {
const {
pageId = "default",
waitUntil = "domcontentloaded",
retries = 1,
retryDelay = 2000,
timeout = this.coreParser.config.timeout,
} = options;
const page = this.coreParser.getPage(pageId);
if (!page) {
throw new Error(`Page with ID '${pageId}' not found`);
}
let lastError;
for (let attempt = 0; attempt <= retries; attempt++) {
try {
console.log(
`🌐 Navigating to: ${url} (attempt ${attempt + 1}/${retries + 1})`
);
await page.goto(url, {
waitUntil,
timeout,
});
console.log(`✅ Navigation successful: ${url}`);
return true;
} catch (error) {
lastError = error;
console.warn(
`⚠️ Navigation attempt ${attempt + 1} failed: ${error.message}`
);
if (attempt < retries) {
console.log(`🔄 Retrying in ${retryDelay}ms...`);
await this.delay(retryDelay);
}
}
}
// All attempts failed
const errorMessage = `Navigation failed after ${retries + 1} attempts: ${
lastError.message
}`;
console.error(`${errorMessage}`);
throw new Error(errorMessage);
}
/**
* Navigate and wait for specific selector
*/
async navigateAndWaitFor(url, selector, options = {}) {
await this.navigateTo(url, options);
const { pageId = "default", timeout = this.coreParser.config.timeout } =
options;
const page = this.coreParser.getPage(pageId);
try {
await page.waitForSelector(selector, { timeout });
console.log(`✅ Selector found: ${selector}`);
return true;
} catch (error) {
console.warn(`⚠️ Selector not found: ${selector} - ${error.message}`);
return false;
}
}
/**
* Check if current page has specific content
*/
async hasContent(content, options = {}) {
const { pageId = "default", timeout = 5000 } = options;
const page = this.coreParser.getPage(pageId);
try {
await page.waitForFunction(
(text) => document.body.innerText.includes(text),
content,
{ timeout }
);
return true;
} catch {
return false;
}
}
/**
* Utility delay function
*/
async delay(ms) {
return new Promise((resolve) => setTimeout(resolve, ms));
}
/**
* Get current page URL
*/
getCurrentUrl(pageId = "default") {
const page = this.coreParser.getPage(pageId);
return page ? page.url() : null;
}
/**
* Take screenshot for debugging
*/
async screenshot(filepath, pageId = "default") {
const page = this.coreParser.getPage(pageId);
if (page) {
await page.screenshot({ path: filepath });
console.log(`📸 Screenshot saved: ${filepath}`);
}
}
}
module.exports = NavigationManager;

27
core-parser/package.json Normal file
View File

@ -0,0 +1,27 @@
{
"name": "core-parser",
"version": "1.0.0",
"description": "Core browser automation and parsing engine for all parsers",
"main": "index.js",
"scripts": {
"test": "jest",
"install:browsers": "npx playwright install chromium"
},
"keywords": [
"parser",
"playwright",
"browser",
"automation",
"core"
],
"author": "Job Market Intelligence Team",
"license": "ISC",
"type": "commonjs",
"dependencies": {
"playwright": "^1.53.2",
"dotenv": "^17.0.0"
},
"devDependencies": {
"jest": "^29.7.0"
}
}

414
demo.js
View File

@ -1,414 +0,0 @@
const fs = require("fs");
const path = require("path");
const readline = require("readline");
// Terminal colors for better readability
const colors = {
reset: "\x1b[0m",
bright: "\x1b[1m",
dim: "\x1b[2m",
red: "\x1b[31m",
green: "\x1b[32m",
yellow: "\x1b[33m",
blue: "\x1b[34m",
magenta: "\x1b[35m",
cyan: "\x1b[36m",
white: "\x1b[37m",
bgRed: "\x1b[41m",
bgGreen: "\x1b[42m",
bgYellow: "\x1b[43m",
bgBlue: "\x1b[44m",
};
// Helper functions for colored output
const log = {
title: (text) =>
console.log(`${colors.bright}${colors.cyan}${text}${colors.reset}`),
success: (text) => console.log(`${colors.green}${text}${colors.reset}`),
info: (text) => console.log(`${colors.blue} ${text}${colors.reset}`),
warning: (text) => console.log(`${colors.yellow}⚠️ ${text}${colors.reset}`),
error: (text) => console.log(`${colors.red}${text}${colors.reset}`),
highlight: (text) =>
console.log(`${colors.bright}${colors.yellow}${text}${colors.reset}`),
step: (text) =>
console.log(`${colors.bright}${colors.magenta}🚀 ${text}${colors.reset}`),
file: (text) => console.log(`${colors.cyan}📄 ${text}${colors.reset}`),
ai: (text) =>
console.log(`${colors.bright}${colors.blue}🧠 ${text}${colors.reset}`),
search: (text) => console.log(`${colors.green}🔍 ${text}${colors.reset}`),
};
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout,
terminal: true,
});
function prompt(question, defaultVal) {
return new Promise((resolve) => {
rl.question(`${question} (default: ${defaultVal}): `, (answer) => {
resolve(answer.trim() || defaultVal);
});
});
}
/**
* Fetch available Ollama models from the local instance
*/
async function getAvailableModels() {
// For demo purposes, just mock 3 popular models
log.info("Simulating Ollama model detection...");
await new Promise((r) => setTimeout(r, 500)); // Simulate API call
const mockModels = ["mistral", "llama2", "codellama"];
log.success(`Found ${mockModels.length} available models`);
return mockModels;
}
/**
* Interactive model selection with available models
*/
async function selectModel(availableModels) {
log.highlight("\n📦 Available Ollama models:");
availableModels.forEach((model, index) => {
console.log(
` ${colors.bright}${index + 1}.${colors.reset} ${colors.cyan}${model}${
colors.reset
}`
);
});
const defaultModel = availableModels.includes("mistral")
? "mistral"
: availableModels[0];
const selection = await prompt(
`${colors.bright}Choose model (1-${availableModels.length} or model name)${colors.reset}`,
defaultModel
);
// Check if it's a number selection
const num = parseInt(selection);
if (num >= 1 && num <= availableModels.length) {
const selectedModel = availableModels[num - 1];
log.success(`Selected model: ${selectedModel}`);
return selectedModel;
}
// Check if it's a valid model name
if (availableModels.includes(selection)) {
log.success(`Selected model: ${selection}`);
return selection;
}
// Default fallback
log.success(`Using default model: ${defaultModel}`);
return defaultModel;
}
async function main() {
log.title("=== LinkedOut Demo Workflow ===");
log.info(
"This is a simulated demo for creating a GIF. It uses fake data and anonymizes personal information."
);
log.highlight("Press Enter to accept defaults.\n");
// Prompt for all possible settings based on linkedout.js configurations
const headless = await prompt("Headless mode", "true");
const keywordsSource = await prompt(
"Keywords source (CSV file or comma-separated)",
"keywords-layoff.csv"
);
const addKeywords = await prompt("Additional keywords (comma-separated)", "");
const city = await prompt("City", "Toronto");
const date_posted = await prompt(
"Date posted (past-24h, past-week, past-month, or empty)",
"past-week"
);
const sort_by = await prompt(
"Sort by (date_posted or relevance)",
"date_posted"
);
const wheels = await prompt("Number of scrolls", "5");
const location_filter = await prompt(
"Location filter (e.g., Ontario,Manitoba)",
"Ontario"
);
const enable_location = await prompt("Enable location check", "true");
const output = await prompt(
"Output file (without extension)",
"demo-results"
);
const enable_ai = await prompt("Enable local AI", "true");
const run_ai_after = await prompt("Run AI after scraping", "true");
const ai_context = await prompt(
"AI context",
"job layoffs and workforce reduction"
);
// Get available models and let user choose
const availableModels = await getAvailableModels();
const ollama_model = await selectModel(availableModels);
const ai_confidence = await prompt("AI confidence threshold", "0.7");
const ai_batch_size = await prompt("AI batch size", "3");
// Simulate loading keywords (only use first 2 for demo)
let keywords = ["layoff", "downsizing"]; // Default demo keywords - only 2 for demo
if (keywordsSource !== "keywords-layoff.csv") {
keywords = keywordsSource
.split(",")
.map((k) => k.trim())
.slice(0, 2);
}
if (addKeywords) {
keywords = keywords.concat(addKeywords.split(",").map((k) => k.trim()));
}
log.step(`Starting demo scrape with ${keywords.length} keywords...`);
log.info(`🌍 City: ${city}, Date: ${date_posted}, Sort: ${sort_by}`);
log.info(
`🔄 Scrolls: ${wheels}, Location filter: ${location_filter || "None"}`
);
// Simulate browser launch and login
await new Promise((r) => setTimeout(r, 500));
log.step("Launching browser" + (headless === "true" ? " (headless)" : ""));
await new Promise((r) => setTimeout(r, 500));
log.step("Logging in to LinkedIn...");
// Simulate scraping for each keyword
const fakePosts = [];
const rejectedPosts = [];
// Define specific numbers for each keyword
const keywordData = {
layoff: { found: 3, accepted: 2, rejected: 1 },
downsizing: { found: 2, accepted: 1, rejected: 1 },
};
for (const keyword of keywords) {
await new Promise((r) => setTimeout(r, 300));
const data = keywordData[keyword] || { found: 2, accepted: 1, rejected: 1 };
log.search(`Searching for "${keyword}"...`);
log.info(`Found ${data.found} posts, checking profiles for location...`);
// Add specific number of accepted posts per keyword
for (let i = 0; i < data.accepted; i++) {
const location =
enable_location === "true"
? i % 2 === 0
? "Toronto, Ontario, Canada"
: "Calgary, Alberta, Canada"
: undefined;
let text;
if (keyword === "layoff") {
text =
i === 0
? "Long considered a local success story, Calgary robotics company Attabotics is restructuring as it deals with insolvency. It has terminated 192 of its 203 employees, keeping a skeleton crew of only 11 as it navigates the road ahead."
: "I'm working to report on the recent game industry layoffs and I'm hoping to connect with anyone connected to or impacted by the recent mass layoffs. Please feel free to contact me either here or anonymously.";
} else {
text =
"Thinking about downsizing your home in Alberta? It's not just a change of address—it's a smart financial move and a big step toward enjoying retirement! Here's what you need to know about tapping into home equity and saving on monthly bills.";
}
fakePosts.push({
keyword,
text: text,
profileLink: `https://www.linkedin.com/in/demo-user-${Math.random()
.toString(36)
.slice(2)}`,
timestamp:
new Date().toISOString().split("T")[0] +
", " +
new Date().toLocaleTimeString("en-CA", { hour12: false }),
location,
locationValid: location ? true : undefined,
locationMatchedFilter: location
? location.includes("Ontario")
? "ontario"
: "alberta"
: undefined,
locationReasoning: location
? `Direct match: "${
location.includes("Ontario") ? "ontario" : "alberta"
}" found in "${location}"`
: undefined,
aiProcessed: false,
});
}
// Add specific rejected posts per keyword
for (let i = 0; i < data.rejected; i++) {
if (keyword === "layoff") {
rejectedPosts.push({
rejected: true,
reason:
'Location filter failed: Location "Vancouver, British Columbia, Canada" does not match any of: ontario, alberta',
keyword: "layoff",
text: "Sad to announce that our Vancouver tech startup is going through a difficult restructuring. We've had to make the tough decision to lay off 30% of our engineering team. These are incredibly talented people and I'm happy to provide recommendations.",
profileLink: "https://www.linkedin.com/in/demo-vancouver-user",
location: "Vancouver, British Columbia, Canada",
locationReasoning:
'Location "Vancouver, British Columbia, Canada" does not match any of: ontario, alberta',
timestamp: new Date().toISOString(),
});
} else {
rejectedPosts.push({
rejected: true,
reason: "No profile link",
keyword: "downsizing",
text: "The days of entering retirement mortgage-free are fading fast — even for older Canadians. A recent Royal LePage survey reveals nearly 1 in 3 Canadians retiring in the next 2 years will still carry a mortgage. Contact us and let's talk about planning smarter — whether you're 25 or 65.",
profileLink: "",
timestamp: new Date().toISOString(),
});
}
}
log.success(
`${data.accepted} posts accepted, ❌ ${data.rejected} posts rejected`
);
}
log.success(`Found ${fakePosts.length} demo posts total`);
// Simulate location validation if enabled
if (enable_location === "true" && location_filter) {
await new Promise((r) => setTimeout(r, 500));
log.step("Validating locations against filter...");
}
// Simulate saving results
const timestamp =
new Date().toISOString().split("T")[0] +
"-" +
new Date().toISOString().split("T")[1].split(".")[0].replace(/:/g, "-");
// Save main results file
let resultsFile = output
? `results/${output}.json`
: `results/demo-results-${timestamp}.json`;
fs.mkdirSync(path.dirname(resultsFile), { recursive: true });
fs.writeFileSync(resultsFile, JSON.stringify(fakePosts, null, 2));
log.file(`Saved demo results to ${resultsFile}`);
// Save rejected posts file
let rejectedFile = output
? `results/${output}-rejected.json`
: `results/demo-results-${timestamp}-rejected.json`;
fs.writeFileSync(rejectedFile, JSON.stringify(rejectedPosts, null, 2));
log.file(`Saved demo rejected posts to ${rejectedFile}`);
const newFiles = [resultsFile, rejectedFile];
// Simulate AI analysis if enabled and set to run after
let aiFile;
if (enable_ai === "true" && run_ai_after === "true") {
await new Promise((r) => setTimeout(r, 500));
log.ai(`Running local AI analysis with model ${ollama_model}...`);
log.info(
`Context: "${ai_context}", Confidence: ${ai_confidence}, Batch size: ${ai_batch_size}`
);
await new Promise((r) => setTimeout(r, 800));
// Fake AI processing with realistic examples
const aiResults = fakePosts.map((post, index) => {
let isRelevant, confidence, reasoning;
if (post.keyword === "layoff") {
if (index === 0) {
// First layoff post - highly relevant
isRelevant = true;
confidence = 0.94;
reasoning =
"The post clearly states that a company has terminated 192 of its 203 employees as part of restructuring due to insolvency, which is directly related to job layoffs and workforce reduction.";
} else {
// Second layoff post - highly relevant
isRelevant = true;
confidence = 0.92;
reasoning =
"Post explicitly discusses game industry layoffs and mass layoffs, which directly relates to job layoffs and workforce reduction.";
}
} else {
// Downsizing post - not relevant to job layoffs
isRelevant = false;
confidence = 0.25;
reasoning =
"The post discusses downsizing a home and financial considerations for retirement, which are not directly related to job layoffs or workforce reduction.";
}
return {
...post,
aiProcessed: true,
aiRelevant: isRelevant,
aiConfidence: Math.round(confidence * 100) / 100, // Round to 2 decimal places
aiReasoning: reasoning,
aiModel: ollama_model,
aiAnalyzedAt:
new Date().toISOString().split("T")[0] +
", " +
new Date().toLocaleTimeString("en-CA", { hour12: false }),
aiType: "local-ollama",
...(confidence < parseFloat(ai_confidence)
? { lowConfidence: true }
: {}),
};
});
aiFile = output
? `results/${output}-ai.json`
: `results/demo-ai-${timestamp}.json`;
fs.writeFileSync(aiFile, JSON.stringify(aiResults, null, 2));
log.file(`Saved demo AI results to ${aiFile}`);
newFiles.push(aiFile);
}
// List new files
log.title("\n=== Demo Complete ===");
log.highlight("New JSON files created:");
newFiles.forEach((file) => log.file(file));
log.info(
"\nYou can right-click the file paths in your terminal or copy them to open in your IDE."
);
// Show examples of what each file contains
log.title("\n=== File Contents Examples ===");
log.highlight("\n📄 Main Results File (accepted posts):");
log.info("Contains posts that passed all filters:");
console.log(
`${colors.dim}${JSON.stringify(fakePosts.slice(0, 1), null, 2)}${
colors.reset
}`
);
log.highlight("\n🚫 Rejected Posts File:");
log.info("Contains posts that were filtered out:");
console.log(
`${colors.dim}${JSON.stringify(rejectedPosts.slice(0, 1), null, 2)}${
colors.reset
}`
);
if (enable_ai === "true" && run_ai_after === "true") {
log.highlight("\n🧠 AI Analysis File:");
log.info("Contains posts with AI relevance analysis:");
const aiResults = JSON.parse(fs.readFileSync(aiFile, "utf-8"));
console.log(
`${colors.dim}${JSON.stringify(aiResults.slice(0, 1), null, 2)}${
colors.reset
}`
);
log.highlight("\nKey AI Features Demonstrated:");
log.success("✅ aiRelevant: true/false based on context analysis");
log.success("✅ aiConfidence: rounded to 2 decimal places (0.00-1.00)");
log.success("✅ aiReasoning: detailed explanation of relevance decision");
log.success(
"✅ Location filtering: shows why posts were accepted/rejected"
);
}
rl.close();
}
main();

497
job-search-parser/README.md Normal file
View File

@ -0,0 +1,497 @@
# Job Search Parser - Job Market Intelligence
Specialized parser for job market intelligence, tracking job postings, market trends, and competitive analysis. Focuses on tech roles and industry insights.
## 🎯 Purpose
The Job Search Parser is designed to:
- **Track Job Market Trends**: Monitor demand for specific roles and skills
- **Competitive Intelligence**: Analyze salary ranges and requirements
- **Industry Insights**: Track hiring patterns across different sectors
- **Skill Gap Analysis**: Identify in-demand technologies and frameworks
- **Market Demand Forecasting**: Predict job market trends
## 🚀 Features
### Core Functionality
- **Multi-Source Aggregation**: Collect job data from multiple platforms
- **Role-Specific Tracking**: Focus on tech roles and emerging positions
- **Skill Analysis**: Extract and categorize required skills
- **Salary Intelligence**: Track compensation ranges and trends
- **Company Intelligence**: Monitor hiring companies and patterns
### Advanced Features
- **Market Trend Analysis**: Identify growing and declining job categories
- **Geographic Distribution**: Track job distribution by location
- **Experience Level Analysis**: Entry, mid, senior level tracking
- **Remote Work Trends**: Monitor remote/hybrid work patterns
- **Technology Stack Tracking**: Framework and tool popularity
## 🌐 Supported Job Sites
### ✅ Implemented Parsers
#### SkipTheDrive Parser
Remote job board specializing in work-from-home positions.
**Features:**
- Keyword-based job search with relevance sorting
- Job type filtering (full-time, part-time, contract)
- Multi-page result parsing with pagination
- Featured/sponsored job identification
- AI-powered job relevance analysis
- Automatic duplicate detection
**Usage:**
```bash
# Parse SkipTheDrive for QA automation jobs
node index.js --sites=skipthedrive --keywords="automation qa,qa engineer"
# Filter by job type
JOB_TYPES="full time,contract" node index.js --sites=skipthedrive
# Run demo with limited results
node index.js --sites=skipthedrive --demo
```
### 🚧 Planned Parsers
- **Indeed**: Comprehensive job aggregator
- **Glassdoor**: Jobs with company reviews and salary data
- **Monster**: Traditional job board
- **SimplyHired**: Job aggregator with salary estimates
- **LinkedIn Jobs**: Professional network job postings
- **AngelList**: Startup and tech jobs
- **Remote.co**: Dedicated remote work jobs
- **FlexJobs**: Flexible and remote positions
## 📦 Installation
```bash
# Install dependencies
npm install
# Run tests
npm test
# Run demo
node demo.js
```
## 🔧 Configuration
### Environment Variables
Create a `.env` file in the parser directory:
```env
# Job Search Configuration
SEARCH_SOURCES=linkedin,indeed,glassdoor
TARGET_ROLES=software engineer,data scientist,product manager
LOCATION_FILTER=Toronto,Vancouver,Calgary
EXPERIENCE_LEVELS=entry,mid,senior
REMOTE_PREFERENCE=remote,hybrid,onsite
# Analysis Configuration
ENABLE_SALARY_ANALYSIS=true
ENABLE_SKILL_ANALYSIS=true
ENABLE_TREND_ANALYSIS=true
MIN_SALARY=50000
MAX_SALARY=200000
# Output Configuration
OUTPUT_FORMAT=json,csv
SAVE_RAW_DATA=true
ANALYSIS_INTERVAL=daily
```
### Command Line Options
```bash
# Basic usage
node index.js
# Specific roles
node index.js --roles="frontend developer,backend developer"
# Geographic focus
node index.js --locations="Toronto,Vancouver"
# Experience level
node index.js --experience="senior"
# Output format
node index.js --output=results/job-market-analysis.json
```
**Available Options:**
- `--roles="role1,role2"`: Target job roles
- `--locations="city1,city2"`: Geographic focus
- `--experience="entry|mid|senior"`: Experience level
- `--remote="remote|hybrid|onsite"`: Remote work preference
- `--salary-min=NUMBER`: Minimum salary filter
- `--salary-max=NUMBER`: Maximum salary filter
- `--output=FILE`: Output filename
- `--format=json|csv`: Output format
- `--trends`: Enable trend analysis
- `--skills`: Enable skill analysis
## 📊 Keywords
### Role-Specific Keywords
Place keyword CSV files in the `keywords/` directory:
```
job-search-parser/
├── keywords/
│ ├── job-search-keywords.csv # General job search terms
│ ├── tech-roles.csv # Technology roles
│ ├── data-roles.csv # Data science roles
│ ├── management-roles.csv # Management positions
│ └── emerging-roles.csv # Emerging job categories
└── index.js
```
### Tech Roles Keywords
```csv
keyword
software engineer
frontend developer
backend developer
full stack developer
data scientist
machine learning engineer
devops engineer
site reliability engineer
cloud architect
security engineer
mobile developer
iOS developer
Android developer
react developer
vue developer
angular developer
node.js developer
python developer
java developer
golang developer
rust developer
data engineer
analytics engineer
```
### Data Science Keywords
```csv
keyword
data scientist
machine learning engineer
data analyst
business analyst
data engineer
analytics engineer
ML engineer
AI engineer
statistician
quantitative analyst
research scientist
data architect
BI developer
ETL developer
```
## 📈 Usage Examples
### Basic Job Search
```bash
# Standard job market analysis
node index.js
# Specific tech roles
node index.js --roles="software engineer,data scientist"
# Geographic focus
node index.js --locations="Toronto,Vancouver,Calgary"
```
### Advanced Analysis
```bash
# Senior level positions
node index.js --experience="senior" --salary-min=100000
# Remote work opportunities
node index.js --remote="remote" --roles="frontend developer"
# Trend analysis
node index.js --trends --skills --output=results/trends.json
```
### Market Intelligence
```bash
# Salary analysis
node index.js --salary-min=80000 --salary-max=150000
# Skill gap analysis
node index.js --skills --roles="machine learning engineer"
# Competitive intelligence
node index.js --companies="Google,Microsoft,Amazon"
```
## 📊 Output Format
### JSON Structure
```json
{
"metadata": {
"timestamp": "2024-01-15T10:30:00Z",
"search_parameters": {
"roles": ["software engineer", "data scientist"],
"locations": ["Toronto", "Vancouver"],
"experience_levels": ["mid", "senior"],
"remote_preference": ["remote", "hybrid"]
},
"total_jobs_found": 1250,
"analysis_duration_seconds": 45
},
"market_overview": {
"total_jobs": 1250,
"average_salary": 95000,
"salary_range": {
"min": 65000,
"max": 180000,
"median": 92000
},
"remote_distribution": {
"remote": 45,
"hybrid": 35,
"onsite": 20
},
"experience_distribution": {
"entry": 15,
"mid": 45,
"senior": 40
}
},
"trends": {
"growing_skills": [
{ "skill": "React", "growth_rate": 25 },
{ "skill": "Python", "growth_rate": 18 },
{ "skill": "AWS", "growth_rate": 22 }
],
"declining_skills": [
{ "skill": "jQuery", "growth_rate": -12 },
{ "skill": "PHP", "growth_rate": -8 }
],
"emerging_roles": ["AI Engineer", "DevSecOps Engineer", "Data Engineer"]
},
"jobs": [
{
"id": "job_1",
"title": "Senior Software Engineer",
"company": "TechCorp",
"location": "Toronto, Ontario",
"remote_type": "hybrid",
"salary": {
"min": 100000,
"max": 140000,
"currency": "CAD"
},
"required_skills": ["React", "Node.js", "TypeScript", "AWS"],
"preferred_skills": ["GraphQL", "Docker", "Kubernetes"],
"experience_level": "senior",
"job_url": "https://example.com/job/1",
"posted_date": "2024-01-10T09:00:00Z",
"scraped_at": "2024-01-15T10:30:00Z"
}
],
"analysis": {
"skill_demand": {
"React": { "count": 45, "avg_salary": 98000 },
"Python": { "count": 38, "avg_salary": 102000 },
"AWS": { "count": 32, "avg_salary": 105000 }
},
"company_insights": {
"top_hirers": [
{ "company": "TechCorp", "jobs": 25 },
{ "company": "StartupXYZ", "jobs": 18 }
],
"salary_leaders": [
{ "company": "BigTech", "avg_salary": 120000 },
{ "company": "FinTech", "avg_salary": 115000 }
]
}
}
}
```
### CSV Output
The parser can also generate CSV files for easy analysis:
```csv
job_id,title,company,location,remote_type,salary_min,salary_max,required_skills,experience_level,posted_date
job_1,Senior Software Engineer,TechCorp,Toronto,hybrid,100000,140000,"React,Node.js,TypeScript",senior,2024-01-10
job_2,Data Scientist,DataCorp,Vancouver,remote,90000,130000,"Python,SQL,ML",mid,2024-01-09
```
## 🔒 Security & Best Practices
### Data Privacy
- Respect job site terms of service
- Implement appropriate rate limiting
- Store data securely and responsibly
- Anonymize sensitive information
### Rate Limiting
- Implement delays between requests
- Respect API rate limits
- Use multiple data sources
- Monitor for blocking/detection
### Legal Compliance
- Educational and research purposes only
- Respect website terms of service
- Implement data retention policies
- Monitor for legal changes
## 🧪 Testing
### Run Tests
```bash
# All tests
npm test
# Specific test suites
npm test -- --testNamePattern="JobSearch"
npm test -- --testNamePattern="Analysis"
npm test -- --testNamePattern="Trends"
```
### Test Coverage
```bash
npm run test:coverage
```
## 🚀 Performance Optimization
### Recommended Settings
#### Fast Analysis
```bash
node index.js --roles="software engineer" --locations="Toronto"
```
#### Comprehensive Analysis
```bash
node index.js --trends --skills --experience="all"
```
#### Focused Intelligence
```bash
node index.js --salary-min=80000 --remote="remote" --trends
```
### Performance Tips
- Use specific role filters to reduce data volume
- Implement caching for repeated searches
- Use parallel processing for multiple sources
- Optimize data storage and retrieval
## 🔧 Troubleshooting
### Common Issues
#### Rate Limiting
```bash
# Reduce request frequency
export REQUEST_DELAY=2000
node index.js
```
#### Data Source Issues
```bash
# Use specific sources
node index.js --sources="linkedin,indeed"
# Check source availability
node index.js --test-sources
```
#### Output Issues
```bash
# Check output directory
mkdir -p results
node index.js --output=results/analysis.json
# Verify file permissions
chmod 755 results/
```
## 📈 Monitoring & Analytics
### Key Metrics
- **Job Volume**: Total jobs found per search
- **Salary Trends**: Average and median salary changes
- **Skill Demand**: Most requested skills
- **Remote Adoption**: Remote work trend analysis
- **Market Velocity**: Job posting frequency
### Dashboard Integration
- Real-time market monitoring
- Trend visualization
- Salary benchmarking
- Skill gap analysis
- Competitive intelligence
## 🤝 Contributing
### Development Setup
1. Fork the repository
2. Create feature branch
3. Add tests for new functionality
4. Ensure all tests pass
5. Submit pull request
### Code Standards
- Follow existing code style
- Add JSDoc comments
- Maintain test coverage
- Update documentation
## 📄 License
This parser is part of the LinkedOut platform and follows the same licensing terms.
---
**Note**: This tool is designed for educational and research purposes. Always respect website terms of service and implement appropriate rate limiting and ethical usage practices.

543
job-search-parser/demo.js Normal file
View File

@ -0,0 +1,543 @@
/**
* Job Search Parser Demo
*
* Demonstrates the Job Search Parser's capabilities for job market intelligence,
* trend analysis, and competitive insights.
*
* This demo uses simulated data for demonstration purposes.
*/
const { logger } = require("../ai-analyzer");
const fs = require("fs");
const path = require("path");
// Terminal colors for demo output
const colors = {
reset: "\x1b[0m",
bright: "\x1b[1m",
cyan: "\x1b[36m",
green: "\x1b[32m",
yellow: "\x1b[33m",
blue: "\x1b[34m",
magenta: "\x1b[35m",
red: "\x1b[31m",
};
const demo = {
title: (text) =>
console.log(`\n${colors.bright}${colors.cyan}${text}${colors.reset}`),
section: (text) =>
console.log(`\n${colors.bright}${colors.magenta}${text}${colors.reset}`),
success: (text) => console.log(`${colors.green}${text}${colors.reset}`),
info: (text) => console.log(`${colors.blue} ${text}${colors.reset}`),
warning: (text) => console.log(`${colors.yellow}⚠️ ${text}${colors.reset}`),
error: (text) => console.log(`${colors.red}${text}${colors.reset}`),
code: (text) => console.log(`${colors.cyan}${text}${colors.reset}`),
};
// Mock job data for demonstration
const mockJobs = [
{
id: "job_1",
title: "Senior Software Engineer",
company: "TechCorp",
location: "Toronto, Ontario",
remote_type: "hybrid",
salary: { min: 100000, max: 140000, currency: "CAD" },
required_skills: ["React", "Node.js", "TypeScript", "AWS"],
preferred_skills: ["GraphQL", "Docker", "Kubernetes"],
experience_level: "senior",
job_url: "https://example.com/job/1",
posted_date: "2024-01-10T09:00:00Z",
scraped_at: "2024-01-15T10:30:00Z",
},
{
id: "job_2",
title: "Data Scientist",
company: "DataCorp",
location: "Vancouver, British Columbia",
remote_type: "remote",
salary: { min: 90000, max: 130000, currency: "CAD" },
required_skills: ["Python", "SQL", "Machine Learning", "Statistics"],
preferred_skills: ["TensorFlow", "PyTorch", "AWS"],
experience_level: "mid",
job_url: "https://example.com/job/2",
posted_date: "2024-01-09T14:30:00Z",
scraped_at: "2024-01-15T10:30:00Z",
},
{
id: "job_3",
title: "Frontend Developer",
company: "StartupXYZ",
location: "Calgary, Alberta",
remote_type: "onsite",
salary: { min: 70000, max: 95000, currency: "CAD" },
required_skills: ["React", "JavaScript", "CSS", "HTML"],
preferred_skills: ["Vue.js", "TypeScript", "Webpack"],
experience_level: "entry",
job_url: "https://example.com/job/3",
posted_date: "2024-01-08T11:15:00Z",
scraped_at: "2024-01-15T10:30:00Z",
},
{
id: "job_4",
title: "DevOps Engineer",
company: "CloudTech",
location: "Toronto, Ontario",
remote_type: "hybrid",
salary: { min: 95000, max: 125000, currency: "CAD" },
required_skills: ["Docker", "Kubernetes", "AWS", "Linux"],
preferred_skills: ["Terraform", "Jenkins", "Prometheus"],
experience_level: "senior",
job_url: "https://example.com/job/4",
posted_date: "2024-01-07T16:45:00Z",
scraped_at: "2024-01-15T10:30:00Z",
},
{
id: "job_5",
title: "Machine Learning Engineer",
company: "AI Solutions",
location: "Vancouver, British Columbia",
remote_type: "remote",
salary: { min: 110000, max: 150000, currency: "CAD" },
required_skills: ["Python", "TensorFlow", "PyTorch", "ML"],
preferred_skills: ["AWS", "Docker", "Kubernetes", "Spark"],
experience_level: "senior",
job_url: "https://example.com/job/5",
posted_date: "2024-01-06T10:20:00Z",
scraped_at: "2024-01-15T10:30:00Z",
},
];
async function runDemo() {
demo.title("=== Job Search Parser Demo ===");
demo.info(
"This demo showcases the Job Search Parser's capabilities for job market intelligence."
);
demo.info("All data shown is simulated for demonstration purposes.");
demo.info("Press Enter to continue through each section...\n");
await waitForEnter();
// 1. Configuration Demo
await demonstrateConfiguration();
// 2. Job Search Process Demo
await demonstrateJobSearch();
// 3. Market Analysis Demo
await demonstrateMarketAnalysis();
// 4. Trend Analysis Demo
await demonstrateTrendAnalysis();
// 5. Skill Analysis Demo
await demonstrateSkillAnalysis();
// 6. Competitive Intelligence Demo
await demonstrateCompetitiveIntelligence();
// 7. Output Generation Demo
await demonstrateOutputGeneration();
demo.title("=== Demo Complete ===");
demo.success("Job Search Parser demo completed successfully!");
demo.info("Check the README.md for detailed usage instructions.");
}
async function demonstrateConfiguration() {
demo.section("1. Configuration Setup");
demo.info(
"The Job Search Parser uses environment variables and command-line options for configuration."
);
demo.code("// Environment Variables (.env file)");
demo.info("SEARCH_SOURCES=linkedin,indeed,glassdoor");
demo.info("TARGET_ROLES=software engineer,data scientist,product manager");
demo.info("LOCATION_FILTER=Toronto,Vancouver,Calgary");
demo.info("EXPERIENCE_LEVELS=entry,mid,senior");
demo.info("REMOTE_PREFERENCE=remote,hybrid,onsite");
demo.info("ENABLE_SALARY_ANALYSIS=true");
demo.info("ENABLE_SKILL_ANALYSIS=true");
demo.info("ENABLE_TREND_ANALYSIS=true");
demo.code("// Command Line Options");
demo.info('node index.js --roles="frontend developer,backend developer"');
demo.info('node index.js --locations="Toronto,Vancouver"');
demo.info('node index.js --experience="senior" --salary-min=100000');
demo.info('node index.js --remote="remote" --trends --skills');
await waitForEnter();
}
async function demonstrateJobSearch() {
demo.section("2. Job Search Process");
demo.info(
"The parser searches multiple job platforms for relevant positions."
);
const searchSources = ["LinkedIn", "Indeed", "Glassdoor"];
const targetRoles = [
"Software Engineer",
"Data Scientist",
"Frontend Developer",
];
demo.code("// Multi-source job search");
for (const source of searchSources) {
logger.search(`Searching ${source} for job postings...`);
await simulateSearch();
const jobsFound = Math.floor(Math.random() * 200) + 50;
logger.success(`Found ${jobsFound} jobs on ${source}`);
}
demo.code("// Role-specific filtering");
for (const role of targetRoles) {
logger.info(`Filtering for ${role} positions...`);
await simulateProcessing();
const roleJobs = Math.floor(Math.random() * 30) + 10;
logger.success(`Found ${roleJobs} ${role} positions`);
}
await waitForEnter();
}
async function demonstrateMarketAnalysis() {
demo.section("3. Market Analysis");
demo.info(
"The parser analyzes market trends, salary ranges, and job distribution."
);
demo.code("// Market overview analysis");
logger.info("Analyzing market overview...");
await simulateProcessing();
const marketOverview = {
total_jobs: 1250,
average_salary: 95000,
salary_range: { min: 65000, max: 180000, median: 92000 },
remote_distribution: { remote: 45, hybrid: 35, onsite: 20 },
experience_distribution: { entry: 15, mid: 45, senior: 40 },
};
demo.success(`Total jobs found: ${marketOverview.total_jobs}`);
demo.info(
`Average salary: $${marketOverview.average_salary.toLocaleString()}`
);
demo.info(
`Salary range: $${marketOverview.salary_range.min.toLocaleString()} - $${marketOverview.salary_range.max.toLocaleString()}`
);
demo.info(
`Remote work: ${marketOverview.remote_distribution.remote}% remote, ${marketOverview.remote_distribution.hybrid}% hybrid`
);
demo.code("// Geographic distribution");
const locations = {
Toronto: 45,
Vancouver: 30,
Calgary: 15,
Other: 10,
};
Object.entries(locations).forEach(([location, percentage]) => {
demo.info(`${location}: ${percentage}% of jobs`);
});
await waitForEnter();
}
async function demonstrateTrendAnalysis() {
demo.section("4. Trend Analysis");
demo.info(
"The parser identifies growing and declining skills and emerging roles."
);
demo.code("// Skill trend analysis");
logger.info("Analyzing skill trends...");
await simulateProcessing();
const growingSkills = [
{ skill: "React", growth_rate: 25 },
{ skill: "Python", growth_rate: 18 },
{ skill: "AWS", growth_rate: 22 },
{ skill: "TypeScript", growth_rate: 30 },
{ skill: "Docker", growth_rate: 15 },
];
const decliningSkills = [
{ skill: "jQuery", growth_rate: -12 },
{ skill: "PHP", growth_rate: -8 },
{ skill: "Angular", growth_rate: -5 },
];
demo.success("Growing skills:");
growingSkills.forEach((skill) => {
demo.info(` ${skill.skill}: +${skill.growth_rate}% growth`);
});
demo.warning("Declining skills:");
decliningSkills.forEach((skill) => {
demo.info(` ${skill.skill}: ${skill.growth_rate}% decline`);
});
demo.code("// Emerging roles");
const emergingRoles = [
"AI Engineer",
"DevSecOps Engineer",
"Data Engineer",
"Cloud Architect",
"Site Reliability Engineer",
];
demo.success("Emerging roles:");
emergingRoles.forEach((role) => {
demo.info(` ${role}`);
});
await waitForEnter();
}
async function demonstrateSkillAnalysis() {
demo.section("5. Skill Analysis");
demo.info("The parser analyzes skill demand and salary correlation.");
demo.code("// Skill demand analysis");
logger.info("Analyzing skill demand...");
await simulateProcessing();
const skillDemand = {
React: { count: 45, avg_salary: 98000 },
Python: { count: 38, avg_salary: 102000 },
AWS: { count: 32, avg_salary: 105000 },
TypeScript: { count: 28, avg_salary: 95000 },
Docker: { count: 25, avg_salary: 103000 },
"Machine Learning": { count: 22, avg_salary: 115000 },
};
demo.success("Top in-demand skills:");
Object.entries(skillDemand)
.sort((a, b) => b[1].count - a[1].count)
.forEach(([skill, data]) => {
demo.info(
` ${skill}: ${
data.count
} jobs, avg salary $${data.avg_salary.toLocaleString()}`
);
});
demo.code("// Salary correlation analysis");
const salaryCorrelation = [
{ skill: "Machine Learning", correlation: 0.85 },
{ skill: "AWS", correlation: 0.78 },
{ skill: "Docker", correlation: 0.72 },
{ skill: "Python", correlation: 0.68 },
{ skill: "React", correlation: 0.65 },
];
demo.success("Skills with highest salary correlation:");
salaryCorrelation.forEach((item) => {
demo.info(
` ${item.skill}: ${(item.correlation * 100).toFixed(0)}% correlation`
);
});
await waitForEnter();
}
async function demonstrateCompetitiveIntelligence() {
demo.section("6. Competitive Intelligence");
demo.info(
"The parser provides insights into company hiring patterns and salary competitiveness."
);
demo.code("// Company hiring analysis");
logger.info("Analyzing company hiring patterns...");
await simulateProcessing();
const topHirers = [
{ company: "TechCorp", jobs: 25, avg_salary: 105000 },
{ company: "StartupXYZ", jobs: 18, avg_salary: 95000 },
{ company: "DataCorp", jobs: 15, avg_salary: 110000 },
{ company: "CloudTech", jobs: 12, avg_salary: 115000 },
{ company: "AI Solutions", jobs: 10, avg_salary: 120000 },
];
demo.success("Top hiring companies:");
topHirers.forEach((company) => {
demo.info(
` ${company.company}: ${
company.jobs
} jobs, avg salary $${company.avg_salary.toLocaleString()}`
);
});
demo.code("// Salary competitiveness");
const salaryLeaders = [
{ company: "BigTech", avg_salary: 120000, market_position: "leader" },
{ company: "FinTech", avg_salary: 115000, market_position: "leader" },
{ company: "AI Solutions", avg_salary: 120000, market_position: "leader" },
{
company: "StartupXYZ",
avg_salary: 95000,
market_position: "competitive",
},
{ company: "TechCorp", avg_salary: 105000, market_position: "competitive" },
];
demo.success("Salary leaders:");
salaryLeaders.forEach((company) => {
const position = company.market_position === "leader" ? "🏆" : "📊";
demo.info(
` ${position} ${
company.company
}: $${company.avg_salary.toLocaleString()}`
);
});
await waitForEnter();
}
async function demonstrateOutputGeneration() {
demo.section("7. Output Generation");
demo.info(
"Results are saved in multiple formats with comprehensive analysis."
);
demo.code("// Generating comprehensive report");
logger.file("Generating job market analysis report...");
const outputData = {
metadata: {
timestamp: new Date().toISOString(),
search_parameters: {
roles: ["software engineer", "data scientist", "frontend developer"],
locations: ["Toronto", "Vancouver", "Calgary"],
experience_levels: ["entry", "mid", "senior"],
remote_preference: ["remote", "hybrid", "onsite"],
},
total_jobs_found: 1250,
analysis_duration_seconds: 45,
},
market_overview: {
total_jobs: 1250,
average_salary: 95000,
salary_range: { min: 65000, max: 180000, median: 92000 },
remote_distribution: { remote: 45, hybrid: 35, onsite: 20 },
experience_distribution: { entry: 15, mid: 45, senior: 40 },
},
trends: {
growing_skills: [
{ skill: "React", growth_rate: 25 },
{ skill: "Python", growth_rate: 18 },
{ skill: "AWS", growth_rate: 22 },
],
declining_skills: [
{ skill: "jQuery", growth_rate: -12 },
{ skill: "PHP", growth_rate: -8 },
],
emerging_roles: ["AI Engineer", "DevSecOps Engineer", "Data Engineer"],
},
jobs: mockJobs,
analysis: {
skill_demand: {
React: { count: 45, avg_salary: 98000 },
Python: { count: 38, avg_salary: 102000 },
AWS: { count: 32, avg_salary: 105000 },
},
company_insights: {
top_hirers: [
{ company: "TechCorp", jobs: 25 },
{ company: "StartupXYZ", jobs: 18 },
],
salary_leaders: [
{ company: "BigTech", avg_salary: 120000 },
{ company: "FinTech", avg_salary: 115000 },
],
},
},
};
// Save to demo file
const outputPath = path.join(__dirname, "demo-job-analysis.json");
fs.writeFileSync(outputPath, JSON.stringify(outputData, null, 2));
demo.success(`Analysis report saved to: ${outputPath}`);
demo.info(`Total jobs analyzed: ${outputData.metadata.total_jobs_found}`);
demo.info(
`Analysis duration: ${outputData.metadata.analysis_duration_seconds} seconds`
);
demo.code("// Output formats available");
demo.info("📁 JSON: Comprehensive analysis with metadata");
demo.info("📊 CSV: Tabular data for spreadsheet analysis");
demo.info("📈 Charts: Visual trend analysis");
demo.info("📋 Summary: Executive summary report");
await waitForEnter();
}
// Helper functions
function waitForEnter() {
return new Promise((resolve) => {
const readline = require("readline");
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout,
});
rl.question("\nPress Enter to continue...", () => {
rl.close();
resolve();
});
});
}
async function simulateSearch() {
return new Promise((resolve) => {
const steps = [
"Connecting to source",
"Searching jobs",
"Filtering results",
"Extracting data",
];
let i = 0;
const interval = setInterval(() => {
if (i < steps.length) {
logger.info(steps[i]);
i++;
} else {
clearInterval(interval);
resolve();
}
}, 600);
});
}
async function simulateProcessing() {
return new Promise((resolve) => {
const dots = [".", "..", "..."];
let i = 0;
const interval = setInterval(() => {
process.stdout.write(`\rProcessing${dots[i]}`);
i = (i + 1) % dots.length;
}, 500);
setTimeout(() => {
clearInterval(interval);
process.stdout.write("\r");
resolve();
}, 2000);
});
}
// Run the demo if this file is executed directly
if (require.main === module) {
runDemo().catch((error) => {
demo.error(`Demo failed: ${error.message}`);
process.exit(1);
});
}
module.exports = { runDemo };

229
job-search-parser/index.js Normal file
View File

@ -0,0 +1,229 @@
#!/usr/bin/env node
/**
* Job Search Parser - Refactored
*
* Uses core-parser for browser management and site-specific strategies for parsing logic
*/
const path = require("path");
const fs = require("fs");
const CoreParser = require("../core-parser");
const { skipthedriveStrategy } = require("./strategies/skipthedrive-strategy");
const { logger, analyzeBatch, checkOllamaStatus } = require("ai-analyzer");
// Load environment variables
require("dotenv").config({ path: path.join(__dirname, ".env") });
// Configuration from environment
const HEADLESS = process.env.HEADLESS !== "false";
const SEARCH_KEYWORDS =
process.env.SEARCH_KEYWORDS || "software engineer,developer,programmer";
const LOCATION_FILTER = process.env.LOCATION_FILTER;
const ENABLE_AI_ANALYSIS = process.env.ENABLE_AI_ANALYSIS === "true";
const MAX_PAGES = parseInt(process.env.MAX_PAGES) || 5;
// Available site strategies
const SITE_STRATEGIES = {
skipthedrive: skipthedriveStrategy,
// Add more site strategies here
// indeed: indeedStrategy,
// glassdoor: glassdoorStrategy,
};
/**
* Parse command line arguments
*/
function parseArguments() {
const args = process.argv.slice(2);
const options = {
sites: ["skipthedrive"], // default
keywords: null,
locationFilter: null,
maxPages: MAX_PAGES,
};
args.forEach((arg) => {
if (arg.startsWith("--sites=")) {
options.sites = arg
.split("=")[1]
.split(",")
.map((s) => s.trim());
} else if (arg.startsWith("--keywords=")) {
options.keywords = arg
.split("=")[1]
.split(",")
.map((k) => k.trim());
} else if (arg.startsWith("--location=")) {
options.locationFilter = arg.split("=")[1];
} else if (arg.startsWith("--max-pages=")) {
options.maxPages = parseInt(arg.split("=")[1]) || MAX_PAGES;
}
});
return options;
}
/**
* Main job search parser function
*/
async function startJobSearchParser(options = {}) {
const cliOptions = parseArguments();
const finalOptions = { ...cliOptions, ...options };
const coreParser = new CoreParser({
headless: HEADLESS,
timeout: 30000,
});
try {
logger.step("🚀 Job Search Parser Starting...");
// Parse keywords
const keywords =
finalOptions.keywords || SEARCH_KEYWORDS.split(",").map((k) => k.trim());
const locationFilter = finalOptions.locationFilter || LOCATION_FILTER;
const sites = finalOptions.sites;
logger.info(`📦 Selected job sites: ${sites.join(", ")}`);
logger.info(`🔍 Search Keywords: ${keywords.join(", ")}`);
logger.info(`📍 Location Filter: ${locationFilter || "None"}`);
logger.info(
`🧠 AI Analysis: ${ENABLE_AI_ANALYSIS ? "Enabled" : "Disabled"}`
);
const allResults = [];
const allRejectedResults = [];
const siteResults = {};
// Process each selected site
for (const site of sites) {
const strategy = SITE_STRATEGIES[site];
if (!strategy) {
logger.error(`❌ Unknown site strategy: ${site}`);
continue;
}
try {
logger.step(`\n🌐 Parsing ${site}...`);
const startTime = Date.now();
const parseResult = await strategy(coreParser, {
keywords,
locationFilter,
maxPages: finalOptions.maxPages,
});
const { results, rejectedResults, summary } = parseResult;
const duration = ((Date.now() - startTime) / 1000).toFixed(2);
// Collect results
allResults.push(...results);
allRejectedResults.push(...rejectedResults);
siteResults[site] = {
count: results.length,
rejected: rejectedResults.length,
duration: `${duration}s`,
summary,
};
logger.success(
`${site} completed in ${duration}s - Found ${results.length} jobs`
);
} catch (error) {
logger.error(`${site} parsing failed: ${error.message}`);
siteResults[site] = {
count: 0,
rejected: 0,
duration: "0s",
error: error.message,
};
}
}
// AI Analysis if enabled
let analysisResults = null;
if (ENABLE_AI_ANALYSIS && allResults.length > 0) {
logger.step("🧠 Running AI Analysis...");
const ollamaStatus = await checkOllamaStatus();
if (ollamaStatus.available) {
analysisResults = await analyzeBatch(allResults, {
context:
"Job market analysis focusing on job postings, skills, and trends",
});
logger.success(
`✅ AI Analysis completed for ${allResults.length} jobs`
);
} else {
logger.warning("⚠️ Ollama not available, skipping AI analysis");
}
}
// Save results
const outputData = {
metadata: {
extractedAt: new Date().toISOString(),
parser: "job-search-parser",
version: "2.0.0",
sites: sites,
keywords: keywords.join(", "),
locationFilter,
analysisResults,
},
results: allResults,
rejectedResults: allRejectedResults,
siteResults,
};
const resultsDir = path.join(__dirname, "results");
if (!fs.existsSync(resultsDir)) {
fs.mkdirSync(resultsDir, { recursive: true });
}
const timestamp = new Date().toISOString().replace(/[:.]/g, "-");
const filename = `job-search-results-${timestamp}.json`;
const filepath = path.join(resultsDir, filename);
fs.writeFileSync(filepath, JSON.stringify(outputData, null, 2));
// Final summary
logger.step("\n📊 Job Search Parser Summary");
logger.success(`✅ Total jobs found: ${allResults.length}`);
logger.info(`❌ Total rejected: ${allRejectedResults.length}`);
logger.info(`📁 Results saved to: ${filepath}`);
logger.info("\n📈 Results by site:");
for (const [site, stats] of Object.entries(siteResults)) {
if (stats.error) {
logger.error(` ${site}: ERROR - ${stats.error}`);
} else {
logger.info(
` ${site}: ${stats.count} jobs found, ${stats.rejected} rejected (${stats.duration})`
);
}
}
logger.success("\n✅ Job Search Parser completed successfully!");
return outputData;
} catch (error) {
logger.error(`❌ Job Search Parser failed: ${error.message}`);
throw error;
} finally {
await coreParser.cleanup();
}
}
// CLI handling
if (require.main === module) {
startJobSearchParser()
.then(() => process.exit(0))
.catch((error) => {
console.error("Fatal error:", error.message);
process.exit(1);
});
}
module.exports = { startJobSearchParser };

View File

@ -0,0 +1,9 @@
keyword
qa automation
automation test
sdet
qa lead
automation lead
playwright
cypress
quality assurance engineer
1 keyword
2 qa automation
3 automation test
4 sdet
5 qa lead
6 automation lead
7 playwright
8 cypress
9 quality assurance engineer

View File

@ -0,0 +1,28 @@
{
"name": "job-search-parser",
"version": "1.0.0",
"description": "Job search parser for multiple job sites using ai-analyzer core",
"main": "index.js",
"scripts": {
"start": "node index.js",
"test": "jest",
"demo": "node demo.js",
"parse:skipthedrive": "node parsers/skipthedrive-demo.js"
},
"keywords": [
"job",
"search",
"parser",
"scraper",
"ai"
],
"author": "",
"license": "ISC",
"type": "commonjs",
"dependencies": {
"ai-analyzer": "file:../ai-analyzer",
"core-parser": "file:../core-parser",
"dotenv": "^17.0.0",
"csv-parser": "^3.0.0"
}
}

View File

@ -0,0 +1,129 @@
#!/usr/bin/env node
/**
* SkipTheDrive Parser Demo
*
* Demonstrates the SkipTheDrive job parser functionality
*/
const { parseSkipTheDrive } = require("./skipthedrive");
const fs = require("fs");
const path = require("path");
const { logger } = require("../../ai-analyzer");
// Load environment variables
require("dotenv").config({ path: path.join(__dirname, "..", ".env") });
async function runDemo() {
logger.step("🚀 SkipTheDrive Parser Demo");
// Demo configuration
const options = {
// Search for QA automation jobs (from your example)
keywords: process.env.SEARCH_KEYWORDS?.split(",").map((k) => k.trim()) || [
"automation qa",
"qa engineer",
"test automation",
],
// Job type filters - can be: "part time", "full time", "contract"
jobTypes: process.env.JOB_TYPES?.split(",").map((t) => t.trim()) || [],
// Location filter (optional)
locationFilter: process.env.LOCATION_FILTER || "",
// Maximum pages to parse
maxPages: parseInt(process.env.MAX_PAGES) || 3,
// Browser headless mode
headless: process.env.HEADLESS !== "false",
// AI analysis
enableAI: process.env.ENABLE_AI_ANALYSIS !== "false",
aiContext: "remote QA and test automation job opportunities",
};
logger.info("Configuration:");
logger.info(`- Keywords: ${options.keywords.join(", ")}`);
logger.info(
`- Job Types: ${
options.jobTypes.length > 0 ? options.jobTypes.join(", ") : "All types"
}`
);
logger.info(`- Location Filter: ${options.locationFilter || "None"}`);
logger.info(`- Max Pages: ${options.maxPages}`);
logger.info(`- Headless: ${options.headless}`);
logger.info(`- AI Analysis: ${options.enableAI}`);
logger.info("\nStarting parser...");
try {
const startTime = Date.now();
const results = await parseSkipTheDrive(options);
const duration = ((Date.now() - startTime) / 1000).toFixed(2);
// Save results
const timestamp = new Date()
.toISOString()
.replace(/[:.]/g, "-")
.slice(0, -5);
const resultsDir = path.join(__dirname, "..", "results");
if (!fs.existsSync(resultsDir)) {
fs.mkdirSync(resultsDir, { recursive: true });
}
const resultsFile = path.join(
resultsDir,
`skipthedrive-results-${timestamp}.json`
);
fs.writeFileSync(resultsFile, JSON.stringify(results, null, 2));
// Display summary
logger.step("\n📊 Parsing Summary:");
logger.info(`- Duration: ${duration} seconds`);
logger.info(`- Jobs Found: ${results.results.length}`);
logger.info(`- Jobs Rejected: ${results.rejectedResults.length}`);
logger.file(`- Results saved to: ${resultsFile}`);
// Show sample results
if (results.results.length > 0) {
logger.info("\n🔍 Sample Jobs Found:");
results.results.slice(0, 5).forEach((job, index) => {
logger.info(`\n${index + 1}. ${job.title}`);
logger.info(` Company: ${job.company}`);
logger.info(` Posted: ${job.daysAgo}`);
logger.info(` Featured: ${job.isFeatured ? "Yes" : "No"}`);
logger.info(` URL: ${job.jobUrl}`);
if (job.aiAnalysis) {
logger.ai(
` AI Relevant: ${job.aiAnalysis.isRelevant ? "Yes" : "No"} (${(
job.aiAnalysis.confidence * 100
).toFixed(0)}% confidence)`
);
}
});
}
// Show rejection reasons
if (results.rejectedResults.length > 0) {
const rejectionReasons = {};
results.rejectedResults.forEach((job) => {
rejectionReasons[job.reason] = (rejectionReasons[job.reason] || 0) + 1;
});
logger.info("\n❌ Rejection Reasons:");
Object.entries(rejectionReasons).forEach(([reason, count]) => {
logger.info(` ${reason}: ${count}`);
});
}
} catch (error) {
logger.error("\n❌ Demo failed:", error.message);
process.exit(1);
}
}
// Run the demo
runDemo().catch((err) => {
logger.error("Fatal error:", err);
process.exit(1);
});

View File

@ -0,0 +1,332 @@
/**
* SkipTheDrive Job Parser
*
* Parses remote job listings from SkipTheDrive.com
* Supports keyword search, job type filters, and pagination
*/
const { chromium } = require("playwright");
const path = require("path");
// Import from ai-analyzer core package
const {
logger,
cleanText,
containsAnyKeyword,
parseLocationFilters,
validateLocationAgainstFilters,
extractLocationFromProfile,
analyzeBatch,
checkOllamaStatus,
} = require("../../ai-analyzer");
/**
* Build search URL for SkipTheDrive
* @param {string} keyword - Search keyword
* @param {string} orderBy - Sort order (date, relevance)
* @param {Array<string>} jobTypes - Job types to filter (part time, full time, contract)
* @returns {string} - Formatted search URL
*/
function buildSearchUrl(keyword, orderBy = "date", jobTypes = []) {
let url = `https://www.skipthedrive.com/?s=${encodeURIComponent(keyword)}`;
if (orderBy) {
url += `&orderby=${orderBy}`;
}
// Add job type filters
jobTypes.forEach((type) => {
url += `&jobtype=${encodeURIComponent(type)}`;
});
return url;
}
/**
* Extract job data from a single job listing element
* @param {Element} article - Job listing DOM element
* @returns {Object} - Extracted job data
*/
async function extractJobData(article) {
try {
// Extract job title and URL
const titleElement = await article.$("h2.post-title a");
const title = titleElement ? await titleElement.textContent() : "";
const jobUrl = titleElement ? await titleElement.getAttribute("href") : "";
// Extract date
const dateElement = await article.$("time.post-date");
const datePosted = dateElement
? await dateElement.getAttribute("datetime")
: "";
const dateText = dateElement ? await dateElement.textContent() : "";
// Extract company name
const companyElement = await article.$(
".custom_fields_company_name_display_search_results"
);
let company = companyElement ? await companyElement.textContent() : "";
company = company.replace(/^\s*[^\s]+\s*/, "").trim(); // Remove icon
// Extract days ago
const daysAgoElement = await article.$(
".custom_fields_job_date_display_search_results"
);
let daysAgo = daysAgoElement ? await daysAgoElement.textContent() : "";
daysAgo = daysAgo.replace(/^\s*[^\s]+\s*/, "").trim(); // Remove icon
// Extract job description excerpt
const excerptElement = await article.$(".excerpt_part");
const description = excerptElement
? await excerptElement.textContent()
: "";
// Check if featured/sponsored
const featuredElement = await article.$(".custom_fields_sponsored_job");
const isFeatured = !!featuredElement;
// Extract job ID from article ID
const articleId = await article.getAttribute("id");
const jobId = articleId ? articleId.replace("post-", "") : "";
return {
jobId,
title: cleanText(title),
company: cleanText(company),
jobUrl,
datePosted,
dateText: cleanText(dateText),
daysAgo: cleanText(daysAgo),
description: cleanText(description),
isFeatured,
source: "skipthedrive",
timestamp: new Date().toISOString(),
};
} catch (error) {
logger.error(`Error extracting job data: ${error.message}`);
return null;
}
}
/**
* Parse SkipTheDrive job listings
* @param {Object} options - Parser options
* @returns {Promise<Array>} - Array of parsed job listings
*/
async function parseSkipTheDrive(options = {}) {
const {
keywords = process.env.SEARCH_KEYWORDS?.split(",").map((k) => k.trim()) || [
"software engineer",
"developer",
],
jobTypes = process.env.JOB_TYPES?.split(",").map((t) => t.trim()) || [],
locationFilter = process.env.LOCATION_FILTER || "",
maxPages = parseInt(process.env.MAX_PAGES) || 5,
headless = process.env.HEADLESS !== "false",
enableAI = process.env.ENABLE_AI_ANALYSIS === "true",
aiContext = process.env.AI_CONTEXT || "remote job opportunities analysis",
} = options;
logger.step("Starting SkipTheDrive parser...");
logger.info(`🔍 Keywords: ${keywords.join(", ")}`);
logger.info(
`📋 Job Types: ${jobTypes.length > 0 ? jobTypes.join(", ") : "All"}`
);
logger.info(`📍 Location Filter: ${locationFilter || "None"}`);
logger.info(`📄 Max Pages: ${maxPages}`);
const browser = await chromium.launch({
headless,
args: [
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-dev-shm-usage",
],
});
const context = await browser.newContext({
userAgent:
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
});
const results = [];
const rejectedResults = [];
const seenJobs = new Set();
try {
// Search for each keyword
for (const keyword of keywords) {
logger.info(`\n🔍 Searching for: ${keyword}`);
const searchUrl = buildSearchUrl(keyword, "date", jobTypes);
const page = await context.newPage();
try {
logger.info(
`Attempting navigation to: ${searchUrl} at ${new Date().toISOString()}`
);
await page.goto(searchUrl, {
waitUntil: "domcontentloaded",
timeout: 30000,
});
logger.info(
`Navigation completed successfully at ${new Date().toISOString()}`
);
// Wait for job listings to load
logger.info("Waiting for selector #loops-wrapper");
await page
.waitForSelector("#loops-wrapper", { timeout: 5000 })
.catch(() => {
logger.warning(`No results found for keyword: ${keyword}`);
});
logger.info("Selector wait completed");
let currentPage = 1;
let hasNextPage = true;
while (hasNextPage && currentPage <= maxPages) {
logger.info(`📄 Processing page ${currentPage} for "${keyword}"`);
// Extract all job articles on current page
const jobArticles = await page.$$("article[id^='post-']");
logger.info(
`Found ${jobArticles.length} job listings on page ${currentPage}`
);
for (const article of jobArticles) {
const jobData = await extractJobData(article);
if (!jobData || seenJobs.has(jobData.jobId)) {
continue;
}
seenJobs.add(jobData.jobId);
// Add keyword that found this job
jobData.searchKeyword = keyword;
// Validate job against keywords
const fullText = `${jobData.title} ${jobData.description} ${jobData.company}`;
if (!containsAnyKeyword(fullText, keywords)) {
rejectedResults.push({
...jobData,
rejected: true,
reason: "Keywords not found in job listing",
});
continue;
}
// Location validation (if enabled)
if (locationFilter) {
const locationFilters = parseLocationFilters(locationFilter);
// For SkipTheDrive, most jobs are remote, but we can check the title/description
const locationValid =
fullText.toLowerCase().includes("remote") ||
locationFilters.some((filter) =>
fullText.toLowerCase().includes(filter.toLowerCase())
);
if (!locationValid) {
rejectedResults.push({
...jobData,
rejected: true,
reason: "Location requirements not met",
});
continue;
}
jobData.locationValid = locationValid;
}
logger.success(`✅ Found: ${jobData.title} at ${jobData.company}`);
results.push(jobData);
}
// Check for next page
const nextPageLink = await page.$("a.nextp");
if (nextPageLink && currentPage < maxPages) {
logger.info("📄 Moving to next page...");
await nextPageLink.click();
await page.waitForLoadState("domcontentloaded");
await page.waitForTimeout(2000); // Wait for content to load
currentPage++;
} else {
hasNextPage = false;
}
}
} catch (error) {
logger.error(`Error processing keyword "${keyword}": ${error.message}`);
} finally {
await page.close();
}
}
logger.success(`\n✅ Parsing complete!`);
logger.info(`📊 Total jobs found: ${results.length}`);
logger.info(`❌ Rejected jobs: ${rejectedResults.length}`);
// Run AI analysis if enabled
let aiAnalysis = null;
if (enableAI && results.length > 0) {
logger.step("Running AI analysis on job listings...");
const aiAvailable = await checkOllamaStatus();
if (aiAvailable) {
const analysisData = results.map((job) => ({
text: `${job.title} at ${job.company}. ${job.description}`,
metadata: {
jobId: job.jobId,
company: job.company,
daysAgo: job.daysAgo,
},
}));
aiAnalysis = await analyzeBatch(analysisData, aiContext);
// Merge AI analysis with results
results.forEach((job, index) => {
if (aiAnalysis && aiAnalysis[index]) {
job.aiAnalysis = {
isRelevant: aiAnalysis[index].isRelevant,
confidence: aiAnalysis[index].confidence,
reasoning: aiAnalysis[index].reasoning,
};
}
});
logger.success("✅ AI analysis completed");
} else {
logger.warning("⚠️ AI not available - skipping analysis");
}
}
return {
results,
rejectedResults,
metadata: {
source: "skipthedrive",
totalJobs: results.length,
rejectedJobs: rejectedResults.length,
keywords: keywords,
jobTypes: jobTypes,
locationFilter: locationFilter,
aiAnalysisEnabled: enableAI,
aiAnalysisCompleted: !!aiAnalysis,
timestamp: new Date().toISOString(),
},
};
} catch (error) {
logger.error(`Fatal error in SkipTheDrive parser: ${error.message}`);
throw error;
} finally {
await browser.close();
}
}
// Export the parser
module.exports = {
parseSkipTheDrive,
buildSearchUrl,
extractJobData,
};

View File

@ -0,0 +1,302 @@
/**
* SkipTheDrive Parsing Strategy
*
* Uses core-parser for browser management and ai-analyzer for utilities
*/
const {
logger,
cleanText,
containsAnyKeyword,
validateLocationAgainstFilters,
} = require("ai-analyzer");
/**
* SkipTheDrive URL builder
*/
function buildSearchUrl(keyword, orderBy = "date", jobTypes = []) {
const baseUrl = "https://www.skipthedrive.com/";
const params = new URLSearchParams({
s: keyword,
orderby: orderBy,
});
if (jobTypes && jobTypes.length > 0) {
params.append("job_type", jobTypes.join(","));
}
return `${baseUrl}?${params.toString()}`;
}
/**
* SkipTheDrive parsing strategy function
*/
async function skipthedriveStrategy(coreParser, options = {}) {
const {
keywords = ["software engineer", "developer", "programmer"],
locationFilter = null,
maxPages = 5,
jobTypes = [],
} = options;
const results = [];
const rejectedResults = [];
const seenJobs = new Set();
try {
// Create main page
const page = await coreParser.createPage("skipthedrive-main");
logger.info("🚀 Starting SkipTheDrive parser...");
logger.info(`🔍 Keywords: ${keywords.join(", ")}`);
logger.info(`📍 Location Filter: ${locationFilter || "None"}`);
logger.info(`📄 Max Pages: ${maxPages}`);
// Search for each keyword
for (const keyword of keywords) {
logger.info(`\n🔍 Searching for: ${keyword}`);
const searchUrl = buildSearchUrl(keyword, "date", jobTypes);
try {
// Navigate to search results
await coreParser.navigateTo(searchUrl, {
pageId: "skipthedrive-main",
retries: 2,
timeout: 30000,
});
// Wait for job listings to load
const hasResults = await coreParser
.waitForSelector(
"#loops-wrapper",
{
timeout: 5000,
},
"skipthedrive-main"
)
.catch(() => {
logger.warning(`No results found for keyword: ${keyword}`);
return false;
});
if (!hasResults) {
continue;
}
// Process multiple pages
let currentPage = 1;
let hasNextPage = true;
while (hasNextPage && currentPage <= maxPages) {
logger.info(`📄 Processing page ${currentPage} for "${keyword}"`);
// Extract jobs from current page
const pageJobs = await extractJobsFromPage(
page,
keyword,
locationFilter
);
for (const job of pageJobs) {
// Skip duplicates
if (seenJobs.has(job.jobId)) continue;
seenJobs.add(job.jobId);
// Validate location if filtering enabled
if (locationFilter) {
const locationValid = validateLocationAgainstFilters(
job.location,
locationFilter
);
if (!locationValid) {
rejectedResults.push({
...job,
rejectionReason: "Location filter mismatch",
});
continue;
}
}
results.push(job);
}
// Check for next page
hasNextPage = await hasNextPageAvailable(page);
if (hasNextPage && currentPage < maxPages) {
await navigateToNextPage(page, currentPage + 1);
currentPage++;
// Wait for new page to load
await page.waitForTimeout(2000);
} else {
hasNextPage = false;
}
}
} catch (error) {
logger.error(`Error processing keyword "${keyword}": ${error.message}`);
}
}
logger.info(
`🎯 SkipTheDrive parsing completed: ${results.length} jobs found, ${rejectedResults.length} rejected`
);
return {
results,
rejectedResults,
summary: {
totalJobs: results.length,
totalRejected: rejectedResults.length,
keywords: keywords.join(", "),
locationFilter,
source: "skipthedrive",
},
};
} catch (error) {
logger.error(`❌ SkipTheDrive parsing failed: ${error.message}`);
throw error;
}
}
/**
* Extract jobs from current page
*/
async function extractJobsFromPage(page, keyword, locationFilter) {
const jobs = [];
try {
// Get all job article elements
const jobElements = await page.$$("article.job_listing");
for (const jobElement of jobElements) {
try {
const job = await extractJobData(jobElement, keyword);
if (job) {
jobs.push(job);
}
} catch (error) {
logger.warning(`Failed to extract job data: ${error.message}`);
}
}
} catch (error) {
logger.error(`Failed to extract jobs from page: ${error.message}`);
}
return jobs;
}
/**
* Extract data from individual job element
*/
async function extractJobData(jobElement, keyword) {
try {
// Extract job ID
const articleId = (await jobElement.getAttribute("id")) || "";
const jobId = articleId ? articleId.replace("post-", "") : "";
// Extract title
const titleElement = await jobElement.$(".job_listing-title a");
const title = titleElement
? cleanText(await titleElement.textContent())
: "";
const jobUrl = titleElement ? await titleElement.getAttribute("href") : "";
// Extract company
const companyElement = await jobElement.$(".company");
const company = companyElement
? cleanText(await companyElement.textContent())
: "";
// Extract location
const locationElement = await jobElement.$(".location");
const location = locationElement
? cleanText(await locationElement.textContent())
: "";
// Extract date posted
const dateElement = await jobElement.$(".job-date");
const dateText = dateElement
? cleanText(await dateElement.textContent())
: "";
// Extract description
const descElement = await jobElement.$(".job_listing-description");
const description = descElement
? cleanText(await descElement.textContent())
: "";
// Check if featured
const featuredElement = await jobElement.$(".featured");
const isFeatured = featuredElement !== null;
// Parse date
let datePosted = null;
let daysAgo = null;
if (dateText) {
const match = dateText.match(/(\d+)\s+days?\s+ago/);
if (match) {
daysAgo = parseInt(match[1]);
const date = new Date();
date.setDate(date.getDate() - daysAgo);
datePosted = date.toISOString().split("T")[0];
}
}
return {
jobId,
title,
company,
location,
jobUrl,
datePosted,
dateText,
daysAgo,
description,
isFeatured,
keyword,
extractedAt: new Date().toISOString(),
source: "skipthedrive",
};
} catch (error) {
logger.warning(`Error extracting job data: ${error.message}`);
return null;
}
}
/**
* Check if next page is available
*/
async function hasNextPageAvailable(page) {
try {
const nextButton = await page.$(".next-page");
return nextButton !== null;
} catch {
return false;
}
}
/**
* Navigate to next page
*/
async function navigateToNextPage(page, pageNumber) {
try {
const nextButton = await page.$(".next-page");
if (nextButton) {
await nextButton.click();
}
} catch (error) {
logger.warning(
`Failed to navigate to page ${pageNumber}: ${error.message}`
);
}
}
module.exports = {
skipthedriveStrategy,
buildSearchUrl,
extractJobsFromPage,
extractJobData,
};

View File

@ -1,2 +0,0 @@
keyword
fired
1 keyword
2 fired

315
linkedin-parser/README.md Normal file
View File

@ -0,0 +1,315 @@
# LinkedIn Parser
LinkedIn posts parser with **integrated AI analysis** using the ai-analyzer core package. AI analysis is now embedded directly into the results JSON file.
## 🚀 Quick Start
```bash
# Install dependencies
npm install
# Run with default settings (AI analysis integrated into results)
npm start
# Run without AI analysis
npm run start:no-ai
```
## 📋 Available Scripts
### Parser Modes
```bash
# Basic parsing with integrated AI analysis
npm start
# Parsing without AI analysis
npm run start:no-ai
# Headless browser mode
npm run start:headless
# Visible browser mode (for debugging)
npm run start:visible
# Disable location filtering
npm run start:no-location
# Custom keywords
npm run start:custom
```
### Testing
```bash
# Run tests
npm test
# Run tests in watch mode
npm run test:watch
# Run tests with coverage
npm run test:coverage
```
### AI Analysis (CLI)
```bash
# Analyze latest results file with default context
npm run analyze:latest
# Analyze latest results file for layoffs
npm run analyze:layoff
# Analyze latest results file for job market trends
npm run analyze:trends
# Analyze specific file (requires --input parameter)
npm run analyze -- --input=results.json
```
### Utilities
```bash
# Show help
npm run help
# Run demo
npm run demo
# Install Playwright browser
npm run install:playwright
```
## 🔧 Configuration
### Environment Variables
Create a `.env` file in the `linkedin-parser` directory:
```env
# LinkedIn Credentials
LINKEDIN_USERNAME=your_email@example.com
LINKEDIN_PASSWORD=your_password
# Search Configuration
CITY=Toronto
DATE_POSTED=past-week
SORT_BY=date_posted
WHEELS=5
# Location Filtering
LOCATION_FILTER=Ontario,Manitoba
ENABLE_LOCATION_CHECK=true
# AI Analysis
ENABLE_AI_ANALYSIS=true
AI_CONTEXT="job market analysis and trends"
OLLAMA_MODEL=mistral
# Browser Configuration
HEADLESS=true
```
### Command Line Options
```bash
# Browser options
--headless=true|false # Browser headless mode
--keyword="kw1,kw2" # Specific keywords
--add-keyword="kw" # Additional keywords
--no-location # Disable location filtering
--no-ai # Disable AI analysis
```
## 📊 Output Files
The parser generates two main files:
1. **`linkedin-results-YYYY-MM-DD-HH-MM.json`** - Main results with **integrated AI analysis**
2. **`linkedin-rejected-YYYY-MM-DD-HH-MM.json`** - Rejected posts with reasons
### Results Structure
Each result in the JSON file now includes AI analysis:
```json
{
"metadata": {
"timestamp": "2025-07-21T02:00:08.561Z",
"totalPosts": 10,
"aiAnalysisEnabled": true,
"aiAnalysisCompleted": true,
"aiContext": "job market analysis and trends",
"aiModel": "mistral"
},
"results": [
{
"keyword": "layoff",
"text": "Post content...",
"profileLink": "https://linkedin.com/in/user",
"location": "Toronto, Ontario",
"aiAnalysis": {
"isRelevant": true,
"confidence": 0.9,
"reasoning": "Post discusses job market conditions and hiring",
"context": "job market analysis and trends",
"model": "mistral",
"analyzedAt": "2025-07-21T02:48:42.487Z"
}
}
]
}
```
## 🧠 AI Analysis Workflow
### Automatic Integration
AI analysis runs automatically after parsing completes and is **embedded directly into the results JSON** (unless disabled with `--no-ai`).
### Manual Re-analysis
You can re-analyze existing results with different contexts using the CLI:
```bash
# Analyze latest results with default context
npm run analyze:latest
# Analyze latest results for layoffs
npm run analyze:layoff
# Analyze latest results for job market trends
npm run analyze:trends
# Analyze specific file with custom context
node ../ai-analyzer/cli.js --input=results.json --context="custom analysis"
```
### CLI Options
The AI analyzer CLI supports:
```bash
--input=FILE # Input JSON file
--output=FILE # Output file (default: original-ai.json)
--context="description" # Analysis context
--model=MODEL # Ollama model (default: mistral)
--latest # Use latest results file
--dir=PATH # Directory to look for results
```
## 🎯 Use Cases
### Basic Usage
```bash
# Run parser with integrated AI analysis
npm start
```
### Testing Different Keywords
```bash
# Test with custom keywords
npm run start:custom
```
### Debugging
```bash
# Run with visible browser
npm run start:visible
# Run without location filtering
npm run start:no-location
```
### Re-analyzing Data
```bash
# After running parser, re-analyze with different contexts
npm run analyze:layoff
npm run analyze:trends
# Analyze specific file
node ../ai-analyzer/cli.js --input=results/linkedin-results-2025-07-20-18-00.json
```
## 🔍 Troubleshooting
### Common Issues
1. **Missing credentials**
```bash
# Check .env file exists and has credentials
cat .env
```
2. **Browser issues**
```bash
# Install Playwright browser
npm run install:playwright
```
3. **AI not available**
```bash
# Make sure Ollama is running
ollama list
# Install mistral model if needed
ollama pull mistral
```
4. **No results found**
```bash
# Try different keywords
npm run start:custom
```
5. **CLI can't find results**
```bash
# Make sure you're in the linkedin-parser directory
cd linkedin-parser
npm run analyze:latest
```
## 📁 Project Structure
```
linkedin-parser/
├── index.js # Main parser with integrated AI analysis
├── package.json # Dependencies and scripts
├── .env # Configuration (create this)
├── keywords/ # Keyword CSV files
└── results/ # Output files (created automatically)
├── linkedin-results-*.json # Results with integrated AI analysis
└── linkedin-rejected-*.json # Rejected posts
```
## 🤝 Integration
This parser integrates with:
- **ai-analyzer**: Core AI utilities and CLI analysis tool
- **job-search-parser**: Job market intelligence (separate module)
### AI Analysis Package
The `ai-analyzer` package provides:
- **Library functions**: `analyzeBatch`, `checkOllamaStatus`, etc.
- **CLI tool**: `cli.js` for standalone analysis
- **Reusable components**: For other parsers in the ecosystem
## 🆕 What's New
- **Integrated AI Analysis**: AI results are now embedded directly in the results JSON
- **No Separate Files**: No more separate AI analysis files to manage
- **Rich Context**: Each post includes detailed AI insights
- **Flexible Re-analysis**: Easy to re-analyze with different contexts
- **Backward Compatible**: Original data structure preserved

412
linkedin-parser/demo.js Normal file
View File

@ -0,0 +1,412 @@
/**
* LinkedIn Parser Demo
*
* Demonstrates the LinkedIn Parser's capabilities for scraping LinkedIn content
* with keyword-based searching, location filtering, and AI analysis.
*
* This demo uses simulated data for safety and demonstration purposes.
*/
const { logger } = require("../ai-analyzer");
const fs = require("fs");
const path = require("path");
// Terminal colors for demo output
const colors = {
reset: "\x1b[0m",
bright: "\x1b[1m",
cyan: "\x1b[36m",
green: "\x1b[32m",
yellow: "\x1b[33m",
blue: "\x1b[34m",
magenta: "\x1b[35m",
red: "\x1b[31m",
};
const demo = {
title: (text) =>
console.log(`\n${colors.bright}${colors.cyan}${text}${colors.reset}`),
section: (text) =>
console.log(`\n${colors.bright}${colors.magenta}${text}${colors.reset}`),
success: (text) => console.log(`${colors.green}${text}${colors.reset}`),
info: (text) => console.log(`${colors.blue} ${text}${colors.reset}`),
warning: (text) => console.log(`${colors.yellow}⚠️ ${text}${colors.reset}`),
error: (text) => console.log(`${colors.red}${text}${colors.reset}`),
code: (text) => console.log(`${colors.cyan}${text}${colors.reset}`),
};
// Mock data for demonstration
const mockPosts = [
{
id: "post_1",
content:
"Just got laid off from my software engineering role at TechCorp. Looking for new opportunities in Toronto. This is really tough but I'm staying positive!",
original_content:
"Just got #laidoff from my software engineering role at TechCorp! Looking for new opportunities in #Toronto. This is really tough but I'm staying positive! 🚀",
author: {
name: "John Doe",
title: "Software Engineer",
company: "TechCorp",
location: "Toronto, Ontario, Canada",
profile_url: "https://linkedin.com/in/johndoe",
},
engagement: { likes: 45, comments: 12, shares: 3 },
metadata: {
post_date: "2024-01-10T14:30:00Z",
scraped_at: "2024-01-15T10:30:00Z",
search_keyword: "layoff",
location_validated: true,
},
},
{
id: "post_2",
content:
"Our company is downsizing and I'm affected. This is really tough news but I'm grateful for the time I had here.",
original_content:
"Our company is #downsizing and I'm affected. This is really tough news but I'm grateful for the time I had here. #RIF #layoff",
author: {
name: "Jane Smith",
title: "Product Manager",
company: "StartupXYZ",
location: "Vancouver, British Columbia, Canada",
profile_url: "https://linkedin.com/in/janesmith",
},
engagement: { likes: 23, comments: 8, shares: 1 },
metadata: {
post_date: "2024-01-09T16:45:00Z",
scraped_at: "2024-01-15T10:30:00Z",
search_keyword: "downsizing",
location_validated: true,
},
},
{
id: "post_3",
content:
"Open to work! Looking for new opportunities in software development. I have 5 years of experience in React, Node.js, and cloud technologies.",
original_content:
"Open to work! Looking for new opportunities in software development. I have 5 years of experience in #React, #NodeJS, and #cloud technologies. #opentowork #jobsearch",
author: {
name: "Bob Wilson",
title: "Full Stack Developer",
company: "Freelance",
location: "Calgary, Alberta, Canada",
profile_url: "https://linkedin.com/in/bobwilson",
},
engagement: { likes: 67, comments: 15, shares: 8 },
metadata: {
post_date: "2024-01-08T11:20:00Z",
scraped_at: "2024-01-15T10:30:00Z",
search_keyword: "open to work",
location_validated: true,
},
},
];
async function runDemo() {
demo.title("=== LinkedIn Parser Demo ===");
demo.info(
"This demo showcases the LinkedIn Parser's capabilities for scraping LinkedIn content."
);
demo.info("All data shown is simulated for demonstration purposes.");
demo.info("Press Enter to continue through each section...\n");
await waitForEnter();
// 1. Configuration Demo
await demonstrateConfiguration();
// 2. Keyword Loading Demo
await demonstrateKeywordLoading();
// 3. Search Process Demo
await demonstrateSearchProcess();
// 4. Location Filtering Demo
await demonstrateLocationFiltering();
// 5. AI Analysis Demo
await demonstrateAIAnalysis();
// 6. Output Generation Demo
await demonstrateOutputGeneration();
demo.title("=== Demo Complete ===");
demo.success("LinkedIn Parser demo completed successfully!");
demo.info("Check the README.md for detailed usage instructions.");
}
async function demonstrateConfiguration() {
demo.section("1. Configuration Setup");
demo.info(
"The LinkedIn Parser uses environment variables and command-line options for configuration."
);
demo.code("// Environment Variables (.env file)");
demo.info("LINKEDIN_USERNAME=your_email@example.com");
demo.info("LINKEDIN_PASSWORD=your_password");
demo.info("CITY=Toronto");
demo.info("DATE_POSTED=past-week");
demo.info("SORT_BY=date_posted");
demo.info("WHEELS=5");
demo.info("LOCATION_FILTER=Ontario,Manitoba");
demo.info("ENABLE_LOCATION_CHECK=true");
demo.info("ENABLE_LOCAL_AI=true");
demo.info('AI_CONTEXT="job layoffs and workforce reduction"');
demo.info("OLLAMA_MODEL=mistral");
demo.code("// Command Line Options");
demo.info('node index.js --keyword="layoff,downsizing" --city="Vancouver"');
demo.info("node index.js --no-location --no-ai");
demo.info("node index.js --output=results/my-results.json");
demo.info("node index.js --ai-after");
await waitForEnter();
}
async function demonstrateKeywordLoading() {
demo.section("2. Keyword Loading");
demo.info(
"Keywords can be loaded from CSV files or specified via command line."
);
// Simulate loading keywords from CSV
demo.code("// Loading keywords from CSV file");
logger.step("Loading keywords from keywords/linkedin-keywords.csv");
const keywords = [
"layoff",
"downsizing",
"reduction in force",
"RIF",
"termination",
"job loss",
"workforce reduction",
"open to work",
"actively seeking",
"job search",
];
demo.success(`Loaded ${keywords.length} keywords from CSV file`);
demo.info("Keywords: " + keywords.slice(0, 5).join(", ") + "...");
demo.code("// Command line keyword override");
demo.info('node index.js --keyword="layoff,downsizing"');
demo.info('node index.js --add-keyword="hiring freeze"');
await waitForEnter();
}
async function demonstrateSearchProcess() {
demo.section("3. Search Process Simulation");
demo.info(
"The parser performs automated LinkedIn searches for each keyword."
);
const keywords = ["layoff", "downsizing", "open to work"];
for (const keyword of keywords) {
demo.code(`// Searching for keyword: "${keyword}"`);
logger.search(`Searching for "${keyword}" in Toronto`);
// Simulate search process
await simulateSearch();
const foundCount = Math.floor(Math.random() * 50) + 10;
const acceptedCount = Math.floor(foundCount * 0.3);
logger.info(`Found ${foundCount} posts, checking profiles for location...`);
logger.success(`Accepted ${acceptedCount} posts after location validation`);
console.log();
}
await waitForEnter();
}
async function demonstrateLocationFiltering() {
demo.section("4. Location Filtering");
demo.info(
"Posts are filtered based on author location using geographic validation."
);
demo.code("// Location filter configuration");
demo.info("LOCATION_FILTER=Ontario,Manitoba");
demo.info("ENABLE_LOCATION_CHECK=true");
demo.code("// Location validation examples");
const testLocations = [
{ location: "Toronto, Ontario, Canada", valid: true },
{ location: "Vancouver, British Columbia, Canada", valid: false },
{ location: "Calgary, Alberta, Canada", valid: false },
{ location: "Winnipeg, Manitoba, Canada", valid: true },
{ location: "New York, NY, USA", valid: false },
];
testLocations.forEach(({ location, valid }) => {
logger.location(`Checking location: ${location}`);
if (valid) {
logger.success(`✅ Location valid - post accepted`);
} else {
logger.warning(`❌ Location invalid - post rejected`);
}
});
await waitForEnter();
}
async function demonstrateAIAnalysis() {
demo.section("5. AI Analysis");
demo.info(
"Posts can be analyzed using local Ollama or OpenAI for relevance scoring."
);
demo.code("// AI analysis configuration");
demo.info("ENABLE_LOCAL_AI=true");
demo.info('AI_CONTEXT="job layoffs and workforce reduction"');
demo.info("OLLAMA_MODEL=mistral");
demo.code("// Analyzing posts with AI");
logger.ai("Starting AI analysis of accepted posts...");
for (let i = 0; i < mockPosts.length; i++) {
const post = mockPosts[i];
logger.info(`Analyzing post ${i + 1}: ${post.content.substring(0, 50)}...`);
// Simulate AI analysis
await simulateProcessing();
const relevanceScore = 0.7 + Math.random() * 0.3;
const confidence = 0.8 + Math.random() * 0.2;
logger.success(
`Relevance: ${relevanceScore.toFixed(
2
)}, Confidence: ${confidence.toFixed(2)}`
);
// Add AI analysis to post
post.ai_analysis = {
relevance_score: relevanceScore,
confidence: confidence,
context_match: relevanceScore > 0.7,
analysis_text: `This post discusses ${post.metadata.search_keyword} and is relevant to the search context.`,
};
}
await waitForEnter();
}
async function demonstrateOutputGeneration() {
demo.section("6. Output Generation");
demo.info("Results are saved to JSON files with comprehensive metadata.");
demo.code("// Generating output file");
logger.file("Saving results to JSON file...");
const outputData = {
metadata: {
timestamp: new Date().toISOString(),
keywords: ["layoff", "downsizing", "open to work"],
city: "Toronto",
date_posted: "past-week",
sort_by: "date_posted",
total_posts_found: 150,
accepted_posts: mockPosts.length,
rejected_posts: 147,
processing_time_seconds: 180,
},
posts: mockPosts,
};
// Save to demo file
const outputPath = path.join(__dirname, "demo-results.json");
fs.writeFileSync(outputPath, JSON.stringify(outputData, null, 2));
demo.success(`Results saved to: ${outputPath}`);
demo.info(`Total posts processed: ${outputData.metadata.total_posts_found}`);
demo.info(`Posts accepted: ${outputData.metadata.accepted_posts}`);
demo.info(`Posts rejected: ${outputData.metadata.rejected_posts}`);
demo.code("// Output file structure");
demo.info("📁 demo-results.json");
demo.info(" ├── metadata");
demo.info(" │ ├── timestamp");
demo.info(" │ ├── keywords");
demo.info(" │ ├── city");
demo.info(" │ ├── total_posts_found");
demo.info(" │ ├── accepted_posts");
demo.info(" │ └── processing_time_seconds");
demo.info(" └── posts[]");
demo.info(" ├── id");
demo.info(" ├── content");
demo.info(" ├── author");
demo.info(" ├── engagement");
demo.info(" ├── ai_analysis");
demo.info(" └── metadata");
await waitForEnter();
}
// Helper functions
function waitForEnter() {
return new Promise((resolve) => {
const readline = require("readline");
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout,
});
rl.question("\nPress Enter to continue...", () => {
rl.close();
resolve();
});
});
}
async function simulateSearch() {
return new Promise((resolve) => {
const steps = [
"Launching browser",
"Logging in",
"Navigating to search",
"Loading results",
];
let i = 0;
const interval = setInterval(() => {
if (i < steps.length) {
logger.info(steps[i]);
i++;
} else {
clearInterval(interval);
resolve();
}
}, 800);
});
}
async function simulateProcessing() {
return new Promise((resolve) => {
const dots = [".", "..", "..."];
let i = 0;
const interval = setInterval(() => {
process.stdout.write(`\rProcessing${dots[i]}`);
i = (i + 1) % dots.length;
}, 500);
setTimeout(() => {
clearInterval(interval);
process.stdout.write("\r");
resolve();
}, 1500);
});
}
// Run the demo if this file is executed directly
if (require.main === module) {
runDemo().catch((error) => {
demo.error(`Demo failed: ${error.message}`);
process.exit(1);
});
}
module.exports = { runDemo };

146
linkedin-parser/index.js Normal file
View File

@ -0,0 +1,146 @@
#!/usr/bin/env node
/**
* LinkedIn Parser - Refactored
*
* Uses core-parser for browser management and linkedin-strategy for parsing logic
*/
const path = require("path");
const fs = require("fs");
const CoreParser = require("../core-parser");
const { linkedinStrategy } = require("./strategies/linkedin-strategy");
const { logger, analyzeBatch, checkOllamaStatus } = require("ai-analyzer");
// Load environment variables
require("dotenv").config({ path: path.join(__dirname, ".env") });
// Configuration from environment
const LINKEDIN_USERNAME = process.env.LINKEDIN_USERNAME;
const LINKEDIN_PASSWORD = process.env.LINKEDIN_PASSWORD;
const HEADLESS = process.env.HEADLESS !== "false";
const SEARCH_KEYWORDS =
process.env.SEARCH_KEYWORDS || "layoff,downsizing,job cuts";
const LOCATION_FILTER = process.env.LOCATION_FILTER;
const ENABLE_AI_ANALYSIS = process.env.ENABLE_AI_ANALYSIS === "true";
const MAX_RESULTS = parseInt(process.env.MAX_RESULTS) || 50;
/**
* Main LinkedIn parser function
*/
async function startLinkedInParser(options = {}) {
const coreParser = new CoreParser({
headless: HEADLESS,
timeout: 30000,
});
try {
logger.step("🚀 LinkedIn Parser Starting...");
// Validate credentials
if (!LINKEDIN_USERNAME || !LINKEDIN_PASSWORD) {
throw new Error(
"LinkedIn credentials not found. Please set LINKEDIN_USERNAME and LINKEDIN_PASSWORD in .env file"
);
}
// Parse keywords
const keywords = SEARCH_KEYWORDS.split(",").map((k) => k.trim());
logger.info(`🔍 Search Keywords: ${keywords.join(", ")}`);
logger.info(`📍 Location Filter: ${LOCATION_FILTER || "None"}`);
logger.info(
`🧠 AI Analysis: ${ENABLE_AI_ANALYSIS ? "Enabled" : "Disabled"}`
);
logger.info(`📊 Max Results: ${MAX_RESULTS}`);
// Run LinkedIn parsing strategy
const parseResult = await linkedinStrategy(coreParser, {
keywords,
locationFilter: LOCATION_FILTER,
maxResults: MAX_RESULTS,
credentials: {
username: LINKEDIN_USERNAME,
password: LINKEDIN_PASSWORD,
},
});
const { results, rejectedResults, summary } = parseResult;
// AI Analysis if enabled
let analysisResults = null;
if (ENABLE_AI_ANALYSIS && results.length > 0) {
logger.step("🧠 Running AI Analysis...");
const ollamaStatus = await checkOllamaStatus();
if (ollamaStatus.available) {
analysisResults = await analyzeBatch(results, {
context:
"LinkedIn posts analysis focusing on job market trends and layoffs",
});
logger.success(`✅ AI Analysis completed for ${results.length} posts`);
} else {
logger.warning("⚠️ Ollama not available, skipping AI analysis");
}
}
// Save results
const outputData = {
metadata: {
extractedAt: new Date().toISOString(),
parser: "linkedin-parser",
version: "2.0.0",
summary,
analysisResults,
},
results,
rejectedResults,
};
const resultsDir = path.join(__dirname, "results");
if (!fs.existsSync(resultsDir)) {
fs.mkdirSync(resultsDir, { recursive: true });
}
const timestamp = new Date().toISOString().replace(/[:.]/g, "-");
const filename = `linkedin-results-${timestamp}.json`;
const filepath = path.join(resultsDir, filename);
fs.writeFileSync(filepath, JSON.stringify(outputData, null, 2));
// Final summary
logger.success("✅ LinkedIn parsing completed successfully!");
logger.info(`📊 Total posts found: ${results.length}`);
logger.info(`❌ Total rejected: ${rejectedResults.length}`);
logger.info(`📁 Results saved to: ${filepath}`);
return outputData;
} catch (error) {
logger.error(`❌ LinkedIn parser failed: ${error.message}`);
throw error;
} finally {
await coreParser.cleanup();
}
}
// CLI handling
if (require.main === module) {
const args = process.argv.slice(2);
const options = {};
// Parse command line arguments
args.forEach((arg) => {
if (arg.startsWith("--")) {
const [key, value] = arg.slice(2).split("=");
options[key] = value || true;
}
});
startLinkedInParser(options)
.then(() => process.exit(0))
.catch((error) => {
console.error("Fatal error:", error.message);
process.exit(1);
});
}
module.exports = { startLinkedInParser };

View File

@ -1,34 +1,51 @@
keyword
layoff
terminated
termination
redundancy
redundancies
restructuring
acquisition
actively seeking
bankruptcy
business realignment
career transition
company closure
company reorganization
cost cutting
workforce reduction
job cuts
job loss
department closure
downsizing
furlough
separation
outplacement
workforce adjustment
rightsizing
business realignment
organizational change
position elimination
role elimination
job elimination
staff reduction
headcount reduction
voluntary separation
hiring
hiring freeze
involuntary separation
job cuts
job elimination
job loss
job opportunity
job search
layoff
looking for opportunities
mass layoff
company reorganization
department closure
site closure
plant closure
merger
new position
new role
office closure
open to work
organizational change
outplacement
plant closure
position elimination
recruiting
reduction in force
redundancies
redundancy
restructuring
rightsizing
RIF
role elimination
separation
site closure
staff reduction
terminated
termination
voluntary separation
workforce adjustment
workforce optimization
workforce reduction
workforce transition
1 keyword
2 layoff acquisition
3 terminated actively seeking
4 termination bankruptcy
5 redundancy business realignment
6 redundancies career transition
7 restructuring company closure
8 company reorganization
9 cost cutting
10 workforce reduction department closure
job cuts
job loss
11 downsizing
12 furlough
separation
outplacement
workforce adjustment
rightsizing
business realignment
organizational change
position elimination
role elimination
job elimination
staff reduction
13 headcount reduction
14 voluntary separation hiring
15 hiring freeze
16 involuntary separation
17 job cuts
18 job elimination
19 job loss
20 job opportunity
21 job search
22 layoff
23 looking for opportunities
24 mass layoff
25 company reorganization merger
26 department closure new position
27 site closure new role
plant closure
28 office closure
29 open to work
30 organizational change
31 outplacement
32 plant closure
33 position elimination
34 recruiting
35 reduction in force
36 redundancies
37 redundancy
38 restructuring
39 rightsizing
40 RIF
41 role elimination
42 separation
43 site closure
44 staff reduction
45 terminated
46 termination
47 voluntary separation
48 workforce adjustment
49 workforce optimization
50 workforce reduction
51 workforce transition

View File

@ -0,0 +1,42 @@
{
"name": "linkedout-parser",
"version": "1.0.0",
"description": "LinkedIn posts parser using ai-analyzer core",
"main": "index.js",
"scripts": {
"start": "node index.js",
"start:no-ai": "node index.js --no-ai",
"start:headless": "node index.js --headless=true",
"start:visible": "node index.js --headless=false",
"start:no-location": "node index.js --no-location",
"start:custom": "node index.js --keyword=\"layoff,downsizing\"",
"test": "jest",
"test:watch": "jest --watch",
"test:coverage": "jest --coverage",
"demo": "node demo.js",
"analyze": "node ../ai-analyzer/cli.js --dir=results",
"analyze:latest": "node ../ai-analyzer/cli.js --latest --dir=results",
"analyze:layoff": "node ../ai-analyzer/cli.js --latest --dir=results --context=\"layoff analysis\"",
"analyze:trends": "node ../ai-analyzer/cli.js --latest --dir=results --context=\"job market trends\"",
"help": "node index.js --help",
"install:playwright": "npx playwright install chromium"
},
"keywords": [
"linkedin",
"parser",
"scraper",
"ai"
],
"author": "",
"license": "ISC",
"type": "commonjs",
"dependencies": {
"ai-analyzer": "file:../ai-analyzer",
"core-parser": "file:../core-parser",
"dotenv": "^17.0.0",
"csv-parser": "^3.2.0"
},
"devDependencies": {
"jest": "^29.0.0"
}
}

View File

@ -0,0 +1,230 @@
/**
* LinkedIn Parsing Strategy
*
* Uses core-parser for browser management and ai-analyzer for utilities
*/
const {
logger,
cleanText,
containsAnyKeyword,
validateLocationAgainstFilters,
extractLocationFromProfile,
} = require("ai-analyzer");
/**
* LinkedIn parsing strategy function
*/
async function linkedinStrategy(coreParser, options = {}) {
const {
keywords = ["layoff", "downsizing", "job cuts"],
locationFilter = null,
maxResults = 50,
credentials = {},
} = options;
const results = [];
const rejectedResults = [];
const seenPosts = new Set();
const seenProfiles = new Set();
try {
// Create main page
const page = await coreParser.createPage("linkedin-main");
// Authenticate to LinkedIn
logger.info("🔐 Authenticating to LinkedIn...");
await coreParser.authenticate("linkedin", credentials, "linkedin-main");
logger.info("✅ LinkedIn authentication successful");
// Search for posts with each keyword
for (const keyword of keywords) {
logger.info(`🔍 Searching LinkedIn for: "${keyword}"`);
const searchUrl = `https://www.linkedin.com/search/results/content/?keywords=${encodeURIComponent(
keyword
)}&sortBy=date_posted`;
await coreParser.navigateTo(searchUrl, {
pageId: "linkedin-main",
retries: 2,
});
// Wait for search results
const hasResults = await coreParser.navigationManager.navigateAndWaitFor(
searchUrl,
".search-results-container",
{ pageId: "linkedin-main", timeout: 10000 }
);
if (!hasResults) {
logger.warning(`No search results found for keyword: ${keyword}`);
continue;
}
// Extract posts from current page
const posts = await extractPostsFromPage(page, keyword);
for (const post of posts) {
// Skip duplicates
if (seenPosts.has(post.postId)) continue;
seenPosts.add(post.postId);
// Validate location if filtering enabled
if (locationFilter) {
const locationValid = validateLocationAgainstFilters(
post.location || post.profileLocation,
locationFilter
);
if (!locationValid) {
rejectedResults.push({
...post,
rejectionReason: "Location filter mismatch",
});
continue;
}
}
results.push(post);
if (results.length >= maxResults) {
logger.info(`📊 Reached maximum results limit: ${maxResults}`);
break;
}
}
if (results.length >= maxResults) break;
}
logger.info(
`🎯 LinkedIn parsing completed: ${results.length} posts found, ${rejectedResults.length} rejected`
);
return {
results,
rejectedResults,
summary: {
totalPosts: results.length,
totalRejected: rejectedResults.length,
keywords: keywords.join(", "),
locationFilter,
},
};
} catch (error) {
logger.error(`❌ LinkedIn parsing failed: ${error.message}`);
throw error;
}
}
/**
* Extract posts from current search results page
*/
async function extractPostsFromPage(page, keyword) {
const posts = [];
try {
// Get all post elements
const postElements = await page.$$(".feed-shared-update-v2");
for (const postElement of postElements) {
try {
const post = await extractPostData(postElement, keyword);
if (post) {
posts.push(post);
}
} catch (error) {
logger.warning(`Failed to extract post data: ${error.message}`);
}
}
} catch (error) {
logger.error(`Failed to extract posts from page: ${error.message}`);
}
return posts;
}
/**
* Extract data from individual post element
*/
async function extractPostData(postElement, keyword) {
try {
// Extract post ID
const postId = (await postElement.getAttribute("data-urn")) || "";
// Extract author info
const authorElement = await postElement.$(".feed-shared-actor__name");
const authorName = authorElement
? cleanText(await authorElement.textContent())
: "";
const authorLinkElement = await postElement.$(".feed-shared-actor__name a");
const authorUrl = authorLinkElement
? await authorLinkElement.getAttribute("href")
: "";
// Extract post content
const contentElement = await postElement.$(".feed-shared-text");
const content = contentElement
? cleanText(await contentElement.textContent())
: "";
// Extract timestamp
const timeElement = await postElement.$(
".feed-shared-actor__sub-description time"
);
const timestamp = timeElement
? await timeElement.getAttribute("datetime")
: "";
// Extract engagement metrics
const likesElement = await postElement.$(".social-counts-reactions__count");
const likesText = likesElement
? cleanText(await likesElement.textContent())
: "0";
const commentsElement = await postElement.$(
".social-counts-comments__count"
);
const commentsText = commentsElement
? cleanText(await commentsElement.textContent())
: "0";
// Check if post contains relevant keywords
const isRelevant = containsAnyKeyword(content, [keyword]);
if (!isRelevant) {
return null; // Skip irrelevant posts
}
return {
postId: cleanText(postId),
authorName,
authorUrl,
content,
timestamp,
keyword,
likes: extractNumber(likesText),
comments: extractNumber(commentsText),
extractedAt: new Date().toISOString(),
source: "linkedin",
};
} catch (error) {
logger.warning(`Error extracting post data: ${error.message}`);
return null;
}
}
/**
* Extract numbers from text (e.g., "15 likes" -> 15)
*/
function extractNumber(text) {
const match = text.match(/\d+/);
return match ? parseInt(match[0]) : 0;
}
module.exports = {
linkedinStrategy,
extractPostsFromPage,
extractPostData,
};

View File

@ -1,648 +0,0 @@
/**
* LinkedIn Posts Scraper (LinkedOut)
*
* A comprehensive tool for scraping LinkedIn posts based on keyword searches.
* Designed to track job market trends, layoffs, and open work opportunities
* by monitoring LinkedIn content automatically.
*
* FEATURES:
* - Automated LinkedIn login with browser automation
* - Keyword-based post searching from CSV files or CLI
* - Configurable search parameters (date, location, sorting)
* - Duplicate detection for posts and profiles
* - Text cleaning (removes hashtags, URLs, emojis)
* - Timestamped JSON output files
* - Command-line parameter overrides (see below)
* - Enhanced geographic location validation
* - Optional local AI-powered context analysis (Ollama)
*
* USAGE:
* node linkedout.js [options]
*
* COMMAND-LINE OPTIONS:
* --headless=true|false Override browser headless mode
* --keyword="kw1,kw2" Use only these keywords (comma-separated, overrides CSV)
* --add-keyword="kw1,kw2" Add extra keywords to CSV/CLI list
* --city="CityName" Override city
* --date_posted=VALUE Override date posted (past-24h, past-week, past-month, or empty)
* --sort_by=VALUE Override sort by (date_posted or relevance)
* --location_filter=VALUE Override location filter
* --output=FILE Output file name
* --no-location Disable location filtering
* --no-ai Disable AI analysis
* --ai-after Run local AI analysis after scraping
* --help, -h Show this help message
*
* EXAMPLES:
* node linkedout.js # Standard scraping
* node linkedout.js --headless=false # Visual mode
* node linkedout.js --keyword="layoff,downsizing" # Only these keywords
* node linkedout.js --add-keyword="hiring freeze" # Add extra keyword(s)
* node linkedout.js --city="Vancouver" --date_posted=past-month
* node linkedout.js --output=results/myfile.json
* node linkedout.js --no-location --no-ai # Fastest, no filters
* node linkedout.js --ai-after # Run AI after scraping
*
* POST-PROCESSING AI ANALYSIS:
* node ai-analyzer-local.js --context="job layoffs" # Run on latest results file
* node ai-analyzer-local.js --input=results/results-2024-01-15.json --context="hiring"
*
* ENVIRONMENT VARIABLES (.env file):
* KEYWORDS=keywords-layoff.csv (filename only, always looks in keywords/ folder unless path is given)
* See README for full list.
*
* OUTPUT:
* - Saves to results/results-YYYY-MM-DD-HH-MM.json (or as specified by --output)
* - Enhanced format with optional location validation and local AI analysis
*
* KEYWORD FILES:
* - Place all keyword CSVs in the keywords/ folder
* - keywords-layoff.csv: 33+ layoff-related terms
* - keywords-open-work.csv: Terms for finding people open to work
* - Custom CSV format: header "keyword" with one keyword per line
*
* DEPENDENCIES:
* - playwright: Browser automation
* - dotenv: Environment variable management
* - csv-parser: CSV file parsing
* - Node.js built-ins: fs, path, child_process
*
* SECURITY & LEGAL:
* - Store credentials securely in .env file
* - Respect LinkedIn's Terms of Service
* - Use responsibly for educational/research purposes
* - Consider rate limiting and LinkedIn API for production use
*/
//process.env.PLAYWRIGHT_BROWSERS_PATH = "0";
// Suppress D-Bus notification errors in WSL
process.env.NO_AT_BRIDGE = "1";
process.env.DBUS_SESSION_BUS_ADDRESS = "/dev/null";
const { chromium } = require("playwright");
const fs = require("fs");
const path = require("path");
require("dotenv").config();
const csv = require("csv-parser");
const { spawn } = require("child_process");
// Core configuration
const DATE_POSTED = process.env.DATE_POSTED || "past-week";
const SORT_BY = process.env.SORT_BY || "date_posted";
const WHEELS = parseInt(process.env.WHEELS) || 5;
const CITY = process.env.CITY || "Toronto";
// Location filtering configuration
const LOCATION_FILTER = process.env.LOCATION_FILTER || "";
const ENABLE_LOCATION_CHECK = process.env.ENABLE_LOCATION_CHECK === "true";
// Local AI analysis configuration
const ENABLE_LOCAL_AI = process.env.ENABLE_LOCAL_AI === "true";
const RUN_LOCAL_AI_AFTER_SCRAPING =
process.env.RUN_LOCAL_AI_AFTER_SCRAPING === "true";
const AI_CONTEXT =
process.env.AI_CONTEXT || "job layoffs and workforce reduction";
// Import enhanced location utilities
const {
parseLocationFilters,
validateLocationAgainstFilters,
extractLocationFromProfile,
} = require("./location-utils");
// Read credentials
const LINKEDIN_USERNAME = process.env.LINKEDIN_USERNAME;
const LINKEDIN_PASSWORD = process.env.LINKEDIN_PASSWORD;
let HEADLESS = process.env.HEADLESS === "true";
// Parse command-line arguments
const args = process.argv.slice(2);
let cliKeywords = null; // If set, only use these
let additionalKeywords = [];
let disableLocation = false;
let disableAI = false;
let runAIAfter = RUN_LOCAL_AI_AFTER_SCRAPING;
let cliCity = null;
let cliDatePosted = null;
let cliSortBy = null;
let cliLocationFilter = null;
let cliOutput = null;
let showHelp = false;
for (const arg of args) {
if (arg.startsWith("--headless=")) {
const val = arg.split("=")[1].toLowerCase();
HEADLESS = val === "true";
}
if (arg.startsWith("--keyword=")) {
cliKeywords = arg
.split("=")[1]
.split(",")
.map((k) => k.trim())
.filter(Boolean);
}
if (arg.startsWith("--add-keyword=")) {
additionalKeywords = additionalKeywords.concat(
arg
.split("=")[1]
.split(",")
.map((k) => k.trim())
.filter(Boolean)
);
}
if (arg === "--no-location") {
disableLocation = true;
}
if (arg === "--no-ai") {
disableAI = true;
}
if (arg === "--ai-after") {
runAIAfter = true;
}
if (arg.startsWith("--city=")) {
cliCity = arg.split("=")[1];
}
if (arg.startsWith("--date_posted=")) {
cliDatePosted = arg.split("=")[1];
}
if (arg.startsWith("--sort_by=")) {
cliSortBy = arg.split("=")[1];
}
if (arg.startsWith("--location_filter=")) {
cliLocationFilter = arg.split("=")[1];
}
if (arg.startsWith("--output=")) {
cliOutput = arg.split("=")[1];
}
if (arg === "--help" || arg === "-h") {
showHelp = true;
}
}
if (showHelp) {
console.log(
`\nLinkedOut - LinkedIn Posts Scraper\n\nUsage: node linkedout.js [options]\n\nOptions:\n --headless=true|false Override browser headless mode\n --keyword="kw1,kw2" Use only these keywords (comma-separated, overrides CSV)\n --add-keyword="kw1,kw2" Add extra keywords to CSV list\n --city="CityName" Override city\n --date_posted=VALUE Override date posted (past-24h, past-week, past-month or '')\n --sort_by=VALUE Override sort by (date_posted or relevance)\n --location_filter=VALUE Override location filter\n --output=FILE Output file name\n --no-location Disable location filtering\n --no-ai Disable AI analysis\n --ai-after Run local AI analysis after scraping\n --help, -h Show this help message\n\nExamples:\n node linkedout.js --keyword="layoff,downsizing"\n node linkedout.js --add-keyword="hiring freeze"\n node linkedout.js --city="Vancouver" --date_posted=past-month\n node linkedout.js --output=results/myfile.json\n`
);
process.exit(0);
}
// Use CLI overrides if provided
const EFFECTIVE_CITY = cliCity || CITY;
const EFFECTIVE_DATE_POSTED = cliDatePosted || DATE_POSTED;
const EFFECTIVE_SORT_BY = cliSortBy || SORT_BY;
const EFFECTIVE_LOCATION_FILTER = cliLocationFilter || LOCATION_FILTER;
// Read keywords from CSV or CLI
const keywords = [];
let keywordEnv = process.env.KEYWORDS || "keywords-layoff.csv";
let csvPath = path.join(
process.cwd(),
keywordEnv.includes("/") ? keywordEnv : `keywords/${keywordEnv}`
);
function loadKeywordsAndStart() {
if (cliKeywords) {
// Only use CLI keywords
cliKeywords.forEach((k) => keywords.push(k));
if (additionalKeywords.length > 0) {
additionalKeywords.forEach((k) => keywords.push(k));
}
startScraper();
} else {
// Load from CSV, then add any additional keywords
fs.createReadStream(csvPath)
.pipe(csv())
.on("data", (row) => {
if (row.keyword) keywords.push(row.keyword.trim());
})
.on("end", () => {
if (keywords.length === 0) {
console.error("No keywords found in csv");
process.exit(1);
}
if (additionalKeywords.length > 0) {
additionalKeywords.forEach((k) => keywords.push(k));
console.log(
`Added additional keywords: ${additionalKeywords.join(", ")}`
);
}
startScraper();
});
}
}
if (!LINKEDIN_USERNAME || !LINKEDIN_PASSWORD) {
throw new Error("Missing LinkedIn credentials in .env file.");
}
function cleanText(text) {
text = text.replace(/#\w+/g, "");
text = text.replace(/\bhashtag\b/gi, "");
text = text.replace(/hashtag-\w+/gi, "");
text = text.replace(/https?:\/\/[^\s]+/g, "");
text = text.replace(
/[\u{1F600}-\u{1F64F}\u{1F300}-\u{1F5FF}\u{1F680}-\u{1F6FF}\u{1F1E0}-\u{1F1FF}]/gu,
""
);
text = text.replace(/\s+/g, " ").trim();
return text;
}
function buildSearchUrl(keyword, city) {
let url = `https://www.linkedin.com/search/results/content/?keywords=${encodeURIComponent(
keyword + " " + city
)}`;
if (EFFECTIVE_DATE_POSTED)
url += `&datePosted=${encodeURIComponent(`"${EFFECTIVE_DATE_POSTED}"`)}`;
if (EFFECTIVE_SORT_BY)
url += `&sortBy=${encodeURIComponent(`"${EFFECTIVE_SORT_BY}"`)}`;
url += `&origin=FACETED_SEARCH`;
return url;
}
function containsAnyKeyword(text, keywords) {
return keywords.some((k) => text.toLowerCase().includes(k.toLowerCase()));
}
/**
* Enhanced profile location validation with smart waiting (no timeouts)
* Uses a new tab to avoid disrupting the main scraping flow
*/
async function validateProfileLocation(
context,
profileLink,
locationFilterString
) {
if (!locationFilterString || !ENABLE_LOCATION_CHECK || disableLocation) {
return {
isValid: true,
location: "Not checked",
matchedFilter: null,
reasoning: "Location check disabled",
error: null,
};
}
let profilePage = null;
try {
console.log(`🌍 Checking profile location: ${profileLink}`);
// Create a new page/tab for profile validation
profilePage = await context.newPage();
await profilePage.goto(profileLink, {
waitUntil: "domcontentloaded",
timeout: 10000,
});
// Always use smart waiting for key profile elements
await Promise.race([
profilePage.waitForSelector("h1", { timeout: 3000 }),
profilePage.waitForSelector("[data-field='experience_section']", {
timeout: 3000,
}),
profilePage.waitForSelector(".pv-text-details__left-panel", {
timeout: 3000,
}),
]);
// Use enhanced location extraction
const location = await extractLocationFromProfile(profilePage);
if (!location) {
return {
isValid: false,
location: "Location not found",
matchedFilter: null,
reasoning: "Could not extract location from profile",
error: "Location extraction failed",
};
}
// Parse location filters
const locationFilters = parseLocationFilters(locationFilterString);
// Validate against filters
const validationResult = validateLocationAgainstFilters(
location,
locationFilters
);
return {
isValid: validationResult.isValid,
location,
matchedFilter: validationResult.matchedFilter,
reasoning: validationResult.reasoning,
error: validationResult.isValid ? null : validationResult.reasoning,
};
} catch (error) {
console.error(`❌ Error checking profile location: ${error.message}`);
return {
isValid: false,
location: "Error checking location",
matchedFilter: null,
reasoning: `Error: ${error.message}`,
error: error.message,
};
} finally {
// Always close the profile page to clean up
if (profilePage) {
try {
await profilePage.close();
} catch (closeError) {
console.error(`⚠️ Error closing profile page: ${closeError.message}`);
}
}
}
}
/**
* Run local AI analysis after scraping is complete
*/
async function runPostScrapingLocalAI(resultsFile) {
if (disableAI || !ENABLE_LOCAL_AI || !runAIAfter) {
return;
}
console.log("\n🧠 Starting post-scraping local AI analysis...");
const analyzerScript = "ai-analyzer-local.js";
const args = [`--input=${resultsFile}`, `--context=${AI_CONTEXT}`];
console.log(`🚀 Running: node ${analyzerScript} ${args.join(" ")}`);
return new Promise((resolve, reject) => {
const child = spawn("node", [analyzerScript, ...args], {
stdio: "inherit",
cwd: process.cwd(),
});
child.on("close", (code) => {
if (code === 0) {
console.log("✅ Local AI analysis completed successfully");
resolve();
} else {
console.error(`❌ Local AI analysis failed with code ${code}`);
reject(new Error(`Local AI analysis process exited with code ${code}`));
}
});
child.on("error", (error) => {
console.error(`❌ Failed to run local AI analysis: ${error.message}`);
reject(error);
});
});
}
async function startScraper() {
console.log("\n🚀 LinkedOut Scraper Starting...");
console.log(`📊 Keywords: ${keywords.length}`);
console.log(
`🌍 Location Filter: ${
ENABLE_LOCATION_CHECK && !disableLocation
? LOCATION_FILTER || "None"
: "Disabled"
}`
);
console.log(
`🧠 Local AI Analysis: ${
ENABLE_LOCAL_AI && !disableAI
? runAIAfter
? "After scraping"
: "Manual"
: "Disabled"
}`
);
const browser = await chromium.launch({
headless: HEADLESS,
args: ["--no-sandbox", "--disable-setuid-sandbox"],
});
const context = await browser.newContext();
const page = await Promise.race([
context.newPage(),
new Promise((_, reject) =>
setTimeout(() => reject(new Error("newPage timeout")), 10000)
),
]).catch((err) => {
console.error("Failed to create new page:", err);
process.exit(1);
});
let scrapeError = null;
try {
await page.goto("https://www.linkedin.com/login");
await page.fill('input[name="session_key"]', LINKEDIN_USERNAME);
await page.fill('input[name="session_password"]', LINKEDIN_PASSWORD);
await page.click('button[type="submit"]');
await page.waitForSelector("img.global-nav__me-photo", {
timeout: 15000,
});
const seenPosts = new Set();
const seenProfiles = new Set();
const results = [];
const rejectedResults = [];
for (const keyword of keywords) {
const searchUrl = buildSearchUrl(keyword, EFFECTIVE_CITY);
await page.goto(searchUrl, { waitUntil: "load" });
try {
await page.waitForSelector(".feed-shared-update-v2", {
timeout: 3000,
});
} catch (error) {
console.log(
`---\nNo posts found for keyword: ${keyword}\nCity: ${EFFECTIVE_CITY}\nDate posted: ${EFFECTIVE_DATE_POSTED}\nSort by: ${EFFECTIVE_SORT_BY}`
);
continue;
}
for (let i = 0; i < WHEELS; i++) {
await page.mouse.wheel(0, 1000);
await page.waitForTimeout(1000);
}
const postContainers = await page.$$(".feed-shared-update-v2");
for (const container of postContainers) {
let text = "";
const textHandle = await container.$(
"div.update-components-text, span.break-words"
);
if (textHandle) {
text = (await textHandle.textContent()) || "";
text = cleanText(text);
}
if (
!text ||
seenPosts.has(text) ||
text.length < 30 ||
!/[a-zA-Z0-9]/.test(text)
) {
rejectedResults.push({
rejected: true,
reason: !text
? "No text"
: seenPosts.has(text)
? "Duplicate post"
: text.length < 30
? "Text too short"
: "No alphanumeric content",
keyword,
text,
profileLink: null,
timestamp: new Date().toISOString(),
});
continue;
}
seenPosts.add(text);
let profileLink = "";
const profileLinkElement = await container.$('a[href*="/in/"]');
if (profileLinkElement) {
profileLink = await profileLinkElement.getAttribute("href");
if (profileLink && !profileLink.startsWith("http")) {
profileLink = `https://www.linkedin.com${profileLink}`;
}
profileLink = profileLink.split("?")[0];
}
if (!profileLink || seenProfiles.has(profileLink)) {
rejectedResults.push({
rejected: true,
reason: !profileLink ? "No profile link" : "Duplicate profile",
keyword,
text,
profileLink,
timestamp: new Date().toISOString(),
});
continue;
}
seenProfiles.add(profileLink);
// Double-check keyword presence
if (!containsAnyKeyword(text, keywords)) {
rejectedResults.push({
rejected: true,
reason: "Keyword not present",
keyword,
text,
profileLink,
timestamp: new Date().toISOString(),
});
continue;
}
console.log("---");
console.log("Keyword:", keyword);
console.log("Post:", text.substring(0, 100) + "...");
console.log("Profile:", profileLink);
// Enhanced location validation
const locationCheck = await validateProfileLocation(
context,
profileLink,
EFFECTIVE_LOCATION_FILTER
);
console.log("📍 Location:", locationCheck.location);
console.log("🎯 Match:", locationCheck.reasoning);
if (!locationCheck.isValid) {
rejectedResults.push({
rejected: true,
reason: `Location filter failed: ${locationCheck.error}`,
keyword,
text,
profileLink,
location: locationCheck.location,
locationReasoning: locationCheck.reasoning,
timestamp: new Date().toISOString(),
});
console.log(
"❌ Skipping - Location filter failed:",
locationCheck.error
);
continue;
}
console.log("✅ Post passed all filters");
results.push({
keyword,
text,
profileLink,
location: locationCheck.location,
locationValid: locationCheck.isValid,
locationMatchedFilter: locationCheck.matchedFilter,
locationReasoning: locationCheck.reasoning,
timestamp: new Date().toLocaleString("en-CA", {
year: "numeric",
month: "2-digit",
day: "2-digit",
hour: "2-digit",
minute: "2-digit",
second: "2-digit",
hour12: false,
}),
aiProcessed: false,
});
}
}
const now = new Date();
const timestamp =
cliOutput ||
`${now.getFullYear()}-${String(now.getMonth() + 1).padStart(
2,
"0"
)}-${String(now.getDate()).padStart(2, "0")}-${String(
now.getHours()
).padStart(2, "0")}-${String(now.getMinutes()).padStart(2, "0")}`;
const resultsDir = "results";
const resultsFile = `${resultsDir}/results-${timestamp}.json`;
const rejectedFile = `${resultsDir}/results-${timestamp}-rejected.json`;
if (!fs.existsSync(resultsDir)) {
fs.mkdirSync(resultsDir);
}
fs.writeFileSync(resultsFile, JSON.stringify(results, null, 2), "utf-8");
fs.writeFileSync(
rejectedFile,
JSON.stringify(rejectedResults, null, 2),
"utf-8"
);
console.log(`\n🎉 Scraping Complete!`);
console.log(`📊 Saved ${results.length} posts to ${resultsFile}`);
console.log(
`📋 Saved ${rejectedResults.length} rejected posts to ${rejectedFile}`
);
// Run local AI analysis if requested
if (runAIAfter && results.length > 0 && !scrapeError) {
try {
await runPostScrapingLocalAI(resultsFile);
} catch (error) {
console.error(
"⚠️ Local AI analysis failed, but scraping completed successfully"
);
}
}
console.log(`\n💡 Next steps:`);
console.log(` 📋 Review results in ${resultsFile}`);
if (!runAIAfter && !disableAI) {
console.log(` 🧠 Local AI Analysis:`);
console.log(` node ai-analyzer-local.js --context="${AI_CONTEXT}"`);
console.log(
` node ai-analyzer-local.js --input=${resultsFile} --context="your context"`
);
}
} catch (err) {
scrapeError = err;
console.error("Error:", err);
} finally {
await browser.close();
}
}
loadKeywordsAndStart();

View File

@ -1,22 +1,54 @@
{
"name": "linkedin-scraper",
"name": "job-market-intelligence",
"version": "1.0.0",
"description": "",
"main": "index.js",
"description": "Job Market Intelligence Platform - Modular parsers for comprehensive job market insights with built-in AI analysis",
"main": "linkedin-parser/index.js",
"scripts": {
"test": "node test/all-tests.js",
"test:location-utils": "node test/location-utils.test.js",
"test:linkedout": "node test/linkedout.test.js",
"test:ai-analyzer": "node test/ai-analyzer.test.js",
"demo": "node demo.js"
"demo": "node demo.js",
"demo:ai-analyzer": "node ai-analyzer/demo.js",
"demo:linkedin-parser": "node linkedin-parser/demo.js",
"demo:job-search-parser": "node job-search-parser/demo.js",
"demo:all": "npm run demo && npm run demo:ai-analyzer && npm run demo:linkedin-parser && npm run demo:job-search-parser",
"start": "node linkedin-parser/index.js",
"start:linkedin": "node linkedin-parser/index.js",
"start:jobs": "node job-search-parser/index.js",
"start:linkedin-no-ai": "node linkedin-parser/index.js --no-ai",
"install:playwright": "npx playwright install chromium"
},
"keywords": [],
"author": "",
"keywords": [
"job-market",
"intelligence",
"linkedin",
"scraper",
"ai-analysis",
"data-intelligence",
"market-research",
"automation",
"playwright",
"ollama",
"openai"
],
"author": "Job Market Intelligence Team",
"license": "ISC",
"type": "commonjs",
"dependencies": {
"ai-analyzer": "file:./ai-analyzer",
"core-parser": "file:./core-parser",
"csv-parser": "^3.2.0",
"dotenv": "^17.0.0",
"playwright": "^1.53.2"
}
"dotenv": "^17.0.0"
},
"engines": {
"node": ">=18.0.0"
},
"repository": {
"type": "git",
"url": "https://github.com/your-username/job-market-intelligence.git"
},
"bugs": {
"url": "https://github.com/your-username/job-market-intelligence/issues"
},
"homepage": "https://github.com/your-username/job-market-intelligence#readme"
}

34
sample-data.json Normal file
View File

@ -0,0 +1,34 @@
{
"results": [
{
"text": "Just got laid off from my software engineering role. Looking for new opportunities in the Toronto area.",
"location": "Toronto, Ontario, Canada",
"keyword": "layoff",
"timestamp": "2024-01-15T10:30:00Z"
},
{
"text": "Excited to share that I'm starting a new position as a Senior Developer at TechCorp!",
"location": "Vancouver, BC, Canada",
"keyword": "hiring",
"timestamp": "2024-01-15T11:00:00Z"
},
{
"text": "Our company is going through a restructuring and unfortunately had to let go of 50 employees.",
"location": "Montreal, Quebec, Canada",
"keyword": "layoff",
"timestamp": "2024-01-15T11:30:00Z"
},
{
"text": "Beautiful weather today! Perfect for a walk in the park.",
"location": "Calgary, Alberta, Canada",
"keyword": "weather",
"timestamp": "2024-01-15T12:00:00Z"
},
{
"text": "We're hiring! Looking for talented developers to join our growing team.",
"location": "Ottawa, Ontario, Canada",
"keyword": "hiring",
"timestamp": "2024-01-15T12:30:00Z"
}
]
}

View File

@ -1,6 +1,6 @@
const fs = require("fs");
const assert = require("assert");
const { analyzeSinglePost } = require("../ai-analyzer-local");
const { analyzeSinglePost, checkOllamaStatus } = require("../ai-analyzer");
console.log("AI Analyzer logic tests");
@ -12,20 +12,69 @@ const context = "job layoffs and workforce reduction";
const model = "mistral"; // or your default model
(async () => {
// Check if Ollama is available
const ollamaAvailable = await checkOllamaStatus(model);
if (!ollamaAvailable) {
console.log("SKIP: Ollama not available - skipping AI analyzer tests");
console.log("PASS: AI analyzer tests skipped (Ollama not running)");
return;
}
console.log(`Testing AI analyzer with ${aiResults.length} posts...`);
for (let i = 0; i < aiResults.length; i++) {
const post = aiResults[i];
console.log(`Testing post ${i + 1}: "${post.text.substring(0, 50)}..."`);
const aiOutput = await analyzeSinglePost(post.text, context, model);
assert.strictEqual(
aiOutput.isRelevant,
post.aiRelevant,
`Post ${i} relevance mismatch: expected ${post.aiRelevant}, got ${aiOutput.isRelevant}`
);
// Test that the function returns the expected structure
assert(
Math.abs(aiOutput.confidence - post.aiConfidence) < 0.05,
`Post ${i} confidence mismatch: expected ${post.aiConfidence}, got ${aiOutput.confidence}`
typeof aiOutput === "object" && aiOutput !== null,
`Post ${i} output is not an object`
);
assert(
typeof aiOutput.isRelevant === "boolean",
`Post ${i} isRelevant is not a boolean: ${typeof aiOutput.isRelevant}`
);
assert(
typeof aiOutput.confidence === "number",
`Post ${i} confidence is not a number: ${typeof aiOutput.confidence}`
);
assert(
typeof aiOutput.reasoning === "string",
`Post ${i} reasoning is not a string: ${typeof aiOutput.reasoning}`
);
// Test that confidence is within valid range
assert(
aiOutput.confidence >= 0 && aiOutput.confidence <= 1,
`Post ${i} confidence out of range: ${aiOutput.confidence} (should be 0-1)`
);
// Test that reasoning exists and is not empty
assert(
aiOutput.reasoning && aiOutput.reasoning.length > 0,
`Post ${i} missing or empty reasoning`
);
// Test that relevance is a boolean value
assert(
aiOutput.isRelevant === true || aiOutput.isRelevant === false,
`Post ${i} isRelevant is not a valid boolean: ${aiOutput.isRelevant}`
);
console.log(
` ✓ Post ${i + 1}: relevant=${aiOutput.isRelevant}, confidence=${
aiOutput.confidence
}`
);
}
console.log(
"PASS: AI analyzer matches expected relevance and confidence for all test posts."
"PASS: AI analyzer returns valid structure and values for all test posts."
);
})();

View File

@ -1,29 +0,0 @@
const fs = require("fs");
const assert = require("assert");
console.log("LinkedOut main logic tests");
const testData = JSON.parse(
fs.readFileSync(__dirname + "/test-data.json", "utf-8")
);
const results = testData.positive;
const rejected = testData.negative;
// Positive: All results should have aiProcessed === false or true, and a keyword
results.forEach((post, i) => {
assert(post.keyword, `Result ${i} missing keyword`);
assert(post.text && post.text.length > 0, `Result ${i} missing text`);
// Only check that profileLink is non-empty
assert(
post.profileLink && post.profileLink.length > 0,
`Result ${i} missing or empty profileLink`
);
});
console.log("PASS: All positive results have required fields.");
// Negative: Rejected results should have 'rejected: true' and a reason
rejected.forEach((rej, i) => {
assert(rej.rejected === true, `Rejected ${i} missing rejected:true`);
assert(rej.reason && rej.reason.length > 0, `Rejected ${i} missing reason`);
});
console.log("PASS: All rejected results have rejected:true and a reason.");

View File

@ -2,7 +2,7 @@ const assert = require("assert");
const {
parseLocationFilters,
validateLocationAgainstFilters,
} = require("../location-utils");
} = require("../ai-analyzer");
console.log("Location Utils tests");

View File

@ -1,19 +0,0 @@
console.log("START!");
const { chromium } = require("playwright");
(async () => {
console.log("browser!");
const browser = await chromium.launch({
headless: true,
args: ["--no-sandbox", "--disable-setuid-sandbox"],
});
console.log("new page!");
const page = await browser.newPage();
console.log("GOTO!");
await page.goto("https://example.com");
console.log("Success!");
await browser.close();
})();