This commit is contained in:
ilia 2025-07-03 21:43:45 -04:00
parent b62854909b
commit a04e9fb374
3 changed files with 806 additions and 806 deletions

494
README.md
View File

@ -1,247 +1,247 @@
# LinkedOut - LinkedIn Posts Scraper # LinkedOut - LinkedIn Posts Scraper
A Node.js application that automates LinkedIn login and scrapes posts containing specific keywords. The tool is designed to help track job market trends, layoffs, and open work opportunities by monitoring LinkedIn content. A Node.js application that automates LinkedIn login and scrapes posts containing specific keywords. The tool is designed to help track job market trends, layoffs, and open work opportunities by monitoring LinkedIn content.
## Features ## Features
- **Automated LinkedIn Login**: Uses Playwright to automate browser interactions - **Automated LinkedIn Login**: Uses Playwright to automate browser interactions
- **Keyword-based Search**: Searches for posts containing keywords from CSV files or CLI - **Keyword-based Search**: Searches for posts containing keywords from CSV files or CLI
- **Flexible Keyword Sources**: Supports multiple CSV files in `keywords/` or CLI-only mode - **Flexible Keyword Sources**: Supports multiple CSV files in `keywords/` or CLI-only mode
- **Configurable Search Parameters**: Customizable date ranges, sorting options, city, and scroll behavior - **Configurable Search Parameters**: Customizable date ranges, sorting options, city, and scroll behavior
- **Duplicate Detection**: Prevents duplicate posts and profiles in results - **Duplicate Detection**: Prevents duplicate posts and profiles in results
- **Clean Text Processing**: Removes hashtags, emojis, and URLs from post content - **Clean Text Processing**: Removes hashtags, emojis, and URLs from post content
- **Timestamped Results**: Saves results to JSON files with timestamps - **Timestamped Results**: Saves results to JSON files with timestamps
- **Command-line Overrides**: Support for runtime parameter adjustments - **Command-line Overrides**: Support for runtime parameter adjustments
- **Enhanced Geographic Location Validation**: Validates user locations against 200+ Canadian cities with smart matching - **Enhanced Geographic Location Validation**: Validates user locations against 200+ Canadian cities with smart matching
- **Local AI Analysis (Ollama)**: Free, private, and fast post-processing with local LLMs - **Local AI Analysis (Ollama)**: Free, private, and fast post-processing with local LLMs
- **Flexible Processing**: Disable features, run AI analysis immediately, or process results later - **Flexible Processing**: Disable features, run AI analysis immediately, or process results later
## Prerequisites ## Prerequisites
- Node.js (v14 or higher) - Node.js (v14 or higher)
- Valid LinkedIn account credentials - Valid LinkedIn account credentials
- [Ollama](https://ollama.ai/) with a model (free, private, local AI) - [Ollama](https://ollama.ai/) with a model (free, private, local AI)
## Installation ## Installation
1. Clone the repository or download the files 1. Clone the repository or download the files
2. Install dependencies: 2. Install dependencies:
```bash ```bash
npm install npm install
``` ```
3. Copy the configuration template and customize: 3. Copy the configuration template and customize:
```bash ```bash
cp env-config.example .env cp env-config.example .env
``` ```
4. Edit `.env` with your settings (see Configuration section below) 4. Edit `.env` with your settings (see Configuration section below)
## Configuration ## Configuration
### Environment Variables (.env file) ### Environment Variables (.env file)
Create a `.env` file from `env-config.example`: Create a `.env` file from `env-config.example`:
```env ```env
# LinkedIn Credentials (Required) # LinkedIn Credentials (Required)
LINKEDIN_USERNAME=your_email@example.com LINKEDIN_USERNAME=your_email@example.com
LINKEDIN_PASSWORD=your_password LINKEDIN_PASSWORD=your_password
# Basic Settings # Basic Settings
HEADLESS=true HEADLESS=true
KEYWORDS=keywords-layoff.csv # Just the filename; always looks in keywords/ unless path is given KEYWORDS=keywords-layoff.csv # Just the filename; always looks in keywords/ unless path is given
DATE_POSTED=past-week DATE_POSTED=past-week
SORT_BY=date_posted SORT_BY=date_posted
CITY=Toronto CITY=Toronto
WHEELS=5 WHEELS=5
# Enhanced Location Filtering # Enhanced Location Filtering
LOCATION_FILTER=Ontario,Manitoba LOCATION_FILTER=Ontario,Manitoba
ENABLE_LOCATION_CHECK=true ENABLE_LOCATION_CHECK=true
# Local AI Analysis (Ollama) # Local AI Analysis (Ollama)
ENABLE_LOCAL_AI=true ENABLE_LOCAL_AI=true
OLLAMA_MODEL=mistral OLLAMA_MODEL=mistral
OLLAMA_HOST=http://localhost:11434 OLLAMA_HOST=http://localhost:11434
RUN_LOCAL_AI_AFTER_SCRAPING=false # true = run after scraping, false = run manually RUN_LOCAL_AI_AFTER_SCRAPING=false # true = run after scraping, false = run manually
AI_CONTEXT=job layoffs and workforce reduction AI_CONTEXT=job layoffs and workforce reduction
AI_CONFIDENCE=0.7 AI_CONFIDENCE=0.7
AI_BATCH_SIZE=3 AI_BATCH_SIZE=3
``` ```
### Configuration Options ### Configuration Options
#### Required #### Required
- `LINKEDIN_USERNAME`: Your LinkedIn email/username - `LINKEDIN_USERNAME`: Your LinkedIn email/username
- `LINKEDIN_PASSWORD`: Your LinkedIn password - `LINKEDIN_PASSWORD`: Your LinkedIn password
#### Basic Settings #### Basic Settings
- `HEADLESS`: Browser headless mode (`true`/`false`, default: `true`) - `HEADLESS`: Browser headless mode (`true`/`false`, default: `true`)
- `KEYWORDS`: CSV file name (default: `keywords-layoff.csv` in `keywords/` folder) - `KEYWORDS`: CSV file name (default: `keywords-layoff.csv` in `keywords/` folder)
- `DATE_POSTED`: Filter by date (`past-24h`, `past-week`, `past-month`, or empty) - `DATE_POSTED`: Filter by date (`past-24h`, `past-week`, `past-month`, or empty)
- `SORT_BY`: Sort results (`relevance` or `date_posted`) - `SORT_BY`: Sort results (`relevance` or `date_posted`)
- `CITY`: Search location (default: `Toronto`) - `CITY`: Search location (default: `Toronto`)
- `WHEELS`: Number of scrolls to load posts (default: `5`) - `WHEELS`: Number of scrolls to load posts (default: `5`)
#### Enhanced Location Filtering #### Enhanced Location Filtering
- `LOCATION_FILTER`: Geographic filter - supports multiple provinces/cities: - `LOCATION_FILTER`: Geographic filter - supports multiple provinces/cities:
- Single: `Ontario` or `Toronto` - Single: `Ontario` or `Toronto`
- Multiple: `Ontario,Manitoba` or `Toronto,Vancouver` - Multiple: `Ontario,Manitoba` or `Toronto,Vancouver`
- `ENABLE_LOCATION_CHECK`: Enable location validation (`true`/`false`) - `ENABLE_LOCATION_CHECK`: Enable location validation (`true`/`false`)
#### Local AI Analysis (Ollama) #### Local AI Analysis (Ollama)
- `ENABLE_LOCAL_AI=true`: Enable local AI analysis - `ENABLE_LOCAL_AI=true`: Enable local AI analysis
- `OLLAMA_MODEL`: Model to use (`mistral`, `llama2`, `codellama`) - `OLLAMA_MODEL`: Model to use (`mistral`, `llama2`, `codellama`)
- `OLLAMA_HOST`: Ollama server URL (default: `http://localhost:11434`) - `OLLAMA_HOST`: Ollama server URL (default: `http://localhost:11434`)
- `RUN_LOCAL_AI_AFTER_SCRAPING`: Run AI immediately after scraping (`true`/`false`) - `RUN_LOCAL_AI_AFTER_SCRAPING`: Run AI immediately after scraping (`true`/`false`)
- `AI_CONTEXT`: Context for analysis (e.g., `job layoffs`) - `AI_CONTEXT`: Context for analysis (e.g., `job layoffs`)
- `AI_CONFIDENCE`: Minimum confidence threshold (0.0-1.0, default: 0.7) - `AI_CONFIDENCE`: Minimum confidence threshold (0.0-1.0, default: 0.7)
- `AI_BATCH_SIZE`: Posts per batch (default: 3) - `AI_BATCH_SIZE`: Posts per batch (default: 3)
## Usage ## Usage
### Basic Commands ### Basic Commands
```bash ```bash
# Standard scraping with configured settings # Standard scraping with configured settings
node linkedout.js node linkedout.js
# Visual mode (see browser) # Visual mode (see browser)
node linkedout.js --headless=false node linkedout.js --headless=false
# Use only these keywords (ignore CSV) # Use only these keywords (ignore CSV)
node linkedout.js --keyword="layoff,downsizing" node linkedout.js --keyword="layoff,downsizing"
# Add extra keywords to CSV/CLI list # Add extra keywords to CSV/CLI list
node linkedout.js --add-keyword="hiring freeze,open to work" node linkedout.js --add-keyword="hiring freeze,open to work"
# Override city and date # Override city and date
node linkedout.js --city="Vancouver" --date_posted=past-month node linkedout.js --city="Vancouver" --date_posted=past-month
# Custom output file # Custom output file
node linkedout.js --output=results/myfile.json node linkedout.js --output=results/myfile.json
# Skip location and AI filtering (fastest) # Skip location and AI filtering (fastest)
node linkedout.js --no-location --no-ai node linkedout.js --no-location --no-ai
# Run AI analysis immediately after scraping # Run AI analysis immediately after scraping
node linkedout.js --ai-after node linkedout.js --ai-after
# Show help # Show help
node linkedout.js --help node linkedout.js --help
``` ```
### All Command-line Options ### All Command-line Options
- `--headless=true|false`: Override browser headless mode - `--headless=true|false`: Override browser headless mode
- `--keyword="kw1,kw2"`: Use only these keywords (comma-separated, overrides CSV) - `--keyword="kw1,kw2"`: Use only these keywords (comma-separated, overrides CSV)
- `--add-keyword="kw1,kw2"`: Add extra keywords to CSV/CLI list - `--add-keyword="kw1,kw2"`: Add extra keywords to CSV/CLI list
- `--city="CityName"`: Override city - `--city="CityName"`: Override city
- `--date_posted=VALUE`: Override date posted (past-24h, past-week, past-month, or empty) - `--date_posted=VALUE`: Override date posted (past-24h, past-week, past-month, or empty)
- `--sort_by=VALUE`: Override sort by (date_posted or relevance) - `--sort_by=VALUE`: Override sort by (date_posted or relevance)
- `--location_filter=VALUE`: Override location filter - `--location_filter=VALUE`: Override location filter
- `--output=FILE`: Output file name - `--output=FILE`: Output file name
- `--no-location`: Disable location filtering - `--no-location`: Disable location filtering
- `--no-ai`: Disable AI analysis - `--no-ai`: Disable AI analysis
- `--ai-after`: Run local AI analysis after scraping - `--ai-after`: Run local AI analysis after scraping
- `--help, -h`: Show help message - `--help, -h`: Show help message
### Keyword Files ### Keyword Files
- Place all keyword CSVs in the `keywords/` folder - Place all keyword CSVs in the `keywords/` folder
- Example: `keywords/keywords-layoff.csv`, `keywords/keywords-open-work.csv` - Example: `keywords/keywords-layoff.csv`, `keywords/keywords-open-work.csv`
- Custom CSV format: header `keyword` with one keyword per line - Custom CSV format: header `keyword` with one keyword per line
### Local AI Analysis Commands ### Local AI Analysis Commands
After scraping, you can run AI analysis on the results: After scraping, you can run AI analysis on the results:
```bash ```bash
# Analyze latest results # Analyze latest results
node ai-analyzer-local.js --context="job layoffs" node ai-analyzer-local.js --context="job layoffs"
# Analyze specific file # Analyze specific file
node ai-analyzer-local.js --input=results/results-2024-01-15.json --context="hiring" node ai-analyzer-local.js --input=results/results-2024-01-15.json --context="hiring"
# Use different model # Use different model
node ai-analyzer-local.js --model=llama2 --context="remote work" node ai-analyzer-local.js --model=llama2 --context="remote work"
# Change confidence and batch size # Change confidence and batch size
node ai-analyzer-local.js --context="job layoffs" --confidence=0.8 --batch-size=5 node ai-analyzer-local.js --context="job layoffs" --confidence=0.8 --batch-size=5
``` ```
## Workflow Examples ## Workflow Examples
### 1. Quick Start (All Features) ### 1. Quick Start (All Features)
```bash ```bash
node linkedout.js --ai-after node linkedout.js --ai-after
``` ```
### 2. Fast Scraping Only ### 2. Fast Scraping Only
```bash ```bash
node linkedout.js --no-location --no-ai node linkedout.js --no-location --no-ai
``` ```
### 3. Location-Only Filtering ### 3. Location-Only Filtering
```bash ```bash
node linkedout.js --no-ai node linkedout.js --no-ai
``` ```
### 4. Test Different AI Contexts ### 4. Test Different AI Contexts
```bash ```bash
node linkedout.js --no-ai node linkedout.js --no-ai
node ai-analyzer-local.js --context="job layoffs" node ai-analyzer-local.js --context="job layoffs"
node ai-analyzer-local.js --context="hiring opportunities" node ai-analyzer-local.js --context="hiring opportunities"
node ai-analyzer-local.js --context="remote work" node ai-analyzer-local.js --context="remote work"
``` ```
## Project Structure ## Project Structure
``` ```
linkedout/ linkedout/
├── .env # Your configuration (create from template) ├── .env # Your configuration (create from template)
├── env-config.example # Configuration template ├── env-config.example # Configuration template
├── linkedout.js # Main scraper ├── linkedout.js # Main scraper
├── ai-analyzer-local.js # Free local AI analyzer (Ollama) ├── ai-analyzer-local.js # Free local AI analyzer (Ollama)
├── location-utils.js # Enhanced location utilities ├── location-utils.js # Enhanced location utilities
├── package.json # Dependencies ├── package.json # Dependencies
├── keywords/ # All keyword CSVs go here ├── keywords/ # All keyword CSVs go here
│ ├── keywords-layoff.csv │ ├── keywords-layoff.csv
│ └── keywords-open-work.csv │ └── keywords-open-work.csv
├── results/ # Output directory ├── results/ # Output directory
└── README.md # This documentation └── README.md # This documentation
``` ```
## Legal & Security ## Legal & Security
- **Credentials**: Store securely in `.env`, add to `.gitignore` - **Credentials**: Store securely in `.env`, add to `.gitignore`
- **LinkedIn ToS**: Respect rate limits and usage guidelines - **LinkedIn ToS**: Respect rate limits and usage guidelines
- **Privacy**: Local AI keeps all data on your machine - **Privacy**: Local AI keeps all data on your machine
- **Usage**: Educational and research purposes only - **Usage**: Educational and research purposes only
## Dependencies ## Dependencies
- `playwright`: Browser automation - `playwright`: Browser automation
- `dotenv`: Environment variables - `dotenv`: Environment variables
- `csv-parser`: CSV file reading - `csv-parser`: CSV file reading
- Built-in: `fs`, `path`, `child_process` - Built-in: `fs`, `path`, `child_process`
## Support ## Support
For issues: For issues:
1. Check this README 1. Check this README
2. Verify `.env` configuration 2. Verify `.env` configuration
3. Test with `--headless=false` for debugging 3. Test with `--headless=false` for debugging
4. Check Ollama status: `ollama list` 4. Check Ollama status: `ollama list`

File diff suppressed because it is too large Load Diff

View File

@ -1,19 +1,19 @@
console.log("START!"); console.log("START!");
const { chromium } = require("playwright"); const { chromium } = require("playwright");
(async () => { (async () => {
console.log("browser!"); console.log("browser!");
const browser = await chromium.launch({ const browser = await chromium.launch({
headless: true, headless: true,
args: ["--no-sandbox", "--disable-setuid-sandbox"], args: ["--no-sandbox", "--disable-setuid-sandbox"],
}); });
console.log("new page!"); console.log("new page!");
const page = await browser.newPage(); const page = await browser.newPage();
console.log("GOTO!"); console.log("GOTO!");
await page.goto("https://example.com"); await page.goto("https://example.com");
console.log("Success!"); console.log("Success!");
await browser.close(); await browser.close();
})(); })();