2026-02-22 13:34:01 +00:00

53 lines
1.5 KiB
Markdown

---
id: jobspy
title: JobSpy Extractor
description: How the JobSpy Python wrapper is orchestrated and normalized.
sidebar_position: 3
---
A walkthrough of the JobSpy extractor for Indeed, LinkedIn, and Glassdoor.
Original websites:
- [indeed.com](https://www.indeed.com)
- [linkedin.com/jobs](https://www.linkedin.com/jobs)
- [glassdoor.com](https://www.glassdoor.com)
## Big picture
JobSpy runs as a Python script per search term, writes JSON, then orchestrator ingests and normalizes into internal job shape.
## 1) Inputs and defaults
Key environment variables:
- `JOBSPY_SITES` (default: `indeed,linkedin`)
- `JOBSPY_SEARCH_TERM` (default: `web developer`)
- `JOBSPY_LOCATION` (default: `UK`)
- `JOBSPY_RESULTS_WANTED` (default: `200`)
- `JOBSPY_HOURS_OLD` (default: `72`)
- `JOBSPY_COUNTRY_INDEED` (default: `UK`)
- `JOBSPY_LINKEDIN_FETCH_DESCRIPTION` (default: `true`)
## 2) Orchestrator flow
The service in `orchestrator/src/server/services/jobspy.ts`:
- Builds search-term list from UI or env
- Runs Python once per term with unique output file
- Reads JSON and maps to `CreateJobInput`
- De-dupes by `jobUrl`
- Deletes temp output files best-effort
## 3) Mapping and cleanup
- Normalizes salary ranges
- Converts empty values to null
- Keeps metadata like skills, ratings, remote flags when available
- Skips rows with invalid site or missing URL
## Notes
- `JOBSPY_SEARCH_TERMS` can be JSON array or `|`, comma, newline-delimited text.
- Set `JOBSPY_LINKEDIN_FETCH_DESCRIPTION=0` to speed runs.
- Temp output files are stored under `data/imports/`.