64 lines
2.4 KiB
Markdown

---
id: jobspy
title: JobSpy Extractor
description: How the JobSpy Python wrapper is orchestrated and normalized.
sidebar_position: 3
---
A walkthrough of the JobSpy extractor for Indeed, LinkedIn, and Glassdoor.
Original websites:
- [indeed.com](https://www.indeed.com)
- [linkedin.com/jobs](https://www.linkedin.com/jobs)
- [glassdoor.com](https://www.glassdoor.com)
## Big picture
JobSpy runs as a Python script per search term, writes JSON, then orchestrator ingests and normalizes into internal job shape.
## 1) Inputs and defaults
Key environment variables:
- `JOBSPY_SITES` (default: `indeed,linkedin`)
- `JOBSPY_SEARCH_TERM` (default: `web developer`)
- `JOBSPY_LOCATION` (default: `UK`)
- `JOBSPY_RESULTS_WANTED` (default: `200`)
- `JOBSPY_HOURS_OLD` (default: `72`)
- `JOBSPY_COUNTRY_INDEED` (default: `UK`)
- `JOBSPY_LINKEDIN_FETCH_DESCRIPTION` (default: `true`)
- `JOBSPY_IS_REMOTE` (unset by default)
## 2) Orchestrator flow
The service in `orchestrator/src/server/services/jobspy.ts`:
- Builds search-term list from UI or env
- Runs Python once per term with unique output file
- Reads JSON and maps to `CreateJobInput`
- De-dupes by `jobUrl`
- Deletes temp output files best-effort
## 3) Mapping and cleanup
- Normalizes salary ranges
- Converts empty values to null
- Keeps metadata like skills, ratings, remote flags when available
- Skips rows with invalid site or missing URL
## Notes
- `JOBSPY_SEARCH_TERMS` can be JSON array or `|`, comma, newline-delimited text.
- Set `JOBSPY_LINKEDIN_FETCH_DESCRIPTION=0` to speed runs.
- Temp output files are stored under `data/imports/`.
- If workplace type is only `Remote`, JobSpy runs with `JOBSPY_IS_REMOTE=true`.
- If workplace type includes `Hybrid` or `Onsite`, JobSpy cannot enforce those filters precisely, so the JobSpy-backed sources run without a workplace-type filter and may return broader results.
## Common Problems
- `Hybrid` or `Onsite` was selected, but Indeed, LinkedIn, or Glassdoor still returned remote jobs.
JobSpy only supports a strict remote toggle. Any workplace-type selection that includes `Hybrid` or `Onsite` broadens those source results.
- A run returned fewer LinkedIn descriptions than expected.
`JOBSPY_LINKEDIN_FETCH_DESCRIPTION=0` disables description fetching to speed up runs.
- Different cities need different workplace-type filters.
This is not supported in the current automatic-run flow. JobSpy receives one global workplace-type selection per run/query invocation.