* don't run for generated docusaurus * format * workflow to autoupdate docs version * fix versioning * add link back to app * remove old docs * html link??? * don't track .docusaurus * documantation build
48 lines
1.4 KiB
Markdown
48 lines
1.4 KiB
Markdown
---
|
|
id: jobspy
|
|
title: JobSpy Extractor
|
|
description: How the JobSpy Python wrapper is orchestrated and normalized.
|
|
sidebar_position: 3
|
|
---
|
|
|
|
A walkthrough of the JobSpy extractor for Indeed, LinkedIn, and Glassdoor.
|
|
|
|
## Big picture
|
|
|
|
JobSpy runs as a Python script per search term, writes JSON, then orchestrator ingests and normalizes into internal job shape.
|
|
|
|
## 1) Inputs and defaults
|
|
|
|
Key environment variables:
|
|
|
|
- `JOBSPY_SITES` (default: `indeed,linkedin`)
|
|
- `JOBSPY_SEARCH_TERM` (default: `web developer`)
|
|
- `JOBSPY_LOCATION` (default: `UK`)
|
|
- `JOBSPY_RESULTS_WANTED` (default: `200`)
|
|
- `JOBSPY_HOURS_OLD` (default: `72`)
|
|
- `JOBSPY_COUNTRY_INDEED` (default: `UK`)
|
|
- `JOBSPY_LINKEDIN_FETCH_DESCRIPTION` (default: `true`)
|
|
|
|
## 2) Orchestrator flow
|
|
|
|
The service in `orchestrator/src/server/services/jobspy.ts`:
|
|
|
|
- Builds search-term list from UI or env
|
|
- Runs Python once per term with unique output file
|
|
- Reads JSON and maps to `CreateJobInput`
|
|
- De-dupes by `jobUrl`
|
|
- Deletes temp output files best-effort
|
|
|
|
## 3) Mapping and cleanup
|
|
|
|
- Normalizes salary ranges
|
|
- Converts empty values to null
|
|
- Keeps metadata like skills, ratings, remote flags when available
|
|
- Skips rows with invalid site or missing URL
|
|
|
|
## Notes
|
|
|
|
- `JOBSPY_SEARCH_TERMS` can be JSON array or `|`, comma, newline-delimited text.
|
|
- Set `JOBSPY_LINKEDIN_FETCH_DESCRIPTION=0` to speed runs.
|
|
- Temp output files are stored under `data/imports/`.
|