diff --git a/documentation/extractors/README.md b/documentation/extractors/README.md index acc49d7..f65c101 100644 --- a/documentation/extractors/README.md +++ b/documentation/extractors/README.md @@ -3,3 +3,4 @@ Technical breakdowns of how each extractor works. - Gradcracker: `gradcracker.md` +- JobSpy: `jobspy.md` diff --git a/documentation/extractors/jobspy.md b/documentation/extractors/jobspy.md new file mode 100644 index 0000000..488a5dd --- /dev/null +++ b/documentation/extractors/jobspy.md @@ -0,0 +1,43 @@ +# JobSpy Extractor (How It Works) + +This is a simple walkthrough of the JobSpy extractor used for Indeed and LinkedIn. + +## Big picture + +JobSpy is a Python library. We wrap it in a tiny Python script, run it once per search term, then ingest the JSON it writes into our database format. + +## 1) Inputs and defaults + +The Python wrapper (`extractors/jobspy/scrape_jobs.py`) reads environment variables and falls back to sensible defaults: + +- `JOBSPY_SITES` (default: `indeed,linkedin`) +- `JOBSPY_SEARCH_TERM` (default: `web developer`) +- `JOBSPY_LOCATION` (default: `UK`) +- `JOBSPY_RESULTS_WANTED` (default: `200`) +- `JOBSPY_HOURS_OLD` (default: `72`) +- `JOBSPY_COUNTRY_INDEED` (default: `UK`) +- `JOBSPY_LINKEDIN_FETCH_DESCRIPTION` (default: `true`) + +It writes output to both CSV and JSON files. The JSON is what we ingest. + +## 2) Orchestrator flow + +The Node service (`orchestrator/src/server/services/jobspy.ts`) controls the run: + +- Builds a list of search terms (from the UI, or `JOBSPY_SEARCH_TERMS` env). +- Runs the Python script once per search term with a unique output filename. +- Reads the JSON file, maps each row to our internal `CreateJobInput` shape. +- De-dupes by `jobUrl` so the same listing only appears once. +- Deletes the CSV/JSON files after ingesting (best effort). + +## 3) Mapping and cleanup + +The mapper normalizes fields like salary ranges, converts empty values to null, and keeps extra metadata (skills, company rating, remote flag, etc.) when available. + +If a row is missing a valid site (`indeed` or `linkedin`) or a job URL, it gets skipped. + +## Notes + +- If `JOBSPY_SEARCH_TERMS` is a JSON array, it will be parsed as-is. Otherwise it can be a `|`, comma, or newline-separated list. +- LinkedIn descriptions are optional and can slow the crawl; set `JOBSPY_LINKEDIN_FETCH_DESCRIPTION=0` to disable. +- Output files are stored under `data/imports/` before being cleaned up.