* feat(shared): add glassdoor to job source model * feat(jobspy): support glassdoor site in scraper and discovery * feat(pipeline): include glassdoor in source selection and API schema * feat(ui): add glassdoor toggle to jobspy settings and run estimates * test/docs: cover glassdoor jobspy integration end-to-end * fix(jobspy): make glassdoor always-on without settings toggle * fix(jobspy): fallback glassdoor when location is country-level * refactor(jobspy): drop direct pandas usage in wrapper * feat(pipeline): gate glassdoor by supported countries * fix(jobspy): restore pandas output and keep glassdoor disable copy * fix(jobspy): map country-level glassdoor searches to city fallbacks * feat(ui): require glassdoor city for country-level runs
1.9 KiB
JobSpy Extractor (How It Works)
This is a simple walkthrough of the JobSpy extractor used for Indeed, LinkedIn, and Glassdoor.
Big picture
JobSpy is a Python library. We wrap it in a tiny Python script, run it once per search term, then ingest the JSON it writes into our database format.
1) Inputs and defaults
The Python wrapper (extractors/jobspy/scrape_jobs.py) reads environment variables and falls back to sensible defaults:
JOBSPY_SITES(default:indeed,linkedin)JOBSPY_SEARCH_TERM(default:web developer)JOBSPY_LOCATION(default:UK)JOBSPY_RESULTS_WANTED(default:200)JOBSPY_HOURS_OLD(default:72)JOBSPY_COUNTRY_INDEED(default:UK)JOBSPY_LINKEDIN_FETCH_DESCRIPTION(default:true)
It writes output to both CSV and JSON files. The JSON is what we ingest.
2) Orchestrator flow
The Node service (orchestrator/src/server/services/jobspy.ts) controls the run:
- Builds a list of search terms (from the UI, or
JOBSPY_SEARCH_TERMSenv). - Runs the Python script once per search term with a unique output filename.
- Reads the JSON file, maps each row to our internal
CreateJobInputshape. - De-dupes by
jobUrlso the same listing only appears once. - Deletes the CSV/JSON files after ingesting (best effort).
3) Mapping and cleanup
The mapper normalizes fields like salary ranges, converts empty values to null, and keeps extra metadata (skills, company rating, remote flag, etc.) when available.
If a row is missing a valid site (indeed, linkedin, or glassdoor) or a job URL, it gets skipped.
Notes
- If
JOBSPY_SEARCH_TERMSis a JSON array, it will be parsed as-is. Otherwise it can be a|, comma, or newline-separated list. - LinkedIn descriptions are optional and can slow the crawl; set
JOBSPY_LINKEDIN_FETCH_DESCRIPTION=0to disable. - Output files are stored under
data/imports/before being cleaned up.