Shaheer Sarfaraz 4e1ea28301
Enable Glassdoor as a JobSpy source (#126)
* feat(shared): add glassdoor to job source model

* feat(jobspy): support glassdoor site in scraper and discovery

* feat(pipeline): include glassdoor in source selection and API schema

* feat(ui): add glassdoor toggle to jobspy settings and run estimates

* test/docs: cover glassdoor jobspy integration end-to-end

* fix(jobspy): make glassdoor always-on without settings toggle

* fix(jobspy): fallback glassdoor when location is country-level

* refactor(jobspy): drop direct pandas usage in wrapper

* feat(pipeline): gate glassdoor by supported countries

* fix(jobspy): restore pandas output and keep glassdoor disable copy

* fix(jobspy): map country-level glassdoor searches to city fallbacks

* feat(ui): require glassdoor city for country-level runs
2026-02-10 17:57:49 +00:00

1.9 KiB

JobSpy Extractor (How It Works)

This is a simple walkthrough of the JobSpy extractor used for Indeed, LinkedIn, and Glassdoor.

Big picture

JobSpy is a Python library. We wrap it in a tiny Python script, run it once per search term, then ingest the JSON it writes into our database format.

1) Inputs and defaults

The Python wrapper (extractors/jobspy/scrape_jobs.py) reads environment variables and falls back to sensible defaults:

  • JOBSPY_SITES (default: indeed,linkedin)
  • JOBSPY_SEARCH_TERM (default: web developer)
  • JOBSPY_LOCATION (default: UK)
  • JOBSPY_RESULTS_WANTED (default: 200)
  • JOBSPY_HOURS_OLD (default: 72)
  • JOBSPY_COUNTRY_INDEED (default: UK)
  • JOBSPY_LINKEDIN_FETCH_DESCRIPTION (default: true)

It writes output to both CSV and JSON files. The JSON is what we ingest.

2) Orchestrator flow

The Node service (orchestrator/src/server/services/jobspy.ts) controls the run:

  • Builds a list of search terms (from the UI, or JOBSPY_SEARCH_TERMS env).
  • Runs the Python script once per search term with a unique output filename.
  • Reads the JSON file, maps each row to our internal CreateJobInput shape.
  • De-dupes by jobUrl so the same listing only appears once.
  • Deletes the CSV/JSON files after ingesting (best effort).

3) Mapping and cleanup

The mapper normalizes fields like salary ranges, converts empty values to null, and keeps extra metadata (skills, company rating, remote flag, etc.) when available.

If a row is missing a valid site (indeed, linkedin, or glassdoor) or a job URL, it gets skipped.

Notes

  • If JOBSPY_SEARCH_TERMS is a JSON array, it will be parsed as-is. Otherwise it can be a |, comma, or newline-separated list.
  • LinkedIn descriptions are optional and can slow the crawl; set JOBSPY_LINKEDIN_FETCH_DESCRIPTION=0 to disable.
  • Output files are stored under data/imports/ before being cleaned up.