* feat(hiringcafe): register new source across shared/server/client enums * feat(hiringcafe-extractor): add browser-backed Hiring Cafe dataset extractor * feat(orchestrator): integrate Hiring Cafe discovery service into pipeline * feat(orchestrator-ui): add Hiring Cafe to source availability and run estimates * chore(hiringcafe): wire CI/docker and add extractor documentation * chore(format): apply biome formatting for Hiring Cafe integration * add original websites * coomints * number or null
1.5 KiB
1.5 KiB
id, title, description, sidebar_position
| id | title | description | sidebar_position |
|---|---|---|---|
| jobspy | JobSpy Extractor | How the JobSpy Python wrapper is orchestrated and normalized. | 3 |
A walkthrough of the JobSpy extractor for Indeed, LinkedIn, and Glassdoor.
Original websites:
Big picture
JobSpy runs as a Python script per search term, writes JSON, then orchestrator ingests and normalizes into internal job shape.
1) Inputs and defaults
Key environment variables:
JOBSPY_SITES(default:indeed,linkedin)JOBSPY_SEARCH_TERM(default:web developer)JOBSPY_LOCATION(default:UK)JOBSPY_RESULTS_WANTED(default:200)JOBSPY_HOURS_OLD(default:72)JOBSPY_COUNTRY_INDEED(default:UK)JOBSPY_LINKEDIN_FETCH_DESCRIPTION(default:true)
2) Orchestrator flow
The service in orchestrator/src/server/services/jobspy.ts:
- Builds search-term list from UI or env
- Runs Python once per term with unique output file
- Reads JSON and maps to
CreateJobInput - De-dupes by
jobUrl - Deletes temp output files best-effort
3) Mapping and cleanup
- Normalizes salary ranges
- Converts empty values to null
- Keeps metadata like skills, ratings, remote flags when available
- Skips rows with invalid site or missing URL
Notes
JOBSPY_SEARCH_TERMScan be JSON array or|, comma, newline-delimited text.- Set
JOBSPY_LINKEDIN_FETCH_DESCRIPTION=0to speed runs. - Temp output files are stored under
data/imports/.