jobspy extractor
This commit is contained in:
parent
976f7b4878
commit
3021ea0390
@ -3,3 +3,4 @@
|
|||||||
Technical breakdowns of how each extractor works.
|
Technical breakdowns of how each extractor works.
|
||||||
|
|
||||||
- Gradcracker: `gradcracker.md`
|
- Gradcracker: `gradcracker.md`
|
||||||
|
- JobSpy: `jobspy.md`
|
||||||
|
|||||||
43
documentation/extractors/jobspy.md
Normal file
43
documentation/extractors/jobspy.md
Normal file
@ -0,0 +1,43 @@
|
|||||||
|
# JobSpy Extractor (How It Works)
|
||||||
|
|
||||||
|
This is a simple walkthrough of the JobSpy extractor used for Indeed and LinkedIn.
|
||||||
|
|
||||||
|
## Big picture
|
||||||
|
|
||||||
|
JobSpy is a Python library. We wrap it in a tiny Python script, run it once per search term, then ingest the JSON it writes into our database format.
|
||||||
|
|
||||||
|
## 1) Inputs and defaults
|
||||||
|
|
||||||
|
The Python wrapper (`extractors/jobspy/scrape_jobs.py`) reads environment variables and falls back to sensible defaults:
|
||||||
|
|
||||||
|
- `JOBSPY_SITES` (default: `indeed,linkedin`)
|
||||||
|
- `JOBSPY_SEARCH_TERM` (default: `web developer`)
|
||||||
|
- `JOBSPY_LOCATION` (default: `UK`)
|
||||||
|
- `JOBSPY_RESULTS_WANTED` (default: `200`)
|
||||||
|
- `JOBSPY_HOURS_OLD` (default: `72`)
|
||||||
|
- `JOBSPY_COUNTRY_INDEED` (default: `UK`)
|
||||||
|
- `JOBSPY_LINKEDIN_FETCH_DESCRIPTION` (default: `true`)
|
||||||
|
|
||||||
|
It writes output to both CSV and JSON files. The JSON is what we ingest.
|
||||||
|
|
||||||
|
## 2) Orchestrator flow
|
||||||
|
|
||||||
|
The Node service (`orchestrator/src/server/services/jobspy.ts`) controls the run:
|
||||||
|
|
||||||
|
- Builds a list of search terms (from the UI, or `JOBSPY_SEARCH_TERMS` env).
|
||||||
|
- Runs the Python script once per search term with a unique output filename.
|
||||||
|
- Reads the JSON file, maps each row to our internal `CreateJobInput` shape.
|
||||||
|
- De-dupes by `jobUrl` so the same listing only appears once.
|
||||||
|
- Deletes the CSV/JSON files after ingesting (best effort).
|
||||||
|
|
||||||
|
## 3) Mapping and cleanup
|
||||||
|
|
||||||
|
The mapper normalizes fields like salary ranges, converts empty values to null, and keeps extra metadata (skills, company rating, remote flag, etc.) when available.
|
||||||
|
|
||||||
|
If a row is missing a valid site (`indeed` or `linkedin`) or a job URL, it gets skipped.
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- If `JOBSPY_SEARCH_TERMS` is a JSON array, it will be parsed as-is. Otherwise it can be a `|`, comma, or newline-separated list.
|
||||||
|
- LinkedIn descriptions are optional and can slow the crawl; set `JOBSPY_LINKEDIN_FETCH_DESCRIPTION=0` to disable.
|
||||||
|
- Output files are stored under `data/imports/` before being cleaned up.
|
||||||
Loading…
x
Reference in New Issue
Block a user