Jobber/overview.md at 3fee6e0befdaabe64aafe9b414e83bdeaae158ff

Shaheer Sarfaraz 82e142a8a8

Auto-Registering Extractor System (#223 )

* initial commit?

* Address PR feedback on extractor discovery and startup resilience

* Address latest PR review comments

* fix city resolution fallback when input parses empty

* address PR feedback on extractor registry and pipeline validation

* address copilot comments on manifests and registry startup

* fix extractor discovery export handling and env isolation in tests

* enforce duplicate manifest id failures in strict mode

* Fix remaining extractor registry and runtime review comments

* docs

* docs

* test all, logic remains in extractors

* Address PR review feedback on extractor registry and validation

* Revert extractor moduleResolution to bundler

* Enforce shared city filtering across all discovery sources

* Deduplicate extractor strict city post-filtering

2026-02-21 17:44:07 +00:00

4.0 KiB

Raw Blame History

id, title, description, sidebar_position

id	title	description	sidebar_position
overview	Extractors Overview	Technical index of supported extractors and how they work.	1

This page helps you choose the right extractor for your run, understand key constraints, and navigate to detailed technical guides.

Extractor integrations are now registered through manifests and loaded automatically at orchestrator startup. Runtime discovery only scans extractors/*/(manifest.ts|src/manifest.ts) and does not read manifests from orchestrator/**. Extractor-specific run logic should also remain in extractors/<name>/ so orchestrator stays source-agnostic. To add a new source, follow Add an Extractor.

Extractor chooser

Extractor	Best use case	Core constraints/dependencies	Notable controls	Output/behavior notes
Gradcracker	UK graduate roles from Gradcracker	Crawling stability depends on page structure and anti-bot behavior; tuned for low concurrency	`GRADCRACKER_SEARCH_TERMS`, `GRADCRACKER_MAX_JOBS_PER_TERM`, `JOBOPS_SKIP_APPLY_FOR_EXISTING`	Scrapes listing metadata, then detail pages and apply URL resolution
JobSpy	Multi-source discovery (Indeed, LinkedIn, Glassdoor)	Requires Python wrapper execution per term; source availability and quality vary by site/location	`JOBSPY_SITES`, `JOBSPY_SEARCH_TERMS`, `JOBSPY_RESULTS_WANTED`, `JOBSPY_HOURS_OLD`, `JOBSPY_LINKEDIN_FETCH_DESCRIPTION`	Produces JSON per term, then orchestrator normalizes and de-duplicates by `jobUrl`
Adzuna	API-based multi-country discovery with low scraping overhead	Requires valid App ID/App Key; country must be in Adzuna-supported list	`ADZUNA_APP_ID`, `ADZUNA_APP_KEY`, `ADZUNA_MAX_JOBS_PER_TERM`	API pagination to dataset output; orchestrator maps progress and de-duplicates by `sourceJobId`/`jobUrl`
Hiring Cafe	Browser-backed discovery using Hiring Cafe search APIs	Subject to upstream anti-bot checks; uses browser context and encoded search-state payloads	`HIRING_CAFE_SEARCH_TERMS`, `HIRING_CAFE_COUNTRY`, `HIRING_CAFE_MAX_JOBS_PER_TERM`, `HIRING_CAFE_DATE_FETCHED_PAST_N_DAYS`	Uses existing pipeline term/country/budget knobs and maps directly to normalized jobs
UKVisaJobs	UK visa sponsorship-focused roles	Requires authenticated session and periodic token/cookie refresh	`UKVISAJOBS_EMAIL`, `UKVISAJOBS_PASSWORD`, `UKVISAJOBS_MAX_JOBS`, `UKVISAJOBS_SEARCH_KEYWORD`	API pagination + dataset output; orchestrator de-dupes and may fetch missing descriptions
Manual Import	One-off jobs not covered by scrapers	Inference quality depends on model/provider and input quality; some URLs cannot be fetched reliably	App/API endpoints (`/api/manual-jobs/infer`, `/api/manual-jobs/import`)	Accepts text/HTML/URL, runs inference, then saves and scores job after review

Which extractor should I use?

Use JobSpy for broad first-pass sourcing across common boards.
Use Adzuna when you want API-first discovery in supported non-UK markets.
Use Hiring Cafe when you want another term/country-driven source without adding credentials.
Use Gradcracker when targeting graduate pipelines in the UK.
Use UKVisaJobs for sponsorship-specific UK searches.
Use Manual Import when you already have a specific posting and need direct import.

Many runs combine sources: broad discovery first, then manual import for high-priority jobs that scraping misses.

4.0 KiB Raw Blame History

Extractor chooser

Which extractor should I use?

Related extractor docs

4.0 KiB

Raw Blame History