ilia c840f289e1
Some checks failed
CI / Linting (Biome) (push) Failing after 40s
CI / Tests (push) Successful in 5m54s
CI / Type Check (adzuna-extractor) (push) Successful in 1m8s
CI / Type Check (gradcracker-extractor) (push) Successful in 1m11s
CI / Type Check (hiringcafe-extractor) (push) Successful in 1m8s
CI / Type Check (orchestrator) (push) Successful in 1m23s
CI / Type Check (startupjobs-extractor) (push) Successful in 1m6s
CI / Type Check (ukvisajobs-extractor) (push) Successful in 1m7s
CI / Documentation (push) Successful in 1m54s
feat(extractors): expand catalog, smoke coverage, and sourcing docs
Adds Arc.dev, BC T-Net, Eluta, iCIMS tenants, QAJobsBoard, and SmartRecruiters
manifests with registry/settings/UI wiring; registers full extractor list in
smoke-extractors and documents supplementary board access paths. Aligns Careerjet
v4 with the url query parameter and fixes strict typing in QAJobsBoard.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-15 22:36:23 -04:00

9.1 KiB
Raw Blame History

id, title, description, sidebar_position
id title description sidebar_position
overview Extractors Overview Technical index of supported extractors and how they work. 1

This page helps you choose the right extractor for your run, understand key constraints, and navigate to detailed technical guides.

Extractor integrations are now registered through manifests and loaded automatically at orchestrator startup. Runtime discovery only scans extractors/*/(manifest.ts|src/manifest.ts) and does not read manifests from orchestrator/**. Extractor-specific run logic should also remain in extractors/<name>/ so orchestrator stays source-agnostic. To add a new source, follow Add an Extractor.

Extractor chooser

Extractor Best use case Core constraints/dependencies Notable controls Output/behavior notes
Gradcracker UK graduate roles from Gradcracker Crawling stability depends on page structure and anti-bot behavior; tuned for low concurrency GRADCRACKER_SEARCH_TERMS, GRADCRACKER_MAX_JOBS_PER_TERM, JOBOPS_SKIP_APPLY_FOR_EXISTING Scrapes listing metadata, then detail pages and apply URL resolution
JobSpy Multi-source discovery (Indeed, LinkedIn, Glassdoor) Requires Python wrapper execution per term; source availability and quality vary by site/location JOBSPY_SITES, JOBSPY_SEARCH_TERMS, JOBSPY_RESULTS_WANTED, JOBSPY_HOURS_OLD, JOBSPY_LINKEDIN_FETCH_DESCRIPTION Produces JSON per term, then orchestrator normalizes and de-duplicates by jobUrl
Adzuna API-based multi-country discovery with low scraping overhead Requires valid App ID/App Key; country must be in Adzuna-supported list ADZUNA_APP_ID, ADZUNA_APP_KEY, ADZUNA_MAX_JOBS_PER_TERM API pagination to dataset output; orchestrator maps progress and de-duplicates by sourceJobId/jobUrl
Hiring Cafe Browser-backed discovery using Hiring Cafe search APIs Subject to upstream anti-bot checks; uses browser context and encoded search-state payloads HIRING_CAFE_SEARCH_TERMS, HIRING_CAFE_COUNTRY, HIRING_CAFE_MAX_JOBS_PER_TERM, HIRING_CAFE_DATE_FETCHED_PAST_N_DAYS Uses existing pipeline term/country/budget knobs and maps directly to normalized jobs
startup.jobs Startup-focused discovery through the published startup-jobs-scraper package No credentials required; detail enrichment depends on Playwright browser binaries being installed existing pipeline searchTerms, selected country/cities, jobspyResultsWanted; npx playwright install for fresh environments Algolia-backed search plus detail-page enrichment via package import; orchestrator maps normalized records and de-duplicates by jobUrl
UKVisaJobs UK visa sponsorship-focused roles Requires authenticated session and periodic token/cookie refresh UKVISAJOBS_EMAIL, UKVISAJOBS_PASSWORD, UKVISAJOBS_MAX_JOBS, UKVISAJOBS_SEARCH_KEYWORD API pagination + dataset output; orchestrator de-dupes and may fetch missing descriptions
SmartRecruiters Enterprise employers on SmartRecruiters public boards No auth; needs configured company identifiers; one HTTP round-trip per posting for apply URLs + descriptions SMARTRECRUITERS_COMPANIES, SMARTRECRUITERS_MAX_JOBS_PER_COMPANY Paginates the public Posting API, filters by pipeline terms, normalizes to CreateJobInput
iCIMS tenants (HTML) Large employers on iCIMS portals No auth; HTML search varies by tenant — maintain explicit tenant hosts ICIMS_TENANTS, Settings: icimsTenants, icimsMaxJobsPerTenant, icimsMaxPagesPerSearch Fetches /jobs/search with iframe-style params, parses listing links, caps per tenant
BC T-Net (RSS) BC tech aggregate via T-Net Canada-only; free RSS (default feed built-in); optional extra feeds BCTENET_RSS_URLS, Settings: bctenetRssUrls, bctenetMaxJobsPerTerm Fetches RSS item blocks, normalizes quirky CDATA link fragments, filters by pipeline terms
Eluta Canadian listings aggregated from employer career sites (RSS) Canada-only source (skipped when search geography is not Canada); RSS location strings must be set ELUTA_RSS_LOCATIONS, ELUTA_MAX_JOBS_PER_TERM Fetches one or more eluta.ca RSS feeds, filters by terms, de-duplicates by guid/URL
QAJobsBoard QA / SDET / automation-heavy board (global JSON feed) No auth; geography skew is manual/filter downstream qajobsboardMaxJobsPerTerm Fetches JobBoardly JSON, filters by pipeline terms
Arc.dev Remote roles from Arc.dev listing pages (tool-tagged paths) Parses SSR __NEXT_DATA__; relies on stable Next payload ARC_REMOTE_JOBS_PATHS (seeds defaults), arcRemoteJobsPaths, arcMaxJobsPerPath Merges Arc-managed + external rows; dedupes by URL
Manual Import One-off jobs not covered by scrapers Inference quality depends on model/provider and input quality; some URLs cannot be fetched reliably App/API endpoints (/api/manual-jobs/infer, /api/manual-jobs/import) Accepts text/HTML/URL, runs inference, then saves and scores job after review

Which extractor should I use?

  • Use JobSpy for broad first-pass sourcing across common boards.
  • Use Adzuna when you want API-first discovery in supported non-UK markets.
  • Use Hiring Cafe when you want another term/country-driven source without adding credentials.
  • Use startup.jobs when you want startup-heavy listings without maintaining another scraper locally.
  • Use Gradcracker when targeting graduate pipelines in the UK.
  • Use UKVisaJobs for sponsorship-specific UK searches.
  • Use SmartRecruiters when you can list target employers public SmartRecruiters company identifiers.
  • Use iCIMS tenants when you can list target *.icims.com career hosts (anonymous portal HTML search).
  • Use BC T-Net for British Columbia tech RSS listings (runs only when search geography is Canada).
  • Use Eluta for Canadian employer-direct listings via RSS (set metro/province location strings).
  • Use QAJobsBoard or Arc.dev when you want QA- or remote-stack-focused feeds without extra credentials.
  • Use Manual Import when you already have a specific posting and need direct import.

Many runs combine sources: broad discovery first, then manual import for high-priority jobs that scraping misses.

QA-focused boards (shipped extractors)

  • QAJobsBoard — Large QA-oriented index via public JSON; filter geography downstream.
  • Arc.dev — Remote feeds (e.g. Playwright / Cypress paths); good for vetted remote slices.

Canadian QA contracting firms (reference)

Staffing and consultancy firms that frequently post QA automation contracts — scrape hints and CLI probes: Canadian / NA QA contracting firms.

Canadian employers — QA-strong ATS (reference)

Direct ATS JSON / extractor wiring for well-known Canadian tech brands (Ashby, Greenhouse, Lever, Workday, SmartRecruiters): Canadian companies — strong QA orgs and scrapable ATS.

Supplementary job boards

Some boards are credential-gated, approval-gated, or scraping-hostile — see Supplementary sources — access notes for realistic paths (Careerjet, Reed, Job Bank XML policy, sponsorship data sources, etc.).

JobOps ships BC T-Net and iCIMS tenant HTML extractors for two cases that are usually workable without vendor contracts; everything else in the old “long tail” list still lands best via Manual Import until someone promotes it to a manifest.

Still common manual-import targets

  • Wellfound (formerly AngelList), Otta, Welcome to the Jungle, Dice, Job Bank (unless you qualify for syndication), regional boards without stable feeds — use Manual Import or an external tool, then normalize here.