Jobber/docs-site/docs/extractors/overview.md

---
id: overview
title: Extractors Overview
description: Technical index of supported extractors and how they work.
sidebar_position: 1
---

This page helps you choose the right extractor for your run, understand key constraints, and navigate to detailed technical guides.

Extractor integrations are now registered through manifests and loaded automatically at orchestrator startup. Runtime discovery only scans `extractors/*/(manifest.ts|src/manifest.ts)` and does not read manifests from `orchestrator/**`. Extractor-specific run logic should also remain in `extractors/<name>/` so orchestrator stays source-agnostic. To add a new source, follow [Add an Extractor](/docs/next/workflows/add-an-extractor).

## Extractor chooser

| Extractor | Best use case | Core constraints/dependencies | Notable controls | Output/behavior notes |
| --- | --- | --- | --- | --- |
| [Gradcracker](/docs/next/extractors/gradcracker) | UK graduate roles from Gradcracker | Crawling stability depends on page structure and anti-bot behavior; tuned for low concurrency | `GRADCRACKER_SEARCH_TERMS`, `GRADCRACKER_MAX_JOBS_PER_TERM`, `JOBOPS_SKIP_APPLY_FOR_EXISTING` | Scrapes listing metadata, then detail pages and apply URL resolution |
| [JobSpy](/docs/next/extractors/jobspy) | Multi-source discovery (Indeed, LinkedIn, Glassdoor) | Requires Python wrapper execution per term; source availability and quality vary by site/location | `JOBSPY_SITES`, `JOBSPY_SEARCH_TERMS`, `JOBSPY_RESULTS_WANTED`, `JOBSPY_HOURS_OLD`, `JOBSPY_LINKEDIN_FETCH_DESCRIPTION` | Produces JSON per term, then orchestrator normalizes and de-duplicates by `jobUrl` |
| [Adzuna](/docs/next/extractors/adzuna) | API-based multi-country discovery with low scraping overhead | Requires valid App ID/App Key; country must be in Adzuna-supported list | `ADZUNA_APP_ID`, `ADZUNA_APP_KEY`, `ADZUNA_MAX_JOBS_PER_TERM` | API pagination to dataset output; orchestrator maps progress and de-duplicates by `sourceJobId`/`jobUrl` |
| [Hiring Cafe](/docs/next/extractors/hiring-cafe) | Browser-backed discovery using Hiring Cafe search APIs | Subject to upstream anti-bot checks; uses browser context and encoded search-state payloads | `HIRING_CAFE_SEARCH_TERMS`, `HIRING_CAFE_COUNTRY`, `HIRING_CAFE_MAX_JOBS_PER_TERM`, `HIRING_CAFE_DATE_FETCHED_PAST_N_DAYS` | Uses existing pipeline term/country/budget knobs and maps directly to normalized jobs |
| [startup.jobs](/docs/next/extractors/startup-jobs) | Startup-focused discovery through the published `startup-jobs-scraper` package | No credentials required; detail enrichment depends on Playwright browser binaries being installed | existing pipeline `searchTerms`, selected country/cities, `jobspyResultsWanted`; `npx playwright install` for fresh environments | Algolia-backed search plus detail-page enrichment via package import; orchestrator maps normalized records and de-duplicates by `jobUrl` |
| [UKVisaJobs](/docs/next/extractors/ukvisajobs) | UK visa sponsorship-focused roles | Requires authenticated session and periodic token/cookie refresh | `UKVISAJOBS_EMAIL`, `UKVISAJOBS_PASSWORD`, `UKVISAJOBS_MAX_JOBS`, `UKVISAJOBS_SEARCH_KEYWORD` | API pagination + dataset output; orchestrator de-dupes and may fetch missing descriptions |
| [SmartRecruiters](/docs/next/extractors/smartrecruiters) | Enterprise employers on SmartRecruiters public boards | No auth; needs configured company identifiers; one HTTP round-trip per posting for apply URLs + descriptions | `SMARTRECRUITERS_COMPANIES`, `SMARTRECRUITERS_MAX_JOBS_PER_COMPANY` | Paginates the public Posting API, filters by pipeline terms, normalizes to `CreateJobInput` |
| iCIMS tenants (HTML) | Large employers on iCIMS portals | No auth; HTML search varies by tenant — maintain explicit tenant hosts | `ICIMS_TENANTS`, Settings: `icimsTenants`, `icimsMaxJobsPerTenant`, `icimsMaxPagesPerSearch` | Fetches `/jobs/search` with iframe-style params, parses listing links, caps per tenant |
| BC T-Net (RSS) | BC tech aggregate via T-Net | Canada-only; free RSS (default feed built-in); optional extra feeds | `BCTENET_RSS_URLS`, Settings: `bctenetRssUrls`, `bctenetMaxJobsPerTerm` | Fetches RSS item blocks, normalizes quirky CDATA link fragments, filters by pipeline terms |
| [Eluta](/docs/next/extractors/eluta) | Canadian listings aggregated from employer career sites (RSS) | Canada-only source (skipped when search geography is not Canada); RSS `location` strings must be set | `ELUTA_RSS_LOCATIONS`, `ELUTA_MAX_JOBS_PER_TERM` | Fetches one or more `eluta.ca` RSS feeds, filters by terms, de-duplicates by guid/URL |
| [QAJobsBoard](/docs/next/extractors/qajobsboard) | QA / SDET / automation-heavy board (global JSON feed) | No auth; geography skew is manual/filter downstream | `qajobsboardMaxJobsPerTerm` | Fetches JobBoardly JSON, filters by pipeline terms |
| [Arc.dev](/docs/next/extractors/arcdev) | Remote roles from Arc.dev listing pages (tool-tagged paths) | Parses SSR `__NEXT_DATA__`; relies on stable Next payload | `ARC_REMOTE_JOBS_PATHS` (seeds defaults), `arcRemoteJobsPaths`, `arcMaxJobsPerPath` | Merges Arc-managed + external rows; dedupes by URL |
| [Manual Import](/docs/next/extractors/manual) | One-off jobs not covered by scrapers | Inference quality depends on model/provider and input quality; some URLs cannot be fetched reliably | App/API endpoints (`/api/manual-jobs/infer`, `/api/manual-jobs/import`) | Accepts text/HTML/URL, runs inference, then saves and scores job after review |

## Which extractor should I use?

- Use **JobSpy** for broad first-pass sourcing across common boards.
- Use **Adzuna** when you want API-first discovery in supported non-UK markets.
- Use **Hiring Cafe** when you want another term/country-driven source without adding credentials.
- Use **startup.jobs** when you want startup-heavy listings without maintaining another scraper locally.
- Use **Gradcracker** when targeting graduate pipelines in the UK.
- Use **UKVisaJobs** for sponsorship-specific UK searches.
- Use **SmartRecruiters** when you can list target employers’ public SmartRecruiters company identifiers.
- Use **iCIMS tenants** when you can list target `*.icims.com` career hosts (anonymous portal HTML search).
- Use **BC T-Net** for British Columbia tech RSS listings (runs only when search geography is Canada).
- Use **Eluta** for Canadian employer-direct listings via RSS (set metro/province `location` strings).
- Use **QAJobsBoard** or **Arc.dev** when you want QA- or remote-stack-focused feeds without extra credentials.
- Use **Manual Import** when you already have a specific posting and need direct import.

Many runs combine sources: broad discovery first, then manual import for high-priority jobs that scraping misses.

### QA-focused boards (shipped extractors)

- **[QAJobsBoard](/docs/next/extractors/qajobsboard)** — Large QA-oriented index via public JSON; filter geography downstream.
- **[Arc.dev](/docs/next/extractors/arcdev)** — Remote feeds (e.g. Playwright / Cypress paths); good for vetted remote slices.

### Canadian QA contracting firms (reference)

Staffing and consultancy firms that frequently post QA automation contracts — scrape hints and CLI probes: **[Canadian / NA QA contracting firms](/docs/next/extractors/qa-contract-staffing-canada)**.

### Canadian employers — QA-strong ATS (reference)

Direct ATS JSON / extractor wiring for well-known Canadian tech brands (Ashby, Greenhouse, Lever, Workday, SmartRecruiters): **[Canadian companies — strong QA orgs and scrapable ATS](/docs/next/extractors/canadian-companies-qa-ats)**.

## Supplementary job boards

Some boards are **credential-gated**, **approval-gated**, or **scraping-hostile** — see **[Supplementary sources — access notes](/docs/next/extractors/supplementary-sources-access-notes)** for realistic paths (Careerjet, Reed, Job Bank XML policy, sponsorship data sources, etc.).

JobOps ships **BC T-Net** and **iCIMS tenant HTML** extractors for two cases that are usually workable without vendor contracts; everything else in the old “long tail” list still lands best via **[Manual Import](/docs/next/extractors/manual)** until someone promotes it to a manifest.

### Still common manual-import targets

- **Wellfound** (formerly AngelList), **Otta**, **Welcome to the Jungle**, **Dice**, **Job Bank** (unless you qualify for syndication), regional boards without stable feeds — use Manual Import or an external tool, then normalize here.

## Related extractor docs

- [Gradcracker](/docs/next/extractors/gradcracker)
- [JobSpy](/docs/next/extractors/jobspy)
- [Adzuna](/docs/next/extractors/adzuna)
- [Hiring Cafe](/docs/next/extractors/hiring-cafe)
- [startup.jobs](/docs/next/extractors/startup-jobs)
- [UKVisaJobs](/docs/next/extractors/ukvisajobs)
- [SmartRecruiters](/docs/next/extractors/smartrecruiters)
- [Supplementary sources — access notes](/docs/next/extractors/supplementary-sources-access-notes)
- [Eluta](/docs/next/extractors/eluta)
- [QAJobsBoard](/docs/next/extractors/qajobsboard)
- [Arc.dev](/docs/next/extractors/arcdev)
- [Canadian / NA QA contracting firms](/docs/next/extractors/qa-contract-staffing-canada)
- [Canadian companies — QA-strong ATS](/docs/next/extractors/canadian-companies-qa-ats)
- [Manual Import](/docs/next/extractors/manual)
- [Add an Extractor](/docs/next/workflows/add-an-extractor)