Some checks failed
CI / Linting (Biome) (push) Failing after 40s
CI / Tests (push) Successful in 5m54s
CI / Type Check (adzuna-extractor) (push) Successful in 1m8s
CI / Type Check (gradcracker-extractor) (push) Successful in 1m11s
CI / Type Check (hiringcafe-extractor) (push) Successful in 1m8s
CI / Type Check (orchestrator) (push) Successful in 1m23s
CI / Type Check (startupjobs-extractor) (push) Successful in 1m6s
CI / Type Check (ukvisajobs-extractor) (push) Successful in 1m7s
CI / Documentation (push) Successful in 1m54s
Adds Arc.dev, BC T-Net, Eluta, iCIMS tenants, QAJobsBoard, and SmartRecruiters manifests with registry/settings/UI wiring; registers full extractor list in smoke-extractors and documents supplementary board access paths. Aligns Careerjet v4 with the url query parameter and fixes strict typing in QAJobsBoard. Co-authored-by: Cursor <cursoragent@cursor.com>
87 lines
9.1 KiB
Markdown
87 lines
9.1 KiB
Markdown
---
|
||
id: overview
|
||
title: Extractors Overview
|
||
description: Technical index of supported extractors and how they work.
|
||
sidebar_position: 1
|
||
---
|
||
|
||
This page helps you choose the right extractor for your run, understand key constraints, and navigate to detailed technical guides.
|
||
|
||
Extractor integrations are now registered through manifests and loaded automatically at orchestrator startup. Runtime discovery only scans `extractors/*/(manifest.ts|src/manifest.ts)` and does not read manifests from `orchestrator/**`. Extractor-specific run logic should also remain in `extractors/<name>/` so orchestrator stays source-agnostic. To add a new source, follow [Add an Extractor](/docs/next/workflows/add-an-extractor).
|
||
|
||
## Extractor chooser
|
||
|
||
| Extractor | Best use case | Core constraints/dependencies | Notable controls | Output/behavior notes |
|
||
| --- | --- | --- | --- | --- |
|
||
| [Gradcracker](/docs/next/extractors/gradcracker) | UK graduate roles from Gradcracker | Crawling stability depends on page structure and anti-bot behavior; tuned for low concurrency | `GRADCRACKER_SEARCH_TERMS`, `GRADCRACKER_MAX_JOBS_PER_TERM`, `JOBOPS_SKIP_APPLY_FOR_EXISTING` | Scrapes listing metadata, then detail pages and apply URL resolution |
|
||
| [JobSpy](/docs/next/extractors/jobspy) | Multi-source discovery (Indeed, LinkedIn, Glassdoor) | Requires Python wrapper execution per term; source availability and quality vary by site/location | `JOBSPY_SITES`, `JOBSPY_SEARCH_TERMS`, `JOBSPY_RESULTS_WANTED`, `JOBSPY_HOURS_OLD`, `JOBSPY_LINKEDIN_FETCH_DESCRIPTION` | Produces JSON per term, then orchestrator normalizes and de-duplicates by `jobUrl` |
|
||
| [Adzuna](/docs/next/extractors/adzuna) | API-based multi-country discovery with low scraping overhead | Requires valid App ID/App Key; country must be in Adzuna-supported list | `ADZUNA_APP_ID`, `ADZUNA_APP_KEY`, `ADZUNA_MAX_JOBS_PER_TERM` | API pagination to dataset output; orchestrator maps progress and de-duplicates by `sourceJobId`/`jobUrl` |
|
||
| [Hiring Cafe](/docs/next/extractors/hiring-cafe) | Browser-backed discovery using Hiring Cafe search APIs | Subject to upstream anti-bot checks; uses browser context and encoded search-state payloads | `HIRING_CAFE_SEARCH_TERMS`, `HIRING_CAFE_COUNTRY`, `HIRING_CAFE_MAX_JOBS_PER_TERM`, `HIRING_CAFE_DATE_FETCHED_PAST_N_DAYS` | Uses existing pipeline term/country/budget knobs and maps directly to normalized jobs |
|
||
| [startup.jobs](/docs/next/extractors/startup-jobs) | Startup-focused discovery through the published `startup-jobs-scraper` package | No credentials required; detail enrichment depends on Playwright browser binaries being installed | existing pipeline `searchTerms`, selected country/cities, `jobspyResultsWanted`; `npx playwright install` for fresh environments | Algolia-backed search plus detail-page enrichment via package import; orchestrator maps normalized records and de-duplicates by `jobUrl` |
|
||
| [UKVisaJobs](/docs/next/extractors/ukvisajobs) | UK visa sponsorship-focused roles | Requires authenticated session and periodic token/cookie refresh | `UKVISAJOBS_EMAIL`, `UKVISAJOBS_PASSWORD`, `UKVISAJOBS_MAX_JOBS`, `UKVISAJOBS_SEARCH_KEYWORD` | API pagination + dataset output; orchestrator de-dupes and may fetch missing descriptions |
|
||
| [SmartRecruiters](/docs/next/extractors/smartrecruiters) | Enterprise employers on SmartRecruiters public boards | No auth; needs configured company identifiers; one HTTP round-trip per posting for apply URLs + descriptions | `SMARTRECRUITERS_COMPANIES`, `SMARTRECRUITERS_MAX_JOBS_PER_COMPANY` | Paginates the public Posting API, filters by pipeline terms, normalizes to `CreateJobInput` |
|
||
| iCIMS tenants (HTML) | Large employers on iCIMS portals | No auth; HTML search varies by tenant — maintain explicit tenant hosts | `ICIMS_TENANTS`, Settings: `icimsTenants`, `icimsMaxJobsPerTenant`, `icimsMaxPagesPerSearch` | Fetches `/jobs/search` with iframe-style params, parses listing links, caps per tenant |
|
||
| BC T-Net (RSS) | BC tech aggregate via T-Net | Canada-only; free RSS (default feed built-in); optional extra feeds | `BCTENET_RSS_URLS`, Settings: `bctenetRssUrls`, `bctenetMaxJobsPerTerm` | Fetches RSS item blocks, normalizes quirky CDATA link fragments, filters by pipeline terms |
|
||
| [Eluta](/docs/next/extractors/eluta) | Canadian listings aggregated from employer career sites (RSS) | Canada-only source (skipped when search geography is not Canada); RSS `location` strings must be set | `ELUTA_RSS_LOCATIONS`, `ELUTA_MAX_JOBS_PER_TERM` | Fetches one or more `eluta.ca` RSS feeds, filters by terms, de-duplicates by guid/URL |
|
||
| [QAJobsBoard](/docs/next/extractors/qajobsboard) | QA / SDET / automation-heavy board (global JSON feed) | No auth; geography skew is manual/filter downstream | `qajobsboardMaxJobsPerTerm` | Fetches JobBoardly JSON, filters by pipeline terms |
|
||
| [Arc.dev](/docs/next/extractors/arcdev) | Remote roles from Arc.dev listing pages (tool-tagged paths) | Parses SSR `__NEXT_DATA__`; relies on stable Next payload | `ARC_REMOTE_JOBS_PATHS` (seeds defaults), `arcRemoteJobsPaths`, `arcMaxJobsPerPath` | Merges Arc-managed + external rows; dedupes by URL |
|
||
| [Manual Import](/docs/next/extractors/manual) | One-off jobs not covered by scrapers | Inference quality depends on model/provider and input quality; some URLs cannot be fetched reliably | App/API endpoints (`/api/manual-jobs/infer`, `/api/manual-jobs/import`) | Accepts text/HTML/URL, runs inference, then saves and scores job after review |
|
||
|
||
## Which extractor should I use?
|
||
|
||
- Use **JobSpy** for broad first-pass sourcing across common boards.
|
||
- Use **Adzuna** when you want API-first discovery in supported non-UK markets.
|
||
- Use **Hiring Cafe** when you want another term/country-driven source without adding credentials.
|
||
- Use **startup.jobs** when you want startup-heavy listings without maintaining another scraper locally.
|
||
- Use **Gradcracker** when targeting graduate pipelines in the UK.
|
||
- Use **UKVisaJobs** for sponsorship-specific UK searches.
|
||
- Use **SmartRecruiters** when you can list target employers’ public SmartRecruiters company identifiers.
|
||
- Use **iCIMS tenants** when you can list target `*.icims.com` career hosts (anonymous portal HTML search).
|
||
- Use **BC T-Net** for British Columbia tech RSS listings (runs only when search geography is Canada).
|
||
- Use **Eluta** for Canadian employer-direct listings via RSS (set metro/province `location` strings).
|
||
- Use **QAJobsBoard** or **Arc.dev** when you want QA- or remote-stack-focused feeds without extra credentials.
|
||
- Use **Manual Import** when you already have a specific posting and need direct import.
|
||
|
||
Many runs combine sources: broad discovery first, then manual import for high-priority jobs that scraping misses.
|
||
|
||
### QA-focused boards (shipped extractors)
|
||
|
||
- **[QAJobsBoard](/docs/next/extractors/qajobsboard)** — Large QA-oriented index via public JSON; filter geography downstream.
|
||
- **[Arc.dev](/docs/next/extractors/arcdev)** — Remote feeds (e.g. Playwright / Cypress paths); good for vetted remote slices.
|
||
|
||
### Canadian QA contracting firms (reference)
|
||
|
||
Staffing and consultancy firms that frequently post QA automation contracts — scrape hints and CLI probes: **[Canadian / NA QA contracting firms](/docs/next/extractors/qa-contract-staffing-canada)**.
|
||
|
||
### Canadian employers — QA-strong ATS (reference)
|
||
|
||
Direct ATS JSON / extractor wiring for well-known Canadian tech brands (Ashby, Greenhouse, Lever, Workday, SmartRecruiters): **[Canadian companies — strong QA orgs and scrapable ATS](/docs/next/extractors/canadian-companies-qa-ats)**.
|
||
|
||
## Supplementary job boards
|
||
|
||
Some boards are **credential-gated**, **approval-gated**, or **scraping-hostile** — see **[Supplementary sources — access notes](/docs/next/extractors/supplementary-sources-access-notes)** for realistic paths (Careerjet, Reed, Job Bank XML policy, sponsorship data sources, etc.).
|
||
|
||
JobOps ships **BC T-Net** and **iCIMS tenant HTML** extractors for two cases that are usually workable without vendor contracts; everything else in the old “long tail” list still lands best via **[Manual Import](/docs/next/extractors/manual)** until someone promotes it to a manifest.
|
||
|
||
### Still common manual-import targets
|
||
|
||
- **Wellfound** (formerly AngelList), **Otta**, **Welcome to the Jungle**, **Dice**, **Job Bank** (unless you qualify for syndication), regional boards without stable feeds — use Manual Import or an external tool, then normalize here.
|
||
|
||
## Related extractor docs
|
||
|
||
- [Gradcracker](/docs/next/extractors/gradcracker)
|
||
- [JobSpy](/docs/next/extractors/jobspy)
|
||
- [Adzuna](/docs/next/extractors/adzuna)
|
||
- [Hiring Cafe](/docs/next/extractors/hiring-cafe)
|
||
- [startup.jobs](/docs/next/extractors/startup-jobs)
|
||
- [UKVisaJobs](/docs/next/extractors/ukvisajobs)
|
||
- [SmartRecruiters](/docs/next/extractors/smartrecruiters)
|
||
- [Supplementary sources — access notes](/docs/next/extractors/supplementary-sources-access-notes)
|
||
- [Eluta](/docs/next/extractors/eluta)
|
||
- [QAJobsBoard](/docs/next/extractors/qajobsboard)
|
||
- [Arc.dev](/docs/next/extractors/arcdev)
|
||
- [Canadian / NA QA contracting firms](/docs/next/extractors/qa-contract-staffing-canada)
|
||
- [Canadian companies — QA-strong ATS](/docs/next/extractors/canadian-companies-qa-ats)
|
||
- [Manual Import](/docs/next/extractors/manual)
|
||
- [Add an Extractor](/docs/next/workflows/add-an-extractor)
|