Jobber/docs-site/docs/workflows/add-an-extractor.md
ilia c840f289e1
Some checks failed
CI / Linting (Biome) (push) Failing after 40s
CI / Tests (push) Successful in 5m54s
CI / Type Check (adzuna-extractor) (push) Successful in 1m8s
CI / Type Check (gradcracker-extractor) (push) Successful in 1m11s
CI / Type Check (hiringcafe-extractor) (push) Successful in 1m8s
CI / Type Check (orchestrator) (push) Successful in 1m23s
CI / Type Check (startupjobs-extractor) (push) Successful in 1m6s
CI / Type Check (ukvisajobs-extractor) (push) Successful in 1m7s
CI / Documentation (push) Successful in 1m54s
feat(extractors): expand catalog, smoke coverage, and sourcing docs
Adds Arc.dev, BC T-Net, Eluta, iCIMS tenants, QAJobsBoard, and SmartRecruiters
manifests with registry/settings/UI wiring; registers full extractor list in
smoke-extractors and documents supplementary board access paths. Aligns Careerjet
v4 with the url query parameter and fixes strict typing in QAJobsBoard.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-15 22:36:23 -04:00

113 lines
3.9 KiB
Markdown

---
id: add-an-extractor
title: Add an Extractor
description: How to add a new extractor using the manifest contract and shared extractor catalog.
sidebar_position: 2
---
## What it is
This guide explains how to add a new extractor that is auto-registered at orchestrator startup.
The extractor runtime is discovered from a local `manifest.ts` file, and the source is type-safe across API/client through the shared catalog in `shared/src/extractors/index.ts`.
Extractor manifests must live in extractor packages under `extractors/<name>/` only. Do not add manifest files inside `orchestrator/`.
Extractor run logic should also live in the extractor package so orchestrator stays extractor-agnostic.
## Why it exists
Without a manifest contract, adding extractors required touching multiple orchestrator files.
With the manifest system, contributors only need to:
1. Add a manifest in their extractor package.
2. Add the new source id to the shared typed catalog.
That keeps runtime wiring dynamic while preserving compile-time safety in API and client code.
## How to use it
1. Create your extractor package under `extractors/<name>/`.
2. Add a `manifest.ts` in the extractor package root (or `src/manifest.ts`).
- Valid locations are only `extractors/<name>/manifest.ts` or `extractors/<name>/src/manifest.ts`.
- `orchestrator/**/manifest.ts` is not used for extractor discovery.
3. Export a manifest with:
- `id`
- `displayName`
- `providesSources`
- `requiredEnvVars` (optional)
- `run(context)` that returns `{ success, jobs, error? }`
4. Add the new source id to `shared/src/extractors/index.ts`:
- append to `EXTRACTOR_SOURCE_IDS`
- add an entry in `EXTRACTOR_SOURCE_METADATA`
5. Ensure your extractor maps output to `CreateJobInput[]`.
6. Register it in `scripts/smoke-extractors.ts` (`ALL_TARGETS`): add one row per manifest so `npx tsx scripts/smoke-extractors.ts` exercises every shipped extractor (keyed sources `SKIP` until env vars exist).
7. Run the full CI checks.
Example manifest:
```ts
import type { ExtractorManifest } from "@shared/types/extractors";
export const manifest: ExtractorManifest = {
id: "myextractor",
displayName: "My Extractor",
providesSources: ["myextractor"],
requiredEnvVars: ["MYEXTRACTOR_API_KEY"],
async run(context) {
// context.searchTerms, context.settings, context.onProgress, context.shouldCancel
const jobs = [];
return { success: true, jobs };
},
};
export default manifest;
```
Subprocess extractors are supported. Keep subprocess spawning inside `run(context)` so orchestrator only depends on the manifest contract.
## Common problems
### Extractor not discovered at startup
- Check file path: `extractors/<name>/manifest.ts` or `extractors/<name>/src/manifest.ts`.
- Ensure the file exports `default` or named `manifest`.
### Source compiles in extractor but fails in API/client
- Add the new source id to `shared/src/extractors/index.ts`.
- Confirm metadata exists for that source id.
### Smoke connectivity
After wiring settings/env, run:
```bash
npx tsx scripts/smoke-extractors.ts myextractor
```
Or the full suite (may take several minutes — JobSpy invokes Python, Hiring Cafe / startup.jobs may need browser deps):
```bash
npx tsx scripts/smoke-extractors.ts
```
Keep `ALL_TARGETS` in that script aligned with manifests under each `extractors/<name>/` package (`manifest.ts` or `src/manifest.ts`).
### Source appears in shared catalog but is unavailable at runtime
- The manifest was not loaded successfully.
- Check startup logs for registry warnings.
### Source requires credentials but never returns jobs
- Add and validate `requiredEnvVars`.
- Verify your manifest `run(context)` reads settings/env values correctly.
## Related pages
- [Extractors Overview](/docs/next/extractors/overview)
- [Adzuna Extractor](/docs/next/extractors/adzuna)
- [Hiring Cafe Extractor](/docs/next/extractors/hiring-cafe)
- [UKVisaJobs Extractor](/docs/next/extractors/ukvisajobs)