Jobber/docs-site/docs/workflows/add-an-extractor.md
ilia c840f289e1
Some checks failed
CI / Linting (Biome) (push) Failing after 40s
CI / Tests (push) Successful in 5m54s
CI / Type Check (adzuna-extractor) (push) Successful in 1m8s
CI / Type Check (gradcracker-extractor) (push) Successful in 1m11s
CI / Type Check (hiringcafe-extractor) (push) Successful in 1m8s
CI / Type Check (orchestrator) (push) Successful in 1m23s
CI / Type Check (startupjobs-extractor) (push) Successful in 1m6s
CI / Type Check (ukvisajobs-extractor) (push) Successful in 1m7s
CI / Documentation (push) Successful in 1m54s
feat(extractors): expand catalog, smoke coverage, and sourcing docs
Adds Arc.dev, BC T-Net, Eluta, iCIMS tenants, QAJobsBoard, and SmartRecruiters
manifests with registry/settings/UI wiring; registers full extractor list in
smoke-extractors and documents supplementary board access paths. Aligns Careerjet
v4 with the url query parameter and fixes strict typing in QAJobsBoard.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-15 22:36:23 -04:00

3.9 KiB

id, title, description, sidebar_position
id title description sidebar_position
add-an-extractor Add an Extractor How to add a new extractor using the manifest contract and shared extractor catalog. 2

What it is

This guide explains how to add a new extractor that is auto-registered at orchestrator startup.

The extractor runtime is discovered from a local manifest.ts file, and the source is type-safe across API/client through the shared catalog in shared/src/extractors/index.ts.

Extractor manifests must live in extractor packages under extractors/<name>/ only. Do not add manifest files inside orchestrator/. Extractor run logic should also live in the extractor package so orchestrator stays extractor-agnostic.

Why it exists

Without a manifest contract, adding extractors required touching multiple orchestrator files.

With the manifest system, contributors only need to:

  1. Add a manifest in their extractor package.
  2. Add the new source id to the shared typed catalog.

That keeps runtime wiring dynamic while preserving compile-time safety in API and client code.

How to use it

  1. Create your extractor package under extractors/<name>/.
  2. Add a manifest.ts in the extractor package root (or src/manifest.ts).
    • Valid locations are only extractors/<name>/manifest.ts or extractors/<name>/src/manifest.ts.
    • orchestrator/**/manifest.ts is not used for extractor discovery.
  3. Export a manifest with:
    • id
    • displayName
    • providesSources
    • requiredEnvVars (optional)
    • run(context) that returns { success, jobs, error? }
  4. Add the new source id to shared/src/extractors/index.ts:
    • append to EXTRACTOR_SOURCE_IDS
    • add an entry in EXTRACTOR_SOURCE_METADATA
  5. Ensure your extractor maps output to CreateJobInput[].
  6. Register it in scripts/smoke-extractors.ts (ALL_TARGETS): add one row per manifest so npx tsx scripts/smoke-extractors.ts exercises every shipped extractor (keyed sources SKIP until env vars exist).
  7. Run the full CI checks.

Example manifest:

import type { ExtractorManifest } from "@shared/types/extractors";

export const manifest: ExtractorManifest = {
  id: "myextractor",
  displayName: "My Extractor",
  providesSources: ["myextractor"],
  requiredEnvVars: ["MYEXTRACTOR_API_KEY"],
  async run(context) {
    // context.searchTerms, context.settings, context.onProgress, context.shouldCancel
    const jobs = [];
    return { success: true, jobs };
  },
};

export default manifest;

Subprocess extractors are supported. Keep subprocess spawning inside run(context) so orchestrator only depends on the manifest contract.

Common problems

Extractor not discovered at startup

  • Check file path: extractors/<name>/manifest.ts or extractors/<name>/src/manifest.ts.
  • Ensure the file exports default or named manifest.

Source compiles in extractor but fails in API/client

  • Add the new source id to shared/src/extractors/index.ts.
  • Confirm metadata exists for that source id.

Smoke connectivity

After wiring settings/env, run:

npx tsx scripts/smoke-extractors.ts myextractor

Or the full suite (may take several minutes — JobSpy invokes Python, Hiring Cafe / startup.jobs may need browser deps):

npx tsx scripts/smoke-extractors.ts

Keep ALL_TARGETS in that script aligned with manifests under each extractors/<name>/ package (manifest.ts or src/manifest.ts).

Source appears in shared catalog but is unavailable at runtime

  • The manifest was not loaded successfully.
  • Check startup logs for registry warnings.

Source requires credentials but never returns jobs

  • Add and validate requiredEnvVars.
  • Verify your manifest run(context) reads settings/env values correctly.