Jobber/docs-site/docs/workflows/add-an-extractor.md
ilia f5179304c1
Some checks failed
CI / Linting (Biome) (push) Failing after 41s
CI / Tests (push) Successful in 5m27s
CI / Type Check (adzuna-extractor) (push) Successful in 1m9s
CI / Type Check (gradcracker-extractor) (push) Successful in 1m13s
CI / Type Check (hiringcafe-extractor) (push) Successful in 1m9s
CI / Type Check (orchestrator) (push) Successful in 1m24s
CI / Type Check (startupjobs-extractor) (push) Successful in 1m8s
CI / Type Check (ukvisajobs-extractor) (push) Successful in 1m9s
CI / Documentation (push) Successful in 1m59s
feat(discovery): blocked countries filter and smoke subprocess fixes
Add blockedCountries in Settings so pipeline discovery drops jobs whose
location mentions listed countries (existing discovered rows are kept).
Document the feature, fix smoke tsconfig inheritance for nested extractors,
and run smoke via an absolute-tsconfig wrapper.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-16 11:41:29 -04:00

3.9 KiB

id, title, description, sidebar_position
id title description sidebar_position
add-an-extractor Add an Extractor How to add a new extractor using the manifest contract and shared extractor catalog. 2

What it is

This guide explains how to add a new extractor that is auto-registered at orchestrator startup.

The extractor runtime is discovered from a local manifest.ts file, and the source is type-safe across API/client through the shared catalog in shared/src/extractors/index.ts.

Extractor manifests must live in extractor packages under extractors/<name>/ only. Do not add manifest files inside orchestrator/. Extractor run logic should also live in the extractor package so orchestrator stays extractor-agnostic.

Why it exists

Without a manifest contract, adding extractors required touching multiple orchestrator files.

With the manifest system, contributors only need to:

  1. Add a manifest in their extractor package.
  2. Add the new source id to the shared typed catalog.

That keeps runtime wiring dynamic while preserving compile-time safety in API and client code.

How to use it

  1. Create your extractor package under extractors/<name>/.
  2. Add a manifest.ts in the extractor package root (or src/manifest.ts).
    • Valid locations are only extractors/<name>/manifest.ts or extractors/<name>/src/manifest.ts.
    • orchestrator/**/manifest.ts is not used for extractor discovery.
  3. Export a manifest with:
    • id
    • displayName
    • providesSources
    • requiredEnvVars (optional)
    • run(context) that returns { success, jobs, error? }
  4. Add the new source id to shared/src/extractors/index.ts:
    • append to EXTRACTOR_SOURCE_IDS
    • add an entry in EXTRACTOR_SOURCE_METADATA
  5. Ensure your extractor maps output to CreateJobInput[].
  6. Register it in scripts/smoke-extractors.ts (ALL_TARGETS): add one row per manifest so npx tsx scripts/smoke-extractors.ts exercises every shipped extractor (keyed sources SKIP until env vars exist).
  7. Run the full CI checks.

Example manifest:

import type { ExtractorManifest } from "@shared/types/extractors";

export const manifest: ExtractorManifest = {
  id: "myextractor",
  displayName: "My Extractor",
  providesSources: ["myextractor"],
  requiredEnvVars: ["MYEXTRACTOR_API_KEY"],
  async run(context) {
    // context.searchTerms, context.settings, context.onProgress, context.shouldCancel
    const jobs = [];
    return { success: true, jobs };
  },
};

export default manifest;

Subprocess extractors are supported. Keep subprocess spawning inside run(context) so orchestrator only depends on the manifest contract.

Common problems

Extractor not discovered at startup

  • Check file path: extractors/<name>/manifest.ts or extractors/<name>/src/manifest.ts.
  • Ensure the file exports default or named manifest.

Source compiles in extractor but fails in API/client

  • Add the new source id to shared/src/extractors/index.ts.
  • Confirm metadata exists for that source id.

Smoke connectivity

After wiring settings/env, run:

npm run smoke:extractors -- myextractor

Or the full suite (may take several minutes — JobSpy invokes Python; Gradcracker / Hiring Cafe need Camoufox: npx camoufox-js fetch):

npm run smoke:extractors

Keep ALL_TARGETS in that script aligned with manifests under each extractors/<name>/ package (manifest.ts or src/manifest.ts).

Source appears in shared catalog but is unavailable at runtime

  • The manifest was not loaded successfully.
  • Check startup logs for registry warnings.

Source requires credentials but never returns jobs

  • Add and validate requiredEnvVars.
  • Verify your manifest run(context) reads settings/env values correctly.