Some checks failed
CI / Linting (Biome) (push) Failing after 41s
CI / Tests (push) Successful in 5m27s
CI / Type Check (adzuna-extractor) (push) Successful in 1m9s
CI / Type Check (gradcracker-extractor) (push) Successful in 1m13s
CI / Type Check (hiringcafe-extractor) (push) Successful in 1m9s
CI / Type Check (orchestrator) (push) Successful in 1m24s
CI / Type Check (startupjobs-extractor) (push) Successful in 1m8s
CI / Type Check (ukvisajobs-extractor) (push) Successful in 1m9s
CI / Documentation (push) Successful in 1m59s
Add blockedCountries in Settings so pipeline discovery drops jobs whose location mentions listed countries (existing discovered rows are kept). Document the feature, fix smoke tsconfig inheritance for nested extractors, and run smoke via an absolute-tsconfig wrapper. Co-authored-by: Cursor <cursoragent@cursor.com>
113 lines
3.9 KiB
Markdown
113 lines
3.9 KiB
Markdown
---
|
|
id: add-an-extractor
|
|
title: Add an Extractor
|
|
description: How to add a new extractor using the manifest contract and shared extractor catalog.
|
|
sidebar_position: 2
|
|
---
|
|
|
|
## What it is
|
|
|
|
This guide explains how to add a new extractor that is auto-registered at orchestrator startup.
|
|
|
|
The extractor runtime is discovered from a local `manifest.ts` file, and the source is type-safe across API/client through the shared catalog in `shared/src/extractors/index.ts`.
|
|
|
|
Extractor manifests must live in extractor packages under `extractors/<name>/` only. Do not add manifest files inside `orchestrator/`.
|
|
Extractor run logic should also live in the extractor package so orchestrator stays extractor-agnostic.
|
|
|
|
## Why it exists
|
|
|
|
Without a manifest contract, adding extractors required touching multiple orchestrator files.
|
|
|
|
With the manifest system, contributors only need to:
|
|
|
|
1. Add a manifest in their extractor package.
|
|
2. Add the new source id to the shared typed catalog.
|
|
|
|
That keeps runtime wiring dynamic while preserving compile-time safety in API and client code.
|
|
|
|
## How to use it
|
|
|
|
1. Create your extractor package under `extractors/<name>/`.
|
|
2. Add a `manifest.ts` in the extractor package root (or `src/manifest.ts`).
|
|
- Valid locations are only `extractors/<name>/manifest.ts` or `extractors/<name>/src/manifest.ts`.
|
|
- `orchestrator/**/manifest.ts` is not used for extractor discovery.
|
|
3. Export a manifest with:
|
|
- `id`
|
|
- `displayName`
|
|
- `providesSources`
|
|
- `requiredEnvVars` (optional)
|
|
- `run(context)` that returns `{ success, jobs, error? }`
|
|
4. Add the new source id to `shared/src/extractors/index.ts`:
|
|
- append to `EXTRACTOR_SOURCE_IDS`
|
|
- add an entry in `EXTRACTOR_SOURCE_METADATA`
|
|
5. Ensure your extractor maps output to `CreateJobInput[]`.
|
|
6. Register it in `scripts/smoke-extractors.ts` (`ALL_TARGETS`): add one row per manifest so `npx tsx scripts/smoke-extractors.ts` exercises every shipped extractor (keyed sources `SKIP` until env vars exist).
|
|
7. Run the full CI checks.
|
|
|
|
Example manifest:
|
|
|
|
```ts
|
|
import type { ExtractorManifest } from "@shared/types/extractors";
|
|
|
|
export const manifest: ExtractorManifest = {
|
|
id: "myextractor",
|
|
displayName: "My Extractor",
|
|
providesSources: ["myextractor"],
|
|
requiredEnvVars: ["MYEXTRACTOR_API_KEY"],
|
|
async run(context) {
|
|
// context.searchTerms, context.settings, context.onProgress, context.shouldCancel
|
|
const jobs = [];
|
|
return { success: true, jobs };
|
|
},
|
|
};
|
|
|
|
export default manifest;
|
|
```
|
|
|
|
Subprocess extractors are supported. Keep subprocess spawning inside `run(context)` so orchestrator only depends on the manifest contract.
|
|
|
|
## Common problems
|
|
|
|
### Extractor not discovered at startup
|
|
|
|
- Check file path: `extractors/<name>/manifest.ts` or `extractors/<name>/src/manifest.ts`.
|
|
- Ensure the file exports `default` or named `manifest`.
|
|
|
|
### Source compiles in extractor but fails in API/client
|
|
|
|
- Add the new source id to `shared/src/extractors/index.ts`.
|
|
- Confirm metadata exists for that source id.
|
|
|
|
### Smoke connectivity
|
|
|
|
After wiring settings/env, run:
|
|
|
|
```bash
|
|
npm run smoke:extractors -- myextractor
|
|
```
|
|
|
|
Or the full suite (may take several minutes — JobSpy invokes Python; Gradcracker / Hiring Cafe need Camoufox: `npx camoufox-js fetch`):
|
|
|
|
```bash
|
|
npm run smoke:extractors
|
|
```
|
|
|
|
Keep `ALL_TARGETS` in that script aligned with manifests under each `extractors/<name>/` package (`manifest.ts` or `src/manifest.ts`).
|
|
|
|
### Source appears in shared catalog but is unavailable at runtime
|
|
|
|
- The manifest was not loaded successfully.
|
|
- Check startup logs for registry warnings.
|
|
|
|
### Source requires credentials but never returns jobs
|
|
|
|
- Add and validate `requiredEnvVars`.
|
|
- Verify your manifest `run(context)` reads settings/env values correctly.
|
|
|
|
## Related pages
|
|
|
|
- [Extractors Overview](/docs/next/extractors/overview)
|
|
- [Adzuna Extractor](/docs/next/extractors/adzuna)
|
|
- [Hiring Cafe Extractor](/docs/next/extractors/hiring-cafe)
|
|
- [UKVisaJobs Extractor](/docs/next/extractors/ukvisajobs)
|