Add blockedCountries in Settings so pipeline discovery drops jobs whose location mentions listed countries (existing discovered rows are kept). Document the feature, fix smoke tsconfig inheritance for nested extractors, and run smoke via an absolute-tsconfig wrapper. Co-authored-by: Cursor <cursoragent@cursor.com>
3.9 KiB
id, title, description, sidebar_position
| id | title | description | sidebar_position |
|---|---|---|---|
| add-an-extractor | Add an Extractor | How to add a new extractor using the manifest contract and shared extractor catalog. | 2 |
What it is
This guide explains how to add a new extractor that is auto-registered at orchestrator startup.
The extractor runtime is discovered from a local manifest.ts file, and the source is type-safe across API/client through the shared catalog in shared/src/extractors/index.ts.
Extractor manifests must live in extractor packages under extractors/<name>/ only. Do not add manifest files inside orchestrator/.
Extractor run logic should also live in the extractor package so orchestrator stays extractor-agnostic.
Why it exists
Without a manifest contract, adding extractors required touching multiple orchestrator files.
With the manifest system, contributors only need to:
- Add a manifest in their extractor package.
- Add the new source id to the shared typed catalog.
That keeps runtime wiring dynamic while preserving compile-time safety in API and client code.
How to use it
- Create your extractor package under
extractors/<name>/. - Add a
manifest.tsin the extractor package root (orsrc/manifest.ts).- Valid locations are only
extractors/<name>/manifest.tsorextractors/<name>/src/manifest.ts. orchestrator/**/manifest.tsis not used for extractor discovery.
- Valid locations are only
- Export a manifest with:
iddisplayNameprovidesSourcesrequiredEnvVars(optional)run(context)that returns{ success, jobs, error? }
- Add the new source id to
shared/src/extractors/index.ts:- append to
EXTRACTOR_SOURCE_IDS - add an entry in
EXTRACTOR_SOURCE_METADATA
- append to
- Ensure your extractor maps output to
CreateJobInput[]. - Register it in
scripts/smoke-extractors.ts(ALL_TARGETS): add one row per manifest sonpx tsx scripts/smoke-extractors.tsexercises every shipped extractor (keyed sourcesSKIPuntil env vars exist). - Run the full CI checks.
Example manifest:
import type { ExtractorManifest } from "@shared/types/extractors";
export const manifest: ExtractorManifest = {
id: "myextractor",
displayName: "My Extractor",
providesSources: ["myextractor"],
requiredEnvVars: ["MYEXTRACTOR_API_KEY"],
async run(context) {
// context.searchTerms, context.settings, context.onProgress, context.shouldCancel
const jobs = [];
return { success: true, jobs };
},
};
export default manifest;
Subprocess extractors are supported. Keep subprocess spawning inside run(context) so orchestrator only depends on the manifest contract.
Common problems
Extractor not discovered at startup
- Check file path:
extractors/<name>/manifest.tsorextractors/<name>/src/manifest.ts. - Ensure the file exports
defaultor namedmanifest.
Source compiles in extractor but fails in API/client
- Add the new source id to
shared/src/extractors/index.ts. - Confirm metadata exists for that source id.
Smoke connectivity
After wiring settings/env, run:
npm run smoke:extractors -- myextractor
Or the full suite (may take several minutes — JobSpy invokes Python; Gradcracker / Hiring Cafe need Camoufox: npx camoufox-js fetch):
npm run smoke:extractors
Keep ALL_TARGETS in that script aligned with manifests under each extractors/<name>/ package (manifest.ts or src/manifest.ts).
Source appears in shared catalog but is unavailable at runtime
- The manifest was not loaded successfully.
- Check startup logs for registry warnings.
Source requires credentials but never returns jobs
- Add and validate
requiredEnvVars. - Verify your manifest
run(context)reads settings/env values correctly.