Shaheer Sarfaraz 82e142a8a8
Auto-Registering Extractor System (#223)
* initial commit?

* Address PR feedback on extractor discovery and startup resilience

* Address latest PR review comments

* fix city resolution fallback when input parses empty

* address PR feedback on extractor registry and pipeline validation

* address copilot comments on manifests and registry startup

* fix extractor discovery export handling and env isolation in tests

* enforce duplicate manifest id failures in strict mode

* Fix remaining extractor registry and runtime review comments

* docs

* docs

* test all, logic remains in extractors

* Address PR review feedback on extractor registry and validation

* Revert extractor moduleResolution to bundler

* Enforce shared city filtering across all discovery sources

* Deduplicate extractor strict city post-filtering
2026-02-21 17:44:07 +00:00
..
2025-12-26 20:17:05 +00:00
2026-02-02 21:30:14 +00:00
2026-01-07 23:53:01 +00:00

UK Visa Jobs Extractor

Fetches job listings from my.ukvisajobs.com that may sponsor work visas.

Setup

npm install

If Playwright browsers are skipped in your environment, install Firefox:

npx playwright install firefox

If Camoufox assets are missing, fetch them:

npx camoufox-js fetch

Configuration

Set the following environment variables:

Variable Description
UKVISAJOBS_EMAIL Login email for automatic token refresh
UKVISAJOBS_PASSWORD Login password for automatic token refresh
UKVISAJOBS_HEADLESS Set to false to show the browser (default: true)
UKVISAJOBS_MAX_JOBS Maximum jobs to fetch (default: 50, max: 200)
UKVISAJOBS_SEARCH_KEYWORD Optional search filter

Automatic login & cache

The extractor will:

  1. Launch a Camoufox (Playwright Firefox) browser and sign in
  2. Navigate to the open jobs page and capture the token/cookies
  3. Cache the session to storage/ukvisajobs-auth.json
  4. Reuse the cached values until the API reports an expired token, then refresh

Running

npm start

Output is written to storage/datasets/default/ as JSON files.