Shaheer Sarfaraz d34a9f041b
Hiring cafe extractor (#192)
* feat(hiringcafe): register new source across shared/server/client enums

* feat(hiringcafe-extractor): add browser-backed Hiring Cafe dataset extractor

* feat(orchestrator): integrate Hiring Cafe discovery service into pipeline

* feat(orchestrator-ui): add Hiring Cafe to source availability and run estimates

* chore(hiringcafe): wire CI/docker and add extractor documentation

* chore(format): apply biome formatting for Hiring Cafe integration

* add original websites

* coomints

* number or null
2026-02-19 12:51:55 +00:00

1.8 KiB

id, title, description, sidebar_position
id title description sidebar_position
ukvisajobs UKVisaJobs Extractor Authenticated session flow, API pagination, and orchestrator ingestion. 5

UKVisaJobs is the most complex extractor because authenticated sessions are required.

Original website: my.ukvisajobs.com

Big picture

Two layers:

  1. extractors/ukvisajobs/src/main.ts handles login/API calls and dataset output.
  2. orchestrator/src/server/services/ukvisajobs.ts executes extractor and ingests/de-dupes output.

1) Authentication and session cache

Session cache file:

  • extractors/ukvisajobs/storage/ukvisajobs-auth.json

Flow:

  • Reuse cached token/cookies when valid
  • Re-login with Playwright + Camoufox when needed
  • Refresh and retry on token-expired responses

Force refresh:

  • UKVISAJOBS_REFRESH_ONLY=1

2) API requests

Endpoint:

  • https://my.ukvisajobs.com/ukvisa-api/api/fetch-jobs-data

Each request includes auth token + session cookies and paginates (15 jobs/page).

3) Mapping

  • Normalizes salary from min/max/interval
  • Builds fallback visa description when content missing
  • Maps job_link to both jobUrl and applicationLink

4) Output dataset

Written to:

  • extractors/ukvisajobs/storage/datasets/default/

Includes per-job JSON files and combined jobs.json.

5) Orchestrator flow

  • Spawns extractor (npx tsx src/main.ts)
  • Runs terms sequentially with delay
  • De-dupes by sourceJobId (fallback jobUrl)
  • Fetches detail pages when descriptions are too short

Controls

  • UKVISAJOBS_EMAIL, UKVISAJOBS_PASSWORD
  • UKVISAJOBS_HEADLESS
  • UKVISAJOBS_MAX_JOBS (default 50, max 200)
  • UKVISAJOBS_SEARCH_KEYWORD

Practical notes

  • Deleting auth cache forces next run to re-login.
  • Low-concurrency/polite scraping by design.
  • If extractor breaks, check session refresh path first.