Shaheer Sarfaraz d34a9f041b
Hiring cafe extractor (#192)
* feat(hiringcafe): register new source across shared/server/client enums

* feat(hiringcafe-extractor): add browser-backed Hiring Cafe dataset extractor

* feat(orchestrator): integrate Hiring Cafe discovery service into pipeline

* feat(orchestrator-ui): add Hiring Cafe to source availability and run estimates

* chore(hiringcafe): wire CI/docker and add extractor documentation

* chore(format): apply biome formatting for Hiring Cafe integration

* add original websites

* coomints

* number or null
2026-02-19 12:51:55 +00:00

76 lines
1.8 KiB
Markdown

---
id: ukvisajobs
title: UKVisaJobs Extractor
description: Authenticated session flow, API pagination, and orchestrator ingestion.
sidebar_position: 5
---
UKVisaJobs is the most complex extractor because authenticated sessions are required.
Original website: [my.ukvisajobs.com](https://my.ukvisajobs.com)
## Big picture
Two layers:
1. `extractors/ukvisajobs/src/main.ts` handles login/API calls and dataset output.
2. `orchestrator/src/server/services/ukvisajobs.ts` executes extractor and ingests/de-dupes output.
## 1) Authentication and session cache
Session cache file:
- `extractors/ukvisajobs/storage/ukvisajobs-auth.json`
Flow:
- Reuse cached token/cookies when valid
- Re-login with Playwright + Camoufox when needed
- Refresh and retry on token-expired responses
Force refresh:
- `UKVISAJOBS_REFRESH_ONLY=1`
## 2) API requests
Endpoint:
- `https://my.ukvisajobs.com/ukvisa-api/api/fetch-jobs-data`
Each request includes auth token + session cookies and paginates (15 jobs/page).
## 3) Mapping
- Normalizes salary from min/max/interval
- Builds fallback visa description when content missing
- Maps `job_link` to both `jobUrl` and `applicationLink`
## 4) Output dataset
Written to:
- `extractors/ukvisajobs/storage/datasets/default/`
Includes per-job JSON files and combined `jobs.json`.
## 5) Orchestrator flow
- Spawns extractor (`npx tsx src/main.ts`)
- Runs terms sequentially with delay
- De-dupes by `sourceJobId` (fallback `jobUrl`)
- Fetches detail pages when descriptions are too short
## Controls
- `UKVISAJOBS_EMAIL`, `UKVISAJOBS_PASSWORD`
- `UKVISAJOBS_HEADLESS`
- `UKVISAJOBS_MAX_JOBS` (default 50, max 200)
- `UKVISAJOBS_SEARCH_KEYWORD`
## Practical notes
- Deleting auth cache forces next run to re-login.
- Low-concurrency/polite scraping by design.
- If extractor breaks, check session refresh path first.