Shaheer Sarfaraz 1f929dfc7f
Create the setup for the documentation page (#171)
* don't run for generated docusaurus

* format

* workflow to autoupdate docs version

* fix versioning

* add link back to app

* remove old docs

* html link???

* don't track .docusaurus

* documantation build
2026-02-15 22:50:52 +00:00

74 lines
1.8 KiB
Markdown

---
id: ukvisajobs
title: UKVisaJobs Extractor
description: Authenticated session flow, API pagination, and orchestrator ingestion.
sidebar_position: 5
---
UKVisaJobs is the most complex extractor because authenticated sessions are required.
## Big picture
Two layers:
1. `extractors/ukvisajobs/src/main.ts` handles login/API calls and dataset output.
2. `orchestrator/src/server/services/ukvisajobs.ts` executes extractor and ingests/de-dupes output.
## 1) Authentication and session cache
Session cache file:
- `extractors/ukvisajobs/storage/ukvisajobs-auth.json`
Flow:
- Reuse cached token/cookies when valid
- Re-login with Playwright + Camoufox when needed
- Refresh and retry on token-expired responses
Force refresh:
- `UKVISAJOBS_REFRESH_ONLY=1`
## 2) API requests
Endpoint:
- `https://my.ukvisajobs.com/ukvisa-api/api/fetch-jobs-data`
Each request includes auth token + session cookies and paginates (15 jobs/page).
## 3) Mapping
- Normalizes salary from min/max/interval
- Builds fallback visa description when content missing
- Maps `job_link` to both `jobUrl` and `applicationLink`
## 4) Output dataset
Written to:
- `extractors/ukvisajobs/storage/datasets/default/`
Includes per-job JSON files and combined `jobs.json`.
## 5) Orchestrator flow
- Spawns extractor (`npx tsx src/main.ts`)
- Runs terms sequentially with delay
- De-dupes by `sourceJobId` (fallback `jobUrl`)
- Fetches detail pages when descriptions are too short
## Controls
- `UKVISAJOBS_EMAIL`, `UKVISAJOBS_PASSWORD`
- `UKVISAJOBS_HEADLESS`
- `UKVISAJOBS_MAX_JOBS` (default 50, max 200)
- `UKVISAJOBS_SEARCH_KEYWORD`
## Practical notes
- Deleting auth cache forces next run to re-login.
- Low-concurrency/polite scraping by design.
- If extractor breaks, check session refresh path first.