Jobber/documentation/extractors/ukvisajobs.md

# UKVisaJobs Extractor (How It Works)

This is a plain-English walkthrough of the UK Visa Jobs extractor. It's the most complex one because the site requires an authenticated session before the API will return jobs.

## Big picture

There are two layers:

1) `extractors/ukvisajobs/src/main.ts` handles logging in, talking to the UKVisaJobs API, and writing a Crawlee-style dataset.
2) `orchestrator/src/server/services/ukvisajobs.ts` runs that extractor, reads the dataset, de-dupes results, and optionally enriches descriptions.

## 1) Authentication and session cache

The API requires a token + cookies. The extractor keeps these in a cache file:

- `extractors/ukvisajobs/storage/ukvisajobs-auth.json`

Flow:

- If there's a cached session, it uses it.
- If not, it launches a real browser (Playwright + Camoufox), logs in with `UKVISAJOBS_EMAIL` and `UKVISAJOBS_PASSWORD`, then captures the auth cookies + token.
- It stores those values in the cache file for reuse.

You can force a refresh with:

- `UKVISAJOBS_REFRESH_ONLY=1`

If the API responds with an "expired" token error, it will automatically re-login and retry.

## 2) API requests

Once authenticated, it posts to:

- `https://my.ukvisajobs.com/ukvisa-api/api/fetch-jobs-data`

Each request:

- Includes the auth token in a form field.
- Includes cookies in the header (`csrf_token`, `ci_session`, `authToken`).
- Filters by search keyword if provided.
- Uses pagination (15 jobs per page).

## 3) Job mapping

The extractor normalizes the raw API data into the project's job shape:

- Salary is built from min/max values and interval.
- Visa-related flags are turned into a short fallback description if the job has no real description.
- The `job_link` becomes both `jobUrl` and `applicationLink`.

## 4) Output dataset

The extractor writes the results to:

- `extractors/ukvisajobs/storage/datasets/default/`

It mirrors Crawlee's dataset format:

- One JSON file per job.
- A combined `jobs.json` containing all jobs.

## 5) Orchestrator flow (how the app uses it)

When the pipeline runs:

- The server spawns the extractor as a child process (`npx tsx src/main.ts`).
- It can run multiple search terms sequentially (with a short delay between them).
- It reads the dataset and de-dupes by `sourceJobId` (or `jobUrl` fallback).
- If a job's description is missing or too short, it makes a direct HTTP request to the job URL and extracts plain text.
  - This is effectively a curl-style fetch of the job page to fill in the JD for scoring and summarization.

## Controls and limits

Key environment variables:

- `UKVISAJOBS_EMAIL`, `UKVISAJOBS_PASSWORD` (required for auth refresh)
- `UKVISAJOBS_HEADLESS` (set `false` to show the browser)
- `UKVISAJOBS_MAX_JOBS` (default 50, max 200)
- `UKVISAJOBS_SEARCH_KEYWORD` (single keyword filter)

The UI also lets you set max jobs and search terms via the pipeline settings.

## Practical notes

- If you remove the auth cache file, the next run will re-login.
- The extractor is intentionally polite: it runs low concurrency and adds short delays.
- If the API or session changes on the UKVisaJobs side, the refresh logic is the first thing to check.