2026-01-16 00:50:43 +00:00

3.1 KiB

UKVisaJobs Extractor (How It Works)

This is a plain-English walkthrough of the UK Visa Jobs extractor. It's the most complex one because the site requires an authenticated session before the API will return jobs.

Big picture

There are two layers:

  1. extractors/ukvisajobs/src/main.ts handles logging in, talking to the UKVisaJobs API, and writing a Crawlee-style dataset.
  2. orchestrator/src/server/services/ukvisajobs.ts runs that extractor, reads the dataset, de-dupes results, and optionally enriches descriptions.

1) Authentication and session cache

The API requires a token + cookies. The extractor keeps these in a cache file:

  • extractors/ukvisajobs/storage/ukvisajobs-auth.json

Flow:

  • If there's a cached session, it uses it.
  • If not, it launches a real browser (Playwright + Camoufox), logs in with UKVISAJOBS_EMAIL and UKVISAJOBS_PASSWORD, then captures the auth cookies + token.
  • It stores those values in the cache file for reuse.

You can force a refresh with:

  • UKVISAJOBS_REFRESH_ONLY=1

If the API responds with an "expired" token error, it will automatically re-login and retry.

2) API requests

Once authenticated, it posts to:

  • https://my.ukvisajobs.com/ukvisa-api/api/fetch-jobs-data

Each request:

  • Includes the auth token in a form field.
  • Includes cookies in the header (csrf_token, ci_session, authToken).
  • Filters by search keyword if provided.
  • Uses pagination (15 jobs per page).

3) Job mapping

The extractor normalizes the raw API data into the project's job shape:

  • Salary is built from min/max values and interval.
  • Visa-related flags are turned into a short fallback description if the job has no real description.
  • The job_link becomes both jobUrl and applicationLink.

4) Output dataset

The extractor writes the results to:

  • extractors/ukvisajobs/storage/datasets/default/

It mirrors Crawlee's dataset format:

  • One JSON file per job.
  • A combined jobs.json containing all jobs.

5) Orchestrator flow (how the app uses it)

When the pipeline runs:

  • The server spawns the extractor as a child process (npx tsx src/main.ts).
  • It can run multiple search terms sequentially (with a short delay between them).
  • It reads the dataset and de-dupes by sourceJobId (or jobUrl fallback).
  • If a job's description is missing or too short, it makes a direct HTTP request to the job URL and extracts plain text.
    • This is effectively a curl-style fetch of the job page to fill in the JD for scoring and summarization.

Controls and limits

Key environment variables:

  • UKVISAJOBS_EMAIL, UKVISAJOBS_PASSWORD (required for auth refresh)
  • UKVISAJOBS_HEADLESS (set false to show the browser)
  • UKVISAJOBS_MAX_JOBS (default 50, max 200)
  • UKVISAJOBS_SEARCH_KEYWORD (single keyword filter)

The UI also lets you set max jobs and search terms via the pipeline settings.

Practical notes

  • If you remove the auth cache file, the next run will re-login.
  • The extractor is intentionally polite: it runs low concurrency and adds short delays.
  • If the API or session changes on the UKVisaJobs side, the refresh logic is the first thing to check.