3.1 KiB
3.1 KiB
UKVisaJobs Extractor (How It Works)
This is a plain-English walkthrough of the UK Visa Jobs extractor. It's the most complex one because the site requires an authenticated session before the API will return jobs.
Big picture
There are two layers:
extractors/ukvisajobs/src/main.tshandles logging in, talking to the UKVisaJobs API, and writing a Crawlee-style dataset.orchestrator/src/server/services/ukvisajobs.tsruns that extractor, reads the dataset, de-dupes results, and optionally enriches descriptions.
1) Authentication and session cache
The API requires a token + cookies. The extractor keeps these in a cache file:
extractors/ukvisajobs/storage/ukvisajobs-auth.json
Flow:
- If there's a cached session, it uses it.
- If not, it launches a real browser (Playwright + Camoufox), logs in with
UKVISAJOBS_EMAILandUKVISAJOBS_PASSWORD, then captures the auth cookies + token. - It stores those values in the cache file for reuse.
You can force a refresh with:
UKVISAJOBS_REFRESH_ONLY=1
If the API responds with an "expired" token error, it will automatically re-login and retry.
2) API requests
Once authenticated, it posts to:
https://my.ukvisajobs.com/ukvisa-api/api/fetch-jobs-data
Each request:
- Includes the auth token in a form field.
- Includes cookies in the header (
csrf_token,ci_session,authToken). - Filters by search keyword if provided.
- Uses pagination (15 jobs per page).
3) Job mapping
The extractor normalizes the raw API data into the project's job shape:
- Salary is built from min/max values and interval.
- Visa-related flags are turned into a short fallback description if the job has no real description.
- The
job_linkbecomes bothjobUrlandapplicationLink.
4) Output dataset
The extractor writes the results to:
extractors/ukvisajobs/storage/datasets/default/
It mirrors Crawlee's dataset format:
- One JSON file per job.
- A combined
jobs.jsoncontaining all jobs.
5) Orchestrator flow (how the app uses it)
When the pipeline runs:
- The server spawns the extractor as a child process (
npx tsx src/main.ts). - It can run multiple search terms sequentially (with a short delay between them).
- It reads the dataset and de-dupes by
sourceJobId(orjobUrlfallback). - If a job's description is missing or too short, it makes a direct HTTP request to the job URL and extracts plain text.
- This is effectively a curl-style fetch of the job page to fill in the JD for scoring and summarization.
Controls and limits
Key environment variables:
UKVISAJOBS_EMAIL,UKVISAJOBS_PASSWORD(required for auth refresh)UKVISAJOBS_HEADLESS(setfalseto show the browser)UKVISAJOBS_MAX_JOBS(default 50, max 200)UKVISAJOBS_SEARCH_KEYWORD(single keyword filter)
The UI also lets you set max jobs and search terms via the pipeline settings.
Practical notes
- If you remove the auth cache file, the next run will re-login.
- The extractor is intentionally polite: it runs low concurrency and adds short delays.
- If the API or session changes on the UKVisaJobs side, the refresh logic is the first thing to check.