uk visa jobs extractor docs
This commit is contained in:
parent
3021ea0390
commit
554d5b176e
@ -4,3 +4,4 @@ Technical breakdowns of how each extractor works.
|
||||
|
||||
- Gradcracker: `gradcracker.md`
|
||||
- JobSpy: `jobspy.md`
|
||||
- UKVisaJobs: `ukvisajobs.md`
|
||||
|
||||
87
documentation/extractors/ukvisajobs.md
Normal file
87
documentation/extractors/ukvisajobs.md
Normal file
@ -0,0 +1,87 @@
|
||||
# UKVisaJobs Extractor (How It Works)
|
||||
|
||||
This is a plain-English walkthrough of the UK Visa Jobs extractor. It's the most complex one because the site requires an authenticated session before the API will return jobs.
|
||||
|
||||
## Big picture
|
||||
|
||||
There are two layers:
|
||||
|
||||
1) `extractors/ukvisajobs/src/main.ts` handles logging in, talking to the UKVisaJobs API, and writing a Crawlee-style dataset.
|
||||
2) `orchestrator/src/server/services/ukvisajobs.ts` runs that extractor, reads the dataset, de-dupes results, and optionally enriches descriptions.
|
||||
|
||||
## 1) Authentication and session cache
|
||||
|
||||
The API requires a token + cookies. The extractor keeps these in a cache file:
|
||||
|
||||
- `extractors/ukvisajobs/storage/ukvisajobs-auth.json`
|
||||
|
||||
Flow:
|
||||
|
||||
- If there's a cached session, it uses it.
|
||||
- If not, it launches a real browser (Playwright + Camoufox), logs in with `UKVISAJOBS_EMAIL` and `UKVISAJOBS_PASSWORD`, then captures the auth cookies + token.
|
||||
- It stores those values in the cache file for reuse.
|
||||
|
||||
You can force a refresh with:
|
||||
|
||||
- `UKVISAJOBS_REFRESH_ONLY=1`
|
||||
|
||||
If the API responds with an "expired" token error, it will automatically re-login and retry.
|
||||
|
||||
## 2) API requests
|
||||
|
||||
Once authenticated, it posts to:
|
||||
|
||||
- `https://my.ukvisajobs.com/ukvisa-api/api/fetch-jobs-data`
|
||||
|
||||
Each request:
|
||||
|
||||
- Includes the auth token in a form field.
|
||||
- Includes cookies in the header (`csrf_token`, `ci_session`, `authToken`).
|
||||
- Filters by search keyword if provided.
|
||||
- Uses pagination (15 jobs per page).
|
||||
|
||||
## 3) Job mapping
|
||||
|
||||
The extractor normalizes the raw API data into the project's job shape:
|
||||
|
||||
- Salary is built from min/max values and interval.
|
||||
- Visa-related flags are turned into a short fallback description if the job has no real description.
|
||||
- The `job_link` becomes both `jobUrl` and `applicationLink`.
|
||||
|
||||
## 4) Output dataset
|
||||
|
||||
The extractor writes the results to:
|
||||
|
||||
- `extractors/ukvisajobs/storage/datasets/default/`
|
||||
|
||||
It mirrors Crawlee's dataset format:
|
||||
|
||||
- One JSON file per job.
|
||||
- A combined `jobs.json` containing all jobs.
|
||||
|
||||
## 5) Orchestrator flow (how the app uses it)
|
||||
|
||||
When the pipeline runs:
|
||||
|
||||
- The server spawns the extractor as a child process (`npx tsx src/main.ts`).
|
||||
- It can run multiple search terms sequentially (with a short delay between them).
|
||||
- It reads the dataset and de-dupes by `sourceJobId` (or `jobUrl` fallback).
|
||||
- If a job's description is missing or too short, it makes a direct HTTP request to the job URL and extracts plain text.
|
||||
- This is effectively a curl-style fetch of the job page to fill in the JD for scoring and summarization.
|
||||
|
||||
## Controls and limits
|
||||
|
||||
Key environment variables:
|
||||
|
||||
- `UKVISAJOBS_EMAIL`, `UKVISAJOBS_PASSWORD` (required for auth refresh)
|
||||
- `UKVISAJOBS_HEADLESS` (set `false` to show the browser)
|
||||
- `UKVISAJOBS_MAX_JOBS` (default 50, max 200)
|
||||
- `UKVISAJOBS_SEARCH_KEYWORD` (single keyword filter)
|
||||
|
||||
The UI also lets you set max jobs and search terms via the pipeline settings.
|
||||
|
||||
## Practical notes
|
||||
|
||||
- If you remove the auth cache file, the next run will re-login.
|
||||
- The extractor is intentionally polite: it runs low concurrency and adds short delays.
|
||||
- If the API or session changes on the UKVisaJobs side, the refresh logic is the first thing to check.
|
||||
Loading…
x
Reference in New Issue
Block a user