diff --git a/documentation/extractors/README.md b/documentation/extractors/README.md index f65c101..f6f5196 100644 --- a/documentation/extractors/README.md +++ b/documentation/extractors/README.md @@ -4,3 +4,4 @@ Technical breakdowns of how each extractor works. - Gradcracker: `gradcracker.md` - JobSpy: `jobspy.md` +- UKVisaJobs: `ukvisajobs.md` diff --git a/documentation/extractors/ukvisajobs.md b/documentation/extractors/ukvisajobs.md new file mode 100644 index 0000000..eff5986 --- /dev/null +++ b/documentation/extractors/ukvisajobs.md @@ -0,0 +1,87 @@ +# UKVisaJobs Extractor (How It Works) + +This is a plain-English walkthrough of the UK Visa Jobs extractor. It's the most complex one because the site requires an authenticated session before the API will return jobs. + +## Big picture + +There are two layers: + +1) `extractors/ukvisajobs/src/main.ts` handles logging in, talking to the UKVisaJobs API, and writing a Crawlee-style dataset. +2) `orchestrator/src/server/services/ukvisajobs.ts` runs that extractor, reads the dataset, de-dupes results, and optionally enriches descriptions. + +## 1) Authentication and session cache + +The API requires a token + cookies. The extractor keeps these in a cache file: + +- `extractors/ukvisajobs/storage/ukvisajobs-auth.json` + +Flow: + +- If there's a cached session, it uses it. +- If not, it launches a real browser (Playwright + Camoufox), logs in with `UKVISAJOBS_EMAIL` and `UKVISAJOBS_PASSWORD`, then captures the auth cookies + token. +- It stores those values in the cache file for reuse. + +You can force a refresh with: + +- `UKVISAJOBS_REFRESH_ONLY=1` + +If the API responds with an "expired" token error, it will automatically re-login and retry. + +## 2) API requests + +Once authenticated, it posts to: + +- `https://my.ukvisajobs.com/ukvisa-api/api/fetch-jobs-data` + +Each request: + +- Includes the auth token in a form field. +- Includes cookies in the header (`csrf_token`, `ci_session`, `authToken`). +- Filters by search keyword if provided. +- Uses pagination (15 jobs per page). + +## 3) Job mapping + +The extractor normalizes the raw API data into the project's job shape: + +- Salary is built from min/max values and interval. +- Visa-related flags are turned into a short fallback description if the job has no real description. +- The `job_link` becomes both `jobUrl` and `applicationLink`. + +## 4) Output dataset + +The extractor writes the results to: + +- `extractors/ukvisajobs/storage/datasets/default/` + +It mirrors Crawlee's dataset format: + +- One JSON file per job. +- A combined `jobs.json` containing all jobs. + +## 5) Orchestrator flow (how the app uses it) + +When the pipeline runs: + +- The server spawns the extractor as a child process (`npx tsx src/main.ts`). +- It can run multiple search terms sequentially (with a short delay between them). +- It reads the dataset and de-dupes by `sourceJobId` (or `jobUrl` fallback). +- If a job's description is missing or too short, it makes a direct HTTP request to the job URL and extracts plain text. + - This is effectively a curl-style fetch of the job page to fill in the JD for scoring and summarization. + +## Controls and limits + +Key environment variables: + +- `UKVISAJOBS_EMAIL`, `UKVISAJOBS_PASSWORD` (required for auth refresh) +- `UKVISAJOBS_HEADLESS` (set `false` to show the browser) +- `UKVISAJOBS_MAX_JOBS` (default 50, max 200) +- `UKVISAJOBS_SEARCH_KEYWORD` (single keyword filter) + +The UI also lets you set max jobs and search terms via the pipeline settings. + +## Practical notes + +- If you remove the auth cache file, the next run will re-login. +- The extractor is intentionally polite: it runs low concurrency and adds short delays. +- If the API or session changes on the UKVisaJobs side, the refresh logic is the first thing to check.