* feat(hiringcafe): register new source across shared/server/client enums * feat(hiringcafe-extractor): add browser-backed Hiring Cafe dataset extractor * feat(orchestrator): integrate Hiring Cafe discovery service into pipeline * feat(orchestrator-ui): add Hiring Cafe to source availability and run estimates * chore(hiringcafe): wire CI/docker and add extractor documentation * chore(format): apply biome formatting for Hiring Cafe integration * add original websites * coomints * number or null
1.8 KiB
1.8 KiB
id, title, description, sidebar_position
| id | title | description | sidebar_position |
|---|---|---|---|
| ukvisajobs | UKVisaJobs Extractor | Authenticated session flow, API pagination, and orchestrator ingestion. | 5 |
UKVisaJobs is the most complex extractor because authenticated sessions are required.
Original website: my.ukvisajobs.com
Big picture
Two layers:
extractors/ukvisajobs/src/main.tshandles login/API calls and dataset output.orchestrator/src/server/services/ukvisajobs.tsexecutes extractor and ingests/de-dupes output.
1) Authentication and session cache
Session cache file:
extractors/ukvisajobs/storage/ukvisajobs-auth.json
Flow:
- Reuse cached token/cookies when valid
- Re-login with Playwright + Camoufox when needed
- Refresh and retry on token-expired responses
Force refresh:
UKVISAJOBS_REFRESH_ONLY=1
2) API requests
Endpoint:
https://my.ukvisajobs.com/ukvisa-api/api/fetch-jobs-data
Each request includes auth token + session cookies and paginates (15 jobs/page).
3) Mapping
- Normalizes salary from min/max/interval
- Builds fallback visa description when content missing
- Maps
job_linkto bothjobUrlandapplicationLink
4) Output dataset
Written to:
extractors/ukvisajobs/storage/datasets/default/
Includes per-job JSON files and combined jobs.json.
5) Orchestrator flow
- Spawns extractor (
npx tsx src/main.ts) - Runs terms sequentially with delay
- De-dupes by
sourceJobId(fallbackjobUrl) - Fetches detail pages when descriptions are too short
Controls
UKVISAJOBS_EMAIL,UKVISAJOBS_PASSWORDUKVISAJOBS_HEADLESSUKVISAJOBS_MAX_JOBS(default 50, max 200)UKVISAJOBS_SEARCH_KEYWORD
Practical notes
- Deleting auth cache forces next run to re-login.
- Low-concurrency/polite scraping by design.
- If extractor breaks, check session refresh path first.