2.0 KiB
2.0 KiB
Gradcracker Scraper (How It Works)
This is a plain-English walkthrough of the Gradcracker extractor in extractors/gradcracker.
Big picture
The scraper builds a list of Gradcracker search URLs, visits each list page, extracts job cards, then opens each job?s detail page to grab the full description and the external application link.
1) Build search URLs
- It starts with a fixed set of UK regions (e.g. London & South East, West Midlands, South West).
- It uses default role terms like
web-developmentandsoftware-systems. - If you set
GRADCRACKER_SEARCH_TERMS, those replace the defaults (JSON array of strings). - Every role is combined with every location to form a Gradcracker search URL, sorted by newest first.
2) Crawl list pages
On each list page it:
- Waits for the job cards to load (
article[wire:key]). - Scrapes basic fields from each card: title, employer, employer URL, discipline, deadline, salary, location, degree required, and start date.
- Queues each job?s detail page for deeper scraping.
Optional controls:
GRADCRACKER_MAX_JOBS_PER_TERMcaps how many jobs are queued per role term.JOBOPS_SKIP_APPLY_FOR_EXISTING=1andJOBOPS_EXISTING_JOB_URLS(orJOBOPS_EXISTING_JOB_URLS_FILE) let it skip jobs you already know about.
3) Crawl job detail pages
On each job page it:
- Waits for the main content block (
.body-content). - Saves the full description text.
- Looks for the Apply button and clicks it to capture the final application URL.
- Handles both popup windows and same-tab redirects.
- Waits for the URL to stabilize before recording it.
- Skips the Apply click if the job is already known (same env rules as above).
4) Progress reporting (optional)
If JOBOPS_EMIT_PROGRESS=1 is set, the extractor prints structured progress lines that the orchestrator can stream into the UI.
Notes
- The crawler runs with Playwright + Crawlee, launched through Camoufox to look more like a real browser.
- Concurrency is kept low (1 or 2) and timeouts are generous to reduce flakiness.