* feat(hiringcafe): register new source across shared/server/client enums * feat(hiringcafe-extractor): add browser-backed Hiring Cafe dataset extractor * feat(orchestrator): integrate Hiring Cafe discovery service into pipeline * feat(orchestrator-ui): add Hiring Cafe to source availability and run estimates * chore(hiringcafe): wire CI/docker and add extractor documentation * chore(format): apply biome formatting for Hiring Cafe integration * add original websites * coomints * number or null
1.3 KiB
1.3 KiB
id, title, description, sidebar_position
| id | title | description | sidebar_position |
|---|---|---|---|
| gradcracker | Gradcracker Extractor | How the Gradcracker crawler builds search URLs and extracts jobs. | 2 |
A plain-English walkthrough of the Gradcracker extractor in extractors/gradcracker.
Original website: gradcracker.com
Big picture
The crawler builds search URLs, scrapes listing pages, then opens job details for descriptions and apply URLs.
1) Build search URLs
- Combines UK regions with role terms.
- Defaults include roles such as
web-developmentandsoftware-systems. GRADCRACKER_SEARCH_TERMSoverrides defaults.
2) Crawl list pages
- Waits for job cards (
article[wire:key]). - Extracts title, employer, discipline, deadline, salary, location, degree, start date.
- Queues job detail pages.
Controls:
GRADCRACKER_MAX_JOBS_PER_TERMJOBOPS_SKIP_APPLY_FOR_EXISTING=1JOBOPS_EXISTING_JOB_URLS/JOBOPS_EXISTING_JOB_URLS_FILE
3) Crawl detail pages
- Waits for
.body-content - Captures full description text
- Clicks apply button to resolve final application URL
- Handles popup and same-tab redirects
4) Progress reporting
Set JOBOPS_EMIT_PROGRESS=1 for structured progress lines consumable by orchestrator UI.
Notes
- Uses Playwright + Crawlee via Camoufox.
- Low concurrency and longer timeouts for stability.