gradcracker technical breakdown
This commit is contained in:
parent
e204f48235
commit
976f7b4878
@ -32,6 +32,8 @@ Essential variables in `.env`:
|
||||
- `/extractors`: Specialized scrapers (Gradcracker, JobSpy, UKVisaJobs).
|
||||
- `/resume-generator`: Python script for RxResume PDF automation.
|
||||
- `/data`: Persistent storage for SQLite DB and generated PDFs.
|
||||
|
||||
Technical breakdowns here: `documentation/extractors/README.md`
|
||||
2. Put your exported RXResume JSON at `resume-generator/base.json`.
|
||||
3. Start: `docker compose up -d --build`
|
||||
4. Open:
|
||||
|
||||
5
documentation/extractors/README.md
Normal file
5
documentation/extractors/README.md
Normal file
@ -0,0 +1,5 @@
|
||||
# Extractors
|
||||
|
||||
Technical breakdowns of how each extractor works.
|
||||
|
||||
- Gradcracker: `gradcracker.md`
|
||||
47
documentation/extractors/gradcracker.md
Normal file
47
documentation/extractors/gradcracker.md
Normal file
@ -0,0 +1,47 @@
|
||||
# Gradcracker Scraper (How It Works)
|
||||
|
||||
This is a plain-English walkthrough of the Gradcracker extractor in `extractors/gradcracker`.
|
||||
|
||||
## Big picture
|
||||
|
||||
The scraper builds a list of Gradcracker search URLs, visits each list page, extracts job cards, then opens each job?s detail page to grab the full description and the external application link.
|
||||
|
||||
## 1) Build search URLs
|
||||
|
||||
- It starts with a fixed set of UK regions (e.g. London & South East, West Midlands, South West).
|
||||
- It uses default role terms like `web-development` and `software-systems`.
|
||||
- If you set `GRADCRACKER_SEARCH_TERMS`, those replace the defaults (JSON array of strings).
|
||||
- Every role is combined with every location to form a Gradcracker search URL, sorted by newest first.
|
||||
|
||||
## 2) Crawl list pages
|
||||
|
||||
On each list page it:
|
||||
|
||||
- Waits for the job cards to load (`article[wire:key]`).
|
||||
- Scrapes basic fields from each card: title, employer, employer URL, discipline, deadline, salary, location, degree required, and start date.
|
||||
- Queues each job?s detail page for deeper scraping.
|
||||
|
||||
Optional controls:
|
||||
|
||||
- `GRADCRACKER_MAX_JOBS_PER_TERM` caps how many jobs are queued per role term.
|
||||
- `JOBOPS_SKIP_APPLY_FOR_EXISTING=1` and `JOBOPS_EXISTING_JOB_URLS` (or `JOBOPS_EXISTING_JOB_URLS_FILE`) let it skip jobs you already know about.
|
||||
|
||||
## 3) Crawl job detail pages
|
||||
|
||||
On each job page it:
|
||||
|
||||
- Waits for the main content block (`.body-content`).
|
||||
- Saves the full description text.
|
||||
- Looks for the Apply button and clicks it to capture the final application URL.
|
||||
- Handles both popup windows and same-tab redirects.
|
||||
- Waits for the URL to stabilize before recording it.
|
||||
- Skips the Apply click if the job is already known (same env rules as above).
|
||||
|
||||
## 4) Progress reporting (optional)
|
||||
|
||||
If `JOBOPS_EMIT_PROGRESS=1` is set, the extractor prints structured progress lines that the orchestrator can stream into the UI.
|
||||
|
||||
## Notes
|
||||
|
||||
- The crawler runs with Playwright + Crawlee, launched through Camoufox to look more like a real browser.
|
||||
- Concurrency is kept low (1 or 2) and timeouts are generous to reduce flakiness.
|
||||
Loading…
x
Reference in New Issue
Block a user