From 976f7b48789d33908c0ec7c6b46e7a224a9bcb4e Mon Sep 17 00:00:00 2001 From: DaKheera47 Date: Fri, 16 Jan 2026 00:40:01 +0000 Subject: [PATCH] gradcracker technical breakdown --- README.md | 2 ++ documentation/extractors/README.md | 5 +++ documentation/extractors/gradcracker.md | 47 +++++++++++++++++++++++++ 3 files changed, 54 insertions(+) create mode 100644 documentation/extractors/README.md create mode 100644 documentation/extractors/gradcracker.md diff --git a/README.md b/README.md index a0fc08e..ff766c0 100644 --- a/README.md +++ b/README.md @@ -32,6 +32,8 @@ Essential variables in `.env`: - `/extractors`: Specialized scrapers (Gradcracker, JobSpy, UKVisaJobs). - `/resume-generator`: Python script for RxResume PDF automation. - `/data`: Persistent storage for SQLite DB and generated PDFs. + +Technical breakdowns here: `documentation/extractors/README.md` 2. Put your exported RXResume JSON at `resume-generator/base.json`. 3. Start: `docker compose up -d --build` 4. Open: diff --git a/documentation/extractors/README.md b/documentation/extractors/README.md new file mode 100644 index 0000000..acc49d7 --- /dev/null +++ b/documentation/extractors/README.md @@ -0,0 +1,5 @@ +# Extractors + +Technical breakdowns of how each extractor works. + +- Gradcracker: `gradcracker.md` diff --git a/documentation/extractors/gradcracker.md b/documentation/extractors/gradcracker.md new file mode 100644 index 0000000..ab98046 --- /dev/null +++ b/documentation/extractors/gradcracker.md @@ -0,0 +1,47 @@ +# Gradcracker Scraper (How It Works) + +This is a plain-English walkthrough of the Gradcracker extractor in `extractors/gradcracker`. + +## Big picture + +The scraper builds a list of Gradcracker search URLs, visits each list page, extracts job cards, then opens each job?s detail page to grab the full description and the external application link. + +## 1) Build search URLs + +- It starts with a fixed set of UK regions (e.g. London & South East, West Midlands, South West). +- It uses default role terms like `web-development` and `software-systems`. +- If you set `GRADCRACKER_SEARCH_TERMS`, those replace the defaults (JSON array of strings). +- Every role is combined with every location to form a Gradcracker search URL, sorted by newest first. + +## 2) Crawl list pages + +On each list page it: + +- Waits for the job cards to load (`article[wire:key]`). +- Scrapes basic fields from each card: title, employer, employer URL, discipline, deadline, salary, location, degree required, and start date. +- Queues each job?s detail page for deeper scraping. + +Optional controls: + +- `GRADCRACKER_MAX_JOBS_PER_TERM` caps how many jobs are queued per role term. +- `JOBOPS_SKIP_APPLY_FOR_EXISTING=1` and `JOBOPS_EXISTING_JOB_URLS` (or `JOBOPS_EXISTING_JOB_URLS_FILE`) let it skip jobs you already know about. + +## 3) Crawl job detail pages + +On each job page it: + +- Waits for the main content block (`.body-content`). +- Saves the full description text. +- Looks for the Apply button and clicks it to capture the final application URL. + - Handles both popup windows and same-tab redirects. + - Waits for the URL to stabilize before recording it. +- Skips the Apply click if the job is already known (same env rules as above). + +## 4) Progress reporting (optional) + +If `JOBOPS_EMIT_PROGRESS=1` is set, the extractor prints structured progress lines that the orchestrator can stream into the UI. + +## Notes + +- The crawler runs with Playwright + Crawlee, launched through Camoufox to look more like a real browser. +- Concurrency is kept low (1 or 2) and timeouts are generous to reduce flakiness.