From 976f7b48789d33908c0ec7c6b46e7a224a9bcb4e Mon Sep 17 00:00:00 2001
From: DaKheera47 <shaheergoodboy@gmail.com>
Date: Fri, 16 Jan 2026 00:40:01 +0000
Subject: [PATCH] gradcracker technical breakdown

---
 README.md                               |  2 ++
 documentation/extractors/README.md      |  5 +++
 documentation/extractors/gradcracker.md | 47 +++++++++++++++++++++++++
 3 files changed, 54 insertions(+)
 create mode 100644 documentation/extractors/README.md
 create mode 100644 documentation/extractors/gradcracker.md

diff --git a/README.md b/README.md
index a0fc08e..ff766c0 100644
--- a/README.md
+++ b/README.md
@@ -32,6 +32,8 @@ Essential variables in `.env`:
 - `/extractors`: Specialized scrapers (Gradcracker, JobSpy, UKVisaJobs).
 - `/resume-generator`: Python script for RxResume PDF automation.
 - `/data`: Persistent storage for SQLite DB and generated PDFs.
+
+Technical breakdowns here: `documentation/extractors/README.md`
 2. Put your exported RXResume JSON at `resume-generator/base.json`.
 3. Start: `docker compose up -d --build`
 4. Open:
diff --git a/documentation/extractors/README.md b/documentation/extractors/README.md
new file mode 100644
index 0000000..acc49d7
--- /dev/null
+++ b/documentation/extractors/README.md
@@ -0,0 +1,5 @@
+# Extractors
+
+Technical breakdowns of how each extractor works.
+
+- Gradcracker: `gradcracker.md`
diff --git a/documentation/extractors/gradcracker.md b/documentation/extractors/gradcracker.md
new file mode 100644
index 0000000..ab98046
--- /dev/null
+++ b/documentation/extractors/gradcracker.md
@@ -0,0 +1,47 @@
+# Gradcracker Scraper (How It Works)
+
+This is a plain-English walkthrough of the Gradcracker extractor in `extractors/gradcracker`.
+
+## Big picture
+
+The scraper builds a list of Gradcracker search URLs, visits each list page, extracts job cards, then opens each job?s detail page to grab the full description and the external application link.
+
+## 1) Build search URLs
+
+- It starts with a fixed set of UK regions (e.g. London & South East, West Midlands, South West).
+- It uses default role terms like `web-development` and `software-systems`.
+- If you set `GRADCRACKER_SEARCH_TERMS`, those replace the defaults (JSON array of strings).
+- Every role is combined with every location to form a Gradcracker search URL, sorted by newest first.
+
+## 2) Crawl list pages
+
+On each list page it:
+
+- Waits for the job cards to load (`article[wire:key]`).
+- Scrapes basic fields from each card: title, employer, employer URL, discipline, deadline, salary, location, degree required, and start date.
+- Queues each job?s detail page for deeper scraping.
+
+Optional controls:
+
+- `GRADCRACKER_MAX_JOBS_PER_TERM` caps how many jobs are queued per role term.
+- `JOBOPS_SKIP_APPLY_FOR_EXISTING=1` and `JOBOPS_EXISTING_JOB_URLS` (or `JOBOPS_EXISTING_JOB_URLS_FILE`) let it skip jobs you already know about.
+
+## 3) Crawl job detail pages
+
+On each job page it:
+
+- Waits for the main content block (`.body-content`).
+- Saves the full description text.
+- Looks for the Apply button and clicks it to capture the final application URL.
+  - Handles both popup windows and same-tab redirects.
+  - Waits for the URL to stabilize before recording it.
+- Skips the Apply click if the job is already known (same env rules as above).
+
+## 4) Progress reporting (optional)
+
+If `JOBOPS_EMIT_PROGRESS=1` is set, the extractor prints structured progress lines that the orchestrator can stream into the UI.
+
+## Notes
+
+- The crawler runs with Playwright + Crawlee, launched through Camoufox to look more like a real browser.
+- Concurrency is kept low (1 or 2) and timeouts are generous to reduce flakiness.