readme

2025-12-14 18:42:13 +00:00 · 2025-12-14 18:42:13 +00:00 · cdd6956eb7
commit cdd6956eb7
parent 87804480fc
1 changed files with 126 additions and 86 deletions
--- a/README.md
+++ b/README.md
@ -1,126 +1,166 @@
-# Job Ops 🚀
+# Job Ops

-Automated job discovery, scoring, and resume generation pipeline.
+Automated job discovery -> AI suitability scoring -> tailored resume PDFs -> a dashboard to review/apply (with optional Notion sync).

-## Features
+## How it works (pipeline)

- **Job Crawler** - Discovers jobs from Gradcracker and other sources
- **AI Scoring** - Ranks jobs by suitability using OpenRouter API
- **Resume Generator** - Creates tailored PDFs via RXResume automation
- **Dashboard UI** - React-based interface for reviewing and applying
+1. **Crawl**: `job-extractor` (Crawlee + Playwright + Camoufox) visits Gradcracker search pages, opens each job page, extracts structured fields + the job description, and captures the real application URL by clicking the apply button (skipped for already-known jobs).
+2. **Import + dedupe**: `orchestrator` reads the Crawlee dataset (`job-extractor/storage/datasets/default/*.json`) and inserts new jobs into SQLite (`jobs.job_url` is unique).
+3. **Score**: `orchestrator` scores up to 50 unprocessed jobs via OpenRouter (cached as `suitabilityScore`/`suitabilityReason`).
+4. **Select**: take the top `N` jobs above `minSuitabilityScore`.
+5. **Process**: for each selected job:
+   - generate a tailored resume summary via OpenRouter (stored on the job)
+   - generate a PDF by injecting the summary into `resume-generator/base.json`, writing a temp resume JSON, then running `resume-generator/rxresume_automation.py` (Playwright automates `rxresu.me` import -> export)
+6. **Review/apply**: the React dashboard shows job status, score, links, and PDFs; clicking `Mark Applied` optionally creates a Notion page.

-## Quick Start with Docker
+Live progress is streamed to the UI via Server-Sent Events at `GET /api/pipeline/progress` (the crawler emits stdout lines prefixed with `JOBOPS_PROGRESS`; the orchestrator forwards them).

-### 1. Configure Environment
+## Architecture (Mermaid)

-```bash
-# Copy the example env file
-cp .env.example .env
+```mermaid
+flowchart LR
+  subgraph UI[User Interface]
+    DASH[React Dashboard]
+  end

-# Edit with your credentials
-nano .env
+  subgraph ORCH[Orchestrator (Node/TS)]
+    API[Express API<br/>/api/*]
+    PIPE[Pipeline Runner]
+    DB[(SQLite<br/>jobs.db)]
+    PDFS[(PDFs<br/>pdfs/)]
+  end
+
+  subgraph CRAWL[job-extractor (Crawlee/Playwright)]
+    C1[Seed search URLs<br/>(locations x roles)]
+    C2[Parse list pages<br/>enqueue job pages]
+    C3[Parse job pages<br/>extract JD + apply URL]
+    DS[(Crawlee dataset<br/>storage/datasets/default)]
+  end
+
+  subgraph EXT[External Services]
+    GC[Gradcracker]
+    OR[OpenRouter]
+    RX[rxresu.me]
+    NO[Notion (optional)]
+    N8N[n8n / cron (optional)]
+  end
+
+  N8N -->|POST /api/webhook/trigger| API
+  DASH -->|REST| API
+  API -->|REST JSON| DASH
+  DASH -->|SSE connect| API
+  API -->|progress events| DASH
+
+  API -->|run pipeline| PIPE
+
+  PIPE -->|spawn| CRAWL
+  C1 --> GC
+  C2 --> GC
+  C3 --> GC
+  CRAWL --> DS
+  API -->|read| DS
+  API --> DB
+
+  PIPE -->|score + summary| OR
+  PIPE -->|spawn python| RX
+  RX -->|export| PDFS
+  API -->|serve /pdfs/*| PDFS
+
+  API -->|create page| NO
 ```

-Required environment variables:
- `OPENROUTER_API_KEY` - Get from [openrouter.ai/keys](https://openrouter.ai/keys)
- `RXRESUME_EMAIL` - Your [rxresu.me](https://rxresu.me) account email
- `RXRESUME_PASSWORD` - Your RXResume password
+## Repo layout

-### 2. Add Your Base Resume
-
-Place your resume JSON at `resume-generator/base.json`.
-You can export this from RXResume.
-
-### 3. Run
-
-```bash
-# Build and start
-docker compose up -d
-
-# View logs
-docker compose logs -f
-
-# Stop
-docker compose down
+```
+job-ops/
+  orchestrator/                 # Express API + React dashboard + pipeline
+    src/server/                 # API routes, pipeline, DB, services
+    src/client/                 # UI (polls jobs, listens to SSE progress)
+    src/shared/                 # shared types (Job, PipelineRun, etc.)
+  job-extractor/                # Crawlee crawler (Gradcracker)
+  resume-generator/             # Python Playwright automation for rxresu.me
+    base.json                   # your exported base resume (template)
+  data/                         # persisted runtime artifacts (Docker default)
+    jobs.db                     # SQLite database
+    pdfs/                       # generated PDFs (resume_<jobId>.pdf)
+  docker-compose.yml            # single-container deployment
+  Dockerfile                    # builds orchestrator + installs browsers
 ```

-### 4. Access
+## Data model (SQLite)

- **Dashboard**: http://localhost:3001
- **API**: http://localhost:3001/api
- **Health**: http://localhost:3001/health
+- `jobs`
+  - from crawl: `title`, `employer`, `jobUrl`, `applicationLink`, `deadline`, `salary`, `location`, `jobDescription`, etc.
+  - enrichments: `status` (`discovered` -> `processing` -> `ready` -> `applied`/`rejected`), `suitabilityScore`, `suitabilityReason`, `tailoredSummary`, `pdfPath`, `notionPageId`
+- `pipeline_runs`: audit log of runs (`running`/`completed`/`failed`, counts, error)

-## Data Persistence
+## Running (Docker)

-All data is stored in the `./data` directory:
- `data/jobs.db` - SQLite database
- `data/pdfs/` - Generated resume PDFs
+1. Create `.env` at repo root (`cp .env.example .env`) and set:
+   - `OPENROUTER_API_KEY`
+   - `RXRESUME_EMAIL`, `RXRESUME_PASSWORD`
+   - optional: `NOTION_API_KEY`, `NOTION_DATABASE_ID`, `WEBHOOK_SECRET`
+2. Put your exported RXResume JSON at `resume-generator/base.json`.
+3. Start: `docker compose up -d --build`
+4. Open:
+   - Dashboard/UI: `http://localhost:3005`
+   - API: `http://localhost:3005/api`
+   - Health: `http://localhost:3005/health`

-## Development
+Persistent data lives in `./data` (bind-mounted into the container).

-### Without Docker
+## Running (local dev)
+
+Prereqs: Node 20+, Python 3.10+, Playwright browsers (Firefox).
+
+Install Node deps (both packages):

 ```bash
-# Install dependencies
 cd orchestrator && npm install
 cd ../job-extractor && npm install
+```

-# Set up Python environment for resume generator
-cd ../resume-generator
-python3 -m venv .venv
-source .venv/bin/activate
-pip install playwright
-playwright install chromium
+Configure the orchestrator env + DB:

-# Run orchestrator (from orchestrator folder)
+```bash
 cd ../orchestrator
-cp .env.example .env  # Configure your env
+cp .env.example .env
 npm run db:migrate
 npm run dev
 ```

-### Build Docker Image
+Set up the resume generator (used for PDF export):

 ```bash
-docker build -t job-ops:latest .
+cd ../resume-generator
+python -m venv .venv
+# Windows PowerShell:
+.\.venv\Scripts\Activate.ps1
+# macOS/Linux:
+# source .venv/bin/activate
+pip install playwright
+python -m playwright install firefox
 ```

-### Push to Docker Hub
+If you're on Windows, set `PYTHON_PATH` in `orchestrator/.env` to your venv python (e.g. `..\resume-generator\.venv\Scripts\python.exe`) or use Docker/WSL.

-```bash
-docker tag job-ops:latest yourusername/job-ops:latest
-docker push yourusername/job-ops:latest
-```
+Dev URLs:
+- API: `http://localhost:3001/api`
+- UI (Vite): `http://localhost:5173`

-## API Endpoints
+## Key endpoints

-| Method | Endpoint | Description |
-|--------|----------|-------------|
-| GET | `/api/jobs` | List all jobs |
-| GET | `/api/jobs/:id` | Get job details |
-| PATCH | `/api/jobs/:id` | Update job |
-| POST | `/api/jobs/:id/process` | Generate resume for job |
-| POST | `/api/jobs/:id/apply` | Mark as applied |
-| POST | `/api/jobs/:id/reject` | Skip job |
-| POST | `/api/jobs/process-discovered` | Process all discovered jobs |
-| GET | `/api/pipeline/status` | Pipeline status |
-| POST | `/api/pipeline/run` | Trigger pipeline |
-| GET | `/api/pipeline/progress` | SSE progress stream |
-| DELETE | `/api/database` | Clear all data |
+- Jobs: `GET /api/jobs`, `POST /api/jobs/:id/process`, `POST /api/jobs/:id/apply`, `POST /api/jobs/:id/reject`, `POST /api/jobs/process-discovered`
+- Pipeline: `POST /api/pipeline/run`, `GET /api/pipeline/status`, `GET /api/pipeline/progress` (SSE)
+- Webhook: `POST /api/webhook/trigger` (optional auth via `WEBHOOK_SECRET`)
+- Ops: `DELETE /api/database` (wipes DB)

-## Architecture
+## Notes / sharp edges

-```
-job-ops/
-├── orchestrator/       # Node.js backend + React frontend
-│   ├── src/server/    # Express API, services, DB
-│   └── src/client/    # React dashboard
-├── job-extractor/      # Crawlee-based job crawler
-├── resume-generator/   # Python Playwright automation
-├── data/               # SQLite DB + generated PDFs
-├── Dockerfile
-└── docker-compose.yml
-```
+- **Crawl targets**: edit `job-extractor/src/main.ts` to change the Gradcracker location/role matrix.
+- **Notion sync is schema-dependent**: `orchestrator/src/server/services/notion.ts` assumes property names; adjust to match your Notion database.
+- **Pipeline config knobs**: `POST /api/pipeline/run` accepts `{ topN, minSuitabilityScore }`; `PIPELINE_TOP_N`/`PIPELINE_MIN_SCORE` are used by `npm run pipeline:run` (CLI runner).
+- **Anti-bot reality**: crawling is headless + "humanized", but sites can still block; expect occasional flakiness.

 ## License