readme

2025-12-14 18:42:13 +00:00 · 2025-12-14 18:42:13 +00:00 · cdd6956eb7
commit cdd6956eb7
parent 87804480fc
1 changed files with 126 additions and 86 deletions
--- a/README.md
+++ b/README.md
@ -1,126 +1,166 @@
-# Job Ops 🚀
+# Job Ops
-Automated job discovery, scoring, and resume generation pipeline.
+Automated job discovery -> AI suitability scoring -> tailored resume PDFs -> a dashboard to review/apply (with optional Notion sync).
-## Features
+## How it works (pipeline)
- **Job Crawler** - Discovers jobs from Gradcracker and other sources
+1. **Crawl**: `job-extractor` (Crawlee + Playwright + Camoufox) visits Gradcracker search pages, opens each job page, extracts structured fields + the job description, and captures the real application URL by clicking the apply button (skipped for already-known jobs).
- **AI Scoring** - Ranks jobs by suitability using OpenRouter API
+2. **Import + dedupe**: `orchestrator` reads the Crawlee dataset (`job-extractor/storage/datasets/default/*.json`) and inserts new jobs into SQLite (`jobs.job_url` is unique).
- **Resume Generator** - Creates tailored PDFs via RXResume automation
+3. **Score**: `orchestrator` scores up to 50 unprocessed jobs via OpenRouter (cached as `suitabilityScore`/`suitabilityReason`).
- **Dashboard UI** - React-based interface for reviewing and applying
+4. **Select**: take the top `N` jobs above `minSuitabilityScore`.
 5. **Process**: for each selected job:
   - generate a tailored resume summary via OpenRouter (stored on the job)
   - generate a PDF by injecting the summary into `resume-generator/base.json`, writing a temp resume JSON, then running `resume-generator/rxresume_automation.py` (Playwright automates `rxresu.me` import -> export)
 6. **Review/apply**: the React dashboard shows job status, score, links, and PDFs; clicking `Mark Applied` optionally creates a Notion page.
-## Quick Start with Docker
+Live progress is streamed to the UI via Server-Sent Events at `GET /api/pipeline/progress` (the crawler emits stdout lines prefixed with `JOBOPS_PROGRESS`; the orchestrator forwards them).
-### 1. Configure Environment
+## Architecture (Mermaid)
-```bash
+```mermaid
-# Copy the example env file
+flowchart LR
-cp .env.example .env
+  subgraph UI[User Interface]
    DASH[React Dashboard]
  end
-# Edit with your credentials
+  subgraph ORCH[Orchestrator (Node/TS)]
-nano .env
+    API[Express API<br/>/api/*]
    PIPE[Pipeline Runner]
    DB[(SQLite<br/>jobs.db)]
    PDFS[(PDFs<br/>pdfs/)]
  end
  subgraph CRAWL[job-extractor (Crawlee/Playwright)]
    C1[Seed search URLs<br/>(locations x roles)]
    C2[Parse list pages<br/>enqueue job pages]
    C3[Parse job pages<br/>extract JD + apply URL]
    DS[(Crawlee dataset<br/>storage/datasets/default)]
  end
  subgraph EXT[External Services]
    GC[Gradcracker]
    OR[OpenRouter]
    RX[rxresu.me]
    NO[Notion (optional)]
    N8N[n8n / cron (optional)]
  end
  N8N -->|POST /api/webhook/trigger| API
  DASH -->|REST| API
  API -->|REST JSON| DASH
  DASH -->|SSE connect| API
  API -->|progress events| DASH
  API -->|run pipeline| PIPE
  PIPE -->|spawn| CRAWL
  C1 --> GC
  C2 --> GC
  C3 --> GC
  CRAWL --> DS
  API -->|read| DS
  API --> DB
  PIPE -->|score + summary| OR
  PIPE -->|spawn python| RX
  RX -->|export| PDFS
  API -->|serve /pdfs/*| PDFS
  API -->|create page| NO
 ```
-Required environment variables:
+## Repo layout
 - `OPENROUTER_API_KEY` - Get from [openrouter.ai/keys](https://openrouter.ai/keys)
 - `RXRESUME_EMAIL` - Your [rxresu.me](https://rxresu.me) account email
 - `RXRESUME_PASSWORD` - Your RXResume password
-### 2. Add Your Base Resume
+```
-
+job-ops/
-Place your resume JSON at `resume-generator/base.json`.
+  orchestrator/                 # Express API + React dashboard + pipeline
-You can export this from RXResume.
+    src/server/                 # API routes, pipeline, DB, services
-
+    src/client/                 # UI (polls jobs, listens to SSE progress)
-### 3. Run
+    src/shared/                 # shared types (Job, PipelineRun, etc.)
-
+  job-extractor/                # Crawlee crawler (Gradcracker)
-```bash
+  resume-generator/             # Python Playwright automation for rxresu.me
-# Build and start
+    base.json                   # your exported base resume (template)
-docker compose up -d
+  data/                         # persisted runtime artifacts (Docker default)
-
+    jobs.db                     # SQLite database
-# View logs
+    pdfs/                       # generated PDFs (resume_<jobId>.pdf)
-docker compose logs -f
+  docker-compose.yml            # single-container deployment
-
+  Dockerfile                    # builds orchestrator + installs browsers
 # Stop
 docker compose down
 ```
-### 4. Access
+## Data model (SQLite)
- **Dashboard**: http://localhost:3001
+- `jobs`
- **API**: http://localhost:3001/api
+  - from crawl: `title`, `employer`, `jobUrl`, `applicationLink`, `deadline`, `salary`, `location`, `jobDescription`, etc.
- **Health**: http://localhost:3001/health
+  - enrichments: `status` (`discovered` -> `processing` -> `ready` -> `applied`/`rejected`), `suitabilityScore`, `suitabilityReason`, `tailoredSummary`, `pdfPath`, `notionPageId`
 - `pipeline_runs`: audit log of runs (`running`/`completed`/`failed`, counts, error)
-## Data Persistence
+## Running (Docker)
-All data is stored in the `./data` directory:
+1. Create `.env` at repo root (`cp .env.example .env`) and set:
- `data/jobs.db` - SQLite database
+   - `OPENROUTER_API_KEY`
- `data/pdfs/` - Generated resume PDFs
+   - `RXRESUME_EMAIL`, `RXRESUME_PASSWORD`
   - optional: `NOTION_API_KEY`, `NOTION_DATABASE_ID`, `WEBHOOK_SECRET`
 2. Put your exported RXResume JSON at `resume-generator/base.json`.
 3. Start: `docker compose up -d --build`
 4. Open:
   - Dashboard/UI: `http://localhost:3005`
   - API: `http://localhost:3005/api`
   - Health: `http://localhost:3005/health`
-## Development
+Persistent data lives in `./data` (bind-mounted into the container).
-### Without Docker
+## Running (local dev)
 Prereqs: Node 20+, Python 3.10+, Playwright browsers (Firefox).
 Install Node deps (both packages):
 ```bash
 # Install dependencies
 cd orchestrator && npm install
 cd ../job-extractor && npm install
 ```
-# Set up Python environment for resume generator
+Configure the orchestrator env + DB:
 cd ../resume-generator
 python3 -m venv .venv
 source .venv/bin/activate
 pip install playwright
 playwright install chromium
-# Run orchestrator (from orchestrator folder)
+```bash
 cd ../orchestrator
-cp .env.example .env  # Configure your env
+cp .env.example .env
 npm run db:migrate
 npm run dev
 ```
-### Build Docker Image
+Set up the resume generator (used for PDF export):
 ```bash
-docker build -t job-ops:latest .
+cd ../resume-generator
 python -m venv .venv
 # Windows PowerShell:
 .\.venv\Scripts\Activate.ps1
 # macOS/Linux:
 # source .venv/bin/activate
 pip install playwright
 python -m playwright install firefox
 ```
-### Push to Docker Hub
+If you're on Windows, set `PYTHON_PATH` in `orchestrator/.env` to your venv python (e.g. `..\resume-generator\.venv\Scripts\python.exe`) or use Docker/WSL.
-```bash
+Dev URLs:
-docker tag job-ops:latest yourusername/job-ops:latest
+- API: `http://localhost:3001/api`
-docker push yourusername/job-ops:latest
+- UI (Vite): `http://localhost:5173`
 ```
-## API Endpoints
+## Key endpoints
-| Method | Endpoint | Description |
+- Jobs: `GET /api/jobs`, `POST /api/jobs/:id/process`, `POST /api/jobs/:id/apply`, `POST /api/jobs/:id/reject`, `POST /api/jobs/process-discovered`
-|--------|----------|-------------|
+- Pipeline: `POST /api/pipeline/run`, `GET /api/pipeline/status`, `GET /api/pipeline/progress` (SSE)
-| GET | `/api/jobs` | List all jobs |
+- Webhook: `POST /api/webhook/trigger` (optional auth via `WEBHOOK_SECRET`)
-| GET | `/api/jobs/:id` | Get job details |
+- Ops: `DELETE /api/database` (wipes DB)
 | PATCH | `/api/jobs/:id` | Update job |
 | POST | `/api/jobs/:id/process` | Generate resume for job |
 | POST | `/api/jobs/:id/apply` | Mark as applied |
 | POST | `/api/jobs/:id/reject` | Skip job |
 | POST | `/api/jobs/process-discovered` | Process all discovered jobs |
 | GET | `/api/pipeline/status` | Pipeline status |
 | POST | `/api/pipeline/run` | Trigger pipeline |
 | GET | `/api/pipeline/progress` | SSE progress stream |
 | DELETE | `/api/database` | Clear all data |
-## Architecture
+## Notes / sharp edges
-```
+- **Crawl targets**: edit `job-extractor/src/main.ts` to change the Gradcracker location/role matrix.
-job-ops/
+- **Notion sync is schema-dependent**: `orchestrator/src/server/services/notion.ts` assumes property names; adjust to match your Notion database.
-├── orchestrator/       # Node.js backend + React frontend
+- **Pipeline config knobs**: `POST /api/pipeline/run` accepts `{ topN, minSuitabilityScore }`; `PIPELINE_TOP_N`/`PIPELINE_MIN_SCORE` are used by `npm run pipeline:run` (CLI runner).
-│   ├── src/server/    # Express API, services, DB
+- **Anti-bot reality**: crawling is headless + "humanized", but sites can still block; expect occasional flakiness.
 │   └── src/client/    # React dashboard
 ├── job-extractor/      # Crawlee-based job crawler
 ├── resume-generator/   # Python Playwright automation
 ├── data/               # SQLite DB + generated PDFs
 ├── Dockerfile
 └── docker-compose.yml
 ```
 ## License