feat(jobs): suppress duplicate postings after skip or apply
Some checks failed
CI / Linting (Biome) (push) Failing after 41s
CI / Tests (push) Successful in 5m25s
CI / Type Check (adzuna-extractor) (push) Successful in 1m8s
CI / Type Check (gradcracker-extractor) (push) Successful in 1m12s
CI / Type Check (hiringcafe-extractor) (push) Successful in 1m9s
CI / Type Check (orchestrator) (push) Successful in 1m25s
CI / Type Check (startupjobs-extractor) (push) Successful in 1m9s
CI / Type Check (ukvisajobs-extractor) (push) Successful in 1m9s
CI / Documentation (push) Failing after 1m56s

Dedup by employer+title and description at import; cascade skip on dismiss; hide repeats in the job list. Document product scope and duplicate detection in docs.

Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
ilia 2026-05-16 18:50:11 -04:00
parent 5401f384c1
commit 17c4d4490a
15 changed files with 471 additions and 19 deletions

View File

@ -15,6 +15,8 @@ Country tokens are normalized to canonical keys (for example `India` → `india`
Global and remote boards often return roles tagged to countries you do not want (for example India-remote QA listings while you target Canada). This filter applies at import time so those rows never enter your **Discovered** queue.
When **Search cities** (or pipeline geography) is a single country such as **Canada**, JobOps also enforces an **allow-list**: only jobs that clearly hire in that country are kept. Vague `Remote` / `Worldwide` rows with no Canada signal are dropped, as are rows that mention any other country.
## How to use it
1. Open **Settings** and expand **Scoring Settings**.
@ -26,7 +28,8 @@ Global and remote boards often return roles tagged to countries you do not want
### Tips
- Use country names as they appear on listings: `India`, `Poland`, `United Kingdom`, or aliases `UK`, `USA`.
- Listings with **no recognizable country** in the location field (for example `Remote` only) are **kept**, not blocked.
- With **no** country-level search geography, listings whose location is only `Remote` / `Worldwide` with no blocked country in the text are **kept**.
- With search geography set to a country (for example `Canada`), vague remote rows with no signal for that country are **dropped** even if they are not on the blocked list.
- The list is capped in Settings validation (max 50 entries, each up to 100 characters).
- Pair with **Search cities** / **country** settings to narrow what extractors query; blocked countries filter what still comes back from broad boards.
@ -50,6 +53,7 @@ Global and remote boards often return roles tagged to countries you do not want
## Related pages
- [Duplicate job detection](./duplicate-jobs)
- [Company skip list](./company-skip-list)
- [Settings](/docs/features/settings)
- [Pipeline Run](/docs/features/pipeline-run)

View File

@ -48,6 +48,7 @@ You may want to avoid certain agencies, staffing brands, or employers without ha
## Related pages
- [Duplicate job detection](./duplicate-jobs)
- [Blocked countries](./blocked-countries)
- [Settings](/docs/features/settings)
- [Pipeline Run](/docs/features/pipeline-run)

View File

@ -0,0 +1,79 @@
---
id: duplicate-jobs
title: Duplicate job detection
description: How JobOps deduplicates cross-source postings and hides repeats after you skip or apply.
sidebar_position: 8
---
## What it is
JobOps treats the same role reposted on different boards as **one opportunity**, and remembers when you have already **skipped** or **applied** so you do not see it again.
Duplicate detection uses normalized keys:
- **Employer + title** — strips punctuation, `(Remote)`, trailing city lines, and legal suffixes (`Inc.`, `Ltd.`) so `Acme Inc.` / `SDET (Remote)` matches `Acme` / `SDET`.
- **Employer + description** — when the job description is long enough, the same posting copy under the same company matches even if the title wording differs slightly.
This runs at **import time** (pipeline) and in the **Jobs list** (UI).
## Why it exists
Job boards repost the same role on LinkedIn, Indeed, QAJobsBoard, and aggregators. Skipping or applying once should mean you do not wade through the same listing again on the next run.
## How to use it
You do not configure duplicate detection separately. It is always on for your profile.
### When you skip a job
1. Skip from the job detail panel or press **`s`** on the Jobs page.
2. JobOps marks that row `skipped`.
3. Any other **Discovered** or **Ready** jobs with the same employer+title (or same employer+description) are **auto-skipped**.
4. Future pipeline imports that match those keys are **not imported**.
### When you mark applied
1. Mark the job applied from the UI.
2. Duplicate open rows are **auto-skipped** (not marked applied) so your Applied tab stays clean.
3. Future imports of the same role are suppressed the same way as for skips.
### Cross-source import
During a pipeline run, if a new posting matches an existing row by URL, source id, or content fingerprint, JobOps **does not create a second row** — the import is counted as skipped in run stats.
### Jobs list
Open jobs that match a prior skip or apply are **hidden** from Discovered, Ready, and All tabs so the queue stays fresh. Skipped and applied rows themselves remain visible in their statuses.
## Defaults and constraints
- Description matching requires at least **80 characters** of normalized text; short or empty descriptions fall back to employer+title only.
- Matching is **per profile** (`ownerProfileId`); different login profiles do not share dedup state.
- Dedup does **not** delete existing rows retroactively when you change skip list or country filters — run discovery again or skip manually for old data.
- Very different titles at the same company (for example `SDET` vs `Product Designer`) are **not** collapsed.
## Common problems
### I still see the same job from another source
- Titles or employer names may differ enough that normalization does not match (for example a staffing agency name vs the hiring company).
- Add the employer to the [Company skip list](./company-skip-list) if it is always noise.
- Skip one row — siblings with matching keys should auto-skip and future imports should stop.
### A different role at the same company disappeared
- Employer+title dedup only merges **the same normalized title**. Different roles at one company should remain separate.
- If two titles normalize to the same string, check for overly generic titles on the board.
### Skipped jobs reappear after a pipeline run
- Confirm the skip saved (status `skipped` in the UI).
- If the repost uses a new employer spelling and a new title **and** a short description, it may not match — block the employer or country instead.
## Related pages
- [Orchestrator](/docs/features/orchestrator)
- [Pipeline Run](/docs/features/pipeline-run)
- [Company skip list](./company-skip-list)
- [Blocked countries](./blocked-countries)
- [Keyboard Shortcuts](/docs/features/keyboard-shortcuts)

View File

@ -27,6 +27,8 @@ Job states:
- `skipped`: explicitly excluded from active queue
- `expired`: deadline passed
When you **skip** or **mark applied**, JobOps also skips matching open duplicates (same company + title or description) and blocks re-import on future runs. See [Duplicate job detection](/docs/features/duplicate-jobs).
## Why it exists
Orchestrator centralizes the transition from discovered opportunities to application-ready artifacts.
@ -132,8 +134,15 @@ curl -X POST "http://localhost:3001/api/jobs/<jobId>/generate-pdf"
- Patch `status` back to `discovered` to return the job to the active queue.
### Duplicate postings
- Skipping one listing auto-skips other **Discovered** / **Ready** rows that match the same normalized employer+title (or employer+description when available).
- The Jobs list hides open rows that match a job you already skipped or applied to.
- Details: [Duplicate job detection](/docs/features/duplicate-jobs).
## Related pages
- [Duplicate job detection](/docs/features/duplicate-jobs)
- [Pipeline Run](/docs/next/features/pipeline-run)
- [Ghostwriter](/docs/next/features/ghostwriter)
- [Reactive Resume](/docs/next/features/reactive-resume)

View File

@ -102,13 +102,15 @@ When new listings are imported, JobOps does not create a second database row if
- a **canonical job URL** (normalizes `http`/`https`, `www`, trailing slashes, common tracking query params, and sorts remaining query keys)
- the pair **`source` + `source_job_id`** when the extractor provides an external id
- a **content fingerprint** (normalized **employer + title**) so the same role from another board is not imported twice
- **skip/apply memory** — imports that match a job you already skipped or applied are not added
Existing jobs keep their stored URL; new imports use the canonical form so the same role is not added again under a slightly different link.
See [Duplicate job detection](./duplicate-jobs) for skip cascades and description matching.
To drop listings before import, use **Settings → Scoring Settings**:
To drop listings before import, use **Settings → Scoring Settings** and pipeline geography:
- [Company skip list](./company-skip-list) — blocked **employer** keywords
- [Blocked countries](./blocked-countries) — drop jobs whose **location** mentions a country you list (for example India)
- [Blocked countries](./blocked-countries) — block specific countries; when search geography is a country (for example Canada), enforce that country only
## Common problems

View File

@ -8,6 +8,19 @@ slug: /
Welcome to the JobOps documentation. This site contains guides for setup, configuration, and day-to-day usage.
## What JobOps does
JobOps is a self-hosted job search operations stack: it **discovers** roles from many boards, **filters** them to your geography and profile, **scores** fit, **tailors** resumes and PDFs, and **tracks** applications after you apply.
In practice:
1. **Discover** — Run a pipeline against LinkedIn, Indeed, Glassdoor, QAJobsBoard, Canadian boards, and other extractors using your search terms and country (for example Canada, remote-only QA).
2. **Filter** — Drop unwanted countries, companies, co-op/intern patterns, non-matching locations, expired LinkedIn reposts, and duplicate postings you already skipped or applied to.
3. **Review** — Work through **Discovered** and **Ready** in the Orchestrator; skip noise, move strong fits to Ready, generate tailored PDFs.
4. **Apply & track** — Mark applied, sync Gmail for recruiter mail, and use the in-progress board and analytics.
Key filters and quality controls are documented under [Core Features](#feature-documentation) — especially [Blocked countries](/docs/features/blocked-countries), [Company skip list](/docs/features/company-skip-list), and [Duplicate job detection](/docs/features/duplicate-jobs).
## Getting Started
- **<a href="/docs/next/getting-started/self-hosting" data-umami-event="docs_intro_self_hosting_click">Self-Hosting Guide</a>**
@ -56,6 +69,18 @@ Welcome to the JobOps documentation. This site contains guides for setup, config
- `?` shortcut help dialog and `Control` hint bar behavior
- Tab-specific actions like skip, move to ready, and mark applied
- **[Duplicate job detection](/docs/features/duplicate-jobs)**
- Cross-source dedup by employer and title
- Auto-skip repeats when you skip or apply
- Hide duplicate open rows in the Jobs list
- **[Blocked countries](/docs/features/blocked-countries)**
- Block listings that mention specific countries
- Canada-only (and other) search geography enforcement
- **[Company skip list](/docs/features/company-skip-list)**
- Block employers by keyword during discovery
- **[Multi-Select and Bulk Actions](/docs/next/features/multi-select-and-bulk-actions)**
- Select many jobs using row checkboxes or select-all
- Run bulk move, skip, and rescore actions from the floating action bar
@ -117,12 +142,12 @@ Welcome to the JobOps documentation. This site contains guides for setup, config
### Key Features
1. **Job Discovery**: Automatically find jobs from multiple sources.
2. **AI Scoring**: Rank jobs by suitability for your profile.
3. **Resume Tailoring**: Generate custom resumes for each job.
4. **PDF Export**: Create tailored PDFs via RxResume integration.
5. **Application Tracking**: Monitor your applied jobs.
6. **Email Tracking**: Auto-track post-application responses.
1. **Job discovery**: Find roles from multiple extractors in one pipeline run.
2. **Geography and quality filters**: Block countries and employers, enforce search-country allow-lists, remote-only runs, and profile deal-breakers.
3. **Duplicate suppression**: Collapse cross-board reposts; remember skips and applications.
4. **AI scoring**: Rank jobs by suitability for your profile.
5. **Resume tailoring**: Generate custom resumes and PDFs per job (RxResume).
6. **Application tracking**: Applied status, post-application Gmail sync, in-progress board, and analytics.
## Contributing to Documentation

View File

@ -34,6 +34,7 @@ const sidebars: SidebarsConfig = {
"features/settings",
"features/company-skip-list",
"features/blocked-countries",
"features/duplicate-jobs",
"features/reactive-resume",
"features/in-progress-board",
"features/ghostwriter",

View File

@ -0,0 +1,32 @@
import { createJob } from "@shared/testing/factories";
import { describe, expect, it } from "vitest";
import { buildDuplicateDismissHints } from "./job-dedup";
describe("buildDuplicateDismissHints", () => {
it("flags open jobs that match a skipped posting", () => {
const jobs = [
createJob({
id: "skipped-1",
employer: "Acme",
title: "SDET",
status: "skipped",
}),
createJob({
id: "open-1",
employer: "Acme Inc.",
title: "SDET (Remote)",
status: "discovered",
}),
createJob({
id: "open-2",
employer: "Contoso",
title: "QA Engineer",
status: "discovered",
}),
];
const hints = buildDuplicateDismissHints(jobs);
expect(hints.get("open-1")).toBe("skipped");
expect(hints.has("open-2")).toBe(false);
});
});

View File

@ -0,0 +1,50 @@
import { collectJobDedupKeys } from "@shared/job-fingerprint";
import type { JobListItem, JobStatus } from "@shared/types";
export type DuplicateDismissReason = "skipped" | "applied";
/**
* Map open jobs to a prior skip/apply when employer+title or description matches.
*/
export function buildDuplicateDismissHints(
jobs: readonly JobListItem[],
): Map<string, DuplicateDismissReason> {
const dismissedKeys = new Map<string, DuplicateDismissReason>();
for (const job of jobs) {
if (job.status !== "skipped" && job.status !== "applied") continue;
const reason: DuplicateDismissReason =
job.status === "applied" ? "applied" : "skipped";
for (const key of collectJobDedupKeys({
employer: job.employer,
title: job.title,
})) {
if (!dismissedKeys.has(key)) dismissedKeys.set(key, reason);
}
}
const hints = new Map<string, DuplicateDismissReason>();
const openStatuses = new Set<JobStatus>([
"discovered",
"ready",
"processing",
]);
for (const job of jobs) {
if (!openStatuses.has(job.status)) continue;
for (const key of collectJobDedupKeys({
employer: job.employer,
title: job.title,
})) {
const reason = dismissedKeys.get(key);
if (reason) {
hints.set(job.id, reason);
break;
}
}
}
return hints;
}
export { collectJobDedupKeys };

View File

@ -1,4 +1,5 @@
import { useSettings } from "@client/hooks/useSettings";
import { buildDuplicateDismissHints } from "@client/lib/job-dedup";
import { inferCountryKeyFromSearchGeography } from "@shared/search-cities";
import type React from "react";
import { useCallback, useEffect, useMemo, useState } from "react";
@ -167,6 +168,11 @@ export const OrchestratorPage: React.FC = () => {
[settings?.searchCities?.value],
);
const duplicateDismissHints = useMemo(
() => buildDuplicateDismissHints(jobs),
[jobs],
);
const jobListFilterExtras = useMemo(
() => ({
foundAfterYmd,
@ -177,6 +183,7 @@ export const OrchestratorPage: React.FC = () => {
? settingsSkipEmployerKeywords
: [],
searchGeographyCountryKey,
duplicateDismissHints,
}),
[
foundAfterYmd,
@ -186,6 +193,7 @@ export const OrchestratorPage: React.FC = () => {
applySettingsCompanySkipList,
settingsSkipEmployerKeywords,
searchGeographyCountryKey,
duplicateDismissHints,
],
);

View File

@ -1,3 +1,4 @@
import type { DuplicateDismissReason } from "@client/lib/job-dedup";
import { jobMatchesAllowedCountry } from "@shared/blocked-countries";
import { textMatchesKeyword } from "@shared/keyword-match";
import type { JobListItem, JobSource } from "@shared/types";
@ -19,6 +20,8 @@ export type JobListFilterExtras = {
settingsBlockedEmployerKeywords: string[];
/** When settings search geography is a country (e.g. Canada), hide other countries. */
searchGeographyCountryKey?: string | null;
/** Hide open jobs that match a prior skip/apply (same company + title/description). */
duplicateDismissHints?: ReadonlyMap<string, DuplicateDismissReason>;
};
const startOfLocalDayMs = (ymd: string): number =>
@ -64,6 +67,7 @@ export const useFilteredJobs = (
employerExclude: [],
settingsBlockedEmployerKeywords: [],
searchGeographyCountryKey: null,
duplicateDismissHints: undefined,
},
) =>
useMemo(() => {
@ -96,6 +100,11 @@ export const useFilteredJobs = (
filtered = filtered.filter((job) => job.closedAt == null);
}
const duplicateHints = listExtras.duplicateDismissHints;
if (duplicateHints && duplicateHints.size > 0) {
filtered = filtered.filter((job) => !duplicateHints.has(job.id));
}
if (sourcesFilter.length > 0) {
const allow = new Set(sourcesFilter);
filtered = filtered.filter((job) => allow.has(job.source));

View File

@ -389,6 +389,22 @@ async function executeJobActionForJob(
});
}
const alsoSkipped = await jobsRepo.skipOpenJobsWithMatchingDedupKeys(
{
employer: updated.employer,
title: updated.title,
jobDescription: updated.jobDescription,
},
updated.ownerProfileId,
updated.id,
);
if (alsoSkipped > 0) {
logger.info("Auto-skipped duplicate open jobs", {
jobId: updated.id,
alsoSkipped,
});
}
return { jobId, ok: true, job: updated };
}
@ -1383,6 +1399,22 @@ jobsRouter.post("/:id/apply", async (req: Request, res: Response) => {
return fail(res, notFound("Job not found"));
}
const alsoSkipped = await jobsRepo.skipOpenJobsWithMatchingDedupKeys(
{
employer: updatedJob.employer,
title: updatedJob.title,
jobDescription: updatedJob.jobDescription,
},
updatedJob.ownerProfileId,
updatedJob.id,
);
if (alsoSkipped > 0) {
logger.info("Auto-skipped duplicate open jobs after apply", {
jobId: updatedJob.id,
alsoSkipped,
});
}
res.json({ success: true, data: updatedJob });
} catch (error) {
const message = error instanceof Error ? error.message : "Unknown error";

View File

@ -5,9 +5,11 @@
import { randomUUID } from "node:crypto";
import { getJobOwnerProfileId } from "@infra/request-context";
import { DEFAULT_JOB_OWNER_PROFILE_ID } from "@server/infra/job-owner-context";
import { buildJobContentFingerprint } from "@shared/job-fingerprint";
import {
buildJobContentFingerprint,
collectJobDedupKeys,
} from "@shared/job-fingerprint";
import { canonicalizeJobUrl } from "@shared/job-url-canonical";
import { normalizeIsRemote } from "@shared/work-arrangement";
import type {
CreateJobInput,
Job,
@ -16,6 +18,7 @@ import type {
JobsRevisionResponse,
UpdateJobInput,
} from "@shared/types";
import { normalizeIsRemote } from "@shared/work-arrangement";
import { and, desc, eq, inArray, isNull, lt, ne, sql } from "drizzle-orm";
import { db, schema } from "../db/index";
@ -39,10 +42,13 @@ function resolveOwnerForCreate(input: CreateJobInput): string {
return getJobOwnerProfileId() ?? DEFAULT_JOB_OWNER_PROFILE_ID;
}
const OPEN_JOB_STATUSES: JobStatus[] = ["discovered", "ready"];
async function loadJobDedupIndexes(ownerProfileId: string): Promise<{
existingCanonicalSet: Set<string>;
existingSourceJobKeySet: Set<string>;
existingContentFingerprintSet: Set<string>;
dismissedDedupKeySet: Set<string>;
}> {
const rows = await db
.select({
@ -52,6 +58,8 @@ async function loadJobDedupIndexes(ownerProfileId: string): Promise<{
contentFingerprint: jobs.contentFingerprint,
employer: jobs.employer,
title: jobs.title,
jobDescription: jobs.jobDescription,
status: jobs.status,
})
.from(jobs)
.where(eq(jobs.ownerProfileId, ownerProfileId));
@ -70,12 +78,12 @@ async function loadJobDedupIndexes(ownerProfileId: string): Promise<{
// recomputing it from (employer, title) so legacy rows participate in
// dedup until they're rewritten.
const existingContentFingerprintSet = new Set<string>();
const dismissedDedupKeySet = new Set<string>();
for (const row of rows) {
const stored = row.contentFingerprint?.trim();
if (stored) {
existingContentFingerprintSet.add(stored);
continue;
}
} else {
const recomputed = buildJobContentFingerprint({
employer: row.employer,
title: row.title,
@ -84,13 +92,114 @@ async function loadJobDedupIndexes(ownerProfileId: string): Promise<{
existingContentFingerprintSet.add(recomputed);
}
}
if (row.status === "skipped" || row.status === "applied") {
for (const key of collectJobDedupKeys({
employer: row.employer,
title: row.title,
jobDescription: row.jobDescription,
})) {
dismissedDedupKeySet.add(key);
}
}
}
return {
existingCanonicalSet,
existingSourceJobKeySet,
existingContentFingerprintSet,
dismissedDedupKeySet,
};
}
function inputMatchesDismissedDedupKeys(
input: CreateJobInput,
dismissedDedupKeySet: Set<string>,
): boolean {
if (dismissedDedupKeySet.size === 0) return false;
const keys = collectJobDedupKeys({
employer: input.employer,
title: input.title,
jobDescription: input.jobDescription,
});
return keys.some((key) => dismissedDedupKeySet.has(key));
}
async function findDismissedJobByDedupKeys(
keys: string[],
ownerProfileId: string,
): Promise<Job | null> {
if (keys.length === 0) return null;
const rows = await db
.select()
.from(jobs)
.where(
and(
eq(jobs.ownerProfileId, ownerProfileId),
inArray(jobs.status, ["skipped", "applied"]),
),
);
for (const row of rows) {
const rowKeys = collectJobDedupKeys({
employer: row.employer,
title: row.title,
jobDescription: row.jobDescription,
});
if (rowKeys.some((key) => keys.includes(key))) {
return mapRowToJob(row);
}
}
return null;
}
/**
* Skip other open jobs that match the same employer/title or description keys.
*/
export async function skipOpenJobsWithMatchingDedupKeys(
anchor: {
employer: string;
title: string;
jobDescription?: string | null;
},
ownerProfileId: string,
excludeJobId: string,
): Promise<number> {
const anchorKeys = collectJobDedupKeys(anchor);
if (anchorKeys.length === 0) return 0;
const rows = await db
.select({
id: jobs.id,
employer: jobs.employer,
title: jobs.title,
jobDescription: jobs.jobDescription,
})
.from(jobs)
.where(
and(
eq(jobs.ownerProfileId, ownerProfileId),
inArray(jobs.status, OPEN_JOB_STATUSES),
ne(jobs.id, excludeJobId),
),
);
let skipped = 0;
for (const row of rows) {
const rowKeys = collectJobDedupKeys({
employer: row.employer,
title: row.title,
jobDescription: row.jobDescription,
});
if (!rowKeys.some((key) => anchorKeys.includes(key))) continue;
const updated = await updateJob(
row.id,
{ status: "skipped" },
ownerProfileId,
);
if (updated) skipped += 1;
}
return skipped;
}
async function findJobByCanonicalUrl(
canonical: string,
ownerProfileId: string,
@ -480,8 +589,23 @@ export async function createJobs(
existingCanonicalSet,
existingSourceJobKeySet,
existingContentFingerprintSet,
dismissedDedupKeySet,
} = await loadJobDedupIndexes(ownerProfileId);
if (
inputMatchesDismissedDedupKeys(normalizedWithOwner, dismissedDedupKeySet)
) {
const existing = await findDismissedJobByDedupKeys(
collectJobDedupKeys({
employer: normalized.employer,
title: normalized.title,
jobDescription: normalized.jobDescription,
}),
ownerProfileId,
);
if (existing) return existing;
}
const sid = normalized.sourceJobId?.trim();
if (sid) {
const sk = sourceJobKey(normalized.source, sid);
@ -537,6 +661,7 @@ export async function createJobs(
existingCanonicalSet,
existingSourceJobKeySet,
existingContentFingerprintSet,
dismissedDedupKeySet,
} = await loadJobDedupIndexes(ownerProfileId);
const batchBuckets = new Map<
@ -582,6 +707,10 @@ export async function createJobs(
const sid = input.sourceJobId?.trim();
const sk = sid ? sourceJobKey(input.source, sid) : null;
if (inputMatchesDismissedDedupKeys(input, dismissedDedupKeySet)) {
skipped += count;
continue;
}
if (sk && existingSourceJobKeySet.has(sk)) {
skipped += count;
continue;

View File

@ -1,6 +1,8 @@
import { describe, expect, it } from "vitest";
import {
buildJobContentFingerprint,
buildJobDescriptionFingerprint,
collectJobDedupKeys,
normalizeEmployerForFingerprint,
normalizeTitleForFingerprint,
} from "./job-fingerprint";
@ -64,6 +66,29 @@ describe("buildJobContentFingerprint", () => {
expect(a).not.toBe(b);
});
it("matches reposts with the same employer and description body", () => {
const description =
"We are hiring an Automation Test Engineer to build scalable test frameworks. ".repeat(
4,
);
const a = buildJobDescriptionFingerprint({
employer: "Joveo",
jobDescription: description,
});
const b = buildJobDescriptionFingerprint({
employer: "Joveo",
jobDescription: description,
});
expect(a).toBe(b);
expect(
collectJobDedupKeys({
employer: "Joveo",
title: "SDET",
jobDescription: description,
}),
).toContain(a);
});
describe("normalizers", () => {
it("normalizeEmployerForFingerprint strips legal suffixes", () => {
expect(normalizeEmployerForFingerprint("Acme Corporation")).toBe("acme");

View File

@ -75,3 +75,49 @@ export function buildJobContentFingerprint(args: {
if (!employer || !title) return null;
return `${employer}::${title}`;
}
const DESCRIPTION_MIN_CHARS = 80;
export function normalizeDescriptionForFingerprint(
jobDescription: string | null | undefined,
): string {
if (!jobDescription?.trim()) return "";
let value = stripDiacritics(jobDescription.toLowerCase());
value = value.replace(/<[^>]+>/g, " ");
value = value.replace(PUNCTUATION_RE, " ");
value = value.replace(WHITESPACE_RE, " ").trim();
if (value.length < DESCRIPTION_MIN_CHARS) return "";
return value.slice(0, 400);
}
/**
* Same employer + materially similar description (cross-posted copy).
*/
export function buildJobDescriptionFingerprint(args: {
employer: string | null | undefined;
jobDescription: string | null | undefined;
}): string | null {
const employer = normalizeEmployerForFingerprint(args.employer);
const description = normalizeDescriptionForFingerprint(args.jobDescription);
if (!employer || !description) return null;
return `${employer}::desc::${description}`;
}
export function collectJobDedupKeys(args: {
employer: string | null | undefined;
title: string | null | undefined;
jobDescription?: string | null | undefined;
}): string[] {
const keys = new Set<string>();
const content = buildJobContentFingerprint({
employer: args.employer,
title: args.title,
});
if (content) keys.add(content);
const description = buildJobDescriptionFingerprint({
employer: args.employer,
jobDescription: args.jobDescription,
});
if (description) keys.add(description);
return [...keys];
}