Data Engineering2026-05-2811 min read

How We Deduplicate 1M+ Job Listings Across 40 ATS Platforms

A technical deep-dive into how JobDataLake deduplicates job listings at scale — covering syndication chain analysis, ATS ID extraction, fuzzy matching, and the tradeoffs at each layer.

The Problem No One Talks About

Job posting deduplication is one of the most unglamorous problems in the job data space — and one of the most impactful for data quality. When you're ingesting job listings from 40+ ATS platforms, the same open position at a company routinely appears multiple times: once from Greenhouse directly, once from Indeed's syndication of that Greenhouse listing, once from LinkedIn, and potentially multiple more times as smaller aggregators pick up the LinkedIn post.

A naive aggregation of job data from multiple sources produces a dataset where 30–40% of records are duplicates. This isn't an edge case — it's the structural reality of how job posting distribution works. The fix requires a multi-layer approach that operates at significantly different computational costs and accuracy levels.

Here's how we handle it at JobDataLake across 1M+ active listings.

Understanding the Syndication Chain

Before writing any deduplication code, it's worth understanding why duplicates exist in the first place. The typical lifecycle of a job posting:

  1. A recruiter creates the job in their ATS (Greenhouse, Lever, Workday, etc.)
  2. The ATS publishes it to the company's career site and its own job board
  3. The ATS (or the company) pushes it to aggregators: Indeed, LinkedIn, Glassdoor, ZipRecruiter
  4. Smaller aggregators crawl the large aggregators and republish
  5. APIs like ours crawl all of the above

By step 5, a single job opening might exist as 5–20 distinct records with slightly different titles, varying description whitespace, inconsistent location formatting, and different metadata timestamps. Each stop in the chain introduces its own normalization quirks.

The key insight for deduplication: the ATS is the source of truth. Every legitimate duplicate traces back to a single record in a single ATS system. If we can identify that ATS record ID, we can collapse all duplicates to the canonical source.

Layer 1: ATS ID Extraction (Recall: ~60%, Precision: ~99%)

The most reliable deduplication key is the ATS job ID extracted from the application URL. ATSs use consistent, stable URL patterns for their job listings:

function extractATSJobID(url: string): string | null {
  if (!url) return null;

  // Greenhouse: boards.greenhouse.io/company/jobs/1234567
  const gh = url.match(/greenhouse.io/[^/]+/jobs/(d+)/);
  if (gh) return `gh:${gh[1]}`;

  // Lever: jobs.lever.co/company/uuid
  const lv = url.match(/jobs.lever.co/[^/]+/([a-f0-9-]{36})/);
  if (lv) return `lv:${lv[1]}`;

  // Workday: company.wd5.myworkdayjobs.com/.../job/location/title/JobID
  const wd = url.match(/myworkdayjobs.com.+/([A-Za-z0-9_-]{8,})/);
  if (wd) return `wd:${wd[1]}`;

  // Ashby: jobs.ashbyhq.com/company/uuid
  const ab = url.match(/ashbyhq.com/[^/]+/([a-f0-9-]{36})/);
  if (ab) return `ab:${ab[1]}`;

  // Rippling: ats.rippling.com/job-descriptions/ID
  const rp = url.match(/rippling.com/job-descriptions/(d+)/);
  if (rp) return `rp:${rp[1]}`;

  // SmartRecruiters: careers.smartrecruiters.com/company/jobID
  const sr = url.match(/smartrecruiters.com/[^/]+/(d+)/);
  if (sr) return `sr:${sr[1]}`;

  return null;
}

We maintain ATS URL patterns for 40+ platforms in a pattern library that we update as ATS vendors change their URL schemes (which happens more often than you'd expect). When we extract an ATS ID, we're ~99% confident it's the authoritative identifier for that posting — two records with the same ATS ID are the same job, full stop.

The limitation: this only works when the application URL passes through intact in the scraped data. URL truncation, redirect chains, or sites that don't expose the original apply URL will miss this layer.

Layer 2: Structural Hash (Recall: ~85%, Precision: ~95%)

For records where ATS ID extraction fails, we fall back to a structural key: a hash of the normalized company domain, normalized job title, and normalized location.

function normalizeTitle(title: string): string {
  return title
    .toLowerCase()
    // Strip seniority qualifiers — "Senior Software Engineer" and
    // "Software Engineer, Senior" should hash the same
    .replace(/(senior|sr.?|junior|jr.?|lead|staff|principal|associate)/g, '')
    // Strip common suffixes — "Software Engineer - Remote" and "Software Engineer" are the same
    .replace(/s*[-–|]s*.+$/, '')
    // Normalize common title variants
    .replace(/software development engineer/, 'software engineer')
    .replace(/sde/, 'software engineer')
    .replace(/mle/, 'machine learning engineer')
    // Strip extra whitespace
    .replace(/s+/g, ' ')
    .trim();
}

function normalizeCompany(company: string, domain?: string): string {
  // Domain is more reliable than company name for matching
  if (domain) {
    return domain.toLowerCase().replace(/^www./, '').split('.')[0];
  }
  return company
    .toLowerCase()
    .replace(/(inc.?|corp.?|llc.?|ltd.?|co.?)/g, '')
    .replace(/s+/g, ' ')
    .trim();
}

function generateStructuralKey(job: { title: string; company: string; domain?: string; city?: string }): string {
  const parts = [
    normalizeCompany(job.company, job.domain),
    normalizeTitle(job.title),
    job.city?.toLowerCase() ?? 'remote',
  ].join('|');
  return crypto.createHash('sha256').update(parts).digest('hex').slice(0, 16);
}

The structural key covers the cases where the apply URL is masked or redirected. Its limitation: title normalization can both over-match (treating genuinely different roles as duplicates) and under-match (missing duplicates where titles are phrased very differently). We tune the normalization rules based on precision/recall analysis against our labeled duplicate set.

Layer 3: Fuzzy Description Matching (Recall: ~92%, Precision: ~90%)

The third layer catches cases where neither the ATS ID nor the structural hash matches — often because title formatting is highly inconsistent across sources. We compute a Jaccard similarity between the bag-of-words representation of job descriptions:

function descriptionShingles(text: string, k = 3): Set<string> {
  // Normalize and tokenize
  const words = text
    .toLowerCase()
    .replace(/[^a-z0-9s]/g, ' ')
    .split(/s+/)
    .filter(w => w.length > 2);

  // Create k-shingles (k-word sequences)
  const shingles = new Set<string>();
  for (let i = 0; i <= words.length - k; i++) {
    shingles.add(words.slice(i, i + k).join(' '));
  }
  return shingles;
}

function jaccardSimilarity(a: Set<string>, b: Set<string>): number {
  const intersection = new Set([...a].filter(x => b.has(x)));
  const union = new Set([...a, ...b]);
  return intersection.size / union.size;
}

// Two listings are duplicates if similarity > 0.7 AND same company
function isFuzzyDuplicate(jobA: any, jobB: any): boolean {
  if (normalizeCompany(jobA.company) !== normalizeCompany(jobB.company)) return false;
  const shinglesA = descriptionShingles(jobA.description ?? '');
  const shinglesB = descriptionShingles(jobB.description ?? '');
  return jaccardSimilarity(shinglesA, shinglesB) > 0.70;
}

This layer is computationally expensive — O(n²) comparisons in the naive implementation. We use MinHash LSH (Locality-Sensitive Hashing) to reduce it to near-O(n) by grouping candidates before computing exact Jaccard similarity.

The Canonical Selection Problem

Once we identify a duplicate cluster, we need to choose the canonical record — the one that stays, with the richest data. Our priority order:

  1. ATS-direct source with extracted ATS ID (most reliable metadata)
  2. Record with salary data — salary fields are often present in direct ATS sources but stripped in aggregator syndications
  3. Record with most structured skills — prefer longer skills arrays over raw descriptions
  4. Most recently updated record — fresher timestamps beat older ones
  5. Longest description — aggregators sometimes truncate; prefer the full version
function selectCanonical(cluster: Job[]): Job {
  return cluster.sort((a, b) => {
    // Prefer ATS-direct sources
    const aHasAtsId = extractATSJobID(a.url) !== null ? 1 : 0;
    const bHasAtsId = extractATSJobID(b.url) !== null ? 1 : 0;
    if (aHasAtsId !== bHasAtsId) return bHasAtsId - aHasAtsId;

    // Prefer records with salary
    const aSalary = a.salary_min_usd ? 1 : 0;
    const bSalary = b.salary_min_usd ? 1 : 0;
    if (aSalary !== bSalary) return bSalary - aSalary;

    // Prefer more skills
    const aSkills = (a.required_skills ?? []).length;
    const bSkills = (b.required_skills ?? []).length;
    if (aSkills !== bSkills) return bSkills - aSkills;

    // Prefer more recent
    return b.posted_at - a.posted_at;
  })[0];
}

What This Means for Data Quality

Running all three layers together, our deduplication pipeline achieves:

  • ~95% duplicate elimination rate on the ingest pipeline (most duplicates caught before storage)
  • ~2% false positive rate — genuinely different roles marked as duplicates (a known tradeoff with fuzzy matching)
  • ~3% false negative rate — true duplicates that slip through to storage

The false positive rate is the more costly error in practice — incorrectly merging two distinct roles is worse than showing a duplicate. We tune the fuzzy matching threshold toward higher precision (fewer false positives) at the cost of slightly higher false negative rate.

The Ongoing Maintenance Problem

Deduplication is not a one-time build. ATS vendors change URL structures. Companies migrate between ATSs. New ATSs enter the market. We actively monitor:

  • ATS ID extraction success rate by platform (drops signal URL structure changes)
  • Duplicate rate by source over time (spikes signal new syndication partner)
  • Manual review queue for flagged potential false positives

The deduplication layer is continuously tuned. It's not a feature you ship once — it's infrastructure you operate.

Frequently Asked Questions

How do you deduplicate job listings at scale?

We use a three-layer approach: ATS ID extraction from apply URLs (highest precision), structural hashing of normalized company+title+location (broad coverage), and fuzzy Jaccard similarity on description shingles (catches edge cases). Each layer trades precision for recall.

Why are there so many duplicate job listings online?

Job postings originate in an ATS and syndicate through multiple channels — Indeed, LinkedIn, ZipRecruiter, aggregators. Each step republishes the listing, often with slight variations. A single open position can appear 5–20 times across a raw aggregated dataset.

What is the ATS job ID and why does it matter?

The ATS job ID is the unique identifier for a job posting within the originating Applicant Tracking System (Greenhouse, Lever, Workday, etc.). It's embedded in the application URL and is the most reliable deduplication key — two records with the same ATS job ID are definitively the same posting.

Try JobDataLake

1M+ enriched job listings from 20,000+ companies. Free API key with 1,000 credits — no credit card required.