Data Quality2026-04-188 min read

Job Listing Data Quality: Why 30-40% of Listings Are Stale or Duplicated

The hidden data quality problems in job listing aggregation — syndication chains, deduplication strategies, freshness metrics, and how to build for quality.

The Job Listing Quality Problem

If you've built a job board or worked with job data at scale, you've encountered the quality problem: a surprising fraction of the listings in any large dataset are either duplicated, expired, or never real in the first place. Estimates vary, but the industry consensus is that 30–40% of listings in an unprocessed aggregated feed have significant quality issues.

This isn't a minor inconvenience — it's a core user experience problem. A job seeker who applies to three listings and discovers they're all the same role at the same company wastes hours. A job seeker who applies to a role that closed three weeks ago is demoralized. A job board with obvious data quality problems loses user trust quickly.

Understanding where quality problems come from, and how to address them systematically, is essential for anyone building with job data.

Source 1: Syndication Chains

The biggest source of duplication isn't careless data handling — it's the structure of the job board ecosystem itself.

When a company posts a job, the path to any given job board often looks like this: Company ATS → Indeed → JobBoard A → JobBoard B → Aggregator C → Your API → Your Board. At each step in this chain, the listing gets re-indexed, possibly with slight variations in title formatting, location data, and description whitespace.

A single job opening at a company like Google or Stripe might appear as 5–15 distinct records in a large aggregated dataset, each with slightly different metadata because each stop in the syndication chain introduced its own normalization quirks.

Deduplication strategies for syndication-caused duplication:

  • Company + title + location hash: The most basic dedup key. Catches obvious duplicates but misses variations in title normalization ("Sr. Engineer" vs "Senior Engineer")
  • Apply URL dedup: If two records share the same application URL (or the same ATS job ID in the URL), they're definitely the same listing
  • Fuzzy title matching: Normalize titles by stripping seniority qualifiers and comparing the base role name + company
  • Description similarity: For high-quality dedup, compute a content hash or similarity score on the normalized job description text
function extractATSJobId(applyUrl: string): string | null {
  // Greenhouse: /jobs/123456
  const greenhouse = applyUrl.match(/greenhouse.io.*/jobs/(d+)/);
  if (greenhouse) return `greenhouse:${greenhouse[1]}`;

  // Lever: /apply/uuid
  const lever = applyUrl.match(/jobs.lever.co/[^/]+/([a-f0-9-]+)/);
  if (lever) return `lever:${lever[1]}`;

  // Workday: various patterns
  const workday = applyUrl.match(/myworkdayjobs.com.*/job/([A-Za-z0-9_-]+)/);
  if (workday) return `workday:${workday[1]}`;

  return null;
}

function generateDeduplicationKeys(job: Job): string[] {
  const keys: string[] = [];

  // ATS-based key (most reliable)
  const atsId = extractATSJobId(job.apply_url);
  if (atsId) keys.push(atsId);

  // Structural key
  const structuralKey = `${job.company_id}|${normalizeTitle(job.title)}|${normalizeLocation(job.location)}`;
  keys.push('struct:' + hashString(structuralKey));

  return keys;
}

Source 2: Zombie Listings

"Zombie listings" are job postings that remain visible on job boards after the position has been filled, cancelled, or put on hold. This happens for several reasons:

  • The company forgot to take down the posting
  • The ATS doesn't propagate deletions to downstream syndication partners
  • Some companies keep listings live indefinitely to build a candidate pipeline
  • Some job boards don't proactively verify that listings are still active

Zombie listings are particularly damaging to user experience. A well-funded startup that stopped hiring due to an internal reorg may have dozens of stale listings still circulating. Job seekers who apply don't hear back, and assume the problem is with their application rather than the listing.

How to detect and handle zombie listings:

  • Age-based TTL: Most legitimate job postings are filled within 30–60 days. Listings older than 90 days should be treated as suspect and ideally re-verified.
  • HTTP status checking: For listings with direct apply URLs, periodically check that the URL returns a 200 (not a 404 or redirect to "position filled" page)
  • Re-confirmation from source: If the job data API re-includes a listing in a fresh crawl, it's still active. If a listing disappears from the API feed, retire it.
async function verifyListingActive(applyUrl: string): Promise<boolean> {
  try {
    const res = await fetch(applyUrl, {
      method: 'HEAD',
      redirect: 'follow',
      signal: AbortSignal.timeout(5000),
    });

    // Some ATSs use 404 for closed roles, others redirect to a "position closed" page
    if (res.status === 404 || res.status === 410) return false;

    // Check for common "position closed" redirect patterns
    const finalUrl = res.url;
    if (finalUrl.includes('position-closed') || finalUrl.includes('job-closed')) return false;

    return res.status === 200;
  } catch {
    // Network error — don't expire, try again later
    return true;
  }
}

Source 3: Ghost Jobs

A ghost job is a listing that was never intended to be filled in the near term — it exists to build a candidate pipeline, for compliance reasons, or simply because someone in HR forgot to deactivate a requisition. Research suggests ghost jobs may account for 10–15% of all listings on major boards.

Ghost jobs are hard to detect programmatically. Some signals:

  • Very generic job descriptions with no specific requirements
  • Listings that have been active for 6+ months at the same company for the same role
  • Listings that appear and disappear cyclically (often caused by re-posting scripts)
  • Companies with an unusually high posting-to-headcount ratio based on their size

For most job board operators, the practical approach is to flag rather than remove suspected ghost jobs, and let user behavior (low click-through, high application bounce) inform ongoing quality scoring.

Source 4: Data Normalization Inconsistencies

Even when a listing is real, fresh, and unique, normalization quality affects usability. Common normalization failures:

  • Location: "New York, NY" vs "New York City" vs "NYC" vs "Manhattan" — a user filtering for "New York" should see all of these
  • Seniority: "L5 Engineer" at Google means something different from "Senior Engineer" at a startup, even though both get normalized to "Senior"
  • Salary: Hourly vs. annual rates, base vs. total compensation, US dollars vs. other currencies all need explicit handling
  • Job title: "Software Development Engineer" (Amazon internal title) vs "Software Engineer" — semantically equivalent but textually different

Salary normalization is particularly important. JobDataLake stores salary in thousands (so salary_min: 150 = $150,000), which is a clean convention. But when ingesting from multiple sources, you need to detect and convert hourly rates and foreign currencies:

function normalizeSalary(raw: RawSalary): NormalizedSalary | null {
  if (!raw.min && !raw.max) return null;

  let min = raw.min;
  let max = raw.max;

  // Convert hourly to annual (assuming 2080 work hours/year)
  if (raw.unit === 'HOUR') {
    min = min ? min * 2080 : undefined;
    max = max ? max * 2080 : undefined;
  }

  // Convert to USD if foreign currency
  if (raw.currency && raw.currency !== 'USD') {
    const rate = getFXRate(raw.currency, 'USD');
    min = min ? min * rate : undefined;
    max = max ? max * rate : undefined;
  }

  // Store in thousands for consistency
  return {
    min: min ? Math.round(min / 1000) : null,
    max: max ? Math.round(max / 1000) : null,
    currency: 'USD',
    unit: 'YEAR',
  };
}

Measuring and Monitoring Data Quality

You can't improve what you don't measure. A basic data quality dashboard for a job board should track:

  • Duplication rate: What percentage of ingested jobs were dropped as duplicates?
  • Age distribution: What percentage of live listings are older than 30/60/90 days?
  • Salary coverage: What percentage of listings have at least one salary field?
  • Skills coverage: What percentage of listings have at least 3 skills tagged?
  • Apply URL health: What percentage of apply URLs return a 200?

Track these metrics over time and by source. You'll quickly identify which sources produce high-quality data and which produce noise, allowing you to weight sources accordingly.

The Case for API-First Data Sourcing

One of the strongest arguments for using a job data API (rather than scraping directly) is that the vendor has already applied significant quality processing before the data reaches you. A good API provider deduplicates across its sources, normalizes fields consistently, and removes expired listings. You still need to apply your own quality filters, but you're starting from a much cleaner baseline.

When evaluating job data APIs, ask specifically about their deduplication approach, their listing freshness guarantees, and how quickly expired listings are removed from the feed. These are the quality signals that separate vendors worth paying for from those that will silently degrade your user experience.

Frequently Asked Questions

Why are so many job listings duplicated?

Companies post on their ATS, which syndicates to Indeed, LinkedIn, ZipRecruiter, and dozens of aggregators. A single posting can appear 10-50 times across platforms, creating massive duplication in aggregated datasets.

How do I detect stale job listings?

Check the posting date — listings older than 60 days are likely filled. Some APIs like JobDataLake track posting freshness and update hourly. You can also check if the original ATS URL still returns 200.

What percentage of job listings online are duplicates?

Research suggests 30-40% of listings on major aggregators are duplicates or stale. Using a deduplicated source like a job data API with ATS-level deduplication dramatically reduces noise.

Try JobDataLake

1M+ enriched job listings from 20,000+ companies. Free API key with 1,000 credits — no credit card required.