Data Engineering2026-04-1810 min read

Job Data Enrichment: What Fields Matter and How to Get Them

A technical guide to job data enrichment — salary normalization, skill extraction, seniority inference, company enrichment, and building a complete job data model.

The Gap Between Raw and Enriched Job Data

A raw job posting is a title, a text description, a company name, and an apply URL. That's the minimum. What makes a job listing actually useful — for job seekers, for search, for analytics, for AI applications — is the structured, enriched layer built on top: normalized salary ranges, standardized skill tags, inferred seniority level, company metadata, and clean location data.

The gap between raw and enriched data is enormous, and it's where most of the value in job data products lives. This article covers each enrichment dimension, why it matters, and how to produce or acquire it.

Salary Normalization

Salary data is simultaneously the most valuable and most inconsistently presented field in job listings. The problems:

  • Some listings show annual salary, some show hourly rate, some show monthly
  • Some show base salary only, some include total compensation (base + equity + bonus)
  • Some show a range, some show a single number ("up to $X" or "from $X")
  • Currency varies for international roles
  • Many listings show no salary at all

A normalized salary model needs to handle all of these:

interface NormalizedSalary {
  min: number | null;        // Annual, base, USD
  max: number | null;        // Annual, base, USD
  currency: string;          // Always 'USD' after normalization
  compensation_type: 'base' | 'total' | 'unknown';
  confidence: 'high' | 'medium' | 'low';
}

// Converting various input formats to annual USD
function normalizeSalaryToAnnual(raw: RawSalaryInput): NormalizedSalary {
  let min = raw.min_value;
  let max = raw.max_value;

  switch (raw.period) {
    case 'HOUR':
      // 2080 hours/year (52 weeks × 40 hours)
      min = min ? min * 2080 : null;
      max = max ? max * 2080 : null;
      break;
    case 'MONTH':
      min = min ? min * 12 : null;
      max = max ? max * 12 : null;
      break;
    case 'WEEK':
      min = min ? min * 52 : null;
      max = max ? max * 52 : null;
      break;
    case 'YEAR':
    default:
      // No conversion needed
      break;
  }

  // Convert non-USD currencies
  if (raw.currency !== 'USD' && raw.currency) {
    const rate = getExchangeRate(raw.currency, 'USD');
    min = min ? min * rate : null;
    max = max ? max * rate : null;
  }

  return {
    min: min ? Math.round(min) : null,
    max: max ? Math.round(max) : null,
    currency: 'USD',
    compensation_type: raw.includes_equity ? 'total' : 'base',
    confidence: raw.currency && raw.period ? 'high' : 'low',
  };
}

Note that JobDataLake stores salary in thousands for cleaner API responses and query parameters — salary_min: 150 means $150,000 annually. When you store this in your database, decide on a consistent convention and document it clearly. Multiplying by 1000 in the display layer is easy; inconsistent storage conventions cause expensive bugs.

Skill Extraction and Normalization

Skill tags are perhaps the most impactful enrichment for search and matching applications. A well-tagged job is dramatically more findable than one where skills are buried in prose.

The technical challenge has two parts: extraction (getting skill names from the description) and normalization (resolving aliases to canonical forms).

Extraction Approaches

  • Dictionary matching: Fast and reliable for standard technology names. Build a comprehensive dictionary and match against tokenized description text.
  • Pattern matching: Catches constructions like "5+ years of Python experience" or "proficiency in React"
  • NLP/NER: Named entity recognition models trained on job description data can catch skills that dictionary approaches miss, including emerging tools not yet in your dictionary
  • LLM extraction: Modern LLMs are surprisingly good at extracting structured skills from job descriptions — useful for periodic quality checks and catching dictionary gaps
async function extractSkillsWithLLM(description: string): Promise<string[]> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      {
        role: 'system',
        content: `Extract technical skills, tools, and technologies from this job description.
        Return a JSON array of normalized skill names.
        Normalize: "JS" -> "JavaScript", "k8s" -> "Kubernetes", "Postgres" -> "PostgreSQL".
        Only include concrete technical skills, not soft skills like "communication".`
      },
      { role: 'user', content: description.slice(0, 2000) }
    ],
    response_format: { type: 'json_object' },
  });
  const { skills } = JSON.parse(response.choices[0].message.content!);
  return skills;
}

Normalization Dictionary

Maintain a normalization map that handles common aliases:

const SKILL_NORMALIZATIONS: Record<string, string> = {
  'js': 'JavaScript',
  'javascript': 'JavaScript',
  'ts': 'TypeScript',
  'typescript': 'TypeScript',
  'py': 'Python',
  'python': 'Python',
  'postgres': 'PostgreSQL',
  'postgresql': 'PostgreSQL',
  'k8s': 'Kubernetes',
  'kubernetes': 'Kubernetes',
  'golang': 'Go',
  'node': 'Node.js',
  'nodejs': 'Node.js',
  'node.js': 'Node.js',
  'react': 'React',
  'react.js': 'React',
  'reactjs': 'React',
  'next.js': 'Next.js',
  'nextjs': 'Next.js',
  'aws': 'AWS',
  'amazon web services': 'AWS',
  'gcp': 'Google Cloud',
  'google cloud platform': 'Google Cloud',
};

Seniority Level Inference

Seniority level is a critical filter for job seekers and a useful analytical dimension. The challenge: seniority is expressed inconsistently across companies and industries.

Google uses internal leveling (L3, L4, L5...). Startups use "Engineer I/II/III". Some companies use "Associate", "Senior", "Staff", "Principal". Some use years of experience requirements. Some use none of these and just say "Software Engineer".

A practical seniority inference model combines multiple signals:

type SeniorityLevel = 'intern' | 'junior' | 'mid' | 'senior' | 'lead' | 'staff' | 'principal' | 'director' | 'vp' | 'c-level';

function inferSeniority(job: Job): SeniorityLevel {
  const title = job.title.toLowerCase();
  const desc = job.description.toLowerCase();

  // Title-based rules (highest confidence)
  if (/(intern|internship|co-op)/.test(title)) return 'intern';
  if (/(c[teo]o|cpo|cmo|cfo|chief)/.test(title)) return 'c-level';
  if (/(vp|vice president)/.test(title)) return 'vp';
  if (/(director)/.test(title)) return 'director';
  if (/(principal)/.test(title)) return 'principal';
  if (/(staff)/.test(title)) return 'staff';
  if (/(lead|tech lead)/.test(title)) return 'lead';
  if (/(senior|sr.?)/.test(title)) return 'senior';
  if (/(junior|jr.?|associate|entry.?level)/.test(title)) return 'junior';

  // Description-based fallback (lower confidence)
  const yearsMatch = desc.match(/(d+)+?s*years?s+(?:ofs+)?(?:professionals+)?experience/);
  if (yearsMatch) {
    const years = parseInt(yearsMatch[1]);
    if (years === 0) return 'intern';
    if (years <= 2) return 'junior';
    if (years <= 4) return 'mid';
    if (years <= 7) return 'senior';
    return 'lead';
  }

  // Default to mid if no signals
  return 'mid';
}

Location Enrichment

Raw location strings from job postings are messy: "SF Bay Area", "Greater New York City Area", "Austin, TX (Remote Friendly)", "Hybrid - Chicago or Remote". Normalization is essential for location-based filtering and geographic analytics.

The components of a normalized location model:

interface NormalizedLocation {
  raw: string;               // Original string from posting
  city: string | null;       // Canonical city name
  state: string | null;      // 2-letter state code for US
  country: string;           // ISO 3166-1 alpha-2 country code
  latitude: number | null;   // For geo-distance search
  longitude: number | null;
  metro: string | null;      // Metro area (e.g., "San Francisco Bay Area")
  work_mode: 'remote' | 'hybrid' | 'onsite' | 'unknown';
}

For geocoding, use Google Maps Geocoding API, Mapbox, or OpenStreetMap's Nominatim for a free option. Cache aggressively — the same city names repeat millions of times across job postings.

Company Enrichment

A job's company metadata dramatically affects how useful it is. Job seekers filter by company stage, size, and industry. Sales tools need company metadata for account scoring. Analytics products need company firmographic data for market analysis.

Core company fields to enrich:

  • Employee count: Most reliably sourced from LinkedIn, though approximate. Buckets (1-10, 11-50, 51-200, 201-500, 500+) are more reliable than precise numbers.
  • Funding stage: Pre-seed, seed, Series A/B/C+, public, bootstrapped. Crunchbase and PitchBook are the primary sources.
  • Industry/sector: Use a standard taxonomy (NAICS, SIC, or a custom classification)
  • Company logo: Clearbit Logo API is the standard solution — provides logos for most companies via domain lookup
  • Domain: The canonical company domain — essential for deduplication and enrichment lookups
async function enrichCompany(companyName: string, domain?: string): Promise<CompanyData> {
  const lookupKey = domain ?? await resolveDomain(companyName);

  // Clearbit enrichment (if you have access)
  const clearbitData = await clearbit.Company.find({ domain: lookupKey });

  return {
    name: clearbitData?.name ?? companyName,
    domain: lookupKey,
    logoUrl: `https://logo.clearbit.com/${lookupKey}`,
    employeeRange: normalizeHeadcount(clearbitData?.metrics?.employees),
    industry: clearbitData?.category?.industry ?? null,
    fundingStage: inferFundingStage(clearbitData),
    linkedinUrl: clearbitData?.linkedin?.handle
      ? `https://linkedin.com/company/${clearbitData.linkedin.handle}`
      : null,
  };
}

Building a Complete Job Data Model

Combining all enrichment layers, a complete job data model looks like this:

interface EnrichedJob {
  // Identity
  id: string;
  external_id: string;
  source: string;
  canonical_url: string;
  apply_url: string;

  // Core fields
  title: string;
  description: string;
  description_html: string;

  // Enriched fields
  seniority_level: SeniorityLevel;
  employment_type: EmploymentType;
  skills: string[];           // Normalized skill names
  skills_required: string[];  // Must-have vs nice-to-have split
  skills_preferred: string[];

  // Salary (stored in thousands for cleaner API params)
  salary_min: number | null;  // e.g., 150 = $150k
  salary_max: number | null;
  salary_currency: string;
  salary_type: 'base' | 'total' | 'unknown';

  // Location
  location: NormalizedLocation;
  remote: boolean;

  // Company
  company: CompanyData;

  // Timestamps
  posted_at: string;          // ISO 8601
  expires_at: string | null;
  indexed_at: string;
  refreshed_at: string;

  // Quality signals
  has_salary: boolean;
  has_skills: boolean;
  dedup_key: string;
  data_quality_score: number; // 0-100
}

Enrichment at Scale

Enrichment is computationally expensive. At scale, you need to be strategic:

  • Enrich on write: Run enrichment as part of the ingest pipeline, so every record is enriched before storage
  • Cache company data: Company metadata doesn't change often. Cache it at the company level, not the job level — a company with 100 postings shouldn't trigger 100 Clearbit lookups
  • Prioritize by value: Salary extraction and skill tagging have the highest impact on user experience — prioritize those. Logo fetching can be async and lazy.
  • Use structured source data: When sourcing from an API that already provides enriched fields (like JobDataLake), use their extraction rather than re-doing it yourself. The incremental quality improvement rarely justifies the infrastructure cost.

The goal is a dataset where every record meets a minimum quality bar: salary populated for as many listings as possible, at least 3 skills tagged, seniority inferred, and location normalized. Listings that don't meet this bar should be deprioritized in search ranking, not excluded entirely — but they should be clearly differentiated in the UI so users aren't misled about completeness.

Frequently Asked Questions

What is job data enrichment?

Job data enrichment is the process of adding structured fields to raw job postings. This includes normalizing salary to annual USD, extracting skills from descriptions, inferring seniority from titles, and adding company metadata like size and funding.

Why is enriched job data more valuable than raw postings?

Raw postings are unstructured text. You cannot filter by salary, skills, or seniority without enrichment. Enriched data enables precise search, analytics, and AI applications that raw descriptions cannot support.

How do you normalize job salaries across currencies?

Convert all salaries to annual USD using current exchange rates. Handle different pay periods (hourly, monthly) by converting to annual. Filter outliers (likely data errors) and store in a consistent unit like thousands.

Try JobDataLake

1M+ enriched job listings from 20,000+ companies. Free API key with 1,000 credits — no credit card required.