Data Cleaning & Pre-Processing

Preparing and Standardizing Lightcast Job Postings Data for Analysis

Objective

Data cleaning is the most critical phase in any analytics workflow.
Following the AD688 project requirements, our goals for this section are:

inspect the raw dataset structure,
identify and remove redundant or outdated columns,
fix inconsistent formatting (e.g., stringified list fields),
standardize salary fields and preserve them even when sparsely reported,
standardize datetime fields and create analysis-ready variables,
handle missing values using appropriate statistical strategies,
remove duplicate job postings,
export a clean, consistent dataset used across EDA, ML, and NLP sections.

This ensures accuracy, reproducibility, and coherence across the entire site.

Initial Dataset Overview

Code

import pandas as pd
import numpy as np
import ast
import re

# Load raw Lightcast job postings
df = pd.read_csv("data/lightcast_job_postings.csv", low_memory=False)

# Basic structural overview
# df.head()
# df.info()
# list(df.columns)

Dropping Redundant or Outdated Columns

Lightcast contains multiple versions of NAICS/SOC classifications, plus metadata fields (tracking URLs, flags, timestamps). We remove clearly redundant metadata and older classification versions while keeping the 2022 NAICS codes and 2021 SOC codes.

Code

columns_to_drop = [
    "ID", "URL", "ACTIVE_URLS", 
    "DUPLICATES", "LAST_UPDATED_TIMESTAMP", "NAICS2", 
    "NAICS3", "NAICS4", "NAICS5", "NAICS6",
    "SOC_2", "SOC_3", "SOC_5"
]

df.drop(
    columns=columns_to_drop, 
    inplace=True, 
    errors="ignore"
    )
# df.head()

Cleaning Messy List-Type Columns

Many Lightcast columns store lists as text such as:

[“Job Board”] [“disabledperson.com”, “dejobs.org”]

So, we convert these into clean comma-separated strings.

Code

def clean_list(x):
    """
    Convert stringified Python lists into comma-separated strings and strip noise.
    If parsing fails, fall back to a lightly cleaned string.
    """
    if pd.isna(x):
        return ""
    x = str(x).replace("\n", " ").strip()
    try:
        obj = ast.literal_eval(x)
        if isinstance(obj, list):
            return ", ".join(str(i).strip() for i in obj)
    except Exception:
        pass
    # Fallback: remove stray quotes if present
    x = re.sub(r'["\']', "", x)
    return x.strip()

list_cols = [
    "SOURCE_TYPES", "SOURCES",
    "SKILLS_NAME", "SPECIALIZED_SKILLS_NAME",
    "COMMON_SKILLS_NAME", "SOFTWARE_SKILLS_NAME",
    "CERTIFICATIONS_NAME"
]

for col in list_cols:
    if col in df.columns:
        df[col] = df[col].apply(clean_list)

# display(df.head(3))

Salary Fields: Cleaning and Preservation

Salary information is central to our job-market analysis, so we explicitly clean and preserve all salary-related fields, even when they are sparsely populated.

We:

•   identify key salary-like columns: SALARY_FROM, SALARY_TO, SALARY, ORIGINAL_PAY_PERIOD,
•   strip currency symbols, commas, and non-numeric characters from numeric fields,
•   create a standardized PAY_PERIOD,
•   convert everything to an annualized ANNUAL_SALARY field suitable for EDA and modeling.

Code

# Key salary-related columns we want to preserve
salary_cols = [
    "SALARY_FROM", 
    "SALARY_TO", 
    "SALARY", 
    "ORIGINAL_PAY_PERIOD"
    ]

# Ensure the columns exist
for col in salary_cols:
    if col not in df.columns:
        df[col] = np.nan

# Helper to clean salary values, including ranges like "90k-120k"
def clean_salary_range(x):
    if pd.isna(x):
        return np.nan
    x = str(x)

    # Keep digits, commas, dot, minus, and en-dash
    x = re.sub(r"[^\d,\.\-\–]", "", x)

    # If there is a range, split and average
    if "-" in x or "–" in x:
        parts = re.split(r"[-–]", x)
        nums = []
        for p in parts:
            p = p.replace(",", "")
            if p == "":
                continue
            try:
                nums.append(float(p))
            except Exception:
                pass
        if len(nums) >= 2:
            return np.mean(nums)
        elif len(nums) == 1:
            return nums[0]
        return np.nan

    # Otherwise, treat as a single value
    x = x.replace(",", "")
    if x == "":
        return np.nan
    try:
        return float(x)
    except Exception:
        return np.nan

# Clean numeric salary fields (FROM, TO, SALARY)
numeric_salary_fields = [
    "SALARY_FROM", "SALARY_TO", "SALARY"
    ]

for col in numeric_salary_fields:
    df[col] = df[col].apply(clean_salary_range)

# Choose the best base salary for each row
def get_base_salary(row):
    # 1. If SALARY exists, treat it as already standardized
    if pd.notna(row["SALARY"]):
        return row["SALARY"]
    # 2. If both FROM and TO exist, average them
    if pd.notna(row["SALARY_FROM"]) and pd.notna(row["SALARY_TO"]):
        return (row["SALARY_FROM"] + row["SALARY_TO"]) / 2
    # 3. Fallback to whichever exists
    if pd.notna(row["SALARY_FROM"]):
        return row["SALARY_FROM"]
    if pd.notna(row["SALARY_TO"]):
        return row["SALARY_TO"]
    return np.nan

# Normalize pay period strings
def normalize_period(x):
    if pd.isna(x):
        return "year"
    x = str(x).lower()

    if any(k in x for k in ["year", "yr", "annual"]):
        return "year"
    if any(k in x for k in ["month", "mo", "per_month"]):
        return "month"
    if "biweek" in x:
        return "biweek"
    if any(k in x for k in ["week", "wk", "per_week"]):
        return "week"
    if any(k in x for k in ["day", "daily"]):
        return "day"
    if any(k in x for k in ["hour", "hr"]):
        return "hour"
    return "year"

df["PAY_PERIOD"] = df["ORIGINAL_PAY_PERIOD"].apply(normalize_period)

# Convert base salary to annual salary
def convert_to_annual(row):
    salary = get_base_salary(row)
    period = row["PAY_PERIOD"]

    if pd.isna(salary):
        return np.nan

    if period == "year":
        return salary
    if period == "month":
        return salary * 12
    if period == "week":
        return salary * 52
    if period == "biweek":
        return salary * 26
    if period == "day":
        return salary * 260   # approx. work days in a year
    if period == "hour":
        return salary * 2080  # 40 hours * 52 weeks
    return salary

df["ANNUAL_SALARY"] = df.apply(convert_to_annual, axis=1)

# Remove implausible values (e.g., < $10k or > $700k per year)
df.loc[df["ANNUAL_SALARY"] < 10000, "ANNUAL_SALARY"] = np.nan
df.loc[df["ANNUAL_SALARY"] > 700000, "ANNUAL_SALARY"] = np.nan

df["ANNUAL_SALARY"] = df["ANNUAL_SALARY"].astype(float)

df[["SALARY_FROM", 
    "SALARY_TO", 
    "SALARY", 
    "PAY_PERIOD",
    "ANNUAL_SALARY"]
    ].head(3)

	SALARY_FROM	SALARY_TO	SALARY	PAY_PERIOD	ANNUAL_SALARY
0	NaN	NaN	NaN	year	NaN
1	NaN	NaN	NaN	year	NaN
2	NaN	NaN	NaN	year	NaN

Handling Missing Values

Understanding missingness is important for determining how to impute values and whether key fields are usable.

We focus on essential analytical fields:

•   salary fields (now cleaned),
•   experience requirements,
•   location fields,
•   duration.

We drop extremely sparse columns (more than 50% missing) but never drop salary-related fields, even if they exceed this threshold.

Code

# Inspect the most-missing columns
missing_summary = df.isna().mean().sort_values(ascending=False).head(15)
# display(missing_summary)

Code

# Critical columns that must be preserved
critical_cols = [
    c for c in ([
        "SALARY_FROM", 
        "SALARY_TO", 
        "SALARY",
        "ORIGINAL_PAY_PERIOD",
        "ANNUAL_SALARY", 
        "PAY_PERIOD"
        ])
    if c in df.columns
]

threshold = len(df) * 0.5

# Drop columns with >50% missing, except critical salary / pay-period fields
cols_to_drop_missing = [
    c for c in df.columns
    if c not in critical_cols and df[c].isna().sum() > threshold
]

df.drop(columns=cols_to_drop_missing, inplace=True)

# Fill numeric values with median (EXCLUDING salary-related fields)
num_cols = df.select_dtypes(include=["float", "int"]).columns
num_exclude = set(["SALARY_FROM", "SALARY_TO", "SALARY", "ANNUAL_SALARY"])
num_fill = [c for c in num_cols if c not in num_exclude]

df[num_fill] = df[num_fill].fillna(df[num_fill].median())

# Leave salary-related fields as NaN when missing

# Fill categorical values with "Unknown"
cat_cols = df.select_dtypes(include="object").columns
df[cat_cols] = df[cat_cols].fillna("Unknown")

print("Shape after missing-value handling:", df.shape)
# df.info()

Shape after missing-value handling: (72498, 114)

Standardizing the POSTED Date and Creating a Month Column

Datetime consistency is essential for trend analysis and for joining with FRED macroeconomic indicators.

Code

# df["POSTED"] = pd.to_datetime(df["POSTED"], errors="coerce")
# df["month"] = df["POSTED"].dt.to_period("M").astype(str)
df = df.assign(
    POSTED = pd.to_datetime(df["POSTED"], errors="coerce"),
    month = lambda x: x["POSTED"].dt.to_period("M").astype(str)
)
display(df[["POSTED", "month"]].head())

	POSTED	month
0	2024-06-02	2024-06
1	2024-06-02	2024-06
2	2024-06-02	2024-06
3	2024-06-02	2024-06
4	2024-06-02	2024-06

Removing Duplicate Job Postings

Duplicate postings can overcount job demand and distort salary statistics. We remove duplicates based on:

job title
company
location
original posting date

Code

subset_cols = [
    c for c in [
        "TITLE_NAME", 
        "COMPANY_NAME", 
        "LOCATION", 
        "POSTED"]
        if c in df.columns]

if subset_cols:
    before = df.shape[0]
    df = df.drop_duplicates(subset=subset_cols, keep="first")
    after = df.shape[0]
    print(f"Removed {before - after} duplicate postings.")
else:
    print("Duplicate removal skipped: key columns missing.")

print(df.shape)

Removed 3300 duplicate postings.
(69198, 115)

Exporting the Final Cleaned Dataset

This cleaned dataset will be used consistently across:

•   Exploratory Data Analysis (EDA),
•   Skill Gap Analysis,
•   ML Methods,
•   NLP Methods.

Exporting ensures team alignment and prevents inconsistent results.

Code

df.to_csv("data/lightcast_cleaned.csv", index=False)
# df.head()
print(df.shape)

(69198, 115)

Summary

This cleaned dataset establishes the foundation for all downstream analysis.

By explicitly cleaning and preserving salary fields, selectively dropping only non-essential sparse columns, standardizing dates, and removing duplicates, we ensure:

accuracy and reproducibility,
consistent salary-based insights,
reduced rendering load for GitHub Pages,
and smooth integration with the ML and NLP pipelines.

This completes the Data Cleaning & Pre-Processing phase.

--- title: "Data Cleaning & Pre-Processing" subtitle: "Preparing and Standardizing Lightcast Job Postings Data for Analysis" bibliography: references.bib csl: csl/econometrica.csl format: html: code-fold: true code-tools: true --- <div class="card reveal"> ## Objective Data cleaning is the most critical phase in any analytics workflow. Following the AD688 project requirements, our goals for this section are: - inspect the raw dataset structure, - identify and remove redundant or outdated columns, - fix inconsistent formatting (e.g., stringified list fields), - **standardize salary fields and preserve them even when sparsely reported**, - standardize datetime fields and create analysis-ready variables, - handle missing values using appropriate statistical strategies, - remove duplicate job postings, - export a clean, consistent dataset used across EDA, ML, and NLP sections. This ensures accuracy, reproducibility, and coherence across the entire site. </div> --- <div class="card reveal"> ### Initial Dataset Overview ```{python} import pandas as pd import numpy as np import ast import re # Load raw Lightcast job postings df = pd.read_csv("data/lightcast_job_postings.csv", low_memory=False) # Basic structural overview # df.head() # df.info() # list(df.columns) ``` </div> --- <div class="card reveal"> ### Dropping Redundant or Outdated Columns Lightcast contains multiple versions of NAICS/SOC classifications, plus metadata fields (tracking URLs, flags, timestamps). We remove clearly redundant metadata and older classification versions while keeping the 2022 NAICS codes and 2021 SOC codes. ```{python} columns_to_drop = [ "ID", "URL", "ACTIVE_URLS", "DUPLICATES", "LAST_UPDATED_TIMESTAMP", "NAICS2", "NAICS3", "NAICS4", "NAICS5", "NAICS6", "SOC_2", "SOC_3", "SOC_5" ] df.drop( columns=columns_to_drop, inplace=True, errors="ignore" ) # df.head() ``` </div> --- <div class="card reveal"> ### Cleaning Messy List-Type Columns Many Lightcast columns store lists as text such as: ["Job Board"] ["disabledperson.com", "dejobs.org"] So, we convert these into clean comma-separated strings. ```{python} def clean_list(x): """ Convert stringified Python lists into comma-separated strings and strip noise. If parsing fails, fall back to a lightly cleaned string. """ if pd.isna(x): return "" x = str(x).replace("\n", " ").strip() try: obj = ast.literal_eval(x) if isinstance(obj, list): return ", ".join(str(i).strip() for i in obj) except Exception: pass # Fallback: remove stray quotes if present x = re.sub(r'["\']', "", x) return x.strip() list_cols = [ "SOURCE_TYPES", "SOURCES", "SKILLS_NAME", "SPECIALIZED_SKILLS_NAME", "COMMON_SKILLS_NAME", "SOFTWARE_SKILLS_NAME", "CERTIFICATIONS_NAME" ] for col in list_cols: if col in df.columns: df[col] = df[col].apply(clean_list) # display(df.head(3)) ``` </div> --- <div class="card reveal"> ### Salary Fields: Cleaning and Preservation Salary information is central to our job-market analysis, so we explicitly clean and preserve all salary-related fields, even when they are sparsely populated. We: • identify key salary-like columns: SALARY_FROM, SALARY_TO, SALARY, ORIGINAL_PAY_PERIOD, • strip currency symbols, commas, and non-numeric characters from numeric fields, • create a standardized PAY_PERIOD, • convert everything to an annualized ANNUAL_SALARY field suitable for EDA and modeling. ```{python} # Key salary-related columns we want to preserve salary_cols = [ "SALARY_FROM", "SALARY_TO", "SALARY", "ORIGINAL_PAY_PERIOD" ] # Ensure the columns exist for col in salary_cols: if col not in df.columns: df[col] = np.nan # Helper to clean salary values, including ranges like "90k-120k" def clean_salary_range(x): if pd.isna(x): return np.nan x = str(x) # Keep digits, commas, dot, minus, and en-dash x = re.sub(r"[^\d,\.\-\–]", "", x) # If there is a range, split and average if "-" in x or "–" in x: parts = re.split(r"[-–]", x) nums = [] for p in parts: p = p.replace(",", "") if p == "": continue try: nums.append(float(p)) except Exception: pass if len(nums) >= 2: return np.mean(nums) elif len(nums) == 1: return nums[0] return np.nan # Otherwise, treat as a single value x = x.replace(",", "") if x == "": return np.nan try: return float(x) except Exception: return np.nan # Clean numeric salary fields (FROM, TO, SALARY) numeric_salary_fields = [ "SALARY_FROM", "SALARY_TO", "SALARY" ] for col in numeric_salary_fields: df[col] = df[col].apply(clean_salary_range) # Choose the best base salary for each row def get_base_salary(row): # 1. If SALARY exists, treat it as already standardized if pd.notna(row["SALARY"]): return row["SALARY"] # 2. If both FROM and TO exist, average them if pd.notna(row["SALARY_FROM"]) and pd.notna(row["SALARY_TO"]): return (row["SALARY_FROM"] + row["SALARY_TO"]) / 2 # 3. Fallback to whichever exists if pd.notna(row["SALARY_FROM"]): return row["SALARY_FROM"] if pd.notna(row["SALARY_TO"]): return row["SALARY_TO"] return np.nan # Normalize pay period strings def normalize_period(x): if pd.isna(x): return "year" x = str(x).lower() if any(k in x for k in ["year", "yr", "annual"]): return "year" if any(k in x for k in ["month", "mo", "per_month"]): return "month" if "biweek" in x: return "biweek" if any(k in x for k in ["week", "wk", "per_week"]): return "week" if any(k in x for k in ["day", "daily"]): return "day" if any(k in x for k in ["hour", "hr"]): return "hour" return "year" df["PAY_PERIOD"] = df["ORIGINAL_PAY_PERIOD"].apply(normalize_period) # Convert base salary to annual salary def convert_to_annual(row): salary = get_base_salary(row) period = row["PAY_PERIOD"] if pd.isna(salary): return np.nan if period == "year": return salary if period == "month": return salary * 12 if period == "week": return salary * 52 if period == "biweek": return salary * 26 if period == "day": return salary * 260 # approx. work days in a year if period == "hour": return salary * 2080 # 40 hours * 52 weeks return salary df["ANNUAL_SALARY"] = df.apply(convert_to_annual, axis=1) # Remove implausible values (e.g., < $10k or > $700k per year) df.loc[df["ANNUAL_SALARY"] < 10000, "ANNUAL_SALARY"] = np.nan df.loc[df["ANNUAL_SALARY"] > 700000, "ANNUAL_SALARY"] = np.nan df["ANNUAL_SALARY"] = df["ANNUAL_SALARY"].astype(float) df[["SALARY_FROM", "SALARY_TO", "SALARY", "PAY_PERIOD", "ANNUAL_SALARY"] ].head(3) ``` </div> --- <div class="card reveal"> ### Handling Missing Values Understanding missingness is important for determining how to impute values and whether key fields are usable. We focus on essential analytical fields: • salary fields (now cleaned), • experience requirements, • location fields, • duration. We drop extremely sparse columns (more than 50% missing) but never drop salary-related fields, even if they exceed this threshold. ```{python} # Inspect the most-missing columns missing_summary = df.isna().mean().sort_values(ascending=False).head(15) # display(missing_summary) ``` ```{python} # Critical columns that must be preserved critical_cols = [ c for c in ([ "SALARY_FROM", "SALARY_TO", "SALARY", "ORIGINAL_PAY_PERIOD", "ANNUAL_SALARY", "PAY_PERIOD" ]) if c in df.columns ] threshold = len(df) * 0.5 # Drop columns with >50% missing, except critical salary / pay-period fields cols_to_drop_missing = [ c for c in df.columns if c not in critical_cols and df[c].isna().sum() > threshold ] df.drop(columns=cols_to_drop_missing, inplace=True) # Fill numeric values with median (EXCLUDING salary-related fields) num_cols = df.select_dtypes(include=["float", "int"]).columns num_exclude = set(["SALARY_FROM", "SALARY_TO", "SALARY", "ANNUAL_SALARY"]) num_fill = [c for c in num_cols if c not in num_exclude] df[num_fill] = df[num_fill].fillna(df[num_fill].median()) # Leave salary-related fields as NaN when missing # Fill categorical values with "Unknown" cat_cols = df.select_dtypes(include="object").columns df[cat_cols] = df[cat_cols].fillna("Unknown") print("Shape after missing-value handling:", df.shape) # df.info() ``` </div> --- <div class="card reveal"> ### Standardizing the POSTED Date and Creating a Month Column Datetime consistency is essential for trend analysis and for joining with FRED macroeconomic indicators. ```{python} # df["POSTED"] = pd.to_datetime(df["POSTED"], errors="coerce") # df["month"] = df["POSTED"].dt.to_period("M").astype(str) df = df.assign( POSTED = pd.to_datetime(df["POSTED"], errors="coerce"), month = lambda x: x["POSTED"].dt.to_period("M").astype(str) ) display(df[["POSTED", "month"]].head()) ``` </div> --- <div class="card reveal"> ### Removing Duplicate Job Postings Duplicate postings can overcount job demand and distort salary statistics. We remove duplicates based on: - job title - company - location - original posting date ```{python} subset_cols = [ c for c in [ "TITLE_NAME", "COMPANY_NAME", "LOCATION", "POSTED"] if c in df.columns] if subset_cols: before = df.shape[0] df = df.drop_duplicates(subset=subset_cols, keep="first") after = df.shape[0] print(f"Removed {before - after} duplicate postings.") else: print("Duplicate removal skipped: key columns missing.") print(df.shape) ``` </div> --- <div class="card reveal"> ### Exporting the Final Cleaned Dataset Exporting the Final Cleaned Dataset This cleaned dataset will be used consistently across: • Exploratory Data Analysis (EDA), • Skill Gap Analysis, • ML Methods, • NLP Methods. Exporting ensures team alignment and prevents inconsistent results. ```{python} df.to_csv("data/lightcast_cleaned.csv", index=False) # df.head() print(df.shape) ``` </div> --- <div class="card reveal"> ## Summary This cleaned dataset establishes the foundation for all downstream analysis. By explicitly cleaning and preserving salary fields, selectively dropping only non-essential sparse columns, standardizing dates, and removing duplicates, we ensure: - accuracy and reproducibility, - consistent salary-based insights, - reduced rendering load for GitHub Pages, - and smooth integration with the ML and NLP pipelines. This completes the Data Cleaning & Pre-Processing phase. </div>