Gender Disparities in Labor Force Analysis (2024)
  • Home
  • Research Background
  • Analysis
    • Gender Disparities Overview
    • Data Cleaning & Preprocessing
    • Exploratory Data Analysis
    • Gender Dominance in Job Postings
    • Machine-Learning Models
    • NLP Analysis
    • Skill Gap Analysis
  • Career Strategy
  • About Us

On this page

  • Objective
  • Initial Dataset Overview
  • Dropping Redundant or Outdated Columns
  • Cleaning Messy List-Type Columns
  • Salary Fields: Cleaning and Preservation
  • Handling Missing Values
  • Standardizing the POSTED Date and Creating a Month Column
  • Removing Duplicate Job Postings
  • Exporting the Final Cleaned Dataset
  • Summary

Data Cleaning & Pre-Processing

  • Show All Code
  • Hide All Code

  • View Source

Preparing and Standardizing Lightcast Job Postings Data for Analysis

Objective

Data cleaning is the most critical phase in any analytics workflow.
Following the AD688 project requirements, our goals for this section are:

  • inspect the raw dataset structure,
  • identify and remove redundant or outdated columns,
  • fix inconsistent formatting (e.g., stringified list fields),
  • standardize salary fields and preserve them even when sparsely reported,
  • standardize datetime fields and create analysis-ready variables,
  • handle missing values using appropriate statistical strategies,
  • remove duplicate job postings,
  • export a clean, consistent dataset used across EDA, ML, and NLP sections.

This ensures accuracy, reproducibility, and coherence across the entire site.


Initial Dataset Overview

Code
import pandas as pd
import numpy as np
import ast
import re

# Load raw Lightcast job postings
df = pd.read_csv("data/lightcast_job_postings.csv", low_memory=False)

# Basic structural overview
# df.head()
# df.info()
# list(df.columns)

Dropping Redundant or Outdated Columns

Lightcast contains multiple versions of NAICS/SOC classifications, plus metadata fields (tracking URLs, flags, timestamps). We remove clearly redundant metadata and older classification versions while keeping the 2022 NAICS codes and 2021 SOC codes.

Code
columns_to_drop = [
    "ID", "URL", "ACTIVE_URLS", 
    "DUPLICATES", "LAST_UPDATED_TIMESTAMP", "NAICS2", 
    "NAICS3", "NAICS4", "NAICS5", "NAICS6",
    "SOC_2", "SOC_3", "SOC_5"
]

df.drop(
    columns=columns_to_drop, 
    inplace=True, 
    errors="ignore"
    )
# df.head()

Cleaning Messy List-Type Columns

Many Lightcast columns store lists as text such as:

[“Job Board”] [“disabledperson.com”, “dejobs.org”]

So, we convert these into clean comma-separated strings.

Code
def clean_list(x):
    """
    Convert stringified Python lists into comma-separated strings and strip noise.
    If parsing fails, fall back to a lightly cleaned string.
    """
    if pd.isna(x):
        return ""
    x = str(x).replace("\n", " ").strip()
    try:
        obj = ast.literal_eval(x)
        if isinstance(obj, list):
            return ", ".join(str(i).strip() for i in obj)
    except Exception:
        pass
    # Fallback: remove stray quotes if present
    x = re.sub(r'["\']', "", x)
    return x.strip()

list_cols = [
    "SOURCE_TYPES", "SOURCES",
    "SKILLS_NAME", "SPECIALIZED_SKILLS_NAME",
    "COMMON_SKILLS_NAME", "SOFTWARE_SKILLS_NAME",
    "CERTIFICATIONS_NAME"
]

for col in list_cols:
    if col in df.columns:
        df[col] = df[col].apply(clean_list)

# display(df.head(3))

Salary Fields: Cleaning and Preservation

Salary information is central to our job-market analysis, so we explicitly clean and preserve all salary-related fields, even when they are sparsely populated.

We:

•   identify key salary-like columns: SALARY_FROM, SALARY_TO, SALARY, ORIGINAL_PAY_PERIOD,
•   strip currency symbols, commas, and non-numeric characters from numeric fields,
•   create a standardized PAY_PERIOD,
•   convert everything to an annualized ANNUAL_SALARY field suitable for EDA and modeling.
Code
# Key salary-related columns we want to preserve
salary_cols = [
    "SALARY_FROM", 
    "SALARY_TO", 
    "SALARY", 
    "ORIGINAL_PAY_PERIOD"
    ]

# Ensure the columns exist
for col in salary_cols:
    if col not in df.columns:
        df[col] = np.nan

# Helper to clean salary values, including ranges like "90k-120k"
def clean_salary_range(x):
    if pd.isna(x):
        return np.nan
    x = str(x)

    # Keep digits, commas, dot, minus, and en-dash
    x = re.sub(r"[^\d,\.\-\–]", "", x)

    # If there is a range, split and average
    if "-" in x or "–" in x:
        parts = re.split(r"[-–]", x)
        nums = []
        for p in parts:
            p = p.replace(",", "")
            if p == "":
                continue
            try:
                nums.append(float(p))
            except Exception:
                pass
        if len(nums) >= 2:
            return np.mean(nums)
        elif len(nums) == 1:
            return nums[0]
        return np.nan

    # Otherwise, treat as a single value
    x = x.replace(",", "")
    if x == "":
        return np.nan
    try:
        return float(x)
    except Exception:
        return np.nan

# Clean numeric salary fields (FROM, TO, SALARY)
numeric_salary_fields = [
    "SALARY_FROM", "SALARY_TO", "SALARY"
    ]

for col in numeric_salary_fields:
    df[col] = df[col].apply(clean_salary_range)

# Choose the best base salary for each row
def get_base_salary(row):
    # 1. If SALARY exists, treat it as already standardized
    if pd.notna(row["SALARY"]):
        return row["SALARY"]
    # 2. If both FROM and TO exist, average them
    if pd.notna(row["SALARY_FROM"]) and pd.notna(row["SALARY_TO"]):
        return (row["SALARY_FROM"] + row["SALARY_TO"]) / 2
    # 3. Fallback to whichever exists
    if pd.notna(row["SALARY_FROM"]):
        return row["SALARY_FROM"]
    if pd.notna(row["SALARY_TO"]):
        return row["SALARY_TO"]
    return np.nan

# Normalize pay period strings
def normalize_period(x):
    if pd.isna(x):
        return "year"
    x = str(x).lower()

    if any(k in x for k in ["year", "yr", "annual"]):
        return "year"
    if any(k in x for k in ["month", "mo", "per_month"]):
        return "month"
    if "biweek" in x:
        return "biweek"
    if any(k in x for k in ["week", "wk", "per_week"]):
        return "week"
    if any(k in x for k in ["day", "daily"]):
        return "day"
    if any(k in x for k in ["hour", "hr"]):
        return "hour"
    return "year"

df["PAY_PERIOD"] = df["ORIGINAL_PAY_PERIOD"].apply(normalize_period)

# Convert base salary to annual salary
def convert_to_annual(row):
    salary = get_base_salary(row)
    period = row["PAY_PERIOD"]

    if pd.isna(salary):
        return np.nan

    if period == "year":
        return salary
    if period == "month":
        return salary * 12
    if period == "week":
        return salary * 52
    if period == "biweek":
        return salary * 26
    if period == "day":
        return salary * 260   # approx. work days in a year
    if period == "hour":
        return salary * 2080  # 40 hours * 52 weeks
    return salary

df["ANNUAL_SALARY"] = df.apply(convert_to_annual, axis=1)

# Remove implausible values (e.g., < $10k or > $700k per year)
df.loc[df["ANNUAL_SALARY"] < 10000, "ANNUAL_SALARY"] = np.nan
df.loc[df["ANNUAL_SALARY"] > 700000, "ANNUAL_SALARY"] = np.nan

df["ANNUAL_SALARY"] = df["ANNUAL_SALARY"].astype(float)

df[["SALARY_FROM", 
    "SALARY_TO", 
    "SALARY", 
    "PAY_PERIOD",
    "ANNUAL_SALARY"]
    ].head(3)
SALARY_FROM SALARY_TO SALARY PAY_PERIOD ANNUAL_SALARY
0 NaN NaN NaN year NaN
1 NaN NaN NaN year NaN
2 NaN NaN NaN year NaN

Handling Missing Values

Understanding missingness is important for determining how to impute values and whether key fields are usable.

We focus on essential analytical fields:

•   salary fields (now cleaned),
•   experience requirements,
•   location fields,
•   duration.

We drop extremely sparse columns (more than 50% missing) but never drop salary-related fields, even if they exceed this threshold.

Code
# Inspect the most-missing columns
missing_summary = df.isna().mean().sort_values(ascending=False).head(15)
# display(missing_summary)
Code
# Critical columns that must be preserved
critical_cols = [
    c for c in ([
        "SALARY_FROM", 
        "SALARY_TO", 
        "SALARY",
        "ORIGINAL_PAY_PERIOD",
        "ANNUAL_SALARY", 
        "PAY_PERIOD"
        ])
    if c in df.columns
]

threshold = len(df) * 0.5

# Drop columns with >50% missing, except critical salary / pay-period fields
cols_to_drop_missing = [
    c for c in df.columns
    if c not in critical_cols and df[c].isna().sum() > threshold
]

df.drop(columns=cols_to_drop_missing, inplace=True)

# Fill numeric values with median (EXCLUDING salary-related fields)
num_cols = df.select_dtypes(include=["float", "int"]).columns
num_exclude = set(["SALARY_FROM", "SALARY_TO", "SALARY", "ANNUAL_SALARY"])
num_fill = [c for c in num_cols if c not in num_exclude]

df[num_fill] = df[num_fill].fillna(df[num_fill].median())

# Leave salary-related fields as NaN when missing

# Fill categorical values with "Unknown"
cat_cols = df.select_dtypes(include="object").columns
df[cat_cols] = df[cat_cols].fillna("Unknown")

print("Shape after missing-value handling:", df.shape)
# df.info()
Shape after missing-value handling: (72498, 114)

Standardizing the POSTED Date and Creating a Month Column

Datetime consistency is essential for trend analysis and for joining with FRED macroeconomic indicators.

Code
# df["POSTED"] = pd.to_datetime(df["POSTED"], errors="coerce")
# df["month"] = df["POSTED"].dt.to_period("M").astype(str)
df = df.assign(
    POSTED = pd.to_datetime(df["POSTED"], errors="coerce"),
    month = lambda x: x["POSTED"].dt.to_period("M").astype(str)
)
display(df[["POSTED", "month"]].head())
POSTED month
0 2024-06-02 2024-06
1 2024-06-02 2024-06
2 2024-06-02 2024-06
3 2024-06-02 2024-06
4 2024-06-02 2024-06

Removing Duplicate Job Postings

Duplicate postings can overcount job demand and distort salary statistics. We remove duplicates based on:

  • job title
  • company
  • location
  • original posting date
Code
subset_cols = [
    c for c in [
        "TITLE_NAME", 
        "COMPANY_NAME", 
        "LOCATION", 
        "POSTED"]
        if c in df.columns]

if subset_cols:
    before = df.shape[0]
    df = df.drop_duplicates(subset=subset_cols, keep="first")
    after = df.shape[0]
    print(f"Removed {before - after} duplicate postings.")
else:
    print("Duplicate removal skipped: key columns missing.")

print(df.shape)
Removed 3300 duplicate postings.
(69198, 115)

Exporting the Final Cleaned Dataset

Exporting the Final Cleaned Dataset

This cleaned dataset will be used consistently across:

•   Exploratory Data Analysis (EDA),
•   Skill Gap Analysis,
•   ML Methods,
•   NLP Methods.

Exporting ensures team alignment and prevents inconsistent results.

Code
df.to_csv("data/lightcast_cleaned.csv", index=False)
# df.head()
print(df.shape)
(69198, 115)

Summary

This cleaned dataset establishes the foundation for all downstream analysis.

By explicitly cleaning and preserving salary fields, selectively dropping only non-essential sparse columns, standardizing dates, and removing duplicates, we ensure:

  • accuracy and reproducibility,

  • consistent salary-based insights,

  • reduced rendering load for GitHub Pages,

  • and smooth integration with the ML and NLP pipelines.

This completes the Data Cleaning & Pre-Processing phase.

© 2025 · AD 688 Web Analytics · Boston University

Team 5

Source Code
---
title: "Data Cleaning & Pre-Processing"
subtitle: "Preparing and Standardizing Lightcast Job Postings Data for Analysis"
bibliography: references.bib
csl: csl/econometrica.csl
format:
  html:
    code-fold: true
    code-tools: true
---

<div class="card reveal">

## Objective

Data cleaning is the most critical phase in any analytics workflow.  
Following the AD688 project requirements, our goals for this section are:

- inspect the raw dataset structure,  
- identify and remove redundant or outdated columns,  
- fix inconsistent formatting (e.g., stringified list fields),  
- **standardize salary fields and preserve them even when sparsely reported**,  
- standardize datetime fields and create analysis-ready variables,  
- handle missing values using appropriate statistical strategies,  
- remove duplicate job postings,  
- export a clean, consistent dataset used across EDA, ML, and NLP sections.

This ensures accuracy, reproducibility, and coherence across the entire site.

</div>

---

<div class="card reveal">

### Initial Dataset Overview

```{python}
import pandas as pd
import numpy as np
import ast
import re

# Load raw Lightcast job postings
df = pd.read_csv("data/lightcast_job_postings.csv", low_memory=False)

# Basic structural overview
# df.head()
# df.info()
# list(df.columns)
```

</div>

---

<div class="card reveal">

### Dropping Redundant or Outdated Columns

Lightcast contains multiple versions of NAICS/SOC classifications, plus metadata fields (tracking URLs, flags, timestamps).
We remove clearly redundant metadata and older classification versions while keeping the 2022 NAICS codes and 2021 SOC codes.

```{python}
columns_to_drop = [
    "ID", "URL", "ACTIVE_URLS", 
    "DUPLICATES", "LAST_UPDATED_TIMESTAMP", "NAICS2", 
    "NAICS3", "NAICS4", "NAICS5", "NAICS6",
    "SOC_2", "SOC_3", "SOC_5"
]

df.drop(
    columns=columns_to_drop, 
    inplace=True, 
    errors="ignore"
    )
# df.head()
```

</div>

---

<div class="card reveal">

### Cleaning Messy List-Type Columns

Many Lightcast columns store lists as text such as:

["Job Board"]
["disabledperson.com", "dejobs.org"]

So, we convert these into clean comma-separated strings.

```{python}
def clean_list(x):
    """
    Convert stringified Python lists into comma-separated strings and strip noise.
    If parsing fails, fall back to a lightly cleaned string.
    """
    if pd.isna(x):
        return ""
    x = str(x).replace("\n", " ").strip()
    try:
        obj = ast.literal_eval(x)
        if isinstance(obj, list):
            return ", ".join(str(i).strip() for i in obj)
    except Exception:
        pass
    # Fallback: remove stray quotes if present
    x = re.sub(r'["\']', "", x)
    return x.strip()

list_cols = [
    "SOURCE_TYPES", "SOURCES",
    "SKILLS_NAME", "SPECIALIZED_SKILLS_NAME",
    "COMMON_SKILLS_NAME", "SOFTWARE_SKILLS_NAME",
    "CERTIFICATIONS_NAME"
]

for col in list_cols:
    if col in df.columns:
        df[col] = df[col].apply(clean_list)

# display(df.head(3))
```

</div>

---

<div class="card reveal">

### Salary Fields: Cleaning and Preservation

Salary information is central to our job-market analysis, so we explicitly clean and preserve all salary-related fields, even when they are sparsely populated.

We:

    •   identify key salary-like columns: SALARY_FROM, SALARY_TO, SALARY, ORIGINAL_PAY_PERIOD,
    •   strip currency symbols, commas, and non-numeric characters from numeric fields,
    •   create a standardized PAY_PERIOD,
    •   convert everything to an annualized ANNUAL_SALARY field suitable for EDA and modeling.


```{python}
# Key salary-related columns we want to preserve
salary_cols = [
    "SALARY_FROM", 
    "SALARY_TO", 
    "SALARY", 
    "ORIGINAL_PAY_PERIOD"
    ]

# Ensure the columns exist
for col in salary_cols:
    if col not in df.columns:
        df[col] = np.nan

# Helper to clean salary values, including ranges like "90k-120k"
def clean_salary_range(x):
    if pd.isna(x):
        return np.nan
    x = str(x)

    # Keep digits, commas, dot, minus, and en-dash
    x = re.sub(r"[^\d,\.\-\–]", "", x)

    # If there is a range, split and average
    if "-" in x or "–" in x:
        parts = re.split(r"[-–]", x)
        nums = []
        for p in parts:
            p = p.replace(",", "")
            if p == "":
                continue
            try:
                nums.append(float(p))
            except Exception:
                pass
        if len(nums) >= 2:
            return np.mean(nums)
        elif len(nums) == 1:
            return nums[0]
        return np.nan

    # Otherwise, treat as a single value
    x = x.replace(",", "")
    if x == "":
        return np.nan
    try:
        return float(x)
    except Exception:
        return np.nan

# Clean numeric salary fields (FROM, TO, SALARY)
numeric_salary_fields = [
    "SALARY_FROM", "SALARY_TO", "SALARY"
    ]

for col in numeric_salary_fields:
    df[col] = df[col].apply(clean_salary_range)

# Choose the best base salary for each row
def get_base_salary(row):
    # 1. If SALARY exists, treat it as already standardized
    if pd.notna(row["SALARY"]):
        return row["SALARY"]
    # 2. If both FROM and TO exist, average them
    if pd.notna(row["SALARY_FROM"]) and pd.notna(row["SALARY_TO"]):
        return (row["SALARY_FROM"] + row["SALARY_TO"]) / 2
    # 3. Fallback to whichever exists
    if pd.notna(row["SALARY_FROM"]):
        return row["SALARY_FROM"]
    if pd.notna(row["SALARY_TO"]):
        return row["SALARY_TO"]
    return np.nan

# Normalize pay period strings
def normalize_period(x):
    if pd.isna(x):
        return "year"
    x = str(x).lower()

    if any(k in x for k in ["year", "yr", "annual"]):
        return "year"
    if any(k in x for k in ["month", "mo", "per_month"]):
        return "month"
    if "biweek" in x:
        return "biweek"
    if any(k in x for k in ["week", "wk", "per_week"]):
        return "week"
    if any(k in x for k in ["day", "daily"]):
        return "day"
    if any(k in x for k in ["hour", "hr"]):
        return "hour"
    return "year"

df["PAY_PERIOD"] = df["ORIGINAL_PAY_PERIOD"].apply(normalize_period)

# Convert base salary to annual salary
def convert_to_annual(row):
    salary = get_base_salary(row)
    period = row["PAY_PERIOD"]

    if pd.isna(salary):
        return np.nan

    if period == "year":
        return salary
    if period == "month":
        return salary * 12
    if period == "week":
        return salary * 52
    if period == "biweek":
        return salary * 26
    if period == "day":
        return salary * 260   # approx. work days in a year
    if period == "hour":
        return salary * 2080  # 40 hours * 52 weeks
    return salary

df["ANNUAL_SALARY"] = df.apply(convert_to_annual, axis=1)

# Remove implausible values (e.g., < $10k or > $700k per year)
df.loc[df["ANNUAL_SALARY"] < 10000, "ANNUAL_SALARY"] = np.nan
df.loc[df["ANNUAL_SALARY"] > 700000, "ANNUAL_SALARY"] = np.nan

df["ANNUAL_SALARY"] = df["ANNUAL_SALARY"].astype(float)

df[["SALARY_FROM", 
    "SALARY_TO", 
    "SALARY", 
    "PAY_PERIOD",
    "ANNUAL_SALARY"]
    ].head(3)
```

</div>

---

<div class="card reveal">

### Handling Missing Values

Understanding missingness is important for determining how to impute values and whether key fields are usable.

We focus on essential analytical fields:

    •   salary fields (now cleaned),
    •   experience requirements,
    •   location fields,
    •   duration.

We drop extremely sparse columns (more than 50% missing) but never drop salary-related fields, even if they exceed this threshold.

```{python}
# Inspect the most-missing columns
missing_summary = df.isna().mean().sort_values(ascending=False).head(15)
# display(missing_summary)
```

```{python}
# Critical columns that must be preserved
critical_cols = [
    c for c in ([
        "SALARY_FROM", 
        "SALARY_TO", 
        "SALARY",
        "ORIGINAL_PAY_PERIOD",
        "ANNUAL_SALARY", 
        "PAY_PERIOD"
        ])
    if c in df.columns
]

threshold = len(df) * 0.5

# Drop columns with >50% missing, except critical salary / pay-period fields
cols_to_drop_missing = [
    c for c in df.columns
    if c not in critical_cols and df[c].isna().sum() > threshold
]

df.drop(columns=cols_to_drop_missing, inplace=True)

# Fill numeric values with median (EXCLUDING salary-related fields)
num_cols = df.select_dtypes(include=["float", "int"]).columns
num_exclude = set(["SALARY_FROM", "SALARY_TO", "SALARY", "ANNUAL_SALARY"])
num_fill = [c for c in num_cols if c not in num_exclude]

df[num_fill] = df[num_fill].fillna(df[num_fill].median())

# Leave salary-related fields as NaN when missing

# Fill categorical values with "Unknown"
cat_cols = df.select_dtypes(include="object").columns
df[cat_cols] = df[cat_cols].fillna("Unknown")

print("Shape after missing-value handling:", df.shape)
# df.info()
```

</div>

---

<div class="card reveal">

### Standardizing the POSTED Date and Creating a Month Column

Datetime consistency is essential for trend analysis and for joining with FRED macroeconomic indicators.

```{python}
# df["POSTED"] = pd.to_datetime(df["POSTED"], errors="coerce")
# df["month"] = df["POSTED"].dt.to_period("M").astype(str)
df = df.assign(
    POSTED = pd.to_datetime(df["POSTED"], errors="coerce"),
    month = lambda x: x["POSTED"].dt.to_period("M").astype(str)
)
display(df[["POSTED", "month"]].head())
```

</div>

---

<div class="card reveal">

### Removing Duplicate Job Postings

Duplicate postings can overcount job demand and distort salary statistics.
We remove duplicates based on:

- job title  
- company  
- location  
- original posting date  

```{python}
subset_cols = [
    c for c in [
        "TITLE_NAME", 
        "COMPANY_NAME", 
        "LOCATION", 
        "POSTED"]
        if c in df.columns]

if subset_cols:
    before = df.shape[0]
    df = df.drop_duplicates(subset=subset_cols, keep="first")
    after = df.shape[0]
    print(f"Removed {before - after} duplicate postings.")
else:
    print("Duplicate removal skipped: key columns missing.")

print(df.shape)
```

</div>

---

<div class="card reveal">

### Exporting the Final Cleaned Dataset

Exporting the Final Cleaned Dataset

This cleaned dataset will be used consistently across:

    •   Exploratory Data Analysis (EDA),
    •   Skill Gap Analysis,
    •   ML Methods,
    •   NLP Methods.

Exporting ensures team alignment and prevents inconsistent results.

```{python}
df.to_csv("data/lightcast_cleaned.csv", index=False)
# df.head()
print(df.shape)
```

</div>

---

<div class="card reveal">

## Summary

This cleaned dataset establishes the foundation for all downstream analysis.

By explicitly cleaning and preserving salary fields, selectively dropping only non-essential sparse columns, standardizing dates, and removing duplicates, we ensure:

- accuracy and reproducibility,
  
- consistent salary-based insights,
  
- reduced rendering load for GitHub Pages,
  
- and smooth integration with the ML and NLP pipelines.

This completes the Data Cleaning & Pre-Processing phase.

</div>