Gender Disparities in Labor Force Analysis (2024)
  • Home
  • Research Background
  • Analysis
    • Gender Disparities Overview
    • Data Cleaning & Preprocessing
    • Exploratory Data Analysis
    • Gender Dominance in Job Postings
    • Machine-Learning Models
    • NLP Analysis
    • Skill Gap Analysis
  • Career Strategy
  • About Us

On this page

  • Objective
  • Load the Cleaned Dataset
  • Industry Data Preparation
  • Exploring Unique Industry Categories (NAICS2_NAME)
  • Classifying Industries by Gender Dominance
  • Median Salary by Industry with Job Count
  • Chart 1: Median Salary by Gender Dominance Category
  • Chart 2: Salary Distribution by Gender Dominance
  • Chart 3: Job Count vs. Median Salary by Gender Dominance Category
  • Chart 4: Median Salary and Job Count Across Male-Dominated Industries
  • Chart 5: Median Salary and Job Count Across Female-Dominated Industries

Gender Dominance Analysis

Distribution of Job Postings and Median Salaries Across Gender-Dominance Categories

Objective

This section aims to clean and prepare the Lightcast job postings dataset, focusing on industries with clearly defined classifications. It then categorizes industries by gender dominance and analyzes job postings, median salaries, and distributions across male-, female-, and mixed-dominated sectors. The visualizations provide insights into how gender dominance correlates with job availability and compensation patterns.

Load the Cleaned Dataset

Code
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.io as pio
import plotly.express as px
import os

# Load raw Lightcast job postings
df = pd.read_csv("data/lightcast_cleaned.csv", low_memory=False)
#df.head()

Industry Data Preparation

We clean the dataset by removing rows where the industry classification is either “Unknown” or “Unclassified Industry” to ensure the analysis focuses on clearly defined sectors. Before dropping these records, we calculate the percentage they represent in the dataset to document their impact on data quality.

Code
# Clean the dataset to analyze available industries
# Remove rows where NAICS2_NAME is "Unknown"
df = df[df["NAICS2_NAME"] != "Unknown"]

# Use unclassified_count function to see % of Unclassified Industry in the dataset
# Count rows before dropping
before_count = len(df)

# Count how many are "Unclassified Industry"
unclassified_count = (df["NAICS2_NAME"] == "Unclassified Industry").sum()

# Percentage of Unclassified Industry rows
percentage = unclassified_count / before_count * 100
print(f"Percentage of Unclassified Industry rows: {percentage:.2f}%")
# Drop those rows
df = df[df["NAICS2_NAME"] != "Unclassified Industry"]

after_count = len(df)
removed_count = before_count - after_count
print("Rows removed:", removed_count)
print("Rows before:", before_count)
print("Rows after:", after_count)
Percentage of Unclassified Industry rows: 13.31%
Rows removed: 9205
Rows before: 69181
Rows after: 59976

Exploring Unique Industry Categories (NAICS2_NAME)

We retrieve the list of unique industry categories by extracting all distinct values from the NAICS2_NAME column after removing missing entries.

Code
# Get list of uniques NAICS2_NAME
naics2_name = df["NAICS2_NAME"].dropna().unique().tolist()
# print(naics2_name)

Classifying Industries by Gender Dominance

We classify each NAICS2_NAME industry into one of three gender-dominance categories based on U.S. labor statistics:

  • Male-dominated industries: Agriculture, Forestry, Fishing and Hunting; Mining, Quarrying, and Oil and Gas Extraction; Utilities; Construction; Manufacturing; Wholesale Trade; Transportation and Warehousing; Professional, Scientific, and Technical Services; Information; Management of Companies and Enterprises.

  • Female-dominated industries: Finance and Insurance; Educational Services; Health Care and Social Assistance; Accommodation and Food Services; Other Services (except Public Administration).

  • Mixed industries: Retail Trade; Administrative and Support and Waste Management and Remediation Services; Real Estate and Rental and Leasing; Arts, Entertainment, and Recreation; Public Administration.

Code
gender_dom_map = {
    # Male-dominated Industries
    "Agriculture, Forestry, Fishing and Hunting": "Male-dominated",
    "Mining, Quarrying, and Oil and Gas Extraction": "Male-dominated",
    "Utilities": "Male-dominated",
    "Construction": "Male-dominated",
    "Manufacturing": "Male-dominated",
    "Wholesale Trade": "Male-dominated",
    "Transportation and Warehousing": "Male-dominated",
    "Professional, Scientific, and Technical Services": "Male-dominated",
    "Information": "Male-dominated",
    "Management of Companies and Enterprises": "Male-dominated",

    # Female-dominated Industries
    "Finance and Insurance": "Female-dominated",
    "Educational Services": "Female-dominated",
    "Health Care and Social Assistance": "Female-dominated",
    "Accommodation and Food Services": "Female-dominated",
    "Other Services (except Public Administration)": "Female-dominated",

    # Mixed Industries
    "Retail Trade": "Mixed",
    "Administrative and Support and Waste Management and Remediation Services": "Mixed",
    "Real Estate and Rental and Leasing": "Mixed",
    "Arts, Entertainment, and Recreation": "Mixed",
    "Public Administration": "Mixed"
}

# Create a GENDER_DOMINANCE column in the dataset
df["GENDER_DOMINANCE"] = df["NAICS2_NAME"].map(gender_dom_map)
gender_dom_summary = (
    df["GENDER_DOMINANCE"]
    .value_counts()
    .reset_index()
)
print("Total Job Posting Counts in Dataset:", len(df))
print("Total Job Posting Counts by Gender Dominance:")
display(gender_dom_summary)
Total Job Posting Counts in Dataset: 59976
Total Job Posting Counts by Gender Dominance:
GENDER_DOMINANCE count
0 Male-dominated 35159
1 Female-dominated 13013
2 Mixed 11804

Out of 59,976 total job postings, the dataset shows 35,159 job postings in male-dominated industries, followed by 13,013 postings in female-dominated industries, while Mixed industries account for 11,804 postings. These counts provide a clear view of how job postings are distributed across the three gender-dominance categories. The larger volume in male-dominated fields suggests that most hiring activity in this dataset is concentrated in those sectors, with smaller but still significant demand in female-dominated and Mixed industries. Together, this distribution gives a straightforward snapshot of where job opportunities are most active.

Median Salary by Industry with Job Count

Code
# Drop rows with missing salary values
df_salary = df.dropna(subset=["SALARY"])
# Save new df as lighcast_gender.csv to local folder
df_salary.to_csv("./data/lightcast_gender.csv", index=False)

# Compute job counts
industry_counts = (df_salary.groupby("NAICS2_NAME")["SALARY"]
    .count()
    .rename("JOB_COUNT"))

industry_median_salary = (
    df_salary.groupby("NAICS2_NAME")["SALARY"]
    .median()
    .rename("MEDIAN_SALARY")
    .to_frame()
    .join(industry_counts)
    .sort_values("MEDIAN_SALARY", ascending=False)
    .reset_index()
)
display(industry_median_salary)
NAICS2_NAME MEDIAN_SALARY JOB_COUNT
0 Accommodation and Food Services 144560.0 212
1 Information 132550.0 2166
2 Professional, Scientific, and Technical Services 130000.0 8530
3 Retail Trade 120000.0 755
4 Construction 118097.5 284
5 Manufacturing 117450.0 1552
6 Finance and Insurance 115239.0 3567
7 Utilities 114206.5 306
8 Wholesale Trade 101597.0 851
9 Management of Companies and Enterprises 101400.0 39
10 Mining, Quarrying, and Oil and Gas Extraction 100600.0 36
11 Administrative and Support and Waste Managemen... 98800.0 3584
12 Transportation and Warehousing 97739.0 203
13 Health Care and Social Assistance 95285.0 1280
14 Agriculture, Forestry, Fishing and Hunting 87500.0 28
15 Other Services (except Public Administration) 85000.0 332
16 Real Estate and Rental and Leasing 85000.0 416
17 Arts, Entertainment, and Recreation 82950.0 85
18 Public Administration 79092.0 691
19 Educational Services 75919.0 950
Code
# Group by gender dominance
gender_dom_summary = (
    df_salary.groupby("GENDER_DOMINANCE")
    .agg(
        JOB_COUNT=("SALARY", "count"),
        MEDIAN_SALARY=("SALARY", "median")
    )
    .reset_index()
    .sort_values("MEDIAN_SALARY", ascending=False)
)
display(gender_dom_summary)
GENDER_DOMINANCE JOB_COUNT MEDIAN_SALARY
1 Male-dominated 13995 125900.0
0 Female-dominated 6341 101500.0
2 Mixed 5531 95000.0

Chart 1: Median Salary by Gender Dominance Category

Code
# Build a color palette for gender dominance
colors = {
    "Male-dominated": "rgba(137, 176, 255, 0.8)",
    "Female-dominated": "rgba(255, 179, 207, 0.8)",
    "Mixed": "rgba(199, 168, 255, 0.8)"
}

fig_gd = go.Figure()
fig_gd.add_trace(
    go.Bar(
        x=gender_dom_summary["GENDER_DOMINANCE"],
        y=gender_dom_summary["MEDIAN_SALARY"],
        text=[
            f"${v:,.0f}" for v in gender_dom_summary["MEDIAN_SALARY"]
            ],
        textposition="outside",
        marker_color=[
            colors[cat] for cat in gender_dom_summary["GENDER_DOMINANCE"]
            ]
    )
)
fig_gd.update_layout(
    title="Median Salary by Gender Dominance Category",
    yaxis_title="Median Salary",
    xaxis_title="Gender Dominance"
    # width=700,
    # height=500
)
fig_gd.show()

Looking at the job postings, male-dominated industries lead in both number and pay, with 13,995 positions and a median salary of $125,900. Female-dominated sectors have 6,341 postings at a median of $101,500, while Mixed industries have the fewest openings, 5,531, with a median of $95,000. Overall, this points to a clear link between gender dominance in an industry and salary levels, with male-heavy fields offering both more opportunities and higher pay.

Chart 2: Salary Distribution by Gender Dominance

Code
fig_box = go.Figure()
for category in df_salary["GENDER_DOMINANCE"].unique():
    fig_box.add_trace(
        go.Box(
            y=df_salary[
                df_salary["GENDER_DOMINANCE"] == category
                ]["SALARY"],
            name=category,
            marker_color=colors.get(category, "lightgray"),
            boxmean=True
        )
    )

fig_box.update_layout(
    title="Salary Distribution by Gender Dominance",
    yaxis_title="Salary",
    xaxis_title="Gender Dominance"
    # width=800,
    # height=550
)
fig_box.show()

The boxplot shows that Male-dominated industries consistently offer higher salaries, with both a higher median and greater variation at the upper end. On the other hand, female-dominated industries display lower median salaries and a tighter distribution, suggesting fewer opportunities for better salary offers. The Mixed industries fall right in-between, with moderate median pay and some high-earning positions, though not as many as in male-dominated fields.

Chart 3: Job Count vs. Median Salary by Gender Dominance Category

Code
bubble_fig = px.scatter(
    gender_dom_summary,
    x="JOB_COUNT",
    y="MEDIAN_SALARY",
    size="JOB_COUNT",
    color="GENDER_DOMINANCE",
    color_discrete_map=colors,
    hover_name="GENDER_DOMINANCE",
    size_max=80
)

bubble_fig.update_layout(
    title="Job Count vs. Median Salary by Gender Dominance Category",
    xaxis_title="Job Count",
    yaxis_title="Median Salary"
    # width=850,
    # height=550
)
bubble_fig.show()

Male-dominated industries lead in both job count and median salary, with the largest number of openings and the highest pay. Female-dominated industries have fewer positions and lower median salaries, while Mixed industries fall in between, offering moderate opportunities and pay. Overall, higher male representation in an industry is associated with more jobs and higher salaries.

Chart 4: Median Salary and Job Count Across Male-Dominated Industries

Code
# Filter for male-dominated industries
male_df = df_salary[df_salary["GENDER_DOMINANCE"] == "Male-dominated"]

# Compute industry-level median salary
male_salary_summary = (
    male_df.groupby("NAICS2_NAME")["SALARY"]
       .median()
    .sort_values(ascending=False)
)

# Compute industry-level job counts
male_job_counts = (male_df.groupby("NAICS2_NAME")["SALARY"]
    .count()
    .reindex(male_salary_summary.index))

# Pastel blue for bars
pastel_blue = "rgba(137, 176, 255, 0.8)"
fig_male = go.Figure()
fig_male.add_trace(
    go.Bar(
        x=male_salary_summary.index,
        y=male_salary_summary.values,
        text=[
            f"${v:,.0f}" for v in male_salary_summary.values
            ],
        textposition="outside",
        marker_color=pastel_blue,
        name="Median Salary"
    )
)

# Add trend line for male-dominated industry job counts
fig_male.add_trace(
    go.Scatter(
        x=male_salary_summary.index,
        y=male_job_counts.values,
        mode="lines+markers",
        name="Job Count",
        yaxis="y2"
    )
)

fig_male.update_layout(
    title="Median Salary and Job Count Across Male-Dominated Industries",
    xaxis_title="Industry",
    yaxis_title="Median Salary",
    yaxis2=dict(
        title="Job Count",
        overlaying="y",
        side="right",
    ),
    xaxis_tickangle=40,
    # width=1100,
    height=700,
    margin=dict(l=50, r=70, t=80, b=150),
    legend=dict(
        orientation="h", 
        yanchor="bottom", 
        y=1.02, 
        xanchor="right", 
        x=1)
)
fig_male.show()

The plot shows that male-dominated industries tend to offer relatively high median salaries, with Information ($132,550) and Professional/Scientific/Technical Services ($130,000) leading the group. However, job availability varies widely across some high-paying sectors: Information with a comparatively modest 2166 postings, while Professional/Scientific/Technical Services offers both high pay and large availability of 8530 postings. Thus, hands-on, strongly male-dominated fields like construction, mining, and agriculture show noticeably lower median salaries compared to many other industries overall, highlighting that physically intensive sectors don’t necessarily correspond to higher earnings.

Chart 5: Median Salary and Job Count Across Female-Dominated Industries

Code
# Filter for female-dominated industries
female_df = df_salary[df_salary["GENDER_DOMINANCE"] == "Female-dominated"]

# Compute industry-level median salary
female_salary_summary = (
    female_df.groupby("NAICS2_NAME")["SALARY"]
    .median()
    .sort_values(ascending=False)
)

# Compute industry-level job counts
female_job_counts = (female_df.groupby("NAICS2_NAME")["SALARY"]
    .count()
    .reindex(female_salary_summary.index))

# Pastel pink for bars
pastel_pink = "rgba(255, 179, 207, 0.8)"
fig_female = go.Figure()
fig_female.add_trace(
    go.Bar(
        x=female_salary_summary.index,
        y=female_salary_summary.values,
        text=[
            f"${v:,.0f}" for v in female_salary_summary.values
            ],
        textposition="outside",
        marker_color=pastel_pink,
        name="Median Salary"
    )
)

# Add trend line for female-dominated industry job counts
fig_female.add_trace(
    go.Scatter(
        x=female_salary_summary.index,
        y=female_job_counts.values,
        mode="lines+markers",
        name="Job Count",
        yaxis="y2"
    )
)

fig_female.update_layout(
    title="Median Salary and Job Count Across Female-Dominated Industries",
    xaxis_title="Industry",
    yaxis_title="Median Salary",
    yaxis2=dict(
        title="Job Count",
        overlaying="y",
        side="right",
    ),
    xaxis_tickangle=40,
    # width=1100,
    height=700,
    margin=dict(l=50, r=70, t=80, b=150),
    legend=dict(
        orientation="h", 
        yanchor="bottom", 
        y=1.02, 
        xanchor="right", 
        x=1)
)
fig_female.show()

Female-dominated industries show a wider salary spread, ranging from about $76k to $145k, with Accommodation and Food Services unexpectedly offering the highest median pay ($144,560) despite only 212 job counts. Finance and Insurance stands out with strong salaries ($115,239) and by far the largest employment (3567), while Health Care and Social Assistance provides mid-range pay ($95,285) with substantial job availability (1280). Other Services and Educational Services offer lower salaries ($75,919) and moderate job counts (950), reflecting more service-oriented, lower-wage career tracks within this sector.

Chart 6: Median Salary and Job Count Across Mixed Industries

Code
# Filter for Mixed industries
balanced_df = df_salary[df_salary["GENDER_DOMINANCE"] == "Mixed"]

# Compute industry-level median salary
balanced_salary_summary = (
    balanced_df.groupby("NAICS2_NAME")["SALARY"]
    .median()
    .sort_values(ascending=False)
)

# Compute industry-level job counts
balanced_job_counts = (balanced_df.groupby("NAICS2_NAME")["SALARY"]
    .count()
    .reindex(balanced_salary_summary.index))

# Pastel purple for bars
pastel_purple = "rgba(199, 168, 255, 0.8)"
fig_balanced = go.Figure()
fig_balanced.add_trace(
    go.Bar(
        x=balanced_salary_summary.index,
        y=balanced_salary_summary.values,
        text=[f"${v:,.0f}" 
            for v in balanced_salary_summary.values
            ],
        textposition="outside",
        marker_color=pastel_purple,
        name="Median Salary"
    )
)

# Add trend line for Mixed industry job counts
fig_balanced.add_trace(
    go.Scatter(
        x=balanced_salary_summary.index,
        y=balanced_job_counts.values,
        mode="lines+markers",
        name="Job Count",
        yaxis="y2"
    )
)

fig_balanced.update_layout(
    title="Median Salary and Job Count Across Mixed Industries",
    xaxis_title="Industry",
    yaxis_title="Median Salary",
    yaxis2=dict(
        title="Job Count",
        overlaying="y",
        side="right",
    ),
    xaxis_tickangle=40,
    # width=1100,
    height=800,
    margin=dict(l=50, r=70, t=80, b=150),
    legend=dict(
        orientation="h", 
        yanchor="bottom", 
        y=1.02, 
        xanchor="right", 
        x=1)
)
fig_balanced.show()

Mixed industries show a narrower salary range - between $80k and $120k. Retail Trade offers the highest median pay ($120,000) despite moderate job availability (755), while Administrative and Support Services has the largest job count (3584) but lower median wages ($95,800). Real Estate, Public Administration and Arts/Entertainment maintain moderate salaries with smaller job counts, reflecting niche but steady career paths. Overall, these industries appear more stable across pay and job availability, without the sharp disparities seen in male-dominated sectors.

Conclusion

The analysis shows that male-dominated industries have the highest number of job postings and median salaries, while female-dominated industries generally offer fewer positions and lower median pay, with some exceptions in high-paying sectors like Finance and Accommodation. Mixed industries display moderate job counts and salaries, suggesting more balanced opportunities. Overall, gender dominance in an industry is strongly associated with both job availability and compensation patterns, highlighting structural differences across sectors.

© 2025 · AD 688 Web Analytics · Boston University

Team 5