Distribution of Job Postings and Median Salaries Across Gender-Dominance Categories
Objective
This section aims to clean and prepare the Lightcast job postings dataset, focusing on industries with clearly defined classifications. It then categorizes industries by gender dominance and analyzes job postings, median salaries, and distributions across male-, female-, and mixed-dominated sectors. The visualizations provide insights into how gender dominance correlates with job availability and compensation patterns.
Load the Cleaned Dataset
Code
import pandas as pdimport numpy as npimport plotly.graph_objects as goimport plotly.io as pioimport plotly.express as pximport os# Load raw Lightcast job postingsdf = pd.read_csv("data/lightcast_cleaned.csv", low_memory=False)#df.head()
Industry Data Preparation
We clean the dataset by removing rows where the industry classification is either “Unknown” or “Unclassified Industry” to ensure the analysis focuses on clearly defined sectors. Before dropping these records, we calculate the percentage they represent in the dataset to document their impact on data quality.
Code
# Clean the dataset to analyze available industries# Remove rows where NAICS2_NAME is "Unknown"df = df[df["NAICS2_NAME"] !="Unknown"]# Use unclassified_count function to see % of Unclassified Industry in the dataset# Count rows before droppingbefore_count =len(df)# Count how many are "Unclassified Industry"unclassified_count = (df["NAICS2_NAME"] =="Unclassified Industry").sum()# Percentage of Unclassified Industry rowspercentage = unclassified_count / before_count *100print(f"Percentage of Unclassified Industry rows: {percentage:.2f}%")# Drop those rowsdf = df[df["NAICS2_NAME"] !="Unclassified Industry"]after_count =len(df)removed_count = before_count - after_countprint("Rows removed:", removed_count)print("Rows before:", before_count)print("Rows after:", after_count)
Percentage of Unclassified Industry rows: 13.31%
Rows removed: 9205
Rows before: 69181
Rows after: 59976
Exploring Unique Industry Categories (NAICS2_NAME)
We retrieve the list of unique industry categories by extracting all distinct values from the NAICS2_NAME column after removing missing entries.
Code
# Get list of uniques NAICS2_NAMEnaics2_name = df["NAICS2_NAME"].dropna().unique().tolist()# print(naics2_name)
Classifying Industries by Gender Dominance
We classify each NAICS2_NAME industry into one of three gender-dominance categories based on U.S. labor statistics:
Male-dominated industries: Agriculture, Forestry, Fishing and Hunting; Mining, Quarrying, and Oil and Gas Extraction; Utilities; Construction; Manufacturing; Wholesale Trade; Transportation and Warehousing; Professional, Scientific, and Technical Services; Information; Management of Companies and Enterprises.
Female-dominated industries: Finance and Insurance; Educational Services; Health Care and Social Assistance; Accommodation and Food Services; Other Services (except Public Administration).
Mixed industries: Retail Trade; Administrative and Support and Waste Management and Remediation Services; Real Estate and Rental and Leasing; Arts, Entertainment, and Recreation; Public Administration.
Code
gender_dom_map = {# Male-dominated Industries"Agriculture, Forestry, Fishing and Hunting": "Male-dominated","Mining, Quarrying, and Oil and Gas Extraction": "Male-dominated","Utilities": "Male-dominated","Construction": "Male-dominated","Manufacturing": "Male-dominated","Wholesale Trade": "Male-dominated","Transportation and Warehousing": "Male-dominated","Professional, Scientific, and Technical Services": "Male-dominated","Information": "Male-dominated","Management of Companies and Enterprises": "Male-dominated",# Female-dominated Industries"Finance and Insurance": "Female-dominated","Educational Services": "Female-dominated","Health Care and Social Assistance": "Female-dominated","Accommodation and Food Services": "Female-dominated","Other Services (except Public Administration)": "Female-dominated",# Mixed Industries"Retail Trade": "Mixed","Administrative and Support and Waste Management and Remediation Services": "Mixed","Real Estate and Rental and Leasing": "Mixed","Arts, Entertainment, and Recreation": "Mixed","Public Administration": "Mixed"}# Create a GENDER_DOMINANCE column in the datasetdf["GENDER_DOMINANCE"] = df["NAICS2_NAME"].map(gender_dom_map)gender_dom_summary = ( df["GENDER_DOMINANCE"] .value_counts() .reset_index())print("Total Job Posting Counts in Dataset:", len(df))print("Total Job Posting Counts by Gender Dominance:")display(gender_dom_summary)
Total Job Posting Counts in Dataset: 59976
Total Job Posting Counts by Gender Dominance:
GENDER_DOMINANCE
count
0
Male-dominated
35159
1
Female-dominated
13013
2
Mixed
11804
Out of 59,976 total job postings, the dataset shows 35,159 job postings in male-dominated industries, followed by 13,013 postings in female-dominated industries, while Mixed industries account for 11,804 postings. These counts provide a clear view of how job postings are distributed across the three gender-dominance categories. The larger volume in male-dominated fields suggests that most hiring activity in this dataset is concentrated in those sectors, with smaller but still significant demand in female-dominated and Mixed industries. Together, this distribution gives a straightforward snapshot of where job opportunities are most active.
Median Salary by Industry with Job Count
Code
# Drop rows with missing salary valuesdf_salary = df.dropna(subset=["SALARY"])# Save new df as lighcast_gender.csv to local folderdf_salary.to_csv("./data/lightcast_gender.csv", index=False)# Compute job countsindustry_counts = (df_salary.groupby("NAICS2_NAME")["SALARY"] .count() .rename("JOB_COUNT"))industry_median_salary = ( df_salary.groupby("NAICS2_NAME")["SALARY"] .median() .rename("MEDIAN_SALARY") .to_frame() .join(industry_counts) .sort_values("MEDIAN_SALARY", ascending=False) .reset_index())display(industry_median_salary)
NAICS2_NAME
MEDIAN_SALARY
JOB_COUNT
0
Accommodation and Food Services
144560.0
212
1
Information
132550.0
2166
2
Professional, Scientific, and Technical Services
130000.0
8530
3
Retail Trade
120000.0
755
4
Construction
118097.5
284
5
Manufacturing
117450.0
1552
6
Finance and Insurance
115239.0
3567
7
Utilities
114206.5
306
8
Wholesale Trade
101597.0
851
9
Management of Companies and Enterprises
101400.0
39
10
Mining, Quarrying, and Oil and Gas Extraction
100600.0
36
11
Administrative and Support and Waste Managemen...
98800.0
3584
12
Transportation and Warehousing
97739.0
203
13
Health Care and Social Assistance
95285.0
1280
14
Agriculture, Forestry, Fishing and Hunting
87500.0
28
15
Other Services (except Public Administration)
85000.0
332
16
Real Estate and Rental and Leasing
85000.0
416
17
Arts, Entertainment, and Recreation
82950.0
85
18
Public Administration
79092.0
691
19
Educational Services
75919.0
950
Code
# Group by gender dominancegender_dom_summary = ( df_salary.groupby("GENDER_DOMINANCE") .agg( JOB_COUNT=("SALARY", "count"), MEDIAN_SALARY=("SALARY", "median") ) .reset_index() .sort_values("MEDIAN_SALARY", ascending=False))display(gender_dom_summary)
GENDER_DOMINANCE
JOB_COUNT
MEDIAN_SALARY
1
Male-dominated
13995
125900.0
0
Female-dominated
6341
101500.0
2
Mixed
5531
95000.0
Chart 1: Median Salary by Gender Dominance Category
Code
# Build a color palette for gender dominancecolors = {"Male-dominated": "rgba(137, 176, 255, 0.8)","Female-dominated": "rgba(255, 179, 207, 0.8)","Mixed": "rgba(199, 168, 255, 0.8)"}fig_gd = go.Figure()fig_gd.add_trace( go.Bar( x=gender_dom_summary["GENDER_DOMINANCE"], y=gender_dom_summary["MEDIAN_SALARY"], text=[f"${v:,.0f}"for v in gender_dom_summary["MEDIAN_SALARY"] ], textposition="outside", marker_color=[ colors[cat] for cat in gender_dom_summary["GENDER_DOMINANCE"] ] ))fig_gd.update_layout( title="Median Salary by Gender Dominance Category", yaxis_title="Median Salary", xaxis_title="Gender Dominance"# width=700,# height=500)fig_gd.show()
Looking at the job postings, male-dominated industries lead in both number and pay, with 13,995 positions and a median salary of $125,900. Female-dominated sectors have 6,341 postings at a median of $101,500, while Mixed industries have the fewest openings, 5,531, with a median of $95,000. Overall, this points to a clear link between gender dominance in an industry and salary levels, with male-heavy fields offering both more opportunities and higher pay.
Chart 2: Salary Distribution by Gender Dominance
Code
fig_box = go.Figure()for category in df_salary["GENDER_DOMINANCE"].unique(): fig_box.add_trace( go.Box( y=df_salary[ df_salary["GENDER_DOMINANCE"] == category ]["SALARY"], name=category, marker_color=colors.get(category, "lightgray"), boxmean=True ) )fig_box.update_layout( title="Salary Distribution by Gender Dominance", yaxis_title="Salary", xaxis_title="Gender Dominance"# width=800,# height=550)fig_box.show()
The boxplot shows that Male-dominated industries consistently offer higher salaries, with both a higher median and greater variation at the upper end. On the other hand, female-dominated industries display lower median salaries and a tighter distribution, suggesting fewer opportunities for better salary offers. The Mixed industries fall right in-between, with moderate median pay and some high-earning positions, though not as many as in male-dominated fields.
Chart 3: Job Count vs. Median Salary by Gender Dominance Category
Code
bubble_fig = px.scatter( gender_dom_summary, x="JOB_COUNT", y="MEDIAN_SALARY", size="JOB_COUNT", color="GENDER_DOMINANCE", color_discrete_map=colors, hover_name="GENDER_DOMINANCE", size_max=80)bubble_fig.update_layout( title="Job Count vs. Median Salary by Gender Dominance Category", xaxis_title="Job Count", yaxis_title="Median Salary"# width=850,# height=550)bubble_fig.show()
Male-dominated industries lead in both job count and median salary, with the largest number of openings and the highest pay. Female-dominated industries have fewer positions and lower median salaries, while Mixed industries fall in between, offering moderate opportunities and pay. Overall, higher male representation in an industry is associated with more jobs and higher salaries.
Chart 4: Median Salary and Job Count Across Male-Dominated Industries
Code
# Filter for male-dominated industriesmale_df = df_salary[df_salary["GENDER_DOMINANCE"] =="Male-dominated"]# Compute industry-level median salarymale_salary_summary = ( male_df.groupby("NAICS2_NAME")["SALARY"] .median() .sort_values(ascending=False))# Compute industry-level job countsmale_job_counts = (male_df.groupby("NAICS2_NAME")["SALARY"] .count() .reindex(male_salary_summary.index))# Pastel blue for barspastel_blue ="rgba(137, 176, 255, 0.8)"fig_male = go.Figure()fig_male.add_trace( go.Bar( x=male_salary_summary.index, y=male_salary_summary.values, text=[f"${v:,.0f}"for v in male_salary_summary.values ], textposition="outside", marker_color=pastel_blue, name="Median Salary" ))# Add trend line for male-dominated industry job countsfig_male.add_trace( go.Scatter( x=male_salary_summary.index, y=male_job_counts.values, mode="lines+markers", name="Job Count", yaxis="y2" ))fig_male.update_layout( title="Median Salary and Job Count Across Male-Dominated Industries", xaxis_title="Industry", yaxis_title="Median Salary", yaxis2=dict( title="Job Count", overlaying="y", side="right", ), xaxis_tickangle=40,# width=1100, height=700, margin=dict(l=50, r=70, t=80, b=150), legend=dict( orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1))fig_male.show()
The plot shows that male-dominated industries tend to offer relatively high median salaries, with Information ($132,550) and Professional/Scientific/Technical Services ($130,000) leading the group. However, job availability varies widely across some high-paying sectors: Information with a comparatively modest 2166 postings, while Professional/Scientific/Technical Services offers both high pay and large availability of 8530 postings. Thus, hands-on, strongly male-dominated fields like construction, mining, and agriculture show noticeably lower median salaries compared to many other industries overall, highlighting that physically intensive sectors don’t necessarily correspond to higher earnings.
Chart 5: Median Salary and Job Count Across Female-Dominated Industries
Code
# Filter for female-dominated industriesfemale_df = df_salary[df_salary["GENDER_DOMINANCE"] =="Female-dominated"]# Compute industry-level median salaryfemale_salary_summary = ( female_df.groupby("NAICS2_NAME")["SALARY"] .median() .sort_values(ascending=False))# Compute industry-level job countsfemale_job_counts = (female_df.groupby("NAICS2_NAME")["SALARY"] .count() .reindex(female_salary_summary.index))# Pastel pink for barspastel_pink ="rgba(255, 179, 207, 0.8)"fig_female = go.Figure()fig_female.add_trace( go.Bar( x=female_salary_summary.index, y=female_salary_summary.values, text=[f"${v:,.0f}"for v in female_salary_summary.values ], textposition="outside", marker_color=pastel_pink, name="Median Salary" ))# Add trend line for female-dominated industry job countsfig_female.add_trace( go.Scatter( x=female_salary_summary.index, y=female_job_counts.values, mode="lines+markers", name="Job Count", yaxis="y2" ))fig_female.update_layout( title="Median Salary and Job Count Across Female-Dominated Industries", xaxis_title="Industry", yaxis_title="Median Salary", yaxis2=dict( title="Job Count", overlaying="y", side="right", ), xaxis_tickangle=40,# width=1100, height=700, margin=dict(l=50, r=70, t=80, b=150), legend=dict( orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1))fig_female.show()
Female-dominated industries show a wider salary spread, ranging from about $76k to $145k, with Accommodation and Food Services unexpectedly offering the highest median pay ($144,560) despite only 212 job counts. Finance and Insurance stands out with strong salaries ($115,239) and by far the largest employment (3567), while Health Care and Social Assistance provides mid-range pay ($95,285) with substantial job availability (1280). Other Services and Educational Services offer lower salaries ($75,919) and moderate job counts (950), reflecting more service-oriented, lower-wage career tracks within this sector.
Chart 6: Median Salary and Job Count Across Mixed Industries
Code
# Filter for Mixed industriesbalanced_df = df_salary[df_salary["GENDER_DOMINANCE"] =="Mixed"]# Compute industry-level median salarybalanced_salary_summary = ( balanced_df.groupby("NAICS2_NAME")["SALARY"] .median() .sort_values(ascending=False))# Compute industry-level job countsbalanced_job_counts = (balanced_df.groupby("NAICS2_NAME")["SALARY"] .count() .reindex(balanced_salary_summary.index))# Pastel purple for barspastel_purple ="rgba(199, 168, 255, 0.8)"fig_balanced = go.Figure()fig_balanced.add_trace( go.Bar( x=balanced_salary_summary.index, y=balanced_salary_summary.values, text=[f"${v:,.0f}"for v in balanced_salary_summary.values ], textposition="outside", marker_color=pastel_purple, name="Median Salary" ))# Add trend line for Mixed industry job countsfig_balanced.add_trace( go.Scatter( x=balanced_salary_summary.index, y=balanced_job_counts.values, mode="lines+markers", name="Job Count", yaxis="y2" ))fig_balanced.update_layout( title="Median Salary and Job Count Across Mixed Industries", xaxis_title="Industry", yaxis_title="Median Salary", yaxis2=dict( title="Job Count", overlaying="y", side="right", ), xaxis_tickangle=40,# width=1100, height=800, margin=dict(l=50, r=70, t=80, b=150), legend=dict( orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1))fig_balanced.show()
Mixed industries show a narrower salary range - between $80k and $120k. Retail Trade offers the highest median pay ($120,000) despite moderate job availability (755), while Administrative and Support Services has the largest job count (3584) but lower median wages ($95,800). Real Estate, Public Administration and Arts/Entertainment maintain moderate salaries with smaller job counts, reflecting niche but steady career paths. Overall, these industries appear more stable across pay and job availability, without the sharp disparities seen in male-dominated sectors.
Conclusion
The analysis shows that male-dominated industries have the highest number of job postings and median salaries, while female-dominated industries generally offer fewer positions and lower median pay, with some exceptions in high-paying sectors like Finance and Accommodation. Mixed industries display moderate job counts and salaries, suggesting more balanced opportunities. Overall, gender dominance in an industry is strongly associated with both job availability and compensation patterns, highlighting structural differences across sectors.