Modeling the Impact of Industry Gender Composition on Salaries
Overview
This analysis examines how industry-level gender dominance relates to advertised salaries across job postings. To incorporate this factor into our models, we assigned each NAICS 2-digit industry code to one of three gender representation categories based on U.S. labor statistics:
Male-dominated
Mixed
Female-dominated
These categories reflect broad workforce participation trends across major sectors. Male-dominated sectors typically include labor-intensive industries such as construction and transportation, while female-dominated sectors include health care, education, and administrative or professional services. Mixed sectors show more balanced gender representation or significant role-based variation.
Integration into Modeling
These three groups were then encoded numerically as:
0 → Male-dominated
1 → Mixed
2 → Female-dominated
This encoding allows the gender composition of industries to be used directly as a structured feature inside regression and machine learning models.
By examining salary differences across these groups—while controlling for experience requirements, employment type, remote status, staffing-company involvement, and geographic factors—we evaluate whether certain industry gender compositions correspond to systematically higher or lower advertised wages within the professional and technical job postings present in our dataset.
Code
import pandas as pdimport numpy as nplightcast_jp = pd.read_csv("data/lightcast_gender.csv", low_memory=False)gender_encoding = {"Male-dominated": 0,"Mixed": 1,"Female-dominated": 2,}lightcast_jp["GENDER_DOMINANCE_CODE"] = lightcast_jp["GENDER_DOMINANCE" ].map(gender_encoding)
Linear Regression
The goal of this model is to answer the question:
How does the gender dominance of an industry affect the offered salary in job postings?
We trained a linear regression model using the following features:
After preparing the dataset and encoding categorical features, we trained a linear regression model using an 80/20 train–test split.
The coefficient for GENDER_DOMINANCE represents the expected change in salary when moving from one dominance group to the next (male → mixed → female).
Moving from a male-dominated to a mixed industry is associated with an average salary decrease of about $7K.
Moving from mixed to female-dominated shows an additional decrease of roughly the same magnitude.
Model Performance Evaluation
To assess the performance of the linear regression model, we compute the R² score using the test dataset. Our model explains approximately 24.3% of the variation in advertised salaries. Meaning that the model captures some underlying structure—especially differences driven by experience, industry, and location—but much of the salary variation remains unexplained under a linear framework.
Using the average job posting profile in our dataset, we generated predicted salaries for each gender dominance category. The results show a downward trend in salary as industries become more female-dominated, even after controlling for experience, employment type, remote status, staffing firm involvement, and state-level differences. In other words, job postings in male-dominated industries are associated with higher advertised salaries, while female-dominated sectors tend to offer lower salaries for otherwise comparable postings in our dataset.
Code
base = X_train.mean().copy()pred = {}for group_code in [0, 1, 2]: base["GENDER_DOMINANCE_CODE"] = group_code pred[group_code] = model.predict(base.to_frame().T)[0]merged = pd.DataFrame({"GENDER_DOMINANCE_CODE": list(pred.keys()),"PREDICTED_SALARY": list(pred.values())})merged["GENDER_DOMINANCE"] = merged["GENDER_DOMINANCE_CODE"].map( {v: k for k, v in gender_encoding.items()} ) merged = merged[["GENDER_DOMINANCE", "PREDICTED_SALARY"]]print(merged)
The Random Forest model is used to capture non-linear relationships between job posting characteristics and advertised salary. Unlike linear regression, which assumes a straight-line effect for each feature, Random Forests build many decision trees and combine their predictions. This allows the model to detect more complex interactions across features such as industry, remote work status, experience requirements, and gender dominance.
Model Training
We specify the number of trees, allow trees to grow to full depth, and set a random seed to ensure reproducible results. Using multiple decision trees makes the model more robust and reduces overfitting.
During training, the forest collectively learns how salary varies with experience level, industry code, employment type, remote type, and other factors.
R²: Measures how much variance in salary the model explains.
RMSE: Measures average prediction error in dollar terms.
The model achieves an R² of approximately 0.44, indicating moderate predictive power given the complexity and noise typical of job posting salary data.
Random Forest provides an estimate of each feature’s importance based on how much it reduces prediction error across the forest of trees. In our results, experience and industry code are the strongest predictors, while gender dominance plays a smaller but measurable role.
The Random Forest model shows substantial salary differences across gender-dominance categories. These predictions suggest that male-dominated industries carry a significant salary premium, with advertised salaries nearly $56,000 higher than those in mixed or female-dominated industries. In contrast, the salaries for mixed and female-dominated sectors are almost identical.
Code
base_rf = X_train.mean().copy()pred_rf = {}for group_code in [0, 1, 2]: base_rf["GENDER_DOMINANCE_CODE"] = group_code pred_rf[group_code] = rf_model.predict(base_rf.to_frame().T)[0]rf_merged = pd.DataFrame({"GENDER_DOMINANCE_CODE": list(pred_rf.keys()),"PREDICTED_SALARY_RF": list(pred_rf.values())})rf_merged["GENDER_DOMINANCE"] = rf_merged["GENDER_DOMINANCE_CODE"].map({v: k for k, v in gender_encoding.items()})rf_merged = rf_merged[["GENDER_DOMINANCE", "PREDICTED_SALARY_RF"]]print(rf_merged)