Machine Learning Models

Modeling the Impact of Industry Gender Composition on Salaries

Overview

This analysis examines how industry-level gender dominance relates to advertised salaries across job postings. To incorporate this factor into our models, we assigned each NAICS 2-digit industry code to one of three gender representation categories based on U.S. labor statistics:

Male-dominated
Mixed
Female-dominated

These categories reflect broad workforce participation trends across major sectors. Male-dominated sectors typically include labor-intensive industries such as construction and transportation, while female-dominated sectors include health care, education, and administrative or professional services. Mixed sectors show more balanced gender representation or significant role-based variation.

Integration into Modeling

These three groups were then encoded numerically as:

0 → Male-dominated
1 → Mixed
2 → Female-dominated

This encoding allows the gender composition of industries to be used directly as a structured feature inside regression and machine learning models.

By examining salary differences across these groups—while controlling for experience requirements, employment type, remote status, staffing-company involvement, and geographic factors—we evaluate whether certain industry gender compositions correspond to systematically higher or lower advertised wages within the professional and technical job postings present in our dataset.

Code

import pandas as pd
import numpy as np

lightcast_jp = pd.read_csv(
    "data/lightcast_gender.csv",
    low_memory=False
)

gender_encoding = {
    "Male-dominated": 0,
    "Mixed": 1,
    "Female-dominated": 2,
}
lightcast_jp["GENDER_DOMINANCE_CODE"] = lightcast_jp[
    "GENDER_DOMINANCE"
    ].map(gender_encoding)

Linear Regression

The goal of this model is to answer the question:

How does the gender dominance of an industry affect the offered salary in job postings?

We trained a linear regression model using the following features:

Gender dominance
Years of Experience
Employment type
Remote type
Internship indicator
Staffing company indicator
State
NAICS code

Code

features = [
    "SALARY", "GENDER_DOMINANCE_CODE", "MIN_YEARS_EXPERIENCE", 
    "EMPLOYMENT_TYPE", "REMOTE_TYPE", "IS_INTERNSHIP", 
    "COMPANY_IS_STAFFING", "STATE_NAME", "NAICS_2022_2"]

df_model = lightcast_jp[features]
# df_model.isna().sum()

df_model = pd.get_dummies(
    df_model,
    columns=["IS_INTERNSHIP", "COMPANY_IS_STAFFING", "STATE_NAME"],
    drop_first=True
)

After preparing the dataset and encoding categorical features, we trained a linear regression model using an 80/20 train–test split.
The coefficient for GENDER_DOMINANCE represents the expected change in salary when moving from one dominance group to the next (male → mixed → female).

Model Training

Code

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X = df_model.drop("SALARY", axis=1)
y = df_model["SALARY"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=79
)

model = LinearRegression()
model.fit(X_train, y_train)

coef = model.coef_[list(X.columns).index("GENDER_DOMINANCE_CODE")]
print(f"Salary change per category step: {coef:.3f}")

Salary change per category step: -6998.673

Moving from a male-dominated to a mixed industry is associated with an average salary decrease of about $7K.
Moving from mixed to female-dominated shows an additional decrease of roughly the same magnitude.

Model Performance Evaluation

To assess the performance of the linear regression model, we compute the R² score using the test dataset. Our model explains approximately 24.3% of the variation in advertised salaries. Meaning that the model captures some underlying structure—especially differences driven by experience, industry, and location—but much of the salary variation remains unexplained under a linear framework.

Code

from sklearn.metrics import r2_score, mean_squared_error

y_pred_lr = model.predict(X_test)
lr_r2 = r2_score(y_test, y_pred_lr)
lr_rmse = mean_squared_error(y_test, y_pred_lr) ** 0.5
print(f"R²: {lr_r2:.4f}")
print(f"RMSE: {lr_rmse:.4f}")

R²: 0.2432
RMSE: 39206.1659

Predicted Salary by Dominance Group

Using the average job posting profile in our dataset, we generated predicted salaries for each gender dominance category. The results show a downward trend in salary as industries become more female-dominated, even after controlling for experience, employment type, remote status, staffing firm involvement, and state-level differences. In other words, job postings in male-dominated industries are associated with higher advertised salaries, while female-dominated sectors tend to offer lower salaries for otherwise comparable postings in our dataset.

Code

base = X_train.mean().copy()

pred = {}
for group_code in [0, 1, 2]:
    base["GENDER_DOMINANCE_CODE"] = group_code
    pred[group_code] = model.predict(base.to_frame().T)[0]

merged = pd.DataFrame({
    "GENDER_DOMINANCE_CODE": list(pred.keys()),
    "PREDICTED_SALARY": list(pred.values())
})
merged["GENDER_DOMINANCE"] = merged["GENDER_DOMINANCE_CODE"].map(
    {v: k for k, v in gender_encoding.items()}
    )    
merged = merged[["GENDER_DOMINANCE", "PREDICTED_SALARY"]]
print(merged)

   GENDER_DOMINANCE  PREDICTED_SALARY
0    Male-dominated     123564.708984
1             Mixed     116566.035994
2  Female-dominated     109567.363003

Random Forest

The Random Forest model is used to capture non-linear relationships between job posting characteristics and advertised salary. Unlike linear regression, which assumes a straight-line effect for each feature, Random Forests build many decision trees and combine their predictions. This allows the model to detect more complex interactions across features such as industry, remote work status, experience requirements, and gender dominance.

Model Training

We specify the number of trees, allow trees to grow to full depth, and set a random seed to ensure reproducible results. Using multiple decision trees makes the model more robust and reduces overfitting.

During training, the forest collectively learns how salary varies with experience level, industry code, employment type, remote type, and other factors.

Code

from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(
    n_estimators=300,
    max_depth=None,
    random_state=79,
    n_jobs=-1
)

rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

Model Performance Evaluation

We evaluate the model using two metrics:

R²: Measures how much variance in salary the model explains.
RMSE: Measures average prediction error in dollar terms.

The model achieves an R² of approximately 0.44, indicating moderate predictive power given the complexity and noise typical of job posting salary data.

Code

rf_r2 = r2_score(y_test, y_pred_rf)
rf_rmse = mean_squared_error(y_test, y_pred_rf) ** 0.5

print(f"R²: {rf_r2:.4f}")
print(f"RMSE: {rf_rmse:,.4f}")

R²: 0.4381
RMSE: 33,780.1637

Feature Importance Analysis

Random Forest provides an estimate of each feature’s importance based on how much it reduces prediction error across the forest of trees. In our results, experience and industry code are the strongest predictors, while gender dominance plays a smaller but measurable role.

Code

rf_importances = (
pd.Series(rf_model.feature_importances_, index=X.columns)
.sort_values(ascending=False)
)
print(rf_importances.head(10))

MIN_YEARS_EXPERIENCE        0.375126
NAICS_2022_2                0.157024
REMOTE_TYPE                 0.068725
GENDER_DOMINANCE_CODE       0.056740
EMPLOYMENT_TYPE             0.040388
IS_INTERNSHIP_True          0.025948
STATE_NAME_California       0.025122
COMPANY_IS_STAFFING_True    0.023374
STATE_NAME_Texas            0.014256
STATE_NAME_New York         0.013872
dtype: float64

Predicted Salary by Dominance Group

The Random Forest model shows substantial salary differences across gender-dominance categories. These predictions suggest that male-dominated industries carry a significant salary premium, with advertised salaries nearly $56,000 higher than those in mixed or female-dominated industries. In contrast, the salaries for mixed and female-dominated sectors are almost identical.

Code

base_rf = X_train.mean().copy()

pred_rf = {}
for group_code in [0, 1, 2]:
    base_rf["GENDER_DOMINANCE_CODE"] = group_code
    pred_rf[group_code] = rf_model.predict(base_rf.to_frame().T)[0]

rf_merged = pd.DataFrame({
"GENDER_DOMINANCE_CODE": list(pred_rf.keys()),
"PREDICTED_SALARY_RF": list(pred_rf.values())
})
rf_merged["GENDER_DOMINANCE"] = rf_merged["GENDER_DOMINANCE_CODE"].map(
{v: k for k, v in gender_encoding.items()}
)
rf_merged = rf_merged[["GENDER_DOMINANCE", "PREDICTED_SALARY_RF"]]
print(rf_merged)

   GENDER_DOMINANCE  PREDICTED_SALARY_RF
0    Male-dominated        128001.509402
1             Mixed         71684.453611
2  Female-dominated         72021.826390