Gender Disparities in Labor Force Analysis (2024)
  • Home
  • Research Background
  • Analysis
    • Gender Disparities Overview
    • Data Cleaning & Preprocessing
    • Exploratory Data Analysis
    • Gender Dominance in Job Postings
    • Machine-Learning Models
    • NLP Analysis
    • Skill Gap Analysis
  • Career Strategy
  • About Us

On this page

  • EDA Overview
  • Load the Cleaned Dataset
  • Job Postings by Industry (Top Sectors Hiring)
  • Annual Salary - Trend Over Time
  • Salary Distribution by Industry
  • Remote vs. On-Site Job Patterns
  • Geographic Distribution of Jobs (by State)
  • Top Software Skills in Job Postings
  • EDA Summary

Exploratory Data Analysis

  • Show All Code
  • Hide All Code

  • View Source

Job Postings, Salaries, Geography, and Remote Work Patterns

EDA Overview

This page explores hiring trends, salary patterns, geographic variation, remote work availability, and software skill demands using the cleaned Lightcast dataset produced earlier.

Each visualization is selected to answer job seeker–focused questions:

  • Which industries are hiring the most?
  • How do salaries vary across sectors?
  • How common are remote roles?
  • How has demand changed over time?
  • Which software skills are most requested?

Load the Cleaned Dataset

Code
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import re

df = pd.read_csv("data/lightcast_cleaned.csv", low_memory=False)
#df.head()
#df.info()
#list(df.columns)

Job Postings by Industry (Top Sectors Hiring)

Code
industry_candidates = [
    "NAICS_2022_2_NAME", 
    "INDUSTRY_NAME", 
    "NAICS_2022_3_NAME"]
industry_col = next(
    (c for c in industry_candidates if c in df.columns), 
    None)
industry_col
'NAICS_2022_2_NAME'
Code
if industry_col:
    industry_counts = (
        df[industry_col]
        .value_counts()
        .head(15)
        .reset_index()
    )
    industry_counts.columns = ["Industry", "Postings"]
    
    fig = go.Figure()
    fig.add_trace(go.Bar(
        x=industry_counts["Postings"],
        y=industry_counts["Industry"],
        orientation="h"
    ))
    fig.update_layout(
        title=f"Top 15 Industries by Job Postings ({industry_col})",
        plot_bgcolor="white",
        margin=dict(l=10, r=10, t=60, b=10),
        height=500,
        xaxis=dict(
            showgrid=True,
            gridcolor="#eaeaea",
        ),
        yaxis=dict(
            categoryorder='total ascending',
        ))
    fig.show()

else:
    print("No industry column found.")

Insight:

Professional, Scientific & Technical Services and Administrative & Support Services account for the largest share of job postings, indicating sustained demand for analytical, operational, and client-facing roles. These sectors typically exhibit continuous hiring cycles driven by project-based work, business expansion, and comparatively high turnover. For job seekers, they represent strong entry points into the labor market with broad role variety and relatively frequent openings.


Annual Salary - Trend Over Time

Code
import plotly.graph_objects as go
from plotly.subplots import make_subplots

df2 = df.copy()
df2 = df2.dropna(subset=["ANNUAL_SALARY", "month"])

# Group by month (string)
agg = (
    df2.groupby("month")
    .agg(
        postings=("ANNUAL_SALARY", "size"), 
        avg_salary=("ANNUAL_SALARY", "mean")
        ).reset_index().sort_values("month")
)

# Dual-axis chart
# fig = make_subplots(specs=[[{"secondary_y": True}]])
fig = go.Figure()
fig.add_trace(go.Bar(
    x=agg["month"],
    y=agg["postings"],
    name="Job Postings",
    marker_color="#ff7043",
    yaxis="y1",
    hovertemplate="Job postings: %{y:,}<extra></extra>"
))
fig.add_trace(go.Scatter(
    x=agg["month"],
    y=agg["avg_salary"],
    name="Annual Salary",
    mode="lines+markers",
    yaxis="y2",
    line=dict(color="#1f77b4", width=3),
    hovertemplate="Salary: $%{y:,.0f}<extra></extra>"
))

fig.update_layout(
    title="Annual Salary - Trend Over Time",
    template="plotly_white",
    hovermode="x unified",
    legend=dict(
        orientation="h", 
        yanchor="bottom", 
        y=1.02, 
        xanchor="right", 
        x=1),
    xaxis=dict(
        title="Month",
        tickformat="%b %Y",
        showgrid=True,
        gridcolor="#eaeaea"
    ),
    yaxis=dict(
        title="Job Postings",
        showgrid=True,
        gridcolor="#eaeaea"
    ),
    yaxis2=dict(
        title="Annual Salary($)",
        overlaying="y",
        side="right",
        tickprefix="$",
        showgrid=False
    )
)
fig.show()

Insight:

The chart shows a clear seasonal pattern in the job market. Job postings stay fairly strong from May through September, although there is a small dip in July before demand rises again in late summer. However, advertised salaries move very differently. They drop steadily from May to July and reach their lowest point mid-summer. Then, in September, salaries jump sharply to the highest level in the entire period.

This contrast suggests that early summer is dominated by lower-paying or junior roles, while late summer and early fall bring a return of higher-value positions. The most important takeaway is that September offers both strong hiring volume and significantly higher salaries, making it a particularly favorable time for job seekers aiming for better-paid opportunities.

Salary Distribution by Industry

Salary variation across industries answers: “Which sectors pay more, and how unequal are wages within each industry?” We use the annualized salary field (ANNUAL_SALARY), which consolidates raw salary data from SALARY_FROM, SALARY_TO, and SALARY and converts it into a consistent yearly format based on reported pay period. This provides a clean, comparable measure of compensation across all industries.

Code
industry_col = industry_col
salary_col = "ANNUAL_SALARY" if "ANNUAL_SALARY" in df.columns else None

if industry_col and salary_col:

    # Filter valid salaries
    df_salary = df[[industry_col, salary_col]].dropna()

    # Remove noise categories that distort analysis
    df_salary = df_salary[df_salary[industry_col] != "Unclassified Industry"]

    # Cap extreme values at 300k for readability (still accurate for 95% of postings)
    df_salary[salary_col] = np.where(
        df_salary[salary_col] > 300000, 
        300000, 
        df_salary[salary_col]
        )

    # Compute median salary per industry for ordering
    medians = (
        df_salary.groupby(industry_col)[salary_col]
        .median()
        .sort_values(ascending=True)
    )

    ordered_categories = medians.index.tolist()

    fig = px.box(
        df_salary,
        y=industry_col,
        x=salary_col,
        category_orders={industry_col: ordered_categories},
        title="Annual Salary Distribution by Industry",
        color=industry_col,
        color_discrete_sequence=px.colors.qualitative.Vivid,
        height=900
    )

    fig.update_layout(
        showlegend=False,
        xaxis_title="Annual Salary (USD)",
        yaxis_title="Industry",
        margin=dict(l=120)
    )

    fig.show()

else:
    print("Salary chart not generated: Missing industry or salary column.")

Insight:

Annual salary levels differ substantially across industries, with Finance, Information, and Professional Services exhibiting the highest median compensation. The wide dispersion of salaries within these sectors reflects variation in seniority, specialization, and credential requirements. In contrast, service-oriented industries such as Retail and Accommodation & Food Services show lower but more compressed salary ranges, consistent with narrower variation in role complexity. These patterns suggest that targeted upskilling toward high-skill roles in Finance, Tech, and Professional Services is likely to yield the greatest earnings uplift.

Remote vs. On-Site Job Patterns

Remote work availability is a crucial factor for job seekers.
This chart shows whether industries lean toward remote, hybrid, or in-person work. Here, [None] represents postings where Lightcast did not provide a remote-work tag, which typically corresponds to on-site or unspecified arrangements.

Code
remote_candidates = ["REMOTE_TYPE_NAME", "REMOTE_TYPE"]
remote_col = next((c for c in remote_candidates if c in df.columns), None)
Code
if remote_col:
    remote_counts = df[remote_col].value_counts().reset_index()
    remote_counts.columns = ["Work Arrangement", "Postings"]

    fig = px.pie(
        remote_counts,
        names="Work Arrangement",
        values="Postings",
        title="Remote vs On-Site Job Postings",
        color_discrete_sequence=px.colors.qualitative.Pastel1
    )
    fig.show()
else:
    print("No remote work field found.")

Insight:

The majority of postings either do not explicitly tag remote work or are implicitly on-site, with only a modest share labeled as Remote or Hybrid. Where remote options are available, they are concentrated in data, IT, and knowledge-intensive roles that can be performed digitally. For job seekers prioritizing location flexibility, this suggests focusing on analytically oriented roles and employers that have formal remote-work policies rather than assuming remote options are widespread across all occupations.

Geographic Distribution of Jobs (by State)

Geographic EDA answers: “Which states have the most job opportunities?” We use STATE_NAME (cleaned) to compute the total number of postings per state.

Code
STATE_CENTROIDS = {
    "AL": (32.806, -86.791), "AK": (64.200, -149.493), "AZ": (34.049, -111.094),
    "AR": (35.201, -91.832), "CA": (36.778, -119.418), "CO": (39.550, -105.782),
    "CT": (41.603, -73.087), "DE": (38.910, -75.527), "DC": (38.907, -77.037),
    "FL": (27.664, -81.516), "GA": (32.165, -82.900), "HI": (19.896, -155.582),
    "ID": (44.068, -114.742), "IL": (40.633, -89.398), "IN": (40.267, -86.134),
    "IA": (41.878, -93.098), "KS": (39.012, -98.484), "KY": (37.839, -84.270),
    "LA": (30.984, -91.963), "ME": (45.254, -69.445), "MD": (39.045, -76.641),
    "MA": (42.407, -71.382), "MI": (44.314, -85.602), "MN": (46.729, -94.685),
    "MS": (32.355, -89.398), "MO": (37.964, -91.832), "MT": (46.879, -110.362),
    "NE": (41.492, -99.901), "NV": (38.803, -116.419), "NH": (43.193, -71.572),
    "NJ": (40.058, -74.406), "NM": (34.519, -105.870), "NY": (43.000, -75.000),
    "NC": (35.759, -79.019), "ND": (47.551, -101.002), "OH": (40.417, -82.907),
    "OK": (35.468, -97.516), "OR": (43.804, -120.554), "PA": (41.203, -77.194),
    "RI": (41.580, -71.477), "SC": (33.837, -81.164), "SD": (43.969, -99.901),
    "TN": (35.518, -86.580), "TX": (31.968, -99.901), "UT": (39.321, -111.094),
    "VT": (44.558, -72.577), "VA": (37.431, -78.657), "WA": (47.751, -120.740),
    "WV": (38.597, -80.454), "WI": (43.785, -88.787), "WY": (43.076, -107.290)
}
STATE_NAMES = {
    "AL": "Alabama", "AK": "Alaska", "AZ": "Arizona", 
    "AR": "Arkansas", "CA": "California", "CO": "Colorado",
    "CT": "Connecticut", "DE": "Delaware", 
    "DC": "District of Columbia", "FL": "Florida", 
    "GA": "Georgia", "HI": "Hawaii", "ID": "Idaho", 
    "IL": "Illinois", "IN": "Indiana", "IA": "Iowa", 
    "KS": "Kansas", "KY": "Kentucky", "LA": "Louisiana", 
    "ME": "Maine", "MD": "Maryland", "MA": "Massachusetts", 
    "MI": "Michigan", "MN": "Minnesota", "MS": "Mississippi", 
    "MO": "Missouri", "MT": "Montana", "NE": "Nebraska", 
    "NV": "Nevada", "NH": "New Hampshire", "NJ": "New Jersey", 
    "NM": "New Mexico", "NY": "New York", "NC": "North Carolina", 
    "ND": "North Dakota", "OH": "Ohio", "OK": "Oklahoma", 
    "OR": "Oregon", "PA": "Pennsylvania", "RI": "Rhode Island",
    "SC": "South Carolina", "SD": "South Dakota", "TN": "Tennessee", 
    "TX": "Texas", "UT": "Utah", "VT": "Vermont", 
    "VA": "Virginia", "WA": "Washington", "WV": "West Virginia", 
    "WI": "Wisconsin", "WY": "Wyoming"
}
STATE_NAME_TO_CODE = {
    "Alabama": "AL", "Alaska": "AK", "Arizona": "AZ", "Arkansas": "AR",
    "California": "CA", "Colorado": "CO", "Connecticut": "CT", "Delaware": "DE",
    "District of Columbia": "DC", "Florida": "FL", "Georgia": "GA",
    "Hawaii": "HI", "Idaho": "ID", "Illinois": "IL", "Indiana": "IN",
    "Iowa": "IA", "Kansas": "KS", "Kentucky": "KY", "Louisiana": "LA",
    "Maine": "ME", "Maryland": "MD", "Massachusetts": "MA", "Michigan": "MI",
    "Minnesota": "MN", "Mississippi": "MS", "Missouri": "MO", "Montana": "MT",
    "Nebraska": "NE", "Nevada": "NV", "New Hampshire": "NH", "New Jersey": "NJ",
    "New Mexico": "NM", "New York": "NY", "North Carolina": "NC",
    "North Dakota": "ND", "Ohio": "OH", "Oklahoma": "OK", "Oregon": "OR",
    "Pennsylvania": "PA", "Rhode Island": "RI", "South Carolina": "SC",
    "South Dakota": "SD", "Tennessee": "TN", "Texas": "TX", "Utah": "UT",
    "Vermont": "VT", "Virginia": "VA", "Washington": "WA",
    "West Virginia": "WV", "Wisconsin": "WI", "Wyoming": "WY"
}
Code
df = df[~df["CITY_NAME"].str.contains("Unknown", na=False)].copy()
df_city = (
    df
    .dropna(subset=["CITY_NAME", "STATE_NAME"])
    .groupby(["CITY_NAME", "STATE_NAME"], as_index=False)
    .agg(total_job_postings=("CITY_NAME", "size"))
)
df_city["STATE_CODE"] = df_city["STATE_NAME"].map(STATE_NAME_TO_CODE)
df_city = df_city.dropna(subset=["STATE_CODE"])
df_city = df_city.nlargest(100, "total_job_postings")

df_state = (
    df_city.groupby("STATE_CODE", as_index=False)
           .agg(total_postings=("total_job_postings", "sum"))
)
# city coordinates
lats, lons = [], []
counts = df_city.groupby("STATE_CODE")["CITY_NAME"].transform("count")
idxs = df_city.groupby("STATE_CODE").cumcount()

for st_code, k, idx in zip(df_city["STATE_CODE"], counts, idxs):
    base = STATE_CENTROIDS.get(st_code, (37.0, -96.0))  # fallback center of US
    angle = 2 * np.pi * (idx / max(k, 1))
    radius = 0.8

    lats.append(base[0] + radius * np.cos(angle))
    lons.append(base[1] + radius * np.sin(angle))

df_city["lat"] = lats
df_city["lon"] = lons

# Bubble size scaling
max_total = float(df_city["total_job_postings"].max())
size_scale = 40.0 / np.sqrt(max_total) if max_total > 0 else 1.0

fig = go.Figure()
fig.add_trace(go.Choropleth(
    locations=df_state["STATE_CODE"],
    z=df_state["total_postings"],
    locationmode="USA-states",
    colorscale="Blues",
    marker_line_color="white",
    marker_line_width=0.6,
    colorbar_title="Total postings",
    hovertemplate="<b>%{location}</b><br>Total: %{z:,}<extra></extra>",
    zmin=0,
    zmax=df_state["total_postings"].max()
))

# City bubbles
fig.add_trace(go.Scattergeo(
    lon=df_city["lon"],
    lat=df_city["lat"],
    text=df_city["CITY_NAME"],
    mode="markers",
    marker=dict(
        size=np.sqrt(df_city["total_job_postings"]) * size_scale,
        sizemin=5,
        opacity=0.85,
        color="#22c55e",
        line=dict(width=0.8, color="white")
    ),
    customdata=df_city["total_job_postings"],
    hovertemplate=(
        "<b>%{text}</b><br>"
        "Total postings: %{customdata:,}"
        "<extra></extra>"
    ),
    showlegend=False,
    name="City postings"
))

# State labels
label_lats, label_lons, label_names = [], [], []
for code in df_state["STATE_CODE"]:
    if code in STATE_CENTROIDS:
        lat, lon = STATE_CENTROIDS[code]
        label_lats.append(lat)
        label_lons.append(lon)
        label_names.append(STATE_NAMES.get(code, code))

fig.add_trace(go.Scattergeo(
    lat=label_lats,
    lon=label_lons,
    mode="text",
    text=label_names,
    textfont=dict(size=9, color="rgba(0,0,0,0.75)"),
    hoverinfo="skip",
    showlegend=False
))

fig.update_layout(
    title="Job postings by Location",
    geo=dict(
        scope="usa",
        projection=go.layout.geo.Projection(type="albers usa"),
        showland=True,
        landcolor="#fafafa"
    ),
    # margin=dict(l=100, r=20, t=60, b=60),
    paper_bgcolor="white",
    plot_bgcolor="white"
)
fig.update_traces(
    selector=dict(type="choropleth"),
    colorbar=dict(
        orientation="h",
        x=0.5,
        xanchor="center",
        y=-0.01,
        yanchor="top",
        thickness=12,
        len=0.6
    )
)
fig.show()

Insight:

Texas, California, and Florida emerge as national hiring hubs, reflecting their large and diversified economies, population growth, and strong business ecosystems. Many of the remaining top states, such as Virginia, New York, New Jersey, and North Carolina are anchored by federal agencies, financial centers, or technology clusters. Concentrating job search efforts in these high-volume states can materially increase the number of suitable opportunities available to candidates.

Top Software Skills in Job Postings

To support the Skill Gap Analysis page, we identify which software tools are requested most often. This uses the cleaned SOFTWARE_SKILLS_NAME column produced in the Data Cleaning step.

Code
if "SOFTWARE_SKILLS_NAME" in df.columns:
    skills = (
        df["SOFTWARE_SKILLS_NAME"]
        .astype(str)
        .str.split(",")
        .explode()
        .str.strip()
    )
    skills = skills[(skills != "") & (skills != "nan")]

    top_skills = skills.value_counts().head(15).reset_index()
    top_skills.columns = ["Skill", "Postings"]
    
    fig = go.Figure()
    fig.add_trace(go.Bar(
        x=top_skills["Postings"],
        y=top_skills["Skill"],
        orientation="h"
    ))
    fig.update_layout(
        title="Top 15 Software Skills",
        plot_bgcolor="white",
        margin=dict(l=10, r=10, t=60, b=10),
        height=500,
        xaxis=dict(
            showgrid=True,
            gridcolor="#eaeaea",
        ),
        yaxis=dict(
            categoryorder='total ascending',
        ),
        legend=dict(
            orientation="h",
            yanchor="bottom",
            y=1.02,
            xanchor="right",
            x=1
        ))
    fig.show()
else:
    print("Software skills column not available.")

Insight:

SQL, Python, and Excel appear as the most universally requested tools, underscoring their status as core competencies for analytics, BI, and operations roles. Tableau and Power BI, together with dashboarding skills more broadly, highlight the premium placed on communicating data insights visually. Cloud and platform skills such as AWS, Azure, and Oracle Cloud are less frequent but still prominent, indicating growing demand for candidates who can work across both traditional analytics stacks and modern cloud-based architectures.


EDA Summary

The exploratory analysis reveals five consistent themes in the 2024 job market:

  • Industry concentration: Hiring is heavily concentrated in Professional Services, Administrative & Support Services, and key service sectors, which together generate a large share of postings.

  • Compensation differences: Median annual salaries vary sharply by industry, with Finance, Information, and Professional Services offering the strongest earnings potential.

  • Remote work structure: Explicitly remote and hybrid roles remain a minority but are disproportionately found in digital and analytics-intensive occupations.

  • Seasonality: Job posting activity dips around late June and rebounds sharply in late summer, highlighting the importance of timing in job search strategies.

  • Skill demand: SQL, Python, Excel, BI tools, and cloud platforms dominate technical requirements, defining a clear “baseline stack” for aspiring data and business analytics professionals.

These insights directly feed into the subsequent Skill Gap Analysis, ML Methods, and Career Strategy sections by identifying which industries, skills, and time windows matter most for job seekers in 2024.

© 2025 · AD 688 Web Analytics · Boston University

Team 5

Source Code
---
title: "Exploratory Data Analysis"
subtitle: "Job Postings, Salaries, Geography, and Remote Work Patterns"
bibliography: references.bib
csl: csl/econometrica.csl
format:
  html:
    code-fold: true
    code-tools: true
---

<div class="card reveal">

### EDA Overview

This page explores hiring trends, salary patterns, geographic variation, remote work availability, and software skill demands using the **cleaned Lightcast dataset** produced earlier.

Each visualization is selected to answer job seeker–focused questions:

- **Which industries are hiring the most?**  
- **How do salaries vary across sectors?**  
- **How common are remote roles?**  
- **How has demand changed over time?**  
- **Which software skills are most requested?**

</div>

---

<div class="card reveal">

### Load the Cleaned Dataset

```{python}
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import re

df = pd.read_csv("data/lightcast_cleaned.csv", low_memory=False)
#df.head()
#df.info()
#list(df.columns)
```

</div>

---

<div class="card reveal">

### Job Postings by Industry (Top Sectors Hiring)

```{python}
industry_candidates = [
    "NAICS_2022_2_NAME", 
    "INDUSTRY_NAME", 
    "NAICS_2022_3_NAME"]
industry_col = next(
    (c for c in industry_candidates if c in df.columns), 
    None)
industry_col
```

```{python}
if industry_col:
    industry_counts = (
        df[industry_col]
        .value_counts()
        .head(15)
        .reset_index()
    )
    industry_counts.columns = ["Industry", "Postings"]
    
    fig = go.Figure()
    fig.add_trace(go.Bar(
        x=industry_counts["Postings"],
        y=industry_counts["Industry"],
        orientation="h"
    ))
    fig.update_layout(
        title=f"Top 15 Industries by Job Postings ({industry_col})",
        plot_bgcolor="white",
        margin=dict(l=10, r=10, t=60, b=10),
        height=500,
        xaxis=dict(
            showgrid=True,
            gridcolor="#eaeaea",
        ),
        yaxis=dict(
            categoryorder='total ascending',
        ))
    fig.show()

else:
    print("No industry column found.")
```


**Insight:**
  
Professional, Scientific & Technical Services and Administrative & Support Services account for the largest share of job postings, indicating sustained demand for analytical, operational, and client-facing roles. These sectors typically exhibit continuous hiring cycles driven by project-based work, business expansion, and comparatively high turnover. For job seekers, they represent strong entry points into the labor market with broad role variety and relatively frequent openings.

</div>

---

<div class="card reveal">

### Annual Salary - Trend Over Time

```{python}
import plotly.graph_objects as go
from plotly.subplots import make_subplots

df2 = df.copy()
df2 = df2.dropna(subset=["ANNUAL_SALARY", "month"])

# Group by month (string)
agg = (
    df2.groupby("month")
    .agg(
        postings=("ANNUAL_SALARY", "size"), 
        avg_salary=("ANNUAL_SALARY", "mean")
        ).reset_index().sort_values("month")
)

# Dual-axis chart
# fig = make_subplots(specs=[[{"secondary_y": True}]])
fig = go.Figure()
fig.add_trace(go.Bar(
    x=agg["month"],
    y=agg["postings"],
    name="Job Postings",
    marker_color="#ff7043",
    yaxis="y1",
    hovertemplate="Job postings: %{y:,}<extra></extra>"
))
fig.add_trace(go.Scatter(
    x=agg["month"],
    y=agg["avg_salary"],
    name="Annual Salary",
    mode="lines+markers",
    yaxis="y2",
    line=dict(color="#1f77b4", width=3),
    hovertemplate="Salary: $%{y:,.0f}<extra></extra>"
))

fig.update_layout(
    title="Annual Salary - Trend Over Time",
    template="plotly_white",
    hovermode="x unified",
    legend=dict(
        orientation="h", 
        yanchor="bottom", 
        y=1.02, 
        xanchor="right", 
        x=1),
    xaxis=dict(
        title="Month",
        tickformat="%b %Y",
        showgrid=True,
        gridcolor="#eaeaea"
    ),
    yaxis=dict(
        title="Job Postings",
        showgrid=True,
        gridcolor="#eaeaea"
    ),
    yaxis2=dict(
        title="Annual Salary($)",
        overlaying="y",
        side="right",
        tickprefix="$",
        showgrid=False
    )
)
fig.show()
```

**Insight:**

The chart shows a clear seasonal pattern in the job market. Job postings stay fairly strong from May through September, although there is a small dip in July before demand rises again in late summer. However, advertised salaries move very differently. They drop steadily from May to July and reach their lowest point mid-summer. Then, in September, salaries jump sharply to the highest level in the entire period.

This contrast suggests that early summer is dominated by lower-paying or junior roles, while late summer and early fall bring a return of higher-value positions. The most important takeaway is that September offers both strong hiring volume and significantly higher salaries, making it a particularly favorable time for job seekers aiming for better-paid opportunities.

</div>

<div class="card reveal">

### Salary Distribution by Industry

Salary variation across industries answers: “Which sectors pay more, and how unequal are wages within each industry?”
We use the annualized salary field (ANNUAL_SALARY), which consolidates raw salary data from SALARY_FROM, SALARY_TO, and SALARY and converts it into a consistent yearly format based on reported pay period.
This provides a clean, comparable measure of compensation across all industries.

```{python}
industry_col = industry_col
salary_col = "ANNUAL_SALARY" if "ANNUAL_SALARY" in df.columns else None

if industry_col and salary_col:

    # Filter valid salaries
    df_salary = df[[industry_col, salary_col]].dropna()

    # Remove noise categories that distort analysis
    df_salary = df_salary[df_salary[industry_col] != "Unclassified Industry"]

    # Cap extreme values at 300k for readability (still accurate for 95% of postings)
    df_salary[salary_col] = np.where(
        df_salary[salary_col] > 300000, 
        300000, 
        df_salary[salary_col]
        )

    # Compute median salary per industry for ordering
    medians = (
        df_salary.groupby(industry_col)[salary_col]
        .median()
        .sort_values(ascending=True)
    )

    ordered_categories = medians.index.tolist()

    fig = px.box(
        df_salary,
        y=industry_col,
        x=salary_col,
        category_orders={industry_col: ordered_categories},
        title="Annual Salary Distribution by Industry",
        color=industry_col,
        color_discrete_sequence=px.colors.qualitative.Vivid,
        height=900
    )

    fig.update_layout(
        showlegend=False,
        xaxis_title="Annual Salary (USD)",
        yaxis_title="Industry",
        margin=dict(l=120)
    )

    fig.show()

else:
    print("Salary chart not generated: Missing industry or salary column.")
```

**Insight:**

Annual salary levels differ substantially across industries, with Finance, Information, and Professional Services exhibiting the highest median compensation. The wide dispersion of salaries within these sectors reflects variation in seniority, specialization, and credential requirements. In contrast, service-oriented industries such as Retail and Accommodation & Food Services show lower but more compressed salary ranges, consistent with narrower variation in role complexity. These patterns suggest that targeted upskilling toward high-skill roles in Finance, Tech, and Professional Services is likely to yield the greatest earnings uplift.

</div>

<div class="card reveal">

### Remote vs. On-Site Job Patterns

Remote work availability is a crucial factor for job seekers.  
This chart shows whether industries lean toward remote, hybrid, or in-person work.
Here, [None] represents postings where Lightcast did not provide a remote-work tag, which typically corresponds to on-site or unspecified arrangements.

```{python}
remote_candidates = ["REMOTE_TYPE_NAME", "REMOTE_TYPE"]
remote_col = next((c for c in remote_candidates if c in df.columns), None)
```

```{python}
if remote_col:
    remote_counts = df[remote_col].value_counts().reset_index()
    remote_counts.columns = ["Work Arrangement", "Postings"]

    fig = px.pie(
        remote_counts,
        names="Work Arrangement",
        values="Postings",
        title="Remote vs On-Site Job Postings",
        color_discrete_sequence=px.colors.qualitative.Pastel1
    )
    fig.show()
else:
    print("No remote work field found.")
```

**Insight:**

The majority of postings either do not explicitly tag remote work or are implicitly on-site, with only a modest share labeled as Remote or Hybrid. Where remote options are available, they are concentrated in data, IT, and knowledge-intensive roles that can be performed digitally. For job seekers prioritizing location flexibility, this suggests focusing on analytically oriented roles and employers that have formal remote-work policies rather than assuming remote options are widespread across all occupations.

</div>

<div class="card reveal">

### Geographic Distribution of Jobs (by State)

Geographic EDA answers: “Which states have the most job opportunities?”
We use STATE_NAME (cleaned) to compute the total number of postings per state.

```{python}
STATE_CENTROIDS = {
    "AL": (32.806, -86.791), "AK": (64.200, -149.493), "AZ": (34.049, -111.094),
    "AR": (35.201, -91.832), "CA": (36.778, -119.418), "CO": (39.550, -105.782),
    "CT": (41.603, -73.087), "DE": (38.910, -75.527), "DC": (38.907, -77.037),
    "FL": (27.664, -81.516), "GA": (32.165, -82.900), "HI": (19.896, -155.582),
    "ID": (44.068, -114.742), "IL": (40.633, -89.398), "IN": (40.267, -86.134),
    "IA": (41.878, -93.098), "KS": (39.012, -98.484), "KY": (37.839, -84.270),
    "LA": (30.984, -91.963), "ME": (45.254, -69.445), "MD": (39.045, -76.641),
    "MA": (42.407, -71.382), "MI": (44.314, -85.602), "MN": (46.729, -94.685),
    "MS": (32.355, -89.398), "MO": (37.964, -91.832), "MT": (46.879, -110.362),
    "NE": (41.492, -99.901), "NV": (38.803, -116.419), "NH": (43.193, -71.572),
    "NJ": (40.058, -74.406), "NM": (34.519, -105.870), "NY": (43.000, -75.000),
    "NC": (35.759, -79.019), "ND": (47.551, -101.002), "OH": (40.417, -82.907),
    "OK": (35.468, -97.516), "OR": (43.804, -120.554), "PA": (41.203, -77.194),
    "RI": (41.580, -71.477), "SC": (33.837, -81.164), "SD": (43.969, -99.901),
    "TN": (35.518, -86.580), "TX": (31.968, -99.901), "UT": (39.321, -111.094),
    "VT": (44.558, -72.577), "VA": (37.431, -78.657), "WA": (47.751, -120.740),
    "WV": (38.597, -80.454), "WI": (43.785, -88.787), "WY": (43.076, -107.290)
}
STATE_NAMES = {
    "AL": "Alabama", "AK": "Alaska", "AZ": "Arizona", 
    "AR": "Arkansas", "CA": "California", "CO": "Colorado",
    "CT": "Connecticut", "DE": "Delaware", 
    "DC": "District of Columbia", "FL": "Florida", 
    "GA": "Georgia", "HI": "Hawaii", "ID": "Idaho", 
    "IL": "Illinois", "IN": "Indiana", "IA": "Iowa", 
    "KS": "Kansas", "KY": "Kentucky", "LA": "Louisiana", 
    "ME": "Maine", "MD": "Maryland", "MA": "Massachusetts", 
    "MI": "Michigan", "MN": "Minnesota", "MS": "Mississippi", 
    "MO": "Missouri", "MT": "Montana", "NE": "Nebraska", 
    "NV": "Nevada", "NH": "New Hampshire", "NJ": "New Jersey", 
    "NM": "New Mexico", "NY": "New York", "NC": "North Carolina", 
    "ND": "North Dakota", "OH": "Ohio", "OK": "Oklahoma", 
    "OR": "Oregon", "PA": "Pennsylvania", "RI": "Rhode Island",
    "SC": "South Carolina", "SD": "South Dakota", "TN": "Tennessee", 
    "TX": "Texas", "UT": "Utah", "VT": "Vermont", 
    "VA": "Virginia", "WA": "Washington", "WV": "West Virginia", 
    "WI": "Wisconsin", "WY": "Wyoming"
}
STATE_NAME_TO_CODE = {
    "Alabama": "AL", "Alaska": "AK", "Arizona": "AZ", "Arkansas": "AR",
    "California": "CA", "Colorado": "CO", "Connecticut": "CT", "Delaware": "DE",
    "District of Columbia": "DC", "Florida": "FL", "Georgia": "GA",
    "Hawaii": "HI", "Idaho": "ID", "Illinois": "IL", "Indiana": "IN",
    "Iowa": "IA", "Kansas": "KS", "Kentucky": "KY", "Louisiana": "LA",
    "Maine": "ME", "Maryland": "MD", "Massachusetts": "MA", "Michigan": "MI",
    "Minnesota": "MN", "Mississippi": "MS", "Missouri": "MO", "Montana": "MT",
    "Nebraska": "NE", "Nevada": "NV", "New Hampshire": "NH", "New Jersey": "NJ",
    "New Mexico": "NM", "New York": "NY", "North Carolina": "NC",
    "North Dakota": "ND", "Ohio": "OH", "Oklahoma": "OK", "Oregon": "OR",
    "Pennsylvania": "PA", "Rhode Island": "RI", "South Carolina": "SC",
    "South Dakota": "SD", "Tennessee": "TN", "Texas": "TX", "Utah": "UT",
    "Vermont": "VT", "Virginia": "VA", "Washington": "WA",
    "West Virginia": "WV", "Wisconsin": "WI", "Wyoming": "WY"
}
```

```{python}
df = df[~df["CITY_NAME"].str.contains("Unknown", na=False)].copy()
df_city = (
    df
    .dropna(subset=["CITY_NAME", "STATE_NAME"])
    .groupby(["CITY_NAME", "STATE_NAME"], as_index=False)
    .agg(total_job_postings=("CITY_NAME", "size"))
)
df_city["STATE_CODE"] = df_city["STATE_NAME"].map(STATE_NAME_TO_CODE)
df_city = df_city.dropna(subset=["STATE_CODE"])
df_city = df_city.nlargest(100, "total_job_postings")

df_state = (
    df_city.groupby("STATE_CODE", as_index=False)
           .agg(total_postings=("total_job_postings", "sum"))
)
# city coordinates
lats, lons = [], []
counts = df_city.groupby("STATE_CODE")["CITY_NAME"].transform("count")
idxs = df_city.groupby("STATE_CODE").cumcount()

for st_code, k, idx in zip(df_city["STATE_CODE"], counts, idxs):
    base = STATE_CENTROIDS.get(st_code, (37.0, -96.0))  # fallback center of US
    angle = 2 * np.pi * (idx / max(k, 1))
    radius = 0.8

    lats.append(base[0] + radius * np.cos(angle))
    lons.append(base[1] + radius * np.sin(angle))

df_city["lat"] = lats
df_city["lon"] = lons

# Bubble size scaling
max_total = float(df_city["total_job_postings"].max())
size_scale = 40.0 / np.sqrt(max_total) if max_total > 0 else 1.0

fig = go.Figure()
fig.add_trace(go.Choropleth(
    locations=df_state["STATE_CODE"],
    z=df_state["total_postings"],
    locationmode="USA-states",
    colorscale="Blues",
    marker_line_color="white",
    marker_line_width=0.6,
    colorbar_title="Total postings",
    hovertemplate="<b>%{location}</b><br>Total: %{z:,}<extra></extra>",
    zmin=0,
    zmax=df_state["total_postings"].max()
))

# City bubbles
fig.add_trace(go.Scattergeo(
    lon=df_city["lon"],
    lat=df_city["lat"],
    text=df_city["CITY_NAME"],
    mode="markers",
    marker=dict(
        size=np.sqrt(df_city["total_job_postings"]) * size_scale,
        sizemin=5,
        opacity=0.85,
        color="#22c55e",
        line=dict(width=0.8, color="white")
    ),
    customdata=df_city["total_job_postings"],
    hovertemplate=(
        "<b>%{text}</b><br>"
        "Total postings: %{customdata:,}"
        "<extra></extra>"
    ),
    showlegend=False,
    name="City postings"
))

# State labels
label_lats, label_lons, label_names = [], [], []
for code in df_state["STATE_CODE"]:
    if code in STATE_CENTROIDS:
        lat, lon = STATE_CENTROIDS[code]
        label_lats.append(lat)
        label_lons.append(lon)
        label_names.append(STATE_NAMES.get(code, code))

fig.add_trace(go.Scattergeo(
    lat=label_lats,
    lon=label_lons,
    mode="text",
    text=label_names,
    textfont=dict(size=9, color="rgba(0,0,0,0.75)"),
    hoverinfo="skip",
    showlegend=False
))

fig.update_layout(
    title="Job postings by Location",
    geo=dict(
        scope="usa",
        projection=go.layout.geo.Projection(type="albers usa"),
        showland=True,
        landcolor="#fafafa"
    ),
    # margin=dict(l=100, r=20, t=60, b=60),
    paper_bgcolor="white",
    plot_bgcolor="white"
)
fig.update_traces(
    selector=dict(type="choropleth"),
    colorbar=dict(
        orientation="h",
        x=0.5,
        xanchor="center",
        y=-0.01,
        yanchor="top",
        thickness=12,
        len=0.6
    )
)
fig.show()
```

**Insight:**

Texas, California, and Florida emerge as national hiring hubs, reflecting their large and diversified economies, population growth, and strong business ecosystems. Many of the remaining top states, such as Virginia, New York, New Jersey, and North Carolina are anchored by federal agencies, financial centers, or technology clusters. Concentrating job search efforts in these high-volume states can materially increase the number of suitable opportunities available to candidates.

</div>

<div class="card reveal">

### Top Software Skills in Job Postings

To support the Skill Gap Analysis page, we identify which software tools are requested most often.
This uses the cleaned SOFTWARE_SKILLS_NAME column produced in the Data Cleaning step.

```{python}
if "SOFTWARE_SKILLS_NAME" in df.columns:
    skills = (
        df["SOFTWARE_SKILLS_NAME"]
        .astype(str)
        .str.split(",")
        .explode()
        .str.strip()
    )
    skills = skills[(skills != "") & (skills != "nan")]

    top_skills = skills.value_counts().head(15).reset_index()
    top_skills.columns = ["Skill", "Postings"]
    
    fig = go.Figure()
    fig.add_trace(go.Bar(
        x=top_skills["Postings"],
        y=top_skills["Skill"],
        orientation="h"
    ))
    fig.update_layout(
        title="Top 15 Software Skills",
        plot_bgcolor="white",
        margin=dict(l=10, r=10, t=60, b=10),
        height=500,
        xaxis=dict(
            showgrid=True,
            gridcolor="#eaeaea",
        ),
        yaxis=dict(
            categoryorder='total ascending',
        ),
        legend=dict(
            orientation="h",
            yanchor="bottom",
            y=1.02,
            xanchor="right",
            x=1
        ))
    fig.show()
else:
    print("Software skills column not available.")
```

**Insight:**

SQL, Python, and Excel appear as the most universally requested tools, underscoring their status as core competencies for analytics, BI, and operations roles. Tableau and Power BI, together with dashboarding skills more broadly, highlight the premium placed on communicating data insights visually. Cloud and platform skills such as AWS, Azure, and Oracle Cloud are less frequent but still prominent, indicating growing demand for candidates who can work across both traditional analytics stacks and modern cloud-based architectures.

</div>

---

<div class="card reveal">

### EDA Summary

The exploratory analysis reveals five consistent themes in the 2024 job market:

- **Industry concentration**: Hiring is heavily concentrated in Professional Services, Administrative & Support Services, and key service sectors, which together generate a large share of postings.

- **Compensation differences**: Median annual salaries vary sharply by industry, with Finance, Information, and Professional Services offering the strongest earnings potential.

- **Remote work structure**: Explicitly remote and hybrid roles remain a minority but are disproportionately found in digital and analytics-intensive occupations.

- **Seasonality**: Job posting activity dips around late June and rebounds sharply in late summer, highlighting the importance of timing in job search strategies.

- **Skill demand**: SQL, Python, Excel, BI tools, and cloud platforms dominate technical requirements, defining a clear “baseline stack” for aspiring data and business analytics professionals.

These insights directly feed into the subsequent Skill Gap Analysis, ML Methods, and Career Strategy sections by identifying which industries, skills, and time windows matter most for job seekers in 2024.

</div>