Exploratory Data Analysis

Job Postings, Salaries, Geography, and Remote Work Patterns

EDA Overview

This page explores hiring trends, salary patterns, geographic variation, remote work availability, and software skill demands using the cleaned Lightcast dataset produced earlier.

Each visualization is selected to answer job seeker–focused questions:

Which industries are hiring the most?
How do salaries vary across sectors?
How common are remote roles?
How has demand changed over time?
Which software skills are most requested?

Load the Cleaned Dataset

Code

import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import re

df = pd.read_csv("data/lightcast_cleaned.csv", low_memory=False)
#df.head()
#df.info()
#list(df.columns)

Job Postings by Industry (Top Sectors Hiring)

Code

industry_candidates = [
    "NAICS_2022_2_NAME", 
    "INDUSTRY_NAME", 
    "NAICS_2022_3_NAME"]
industry_col = next(
    (c for c in industry_candidates if c in df.columns), 
    None)
industry_col

'NAICS_2022_2_NAME'

Code

if industry_col:
    industry_counts = (
        df[industry_col]
        .value_counts()
        .head(15)
        .reset_index()
    )
    industry_counts.columns = ["Industry", "Postings"]
    
    fig = go.Figure()
    fig.add_trace(go.Bar(
        x=industry_counts["Postings"],
        y=industry_counts["Industry"],
        orientation="h"
    ))
    fig.update_layout(
        title=f"Top 15 Industries by Job Postings ({industry_col})",
        plot_bgcolor="white",
        margin=dict(l=10, r=10, t=60, b=10),
        height=500,
        xaxis=dict(
            showgrid=True,
            gridcolor="#eaeaea",
        ),
        yaxis=dict(
            categoryorder='total ascending',
        ))
    fig.show()

else:
    print("No industry column found.")

Insight:

Professional, Scientific & Technical Services and Administrative & Support Services account for the largest share of job postings, indicating sustained demand for analytical, operational, and client-facing roles. These sectors typically exhibit continuous hiring cycles driven by project-based work, business expansion, and comparatively high turnover. For job seekers, they represent strong entry points into the labor market with broad role variety and relatively frequent openings.

Annual Salary - Trend Over Time

Code

import plotly.graph_objects as go
from plotly.subplots import make_subplots

df2 = df.copy()
df2 = df2.dropna(subset=["ANNUAL_SALARY", "month"])

# Group by month (string)
agg = (
    df2.groupby("month")
    .agg(
        postings=("ANNUAL_SALARY", "size"), 
        avg_salary=("ANNUAL_SALARY", "mean")
        ).reset_index().sort_values("month")
)

# Dual-axis chart
# fig = make_subplots(specs=[[{"secondary_y": True}]])
fig = go.Figure()
fig.add_trace(go.Bar(
    x=agg["month"],
    y=agg["postings"],
    name="Job Postings",
    marker_color="#ff7043",
    yaxis="y1",
    hovertemplate="Job postings: %{y:,}<extra></extra>"
))
fig.add_trace(go.Scatter(
    x=agg["month"],
    y=agg["avg_salary"],
    name="Annual Salary",
    mode="lines+markers",
    yaxis="y2",
    line=dict(color="#1f77b4", width=3),
    hovertemplate="Salary: $%{y:,.0f}<extra></extra>"
))

fig.update_layout(
    title="Annual Salary - Trend Over Time",
    template="plotly_white",
    hovermode="x unified",
    legend=dict(
        orientation="h", 
        yanchor="bottom", 
        y=1.02, 
        xanchor="right", 
        x=1),
    xaxis=dict(
        title="Month",
        tickformat="%b %Y",
        showgrid=True,
        gridcolor="#eaeaea"
    ),
    yaxis=dict(
        title="Job Postings",
        showgrid=True,
        gridcolor="#eaeaea"
    ),
    yaxis2=dict(
        title="Annual Salary($)",
        overlaying="y",
        side="right",
        tickprefix="$",
        showgrid=False
    )
)
fig.show()

Insight:

The chart shows a clear seasonal pattern in the job market. Job postings stay fairly strong from May through September, although there is a small dip in July before demand rises again in late summer. However, advertised salaries move very differently. They drop steadily from May to July and reach their lowest point mid-summer. Then, in September, salaries jump sharply to the highest level in the entire period.

This contrast suggests that early summer is dominated by lower-paying or junior roles, while late summer and early fall bring a return of higher-value positions. The most important takeaway is that September offers both strong hiring volume and significantly higher salaries, making it a particularly favorable time for job seekers aiming for better-paid opportunities.

Salary Distribution by Industry

Salary variation across industries answers: “Which sectors pay more, and how unequal are wages within each industry?” We use the annualized salary field (ANNUAL_SALARY), which consolidates raw salary data from SALARY_FROM, SALARY_TO, and SALARY and converts it into a consistent yearly format based on reported pay period. This provides a clean, comparable measure of compensation across all industries.

Code

industry_col = industry_col
salary_col = "ANNUAL_SALARY" if "ANNUAL_SALARY" in df.columns else None

if industry_col and salary_col:

    # Filter valid salaries
    df_salary = df[[industry_col, salary_col]].dropna()

    # Remove noise categories that distort analysis
    df_salary = df_salary[df_salary[industry_col] != "Unclassified Industry"]

    # Cap extreme values at 300k for readability (still accurate for 95% of postings)
    df_salary[salary_col] = np.where(
        df_salary[salary_col] > 300000, 
        300000, 
        df_salary[salary_col]
        )

    # Compute median salary per industry for ordering
    medians = (
        df_salary.groupby(industry_col)[salary_col]
        .median()
        .sort_values(ascending=True)
    )

    ordered_categories = medians.index.tolist()

    fig = px.box(
        df_salary,
        y=industry_col,
        x=salary_col,
        category_orders={industry_col: ordered_categories},
        title="Annual Salary Distribution by Industry",
        color=industry_col,
        color_discrete_sequence=px.colors.qualitative.Vivid,
        height=900
    )

    fig.update_layout(
        showlegend=False,
        xaxis_title="Annual Salary (USD)",
        yaxis_title="Industry",
        margin=dict(l=120)
    )

    fig.show()

else:
    print("Salary chart not generated: Missing industry or salary column.")

Insight:

Annual salary levels differ substantially across industries, with Finance, Information, and Professional Services exhibiting the highest median compensation. The wide dispersion of salaries within these sectors reflects variation in seniority, specialization, and credential requirements. In contrast, service-oriented industries such as Retail and Accommodation & Food Services show lower but more compressed salary ranges, consistent with narrower variation in role complexity. These patterns suggest that targeted upskilling toward high-skill roles in Finance, Tech, and Professional Services is likely to yield the greatest earnings uplift.

Remote vs. On-Site Job Patterns

Remote work availability is a crucial factor for job seekers.
This chart shows whether industries lean toward remote, hybrid, or in-person work. Here, [None] represents postings where Lightcast did not provide a remote-work tag, which typically corresponds to on-site or unspecified arrangements.

Code

remote_candidates = ["REMOTE_TYPE_NAME", "REMOTE_TYPE"]
remote_col = next((c for c in remote_candidates if c in df.columns), None)

Code

if remote_col:
    remote_counts = df[remote_col].value_counts().reset_index()
    remote_counts.columns = ["Work Arrangement", "Postings"]

    fig = px.pie(
        remote_counts,
        names="Work Arrangement",
        values="Postings",
        title="Remote vs On-Site Job Postings",
        color_discrete_sequence=px.colors.qualitative.Pastel1
    )
    fig.show()
else:
    print("No remote work field found.")

Insight:

The majority of postings either do not explicitly tag remote work or are implicitly on-site, with only a modest share labeled as Remote or Hybrid. Where remote options are available, they are concentrated in data, IT, and knowledge-intensive roles that can be performed digitally. For job seekers prioritizing location flexibility, this suggests focusing on analytically oriented roles and employers that have formal remote-work policies rather than assuming remote options are widespread across all occupations.

Geographic Distribution of Jobs (by State)

Geographic EDA answers: “Which states have the most job opportunities?” We use STATE_NAME (cleaned) to compute the total number of postings per state.

Code

STATE_CENTROIDS = {
    "AL": (32.806, -86.791), "AK": (64.200, -149.493), "AZ": (34.049, -111.094),
    "AR": (35.201, -91.832), "CA": (36.778, -119.418), "CO": (39.550, -105.782),
    "CT": (41.603, -73.087), "DE": (38.910, -75.527), "DC": (38.907, -77.037),
    "FL": (27.664, -81.516), "GA": (32.165, -82.900), "HI": (19.896, -155.582),
    "ID": (44.068, -114.742), "IL": (40.633, -89.398), "IN": (40.267, -86.134),
    "IA": (41.878, -93.098), "KS": (39.012, -98.484), "KY": (37.839, -84.270),
    "LA": (30.984, -91.963), "ME": (45.254, -69.445), "MD": (39.045, -76.641),
    "MA": (42.407, -71.382), "MI": (44.314, -85.602), "MN": (46.729, -94.685),
    "MS": (32.355, -89.398), "MO": (37.964, -91.832), "MT": (46.879, -110.362),
    "NE": (41.492, -99.901), "NV": (38.803, -116.419), "NH": (43.193, -71.572),
    "NJ": (40.058, -74.406), "NM": (34.519, -105.870), "NY": (43.000, -75.000),
    "NC": (35.759, -79.019), "ND": (47.551, -101.002), "OH": (40.417, -82.907),
    "OK": (35.468, -97.516), "OR": (43.804, -120.554), "PA": (41.203, -77.194),
    "RI": (41.580, -71.477), "SC": (33.837, -81.164), "SD": (43.969, -99.901),
    "TN": (35.518, -86.580), "TX": (31.968, -99.901), "UT": (39.321, -111.094),
    "VT": (44.558, -72.577), "VA": (37.431, -78.657), "WA": (47.751, -120.740),
    "WV": (38.597, -80.454), "WI": (43.785, -88.787), "WY": (43.076, -107.290)
}
STATE_NAMES = {
    "AL": "Alabama", "AK": "Alaska", "AZ": "Arizona", 
    "AR": "Arkansas", "CA": "California", "CO": "Colorado",
    "CT": "Connecticut", "DE": "Delaware", 
    "DC": "District of Columbia", "FL": "Florida", 
    "GA": "Georgia", "HI": "Hawaii", "ID": "Idaho", 
    "IL": "Illinois", "IN": "Indiana", "IA": "Iowa", 
    "KS": "Kansas", "KY": "Kentucky", "LA": "Louisiana", 
    "ME": "Maine", "MD": "Maryland", "MA": "Massachusetts", 
    "MI": "Michigan", "MN": "Minnesota", "MS": "Mississippi", 
    "MO": "Missouri", "MT": "Montana", "NE": "Nebraska", 
    "NV": "Nevada", "NH": "New Hampshire", "NJ": "New Jersey", 
    "NM": "New Mexico", "NY": "New York", "NC": "North Carolina", 
    "ND": "North Dakota", "OH": "Ohio", "OK": "Oklahoma", 
    "OR": "Oregon", "PA": "Pennsylvania", "RI": "Rhode Island",
    "SC": "South Carolina", "SD": "South Dakota", "TN": "Tennessee", 
    "TX": "Texas", "UT": "Utah", "VT": "Vermont", 
    "VA": "Virginia", "WA": "Washington", "WV": "West Virginia", 
    "WI": "Wisconsin", "WY": "Wyoming"
}
STATE_NAME_TO_CODE = {
    "Alabama": "AL", "Alaska": "AK", "Arizona": "AZ", "Arkansas": "AR",
    "California": "CA", "Colorado": "CO", "Connecticut": "CT", "Delaware": "DE",
    "District of Columbia": "DC", "Florida": "FL", "Georgia": "GA",
    "Hawaii": "HI", "Idaho": "ID", "Illinois": "IL", "Indiana": "IN",
    "Iowa": "IA", "Kansas": "KS", "Kentucky": "KY", "Louisiana": "LA",
    "Maine": "ME", "Maryland": "MD", "Massachusetts": "MA", "Michigan": "MI",
    "Minnesota": "MN", "Mississippi": "MS", "Missouri": "MO", "Montana": "MT",
    "Nebraska": "NE", "Nevada": "NV", "New Hampshire": "NH", "New Jersey": "NJ",
    "New Mexico": "NM", "New York": "NY", "North Carolina": "NC",
    "North Dakota": "ND", "Ohio": "OH", "Oklahoma": "OK", "Oregon": "OR",
    "Pennsylvania": "PA", "Rhode Island": "RI", "South Carolina": "SC",
    "South Dakota": "SD", "Tennessee": "TN", "Texas": "TX", "Utah": "UT",
    "Vermont": "VT", "Virginia": "VA", "Washington": "WA",
    "West Virginia": "WV", "Wisconsin": "WI", "Wyoming": "WY"
}

Code

df = df[~df["CITY_NAME"].str.contains("Unknown", na=False)].copy()
df_city = (
    df
    .dropna(subset=["CITY_NAME", "STATE_NAME"])
    .groupby(["CITY_NAME", "STATE_NAME"], as_index=False)
    .agg(total_job_postings=("CITY_NAME", "size"))
)
df_city["STATE_CODE"] = df_city["STATE_NAME"].map(STATE_NAME_TO_CODE)
df_city = df_city.dropna(subset=["STATE_CODE"])
df_city = df_city.nlargest(100, "total_job_postings")

df_state = (
    df_city.groupby("STATE_CODE", as_index=False)
           .agg(total_postings=("total_job_postings", "sum"))
)
# city coordinates
lats, lons = [], []
counts = df_city.groupby("STATE_CODE")["CITY_NAME"].transform("count")
idxs = df_city.groupby("STATE_CODE").cumcount()

for st_code, k, idx in zip(df_city["STATE_CODE"], counts, idxs):
    base = STATE_CENTROIDS.get(st_code, (37.0, -96.0))  # fallback center of US
    angle = 2 * np.pi * (idx / max(k, 1))
    radius = 0.8

    lats.append(base[0] + radius * np.cos(angle))
    lons.append(base[1] + radius * np.sin(angle))

df_city["lat"] = lats
df_city["lon"] = lons

# Bubble size scaling
max_total = float(df_city["total_job_postings"].max())
size_scale = 40.0 / np.sqrt(max_total) if max_total > 0 else 1.0

fig = go.Figure()
fig.add_trace(go.Choropleth(
    locations=df_state["STATE_CODE"],
    z=df_state["total_postings"],
    locationmode="USA-states",
    colorscale="Blues",
    marker_line_color="white",
    marker_line_width=0.6,
    colorbar_title="Total postings",
    hovertemplate="<b>%{location}</b><br>Total: %{z:,}<extra></extra>",
    zmin=0,
    zmax=df_state["total_postings"].max()
))

# City bubbles
fig.add_trace(go.Scattergeo(
    lon=df_city["lon"],
    lat=df_city["lat"],
    text=df_city["CITY_NAME"],
    mode="markers",
    marker=dict(
        size=np.sqrt(df_city["total_job_postings"]) * size_scale,
        sizemin=5,
        opacity=0.85,
        color="#22c55e",
        line=dict(width=0.8, color="white")
    ),
    customdata=df_city["total_job_postings"],
    hovertemplate=(
        "<b>%{text}</b><br>"
        "Total postings: %{customdata:,}"
        "<extra></extra>"
    ),
    showlegend=False,
    name="City postings"
))

# State labels
label_lats, label_lons, label_names = [], [], []
for code in df_state["STATE_CODE"]:
    if code in STATE_CENTROIDS:
        lat, lon = STATE_CENTROIDS[code]
        label_lats.append(lat)
        label_lons.append(lon)
        label_names.append(STATE_NAMES.get(code, code))

fig.add_trace(go.Scattergeo(
    lat=label_lats,
    lon=label_lons,
    mode="text",
    text=label_names,
    textfont=dict(size=9, color="rgba(0,0,0,0.75)"),
    hoverinfo="skip",
    showlegend=False
))

fig.update_layout(
    title="Job postings by Location",
    geo=dict(
        scope="usa",
        projection=go.layout.geo.Projection(type="albers usa"),
        showland=True,
        landcolor="#fafafa"
    ),
    # margin=dict(l=100, r=20, t=60, b=60),
    paper_bgcolor="white",
    plot_bgcolor="white"
)
fig.update_traces(
    selector=dict(type="choropleth"),
    colorbar=dict(
        orientation="h",
        x=0.5,
        xanchor="center",
        y=-0.01,
        yanchor="top",
        thickness=12,
        len=0.6
    )
)
fig.show()

Insight:

Texas, California, and Florida emerge as national hiring hubs, reflecting their large and diversified economies, population growth, and strong business ecosystems. Many of the remaining top states, such as Virginia, New York, New Jersey, and North Carolina are anchored by federal agencies, financial centers, or technology clusters. Concentrating job search efforts in these high-volume states can materially increase the number of suitable opportunities available to candidates.

EDA Summary

The exploratory analysis reveals five consistent themes in the 2024 job market:

Industry concentration: Hiring is heavily concentrated in Professional Services, Administrative & Support Services, and key service sectors, which together generate a large share of postings.
Compensation differences: Median annual salaries vary sharply by industry, with Finance, Information, and Professional Services offering the strongest earnings potential.
Remote work structure: Explicitly remote and hybrid roles remain a minority but are disproportionately found in digital and analytics-intensive occupations.
Seasonality: Job posting activity dips around late June and rebounds sharply in late summer, highlighting the importance of timing in job search strategies.
Skill demand: SQL, Python, Excel, BI tools, and cloud platforms dominate technical requirements, defining a clear “baseline stack” for aspiring data and business analytics professionals.

These insights directly feed into the subsequent Skill Gap Analysis, ML Methods, and Career Strategy sections by identifying which industries, skills, and time windows matter most for job seekers in 2024.

--- title: "Exploratory Data Analysis" subtitle: "Job Postings, Salaries, Geography, and Remote Work Patterns" bibliography: references.bib csl: csl/econometrica.csl format: html: code-fold: true code-tools: true --- <div class="card reveal"> ### EDA Overview This page explores hiring trends, salary patterns, geographic variation, remote work availability, and software skill demands using the **cleaned Lightcast dataset** produced earlier. Each visualization is selected to answer job seeker–focused questions: - **Which industries are hiring the most?** - **How do salaries vary across sectors?** - **How common are remote roles?** - **How has demand changed over time?** - **Which software skills are most requested?** </div> --- <div class="card reveal"> ### Load the Cleaned Dataset ```{python} import pandas as pd import plotly.express as px import plotly.graph_objects as go import numpy as np import re df = pd.read_csv("data/lightcast_cleaned.csv", low_memory=False) #df.head() #df.info() #list(df.columns) ``` </div> --- <div class="card reveal"> ### Job Postings by Industry (Top Sectors Hiring) ```{python} industry_candidates = [ "NAICS_2022_2_NAME", "INDUSTRY_NAME", "NAICS_2022_3_NAME"] industry_col = next( (c for c in industry_candidates if c in df.columns), None) industry_col ``` ```{python} if industry_col: industry_counts = ( df[industry_col] .value_counts() .head(15) .reset_index() ) industry_counts.columns = ["Industry", "Postings"] fig = go.Figure() fig.add_trace(go.Bar( x=industry_counts["Postings"], y=industry_counts["Industry"], orientation="h" )) fig.update_layout( title=f"Top 15 Industries by Job Postings ({industry_col})", plot_bgcolor="white", margin=dict(l=10, r=10, t=60, b=10), height=500, xaxis=dict( showgrid=True, gridcolor="#eaeaea", ), yaxis=dict( categoryorder='total ascending', )) fig.show() else: print("No industry column found.") ``` **Insight:** Professional, Scientific & Technical Services and Administrative & Support Services account for the largest share of job postings, indicating sustained demand for analytical, operational, and client-facing roles. These sectors typically exhibit continuous hiring cycles driven by project-based work, business expansion, and comparatively high turnover. For job seekers, they represent strong entry points into the labor market with broad role variety and relatively frequent openings. </div> --- <div class="card reveal"> ### Annual Salary - Trend Over Time ```{python} import plotly.graph_objects as go from plotly.subplots import make_subplots df2 = df.copy() df2 = df2.dropna(subset=["ANNUAL_SALARY", "month"]) # Group by month (string) agg = ( df2.groupby("month") .agg( postings=("ANNUAL_SALARY", "size"), avg_salary=("ANNUAL_SALARY", "mean") ).reset_index().sort_values("month") ) # Dual-axis chart # fig = make_subplots(specs=[[{"secondary_y": True}]]) fig = go.Figure() fig.add_trace(go.Bar( x=agg["month"], y=agg["postings"], name="Job Postings", marker_color="#ff7043", yaxis="y1", hovertemplate="Job postings: %{y:,}<extra></extra>" )) fig.add_trace(go.Scatter( x=agg["month"], y=agg["avg_salary"], name="Annual Salary", mode="lines+markers", yaxis="y2", line=dict(color="#1f77b4", width=3), hovertemplate="Salary: $%{y:,.0f}<extra></extra>" )) fig.update_layout( title="Annual Salary - Trend Over Time", template="plotly_white", hovermode="x unified", legend=dict( orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1), xaxis=dict( title="Month", tickformat="%b %Y", showgrid=True, gridcolor="#eaeaea" ), yaxis=dict( title="Job Postings", showgrid=True, gridcolor="#eaeaea" ), yaxis2=dict( title="Annual Salary($)", overlaying="y", side="right", tickprefix="$", showgrid=False ) ) fig.show() ``` **Insight:** The chart shows a clear seasonal pattern in the job market. Job postings stay fairly strong from May through September, although there is a small dip in July before demand rises again in late summer. However, advertised salaries move very differently. They drop steadily from May to July and reach their lowest point mid-summer. Then, in September, salaries jump sharply to the highest level in the entire period. This contrast suggests that early summer is dominated by lower-paying or junior roles, while late summer and early fall bring a return of higher-value positions. The most important takeaway is that September offers both strong hiring volume and significantly higher salaries, making it a particularly favorable time for job seekers aiming for better-paid opportunities. </div> <div class="card reveal"> ### Salary Distribution by Industry Salary variation across industries answers: “Which sectors pay more, and how unequal are wages within each industry?” We use the annualized salary field (ANNUAL_SALARY), which consolidates raw salary data from SALARY_FROM, SALARY_TO, and SALARY and converts it into a consistent yearly format based on reported pay period. This provides a clean, comparable measure of compensation across all industries. ```{python} industry_col = industry_col salary_col = "ANNUAL_SALARY" if "ANNUAL_SALARY" in df.columns else None if industry_col and salary_col: # Filter valid salaries df_salary = df[[industry_col, salary_col]].dropna() # Remove noise categories that distort analysis df_salary = df_salary[df_salary[industry_col] != "Unclassified Industry"] # Cap extreme values at 300k for readability (still accurate for 95% of postings) df_salary[salary_col] = np.where( df_salary[salary_col] > 300000, 300000, df_salary[salary_col] ) # Compute median salary per industry for ordering medians = ( df_salary.groupby(industry_col)[salary_col] .median() .sort_values(ascending=True) ) ordered_categories = medians.index.tolist() fig = px.box( df_salary, y=industry_col, x=salary_col, category_orders={industry_col: ordered_categories}, title="Annual Salary Distribution by Industry", color=industry_col, color_discrete_sequence=px.colors.qualitative.Vivid, height=900 ) fig.update_layout( showlegend=False, xaxis_title="Annual Salary (USD)", yaxis_title="Industry", margin=dict(l=120) ) fig.show() else: print("Salary chart not generated: Missing industry or salary column.") ``` **Insight:** Annual salary levels differ substantially across industries, with Finance, Information, and Professional Services exhibiting the highest median compensation. The wide dispersion of salaries within these sectors reflects variation in seniority, specialization, and credential requirements. In contrast, service-oriented industries such as Retail and Accommodation & Food Services show lower but more compressed salary ranges, consistent with narrower variation in role complexity. These patterns suggest that targeted upskilling toward high-skill roles in Finance, Tech, and Professional Services is likely to yield the greatest earnings uplift. </div> <div class="card reveal"> ### Remote vs. On-Site Job Patterns Remote work availability is a crucial factor for job seekers. This chart shows whether industries lean toward remote, hybrid, or in-person work. Here, [None] represents postings where Lightcast did not provide a remote-work tag, which typically corresponds to on-site or unspecified arrangements. ```{python} remote_candidates = ["REMOTE_TYPE_NAME", "REMOTE_TYPE"] remote_col = next((c for c in remote_candidates if c in df.columns), None) ``` ```{python} if remote_col: remote_counts = df[remote_col].value_counts().reset_index() remote_counts.columns = ["Work Arrangement", "Postings"] fig = px.pie( remote_counts, names="Work Arrangement", values="Postings", title="Remote vs On-Site Job Postings", color_discrete_sequence=px.colors.qualitative.Pastel1 ) fig.show() else: print("No remote work field found.") ``` **Insight:** The majority of postings either do not explicitly tag remote work or are implicitly on-site, with only a modest share labeled as Remote or Hybrid. Where remote options are available, they are concentrated in data, IT, and knowledge-intensive roles that can be performed digitally. For job seekers prioritizing location flexibility, this suggests focusing on analytically oriented roles and employers that have formal remote-work policies rather than assuming remote options are widespread across all occupations. </div> <div class="card reveal"> ### Geographic Distribution of Jobs (by State) Geographic EDA answers: “Which states have the most job opportunities?” We use STATE_NAME (cleaned) to compute the total number of postings per state. ```{python} STATE_CENTROIDS = { "AL": (32.806, -86.791), "AK": (64.200, -149.493), "AZ": (34.049, -111.094), "AR": (35.201, -91.832), "CA": (36.778, -119.418), "CO": (39.550, -105.782), "CT": (41.603, -73.087), "DE": (38.910, -75.527), "DC": (38.907, -77.037), "FL": (27.664, -81.516), "GA": (32.165, -82.900), "HI": (19.896, -155.582), "ID": (44.068, -114.742), "IL": (40.633, -89.398), "IN": (40.267, -86.134), "IA": (41.878, -93.098), "KS": (39.012, -98.484), "KY": (37.839, -84.270), "LA": (30.984, -91.963), "ME": (45.254, -69.445), "MD": (39.045, -76.641), "MA": (42.407, -71.382), "MI": (44.314, -85.602), "MN": (46.729, -94.685), "MS": (32.355, -89.398), "MO": (37.964, -91.832), "MT": (46.879, -110.362), "NE": (41.492, -99.901), "NV": (38.803, -116.419), "NH": (43.193, -71.572), "NJ": (40.058, -74.406), "NM": (34.519, -105.870), "NY": (43.000, -75.000), "NC": (35.759, -79.019), "ND": (47.551, -101.002), "OH": (40.417, -82.907), "OK": (35.468, -97.516), "OR": (43.804, -120.554), "PA": (41.203, -77.194), "RI": (41.580, -71.477), "SC": (33.837, -81.164), "SD": (43.969, -99.901), "TN": (35.518, -86.580), "TX": (31.968, -99.901), "UT": (39.321, -111.094), "VT": (44.558, -72.577), "VA": (37.431, -78.657), "WA": (47.751, -120.740), "WV": (38.597, -80.454), "WI": (43.785, -88.787), "WY": (43.076, -107.290) } STATE_NAMES = { "AL": "Alabama", "AK": "Alaska", "AZ": "Arizona", "AR": "Arkansas", "CA": "California", "CO": "Colorado", "CT": "Connecticut", "DE": "Delaware", "DC": "District of Columbia", "FL": "Florida", "GA": "Georgia", "HI": "Hawaii", "ID": "Idaho", "IL": "Illinois", "IN": "Indiana", "IA": "Iowa", "KS": "Kansas", "KY": "Kentucky", "LA": "Louisiana", "ME": "Maine", "MD": "Maryland", "MA": "Massachusetts", "MI": "Michigan", "MN": "Minnesota", "MS": "Mississippi", "MO": "Missouri", "MT": "Montana", "NE": "Nebraska", "NV": "Nevada", "NH": "New Hampshire", "NJ": "New Jersey", "NM": "New Mexico", "NY": "New York", "NC": "North Carolina", "ND": "North Dakota", "OH": "Ohio", "OK": "Oklahoma", "OR": "Oregon", "PA": "Pennsylvania", "RI": "Rhode Island", "SC": "South Carolina", "SD": "South Dakota", "TN": "Tennessee", "TX": "Texas", "UT": "Utah", "VT": "Vermont", "VA": "Virginia", "WA": "Washington", "WV": "West Virginia", "WI": "Wisconsin", "WY": "Wyoming" } STATE_NAME_TO_CODE = { "Alabama": "AL", "Alaska": "AK", "Arizona": "AZ", "Arkansas": "AR", "California": "CA", "Colorado": "CO", "Connecticut": "CT", "Delaware": "DE", "District of Columbia": "DC", "Florida": "FL", "Georgia": "GA", "Hawaii": "HI", "Idaho": "ID", "Illinois": "IL", "Indiana": "IN", "Iowa": "IA", "Kansas": "KS", "Kentucky": "KY", "Louisiana": "LA", "Maine": "ME", "Maryland": "MD", "Massachusetts": "MA", "Michigan": "MI", "Minnesota": "MN", "Mississippi": "MS", "Missouri": "MO", "Montana": "MT", "Nebraska": "NE", "Nevada": "NV", "New Hampshire": "NH", "New Jersey": "NJ", "New Mexico": "NM", "New York": "NY", "North Carolina": "NC", "North Dakota": "ND", "Ohio": "OH", "Oklahoma": "OK", "Oregon": "OR", "Pennsylvania": "PA", "Rhode Island": "RI", "South Carolina": "SC", "South Dakota": "SD", "Tennessee": "TN", "Texas": "TX", "Utah": "UT", "Vermont": "VT", "Virginia": "VA", "Washington": "WA", "West Virginia": "WV", "Wisconsin": "WI", "Wyoming": "WY" } ``` ```{python} df = df[~df["CITY_NAME"].str.contains("Unknown", na=False)].copy() df_city = ( df .dropna(subset=["CITY_NAME", "STATE_NAME"]) .groupby(["CITY_NAME", "STATE_NAME"], as_index=False) .agg(total_job_postings=("CITY_NAME", "size")) ) df_city["STATE_CODE"] = df_city["STATE_NAME"].map(STATE_NAME_TO_CODE) df_city = df_city.dropna(subset=["STATE_CODE"]) df_city = df_city.nlargest(100, "total_job_postings") df_state = ( df_city.groupby("STATE_CODE", as_index=False) .agg(total_postings=("total_job_postings", "sum")) ) # city coordinates lats, lons = [], [] counts = df_city.groupby("STATE_CODE")["CITY_NAME"].transform("count") idxs = df_city.groupby("STATE_CODE").cumcount() for st_code, k, idx in zip(df_city["STATE_CODE"], counts, idxs): base = STATE_CENTROIDS.get(st_code, (37.0, -96.0)) # fallback center of US angle = 2 * np.pi * (idx / max(k, 1)) radius = 0.8 lats.append(base[0] + radius * np.cos(angle)) lons.append(base[1] + radius * np.sin(angle)) df_city["lat"] = lats df_city["lon"] = lons # Bubble size scaling max_total = float(df_city["total_job_postings"].max()) size_scale = 40.0 / np.sqrt(max_total) if max_total > 0 else 1.0 fig = go.Figure() fig.add_trace(go.Choropleth( locations=df_state["STATE_CODE"], z=df_state["total_postings"], locationmode="USA-states", colorscale="Blues", marker_line_color="white", marker_line_width=0.6, colorbar_title="Total postings", hovertemplate="<b>%{location}</b><br>Total: %{z:,}<extra></extra>", zmin=0, zmax=df_state["total_postings"].max() )) # City bubbles fig.add_trace(go.Scattergeo( lon=df_city["lon"], lat=df_city["lat"], text=df_city["CITY_NAME"], mode="markers", marker=dict( size=np.sqrt(df_city["total_job_postings"]) * size_scale, sizemin=5, opacity=0.85, color="#22c55e", line=dict(width=0.8, color="white") ), customdata=df_city["total_job_postings"], hovertemplate=( "<b>%{text}</b><br>" "Total postings: %{customdata:,}" "<extra></extra>" ), showlegend=False, name="City postings" )) # State labels label_lats, label_lons, label_names = [], [], [] for code in df_state["STATE_CODE"]: if code in STATE_CENTROIDS: lat, lon = STATE_CENTROIDS[code] label_lats.append(lat) label_lons.append(lon) label_names.append(STATE_NAMES.get(code, code)) fig.add_trace(go.Scattergeo( lat=label_lats, lon=label_lons, mode="text", text=label_names, textfont=dict(size=9, color="rgba(0,0,0,0.75)"), hoverinfo="skip", showlegend=False )) fig.update_layout( title="Job postings by Location", geo=dict( scope="usa", projection=go.layout.geo.Projection(type="albers usa"), showland=True, landcolor="#fafafa" ), # margin=dict(l=100, r=20, t=60, b=60), paper_bgcolor="white", plot_bgcolor="white" ) fig.update_traces( selector=dict(type="choropleth"), colorbar=dict( orientation="h", x=0.5, xanchor="center", y=-0.01, yanchor="top", thickness=12, len=0.6 ) ) fig.show() ``` **Insight:** Texas, California, and Florida emerge as national hiring hubs, reflecting their large and diversified economies, population growth, and strong business ecosystems. Many of the remaining top states, such as Virginia, New York, New Jersey, and North Carolina are anchored by federal agencies, financial centers, or technology clusters. Concentrating job search efforts in these high-volume states can materially increase the number of suitable opportunities available to candidates. </div> <div class="card reveal"> ### Top Software Skills in Job Postings To support the Skill Gap Analysis page, we identify which software tools are requested most often. This uses the cleaned SOFTWARE_SKILLS_NAME column produced in the Data Cleaning step. ```{python} if "SOFTWARE_SKILLS_NAME" in df.columns: skills = ( df["SOFTWARE_SKILLS_NAME"] .astype(str) .str.split(",") .explode() .str.strip() ) skills = skills[(skills != "") & (skills != "nan")] top_skills = skills.value_counts().head(15).reset_index() top_skills.columns = ["Skill", "Postings"] fig = go.Figure() fig.add_trace(go.Bar( x=top_skills["Postings"], y=top_skills["Skill"], orientation="h" )) fig.update_layout( title="Top 15 Software Skills", plot_bgcolor="white", margin=dict(l=10, r=10, t=60, b=10), height=500, xaxis=dict( showgrid=True, gridcolor="#eaeaea", ), yaxis=dict( categoryorder='total ascending', ), legend=dict( orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1 )) fig.show() else: print("Software skills column not available.") ``` **Insight:** SQL, Python, and Excel appear as the most universally requested tools, underscoring their status as core competencies for analytics, BI, and operations roles. Tableau and Power BI, together with dashboarding skills more broadly, highlight the premium placed on communicating data insights visually. Cloud and platform skills such as AWS, Azure, and Oracle Cloud are less frequent but still prominent, indicating growing demand for candidates who can work across both traditional analytics stacks and modern cloud-based architectures. </div> --- <div class="card reveal"> ### EDA Summary The exploratory analysis reveals five consistent themes in the 2024 job market: - **Industry concentration**: Hiring is heavily concentrated in Professional Services, Administrative & Support Services, and key service sectors, which together generate a large share of postings. - **Compensation differences**: Median annual salaries vary sharply by industry, with Finance, Information, and Professional Services offering the strongest earnings potential. - **Remote work structure**: Explicitly remote and hybrid roles remain a minority but are disproportionately found in digital and analytics-intensive occupations. - **Seasonality**: Job posting activity dips around late June and rebounds sharply in late summer, highlighting the importance of timing in job search strategies. - **Skill demand**: SQL, Python, Excel, BI tools, and cloud platforms dominate technical requirements, defining a clear “baseline stack” for aspiring data and business analytics professionals. These insights directly feed into the subsequent Skill Gap Analysis, ML Methods, and Career Strategy sections by identifying which industries, skills, and time windows matter most for job seekers in 2024. </div>