NLP Methods

Text Mining and Skill Extraction from Job Descriptions

Overview

Natural Language Processing (NLP) techniques allow us to extract patterns, themes, and linguistic signals embedded in job descriptions.
While structured data captures industry, salary, skills, and location, unstructured job text reveals employers’ expectations, behavioral traits, and role-specific competencies that are not always encoded in Lightcast’s structured fields.

In this section, we:

clean and normalize job description text
tokenize and filter low-value words
compute word frequencies and visualize them
apply TF–IDF to identify high-information terms
quantify the presence of technical keywords in descriptions

These insights complement earlier EDA and Skill Gap Analysis results by highlighting how employers talk about data and analytics roles in practice.

Load the Dataset

Code

import pandas as pd
import numpy as np
import re
from collections import Counter
import plotly.express as px
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv("data/lightcast_cleaned.csv", low_memory=False)

# Keep only the columns we need for NLP
df_text = df[["TITLE_NAME", "BODY", "NAICS_2022_2_NAME", "NAICS_2022_2"]].copy()
# df_text.head()

Text Cleaning Pipeline

We apply a lightweight but robust text cleaning pipeline: 1. Convert all text to lowercase 2. Remove punctuation, digits, and symbols 3. Collapse repeated whitespace 4. Tokenize into individual words 5. Remove very common “filler” words using a custom stopword list 6. Keep only alphabetic tokens with length ≥ 4

This produces a semantically meaningful set of tokens for each job description while avoiding heavy external dependencies.

Code

# Simple tokenizer using regex + Python only (no NLTK)
def simple_tokenize(text: str):
    text = str(text).lower()
    text = re.sub(r"[^a-z\s]", " ", text)   # keep only letters and spaces
    text = re.sub(r"\s+", " ", text).strip()
    tokens = text.split()
    return [w for w in tokens if len(w) > 3]

# Custom stopword list (focused on generic English terms)
custom_stopwords = {
    "this","that","with","from","about","there","their","which","have","were",
    "been","also","into","such","they","them","your","will","would","could",
    "should","other","than","some","more","when","what","where","these","those",
    "just","here","very","much","many","most","over","under","while","after",
    "before","still","next","only","each","every","then","because","within",
    "including","using","across","through"
}

def clean_text(t: str) -> str:
    t = str(t).lower()
    t = re.sub(r"[^a-z\s]", " ", t)
    t = re.sub(r"\s+", " ", t).strip()
    return t

# Clean + tokenize BODY text
df_text["clean_body"] = df_text["BODY"].fillna("").astype(str).apply(clean_text)

df_text["tokens"] = df_text["clean_body"].apply(
    lambda text: [w for w in simple_tokenize(text) if w not in custom_stopwords]
)

df_text[["TITLE_NAME", "tokens"]].head()

	TITLE_NAME	tokens
0	Enterprise Analysts	[enterprise, analyst, merchandising, dorado, a...
1	Oracle Consultants	[oracle, consultant, reports, augusta, maine, ...
2	Data Analysts	[taking, care, people, heart, everything, star...
3	Management Analysts	[role, wells, fargo, looking, platform, tools,...
4	Unclassified	[comisiones, semana, comiensa, rapido, modesto...

Word Frequency Extraction

We aggregate a global vocabulary across all postings and compute the most frequently occurring terms.

Code

all_words = []
df_text["tokens"].apply(all_words.extend)

word_freq = Counter(all_words).most_common(20)
freq_df = pd.DataFrame(word_freq, columns=["word", "count"])
freq_df.head()

	word	count
0	data	512950
1	experience	361495
2	business	301725
3	work	234020
4	skills	181358

The bar chart below highlights the 20 most common lexical terms appearing across job descriptions.

Code

fig_freq = px.bar(
    freq_df.sort_values("count", ascending=True),
    x="count",
    y="word",
    orientation="h",
    title="Most Common Terms in Job Descriptions",
    labels={"count": "Frequency", "word": "Term"}
)
fig_freq.update_layout(
    xaxis=dict(
        showgrid=True,
        gridcolor="#eaeaea",
    ),
    yaxis=dict(
        categoryorder='total ascending',
    ),
    plot_bgcolor="white",
    paper_bgcolor="white",
    showlegend=False 
)
fig_freq.show()

Word Cloud

A word cloud provides a quick, intuitive view of recurring terms. Large words appear more frequently in text, giving a qualitative impression of employer emphasis.

Code

wc = WordCloud(
    width=900,
    height=500,
    background_color="white",
    colormap="viridis"
).generate(" ".join(all_words))

plt.figure(figsize=(12, 6))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.title("Word Cloud of Job Description Terms", fontsize=16)
plt.show()

TF–IDF: High-Information Terms Across Job Descriptions

Raw word counts can be dominated by generic terms (for example, “team,” “support,” “experience”). To focus on high-information terms that distinguish postings, we apply TF–IDF (Term Frequency–Inverse Document Frequency).

This gives higher scores to words that

appear frequently within a posting, but
do not appear in every other posting.

Code

# For performance, optionally sample a subset if the dataset is huge
MAX_DOCS = 15000
if len(df_text) > MAX_DOCS:
    df_tfidf = df_text.sample(n=MAX_DOCS, random_state=42).copy()
else:
    df_tfidf = df_text.copy()

tfidf_vectorizer = TfidfVectorizer(
    max_features=3000,
    min_df=5,
    max_df=0.6,
    stop_words=None  # custom stopwords already applied at the text level
)

tfidf_matrix = tfidf_vectorizer.fit_transform(df_tfidf["clean_body"])
feature_names = np.array(tfidf_vectorizer.get_feature_names_out())
# tfidf_matrix.shape

Code

# Compute average TF–IDF score per term across all sampled documents
avg_tfidf = tfidf_matrix.mean(axis=0).A1

tfidf_df = (
    pd.DataFrame({"term": feature_names, "score": avg_tfidf})
    .sort_values("score", ascending=False)
)
# tfidf_top = tfidf_df.head(25)
tfidf_df.head(3)

	term	score
2394	sap	0.053556
1899	oracle	0.035554
2995	your	0.031992

Code

fig_tfidf = px.bar(
    tfidf_df.head(25).sort_values("score", ascending=True),
    x="score",
    y="term",
    orientation="h",
    title="Top 25 High-Information Terms (TF–IDF)",
    labels={"score": "Average TF–IDF Score", "term": "Term"},
    # color="term",  # categorical coloring
    # color_discrete_sequence=px.colors.qualitative.Set3,  # vibrant, premium palette
    # height=700
)
fig_tfidf.update_layout(
    xaxis=dict(
        showgrid=True,
        gridcolor="#eaeaea",
    ),
    yaxis=dict(
        categoryorder='total ascending',
    ),
    plot_bgcolor="white",
    paper_bgcolor="white",
    showlegend=False
)
fig_tfidf.show()

TF–IDF by Industry (Macro Lens)

To connect language patterns with industry context, we compare TF–IDF profiles across a few major NAICS sectors:

Information (51)
Finance and Insurance (52)
Professional, Scientific, and Technical Services (54)

Code

# Map NAICS_2022_2 to readable sector labels
sector_map = {
    51.0: "Information",
    52.0: "Finance & Insurance",
    54.0: "Professional, Scientific & Technical"
}

df_text["NAICS_2022_2"] = df["NAICS_2022_2"].astype(float)
df_text["sector"] = df_text["NAICS_2022_2"].map(sector_map)

df_sector = df_text.dropna(subset=["sector", "clean_body"]).copy()
df_sector["sector"].value_counts()

sector
Professional, Scientific & Technical    22339
Finance & Insurance                      6990
Information                              3771
Name: count, dtype: int64

Code

# Build separate corpora per sector
sector_texts = (
    df_sector
    .groupby("sector")["clean_body"]
    .apply(lambda s: " ".join(s.tolist()))
)

sector_texts

sector
Finance & Insurance                     taking care of people is at the heart of every...
Information                             about lumen lumen connects the world we are ig...
Professional, Scientific & Technical    sr marketing analyst united states ny new york...
Name: clean_body, dtype: object

Code

# Separate vectorizer for sector-level TF–IDF
sector_vectorizer = TfidfVectorizer(
    max_features=2000,
    min_df=1,    # allow terms that appear in at least one sector corpus
    max_df=1.0,  # do not drop terms that appear across multiple sectors
    stop_words=None
)

sector_tfidf = sector_vectorizer.fit_transform(sector_texts.values)
sector_terms = np.array(sector_vectorizer.get_feature_names_out())

sector_tfidf.shape

(3, 2000)

Code

# For each sector, extract its top 15 TF–IDF terms
rows = []
for i, sector_name in enumerate(sector_texts.index):
    row_vec = sector_tfidf[i].toarray().ravel()
    top_idx = row_vec.argsort()[-15:][::-1]
    for idx in top_idx:
        rows.append({
            "sector": sector_name,
            "term": sector_terms[idx],
            "score": row_vec[idx]
        })

sector_tfidf_df = pd.DataFrame(rows)
sector_tfidf_df.head()

	sector	term	score
0	Finance & Insurance	and	0.671229
1	Finance & Insurance	to	0.368418
2	Finance & Insurance	the	0.301178
3	Finance & Insurance	of	0.272638
4	Finance & Insurance	in	0.171988

Code

fig_sect = px.bar(
    sector_tfidf_df,
    x="score",
    y="term",
    color="sector",
    barmode="group",
    orientation="h",
    title="High-Information Terms by Sector (TF–IDF – Top 15 per Sector)",
    labels={"score": "TF–IDF Score", "term": "Term", "sector": "Sector"}
    # height=800
)
fig_sect.update_layout(
    xaxis=dict(
        showgrid=True,
        gridcolor="#eaeaea",
    ),
    yaxis=dict(
        categoryorder='total ascending',
    ),
    legend=dict(orientation="v", yanchor="top", y=0.25, xanchor="left", x=0.75)
    )
fig_sect.show()

Technical Terms Directly from Descriptions

Although Lightcast provides structured software skill fields, job descriptions often repeat these terms inside free text. To quantify this, we scan cleaned descriptions for common technical keywords.

Code

technical_terms = [
    "python", "sql", "excel", "tableau", "powerbi", "power bi",
    "aws", "azure", "cloud", "machine learning",
    "analytics", "analysis", "business intelligence"
]

tech_counts = {}
for term in technical_terms:
    pattern = term.replace(" ", "")  # basic normalization for "power bi" -> "powerbi"
    tech_counts[term] = df_text["clean_body"].str.contains(pattern, regex=False).sum()

tech_df = (
    pd.DataFrame.from_dict(tech_counts, orient="index", columns=["count"])
    .sort_values("count", ascending=True)
    .reset_index()
    .rename(columns={"index": "term"})
)
tech_df.head()

	term	count
0	machine learning	22
1	business intelligence	56
2	powerbi	3115
3	power bi	3115
4	azure	5580

Code

fig_tech = px.bar(
    tech_df,
    x="count",
    y="term",
    orientation="h",
    title="Technical Terms Found in Job Descriptions",
    labels={"count": "Mentions", "term": "Term"},
    # color="count",
    # color_continuous_scale=px.colors.sequential.GnBu,
    # height=600
)
fig_tech.update_layout(
    xaxis=dict(
        showgrid=True,
        gridcolor="#eaeaea",
    ),
    yaxis=dict(
        categoryorder='total ascending',
    ))
fig_tech.show()

Interpretation & Insights

Dominant Themes in Employer Language
- Global frequency and word clouds show recurring emphasis on experience, support, team, management, and responsibilities.
- This indicates that employers value not only technical competence but also the ability to operate in collaborative, process-oriented environments.
High-Information Terms (TF–IDF)
- TF–IDF surfaces more specialized vocabulary (e.g., “pipeline,” “analytics,” “visualization,” “governance”) that differentiates advanced analytics roles from generic postings.
- These terms often correspond to specific project responsibilities or technical domains within data teams.
Sector-Specific Vocabulary
- In Finance & Insurance, TF–IDF highlights concepts related to risk, portfolios, credit, and regulatory reporting.
- In Professional, Scientific & Technical Services, terms linked to modeling, experimentation, and client delivery become more prominent.
- In Information, we see emphasis on platforms, content, and digital products.
- This demonstrates how sector context shapes the language of “data work.”
Technical Signal in Text
- The repeated presence of terms such as SQL, Python, Excel, Tableau, Power BI, AWS, and Azure inside descriptions reinforces the structured skill patterns observed in the EDA section.
- These technologies function as “linguistic anchors” that clearly distinguish analytics roles from general business positions.
Implications for Job Seekers
- Candidates aligning with these text-level signals (SQL + Python + BI + cloud) are better positioned for data-intensive roles.
- Beyond tools, employers repeatedly emphasize concepts related to analysis, insights, decision-making, and stakeholder communication — confirming that soft skills and interpretation capabilities matter as much as raw coding ability.
- Understanding how employers write about roles helps job seekers tailor resumes, cover letters, and LinkedIn profiles using the same vocabulary that appears in successful job descriptions.

Taken together, the NLP results provide a qualitative, language-based complement to our structured EDA and Skill Gap Analysis, strengthening the case for a hybrid skill profile: technical depth, sector awareness, and communication skills that translate data into decisions.