library(tidyverse)
library(dplyr)
library(caret)
library(rpart)
library(rpart.plot)
<- read.csv('vgchartz-2024.csv') vg
Classification Tree
Developed a classification model to predict video game sales performance using a real-world video game dataset. Preprocessed the data through binning, factor conversion, and top-category filtering. Built, visualized, and pruned decision trees using cross-validation, and evaluated model performance using confusion matrices.
This work is part of an assignment for the AD699 Data Mining course.
This dataset appears to contain information about video games, including the title, the console it was developed for, genre, publisher, developer, game rating, various sales numbers, release date, and, if available, the last update date. While some titles appear multiple times, other variables indicate that these entries correspond to the same game released on different consoles or different versions, such as remakes. The critic_score represents the overall rating of a game, with a maximum score of 10. And the dataset includes sales data across different regions (NA, JP, PAL, Other), and difference between numbers suggest the game popularity across various geographical areas. Furthermore, there is huge number of NAs.
After using equal frequency binning method to our output variable, we can see that the data equally was distributed within three groups: Low, Medium, High.
$total_sales <- cut(vg$total_sales,
vgbreaks = quantile(vg$total_sales, probs = seq(0, 1, length.out = 4), na.rm = TRUE),
include.lowest = TRUE,
labels = c("Low", "Medium", "High"))
table(vg$total_sales)
Low Medium High
6339 6386 6197
str(vg)
'data.frame': 64016 obs. of 14 variables:
$ img : chr "/games/boxart/full_6510540AmericaFrontccc.jpg" "/games/boxart/full_5563178AmericaFrontccc.jpg" "/games/boxart/827563ccc.jpg" "/games/boxart/full_9218923AmericaFrontccc.jpg" ...
$ title : chr "Grand Theft Auto V" "Grand Theft Auto V" "Grand Theft Auto: Vice City" "Grand Theft Auto V" ...
$ console : chr "PS3" "PS4" "PS2" "X360" ...
$ genre : chr "Action" "Action" "Action" "Action" ...
$ publisher : chr "Rockstar Games" "Rockstar Games" "Rockstar Games" "Rockstar Games" ...
$ developer : chr "Rockstar North" "Rockstar North" "Rockstar North" "Rockstar North" ...
$ critic_score: num 9.4 9.7 9.6 NA 8.1 8.7 8.8 9.8 8.4 8 ...
$ total_sales : Factor w/ 3 levels "Low","Medium",..: 3 3 3 3 3 3 3 3 3 3 ...
$ na_sales : num 6.37 6.06 8.41 9.06 6.18 9.07 9.76 5.26 8.27 4.99 ...
$ jp_sales : num 0.99 0.6 0.47 0.06 0.41 0.13 0.11 0.21 0.07 0.65 ...
$ pal_sales : num 9.85 9.71 5.49 5.33 6.05 4.29 3.73 6.21 4.32 5.88 ...
$ other_sales : num 3.12 3.02 1.78 1.42 2.44 1.33 1.14 2.26 1.2 2.28 ...
$ release_date: chr "2013-09-17" "2014-11-18" "2002-10-28" "2013-09-17" ...
$ last_update : chr "" "2018-01-03" "" "" ...
$img <- as.factor(vg$img)
vg$title <- as.factor(vg$title)
vg$console <- as.factor(vg$console)
vg$genre <- as.factor(vg$genre)
vg$publisher <- as.factor(vg$publisher)
vg$developer <- as.factor(vg$developer)
vg$release_date <- as.factor(vg$release_date)
vg$last_update <- as.factor(vg$last_update)
vg
str(vg)
'data.frame': 64016 obs. of 14 variables:
$ img : Factor w/ 56177 levels "/games/boxart/1000386ccc.jpg",..: 31526 27433 6702 42887 25079 47407 47389 23589 12344 23571 ...
$ title : Factor w/ 39798 levels "_summer ##","- Arcane preRaise -",..: 13767 13767 13779 13767 5281 5291 5276 27100 5283 5283 ...
$ console : Factor w/ 81 levels "2600","3DO","3DS",..: 57 58 56 76 58 76 76 58 76 57 ...
$ genre : Factor w/ 20 levels "Action","Action-Adventure",..: 1 1 1 1 16 16 16 2 16 16 ...
$ publisher : Factor w/ 3383 levels "][ Games","@unepic_fran",..: 2514 2514 2514 2514 84 84 84 2514 84 84 ...
$ developer : Factor w/ 8863 levels "",".theprodukkt",..: 6580 6580 6580 6580 8072 3787 8072 6576 8072 8072 ...
$ critic_score: num 9.4 9.7 9.6 NA 8.1 8.7 8.8 9.8 8.4 8 ...
$ total_sales : Factor w/ 3 levels "Low","Medium",..: 3 3 3 3 3 3 3 3 3 3 ...
$ na_sales : num 6.37 6.06 8.41 9.06 6.18 9.07 9.76 5.26 8.27 4.99 ...
$ jp_sales : num 0.99 0.6 0.47 0.06 0.41 0.13 0.11 0.21 0.07 0.65 ...
$ pal_sales : num 9.85 9.71 5.49 5.33 6.05 4.29 3.73 6.21 4.32 5.88 ...
$ other_sales : num 3.12 3.02 1.78 1.42 2.44 1.33 1.14 2.26 1.2 2.28 ...
$ release_date: Factor w/ 7923 levels "","1971-12-03",..: 6007 6357 2845 6007 6573 5506 5197 7244 5769 5769 ...
$ last_update : Factor w/ 1546 levels "","2017-11-28",..: 1 15 1 1 26 1 1 285 109 109 ...
After converting all chr parameters to factor, our dataset contains only numbers or factors.
Next we are going to do some “topN” filtering.
<- vg %>%
top_7_Console count(console, sort = TRUE) %>%
slice(1:7)
top_7_Console
console n
1 PC 12617
2 PS2 3565
3 DS 3288
4 PS4 2878
5 PS 2707
6 NS 2337
7 XBL 2120
7 popular console types: PC, PS2, DS, PS4, PS, NS, XBL. Also there is huge difference between PC and other remaining values, making PC much more popular. This is likely due to its open system, allowing users to install and download games freely without strict platform restrictions.
<- vg %>%
top_7_Genre count(genre, sort = TRUE) %>%
slice(1:7)
top_7_Genre
genre n
1 Misc 9304
2 Action 8557
3 Adventure 6260
4 Role-Playing 5721
5 Sports 5586
6 Shooter 5410
7 Platform 4001
7 most common genres: misc, action, adventure, role-playing, sports, shooter, platform. The most prevalent genre is Miscellaneous, likely due to games that don’t fit into traditional categories or are poorly documented. Following that, Action games are the second most popular, with Adventure games ranking third. The popularity of platform games (Mario, Zelda or Sonic) shows that traditional gameplay styles still have a strong place in market.
<- vg %>%
top_7_Publisher count(publisher, sort = TRUE) %>%
slice(1:7)
top_7_Publisher
publisher n
1 Unknown 8842
2 Sega 2207
3 Ubisoft 1663
4 Electronic Arts 1619
5 Activision 1582
6 Konami 1544
7 Nintendo 1476
7 most common publishers: unknown, sega, ubisoft, electronic arts, activision, konami, nintendo. Notably, 8,842 records lack publisher information, categorized under “Unknown.” Among named publishers, Sega remains a key player. Also, Ubisoft, EA, Activision have released most of the popular games over the last few years.
<- vg %>%
top_7_Developer count(developer, sort = TRUE) %>%
slice(1:7)
top_7_Developer
developer n
1 Unknown 4435
2 Konami 976
3 Sega 915
4 Capcom 870
5 Namco 489
6 Square Enix 425
7 SNK Corporation 408
7 most common developers: unknown, konami, sega, namco, square enix, capcom, snk corporation. Similarly, 4,435 records lack developer information. Several major developers, such as Sega and Konami, also appear among the top publishers, reinforcing their influence in the industry.
<- vg %>% filter(console %in% top_7_Console$console &
vg %in% top_7_Genre$genre &
genre %in% top_7_Publisher$publisher &
publisher %in% top_7_Developer$developer)
developer
nrow(vg)
[1] 939
After applying these filters, the dataset is reduced to 939 rows.
<- droplevels(vg)
vg nrow(vg)
[1] 939
The dataset contains 939 records.
set.seed(79)
<- sample(c(1:nrow(vg)), nrow(vg)*0.6)
vg.index <- vg[vg.index, ]
vg_train <- vg[-vg.index, ] vg_valid
Next we will use rpart.plot to display a classification tree that depicts our model.
<- rpart(total_sales ~ console + genre + publisher + developer + critic_score, method = 'class', data = vg_train)
model
rpart.plot(model)
The initial plot starts with default parameters, where the tree starts from console varaible at the root. Each node represents type of total_sales (Low, Medium, High) differentiated by color. Within each node, we see class probabilities and the percentage of observations classified at that node.
rpart.plot(model, extra=2, fallen.leaves = FALSE)
After adding extra=2, in the new plot the class rate is relative to the total number of observations in each class. For instance, in the second left node, 50 observations belong to Low sales category out of 102 total observations. And setting fallen.leaves = FALSE, moves the leaf nodes away from the bottom, changing the structure of trees.
rpart.plot(model, type=5, extra = 2, fallen.leaves=FALSE)
In the third plot used type=5, which adds variable names for each split line, and class labels appear only at the leaves. This layout more intuitive, as it avoids overwhelming details such as numbers and labels.
The root node is console variable, and rule is Console = DS or PC. The root node is starting point of model, and have highest impact on outcome variable. In our model, the type of console is primary factor to decide total sales number. So, this approach can help strategize future plans for the game publisher, focusing more on high-performance consoles.
We see that from 5 input variables (console, genre, publisher, developer, critic score), only 2 (console, genre) appears in the tree model, as a useful parameters to predict price. Variable subset selection is automatic since it is part of the split selection.
rpart.rules(model)
total_sales Low Medi High
Low [ .56 .27 .16] when console is DS or NS or PC or PS or PS2 & genre is Action or Adventure or Misc or Role-Playing
Medium [ .19 .57 .23] when console is DS or NS or PC or PS or PS2 & genre is Sports
High [ .11 .22 .67] when console is DS or NS or PC or PS or PS2 & genre is Platform or Shooter
High [ .00 .00 1.00] when console is PS4
From the rules of our model, let’s describe second rule (index = 10): The video game is for PS console and Sports genre, by following tree nodes this game will be classified into Medium sales group.
<- rpart(total_sales ~ console + genre + publisher + developer + critic_score, method="class", cp=0, minsplit=2, data = vg_train)
complex_model
rpart.plot(complex_model, extra=1, fallen.leaves=FALSE)
<- rpart(total_sales ~ console + genre + publisher + developer + critic_score, method="class",
five_fold_cv cp=0.00001, minsplit=5, xval=5, data=vg_train)
<- printcp(five_fold_cv) a
Classification tree:
rpart(formula = total_sales ~ console + genre + publisher + developer +
critic_score, data = vg_train, method = "class", cp = 1e-05,
minsplit = 5, xval = 5)
Variables actually used in tree construction:
[1] console critic_score genre
Root node error: 81/127 = 0.6378
n=127 (436 observations deleted due to missingness)
CP nsplit rel error xerror xstd
1 0.1666667 0 1.00000 1.08642 0.064178
2 0.0925926 2 0.66667 0.85185 0.069303
3 0.0493827 4 0.48148 0.85185 0.069303
4 0.0370370 5 0.43210 0.77778 0.069562
5 0.0246914 6 0.39506 0.76543 0.069545
6 0.0123457 7 0.37037 0.77778 0.069562
7 0.0061728 10 0.33333 0.77778 0.069562
8 0.0000100 12 0.32099 0.80247 0.069545
<- data.frame(a)
a a
CP nsplit rel.error xerror xstd
1 0.16666667 0 1.0000000 1.0864198 0.06417809
2 0.09259259 2 0.6666667 0.8518519 0.06930294
3 0.04938272 4 0.4814815 0.8518519 0.06930294
4 0.03703704 5 0.4320988 0.7777778 0.06956221
5 0.02469136 6 0.3950617 0.7654321 0.06954496
6 0.01234568 7 0.3703704 0.7777778 0.06956221
7 0.00617284 10 0.3333333 0.7777778 0.06956221
8 0.00001000 12 0.3209877 0.8024691 0.06954496
From the complexity parameter table for eight trees, we see that xerror value decreased at some point and started to increase again. This minimum value of cross validation error (0.7037037) gives optimal cp value. In our case cp = 0.03703704
<- prune(five_fold_cv,
pruned.ct cp=five_fold_cv$cptable[which.min(five_fold_cv$cptable[,"xerror"]),"CP"])
rpart.plot(pruned.ct,type=5, extra = 2, fallen.leaves=FALSE)
# Huge tree results
<- predict(complex_model, vg_train, type="class")
complex_model.pred confusionMatrix(complex_model.pred, vg_train$total_sales)
Confusion Matrix and Statistics
Reference
Prediction Low Medium High
Low 39 10 4
Medium 5 33 9
High 1 3 23
Overall Statistics
Accuracy : 0.748
95% CI : (0.6633, 0.8208)
No Information Rate : 0.3622
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.617
Mcnemar's Test P-Value : 0.09099
Statistics by Class:
Class: Low Class: Medium Class: High
Sensitivity 0.8667 0.7174 0.6389
Specificity 0.8293 0.8272 0.9560
Pos Pred Value 0.7358 0.7021 0.8519
Neg Pred Value 0.9189 0.8375 0.8700
Prevalence 0.3543 0.3622 0.2835
Detection Rate 0.3071 0.2598 0.1811
Detection Prevalence 0.4173 0.3701 0.2126
Balanced Accuracy 0.8480 0.7723 0.7975
<- predict(complex_model, vg_valid, type="class")
complex_model.pred2 confusionMatrix(complex_model.pred2, vg_valid$total_sales)
Confusion Matrix and Statistics
Reference
Prediction Low Medium High
Low 18 15 3
Medium 4 8 5
High 5 2 7
Overall Statistics
Accuracy : 0.4925
95% CI : (0.3682, 0.6176)
No Information Rate : 0.403
P-Value [Acc > NIR] : 0.08620
Kappa : 0.2096
Mcnemar's Test P-Value : 0.04293
Statistics by Class:
Class: Low Class: Medium Class: High
Sensitivity 0.6667 0.3200 0.4667
Specificity 0.5500 0.7857 0.8654
Pos Pred Value 0.5000 0.4706 0.5000
Neg Pred Value 0.7097 0.6600 0.8491
Prevalence 0.4030 0.3731 0.2239
Detection Rate 0.2687 0.1194 0.1045
Detection Prevalence 0.5373 0.2537 0.2090
Balanced Accuracy 0.6083 0.5529 0.6660
The fully grown tree model have a high accuracy for training data (74.8%), but its performance drops significantly on the validation data (49.25%). This suggests that the model is overfitting, meaning it has learned the noise and specific patterns of the training set that do not generalize well to new, unseen data.
# Pruned tree results
<- predict(pruned.ct, vg_train, type="class")
pruned.ct.pred confusionMatrix(pruned.ct.pred, vg_train$total_sales)
Confusion Matrix and Statistics
Reference
Prediction Low Medium High
Low 32 10 4
Medium 12 34 17
High 1 2 15
Overall Statistics
Accuracy : 0.6378
95% CI : (0.5478, 0.7212)
No Information Rate : 0.3622
P-Value [Acc > NIR] : 2.688e-10
Kappa : 0.4443
Mcnemar's Test P-Value : 0.003155
Statistics by Class:
Class: Low Class: Medium Class: High
Sensitivity 0.7111 0.7391 0.4167
Specificity 0.8293 0.6420 0.9670
Pos Pred Value 0.6957 0.5397 0.8333
Neg Pred Value 0.8395 0.8125 0.8073
Prevalence 0.3543 0.3622 0.2835
Detection Rate 0.2520 0.2677 0.1181
Detection Prevalence 0.3622 0.4961 0.1417
Balanced Accuracy 0.7702 0.6906 0.6918
<- predict(pruned.ct, vg_valid, type="class")
pruned.ct.pred2 confusionMatrix(pruned.ct.pred2, vg_valid$total_sales)
Confusion Matrix and Statistics
Reference
Prediction Low Medium High
Low 14 13 2
Medium 9 10 10
High 4 2 3
Overall Statistics
Accuracy : 0.403
95% CI : (0.2849, 0.53)
No Information Rate : 0.403
P-Value [Acc > NIR] : 0.54633
Kappa : 0.0583
Mcnemar's Test P-Value : 0.08112
Statistics by Class:
Class: Low Class: Medium Class: High
Sensitivity 0.5185 0.4000 0.20000
Specificity 0.6250 0.5476 0.88462
Pos Pred Value 0.4828 0.3448 0.33333
Neg Pred Value 0.6579 0.6053 0.79310
Prevalence 0.4030 0.3731 0.22388
Detection Rate 0.2090 0.1493 0.04478
Detection Prevalence 0.4328 0.4328 0.13433
Balanced Accuracy 0.5718 0.4738 0.54231
In comparison the pruned tree shows lower accuracy on both the training set (61.42%) and the validation set (41.79%). While the overall accuracy is lower, the smaller performance gap between training and validation data indicates that the pruned tree generalizes better and is less prone to overfitting. Although the huge tree model appears to perform better in terms of accuracy. Therefore, the pruned model is more reliable for making predictions on new data, even if it comes at the cost of slightly lower accuracy. When working with the model that has more than two outcome parameters, the accuracy not always enough to evaluate performance. More variables, more chance the model predict wrong class, and baseline accuracy also drops down, making a high accuracy harder to achieve.
When using a pruned tree, the difference between training and validation accuracy decreases because pruning reduces overfitting. By removing unnecessary splits, the model captures meaningful patterns rather than noise. As a result the training accuracy decreases, but validation accuracy remains more stable making the model more reliable to new data.