Last year, LeBron James’ 11-year All-NBA 1st team streak came to an end. The Lakers’ lack of success combined with the fact that James played only 55 games dropped him to a 3rd team spot. This year, LeBron is a lock to make the All-NBA 1st team. LeBron leads the league in assists and sits squarely in the MVP conversation with the Lakers in first in the West.
Aside from James, the 1st team seems as predictable as ever. Giannis – the current MVP race leader – will get the forward spot alongside LeBron. Davis will claim the center spot, with Harden one guard spot. The second guard spot may go to either Doncic or Lillard. If Lillard continues his incredible hot streak and leads the Blazers to the playoffs while the Mavs falter, the spot may go to Lillard. (Some may consider Doncic a forward. He’s listed as a guard on the All-Star team, so we assume he counts as a guard).
Last year, we saw a much different race. Giannis and Harden were clear 1st team players the entire season. Voters unanimously placed both players on the 1st team. Curry and George seemed likely to secure the other guard and forward spots. However, those spots were far from certain. Furthermore, Embiid and Jokic were in a tight race for the 1st team center spot.
Though this year’s first team is predictable, the remaining spots seem up in the air. For example, who deserves the 2nd team center spot? Embiid’s injuries likely prevent him from getting the spot. Rudy Gobert, Nikola Jokic, and Bam Adebayo all have strong cases to earn a spot. Given Davis will be on the 1st team, these 4 players must battle it out for the remaining 2 center spots. And that’s not even including Karl-Anthony Towns.
This tight centers race presents only one of the many close calls this year. At this point in the season, players’ stats start to stabilize, allowing us to evaluate their performance. To predict the All-NBA teams, we created a deep neural network.
Last year’s performance
Last year, we created 4 models to predict the All-NBA teams. The models performed exceptionally. The table below shows the predicted teams from the average of the 4 models:
|1st team||Damian Lillard (0.947)*||James Harden (0.997)||Giannis Antetokounmpo (1.000)||Kevin Durant (0.984)||Joel Embiid (0.900)|
|2nd team||Stephen Curry (0.947)*||Russell Westbrook (0.811)||Paul George (0.963)||Kawhi Leonard (0.884)||Nikola Jokic (0.897)|
|3rd team||Kyrie Irving (0.713)||Kemba Walker (0.417)||LeBron James (0.551)||Blake Griffin (0.420)||Rudy Gobert (0.854)|
The asterisk indicates that Lillard and Curry had identical probabilities. We noted this spot would go to Curry. Furthermore, though Durant had a higher probability than George, we said narrative will push George over Durant. If we consider the Lillard/Curry prediction as correct, then we correctly predicted 11/15 spots. (Note that each “switch” necessary is 2 incorrect predictions. So, when we predicted George to be 2nd team and Durant 1st team, and the real result was flipped, we counted that as 2 incorrect predictions even though it’s technically 1 switch.) Each player we predicted to make an All-NBA team made an All-NBA team. The only switches necessary to have perfect predictions are George/Durant and Irving/Westbrook.
Our models’ success last year shows that we can model All-NBA selections with stats. The models looked at historical data starting from the 1979-1980 season (introduction of the 3-point line). Each year, our data set consisted of all the players to make an All-NBA team or the All-Star team. So, we had a small and arbitrary data set. Though All-Stars generally overlap with All-NBA players, there’s no reason to condition on being an All-Star. Using this data, we trained our models with the following features:
|Counting stats||Advanced stats||Team stats||Other|
These inputs also aren’t perfect. For example, there’s no reason to use FG% instead of TS% of eFG%. There’s also no logic for the specific advanced stats we used. Lastly, team success may be unnecessary. Factoring team success creates a feedback loop. Great players make teams great, so most All-NBA-level players will play on good teams. Though there’s a few exceptions (like Trae Young), team stats won’t give us much information.
We used the models’ prediction probabilities to create the All-NBA teams. The player with the highest prediction probability went into the highest available slot in his position.
This year’s methods
This year, we improved our methods in several ways. We’ll discuss each of them below.
This year, we used much more data. We collected data for every player whose rookie season came on or after the 1979-1980 season. This boundary ensures that every player played their entire career with the 3-point line.
We split our data into 3 parts: a training set, validation set, and testing set. The training set consists of 50% of the data, the validation set 25%, and the testing set 25%. We split up the data sets randomly in a stratified way, meaning we preserve class balance.
Using every player season gives us over 15,000 samples. However, among these 15,000 samples, only 511 of them made an All-NBA team. So, our data set is unbalanced. To combat this, we performed something called SMOTE.
SMOTE (synthetic minority over-sampling technique) is a way to create fake (synthetic) data points to balance our classes. SMOTE creates data points by assigning random weights between 0 and 1 to some neighborhood of points in a class.
We used borderline-SMOTE, which is a special type of SMOTE. Borderline-SMOTE generates synthetic data using borderline minority examples. So, we’re generating data from samples that are close to the majority examples. The hope is that the synthetic data helps the model differentiate between these borderline cases better.
Our initial data set consists of about 15,000 player samples. However, we generated several fake player samples to balance the classes.
One super important thing with SMOTE is to do it after splitting up the data. If we perform SMOTE before splitting our data into training/validation/testing sets, we’ll get misleading accuracy. This is because our model will “bleed” information; samples in the train set could be generated from samples in the test set, and vice versa. Furthermore, it’s important not to use SMOTE on our testing and validation set, because we want those sets to follow the data we’ll use to make new predictions. This is a common mistake that leads to artificially high accuracy and bad models.
We only performed SMOTE on our training set. Our validation and testing sets remain unchanged.
This year, we’re using a much broader range of features. We hope these features better reflect a player’s skill.
The only features we had last year that we don’t have this year are team-based features, such as team wins and seed.
The table below lists all the features we used in our model.
|G||FG, FGA, FG%||ORB, ORB%||AST, AST%||STL, STL%||PER|
|GS||2P, 2PA, 2P%||DRB, DRB%||TOV, TOV%||BLK, BLK%||OWS, DWS, WS, WS/48|
|MP||3P, 3PA, 3P%||TRB, TRB%||USG%||PF||OBPM, DBPM, BPM|
|FT, FTA, FT%||VORP|
|eFG%||RAPTOR (off. and def.)|
In total, we have 47 features. We see that a lot of them are collinear (meaning some features directly predict another). If we construct the model in a specific way, this isn’t a problem (more on this later).
We used Keras to create a deep neural network. The model consisted of 6 layers. The first layer had 47 nodes, as we had 47 features. The next layer had 32 nodes. From there, each subsequent layer had half the nodes of the previous layer until we reached 4 nodes. From then, the next layer was the output layer (only 1 node).
Each layer except for the last one uses a leaky ReLU activation. The final layer uses a sigmoid activation such that all our probabilities fall between 0 and 1.
A common issue in machine learning is overfitting. This occurs when a model learns the training data too aggressively. While this results in great training accuracy, it makes the model bad at predicting results from new data. In previous machine learning projects, we’d check for overfitting by looking at cross-validated accuracy metrics. Here, we can take a better approach. We designed the model to avoid overfitting.
The first way we avoid overfitting is with weight regularization. If a model has very large weights, it’s likely to overfit. A small change in input data results in a huge change in the output if we have large weights. Weight regularization fixes this issue.
There are two ways to regularize weights: L1 regularization and L2 regularization. They regularize in different ways. L1 uses the l-1 norm to regularize weights. This results in sparse coefficients, meaning we’re making several weights 0 and keeping the important ones. L2 regularization uses the l-2 norm, which results in shrinking coefficients to smaller values.
The first 3 layers of our model use L1 regularization. This helps fix any issues we’d have with collinearity, as features that don’t contribute to the model will shrink to 0. The final 3 layers of our model had L2 regularization. This shrinks the remaining non-zero coefficients.
Overfitting: early stopping
In a neural network, we’re trying to find the weights and biases that minimize a given loss function. Each time we make a full cycle through the training data, the model updates its weights. We call each cycle an epoch. We pre-set the number of epochs (in this case, 200).
The model calculates the loss function from the training data. Because we’re concerned with making predictions on new data, this could be a problem. Our new data might not mimic our training data. (In fact, it doesn’t because our training data has synthetic observations). This means that we should care about the validation loss too.
In early stopping, we stop the model from completing all the epochs to avoid overfitting. We do this by minimizing the validation loss. If the validation loss rises in a given epoch, then that weight update isn’t making our models more accurate. So, early stopping finds this minimum validation loss and uses those weights.
Our early stopping looks forward 25 epochs before stopping. This means that once we find a validation loss minimum, we look forward 25 epochs. If none of these epochs have lower validation loss, then we use the weights from the previous minimum. This allows us for optimize our weights to both our training and validation data.
Let’s visualize this. The graph below shows the model’s loss on the training and validation data throughout all 200 epochs.
Notice that at some epochs, the loss in the validation data increases. Validation loss at epoch 200 isn’t at a minimum, so our model could be better.
The graph below shows the model’s loss with early stopping.
Notice that at epoch 5, our validation loss is minimized. So, the model completes the next 25 epochs (until epoch 30). None of these epochs result in lower validation loss, so the model stops learning and uses the weights from this earlier epoch.
The final method we’ll use to avoid overfitting is dropout. Dropout occurs when we randomly drop (or ignore) a certain subset of nodes in a layer. This lets us mimic creating a bunch of different neural network structures in a simple way to see which performs best. Each time we drop a unit, we also drop its connections. This causes other units to adjust their weights accordingly.
Like L1 regularization, dropout can make our coefficients sparse, helping us avoid overfitting. After each layer except for the final output layer, we randomly dropped 20% of the nodes.
Now that we understand the model’s structure, let’s evaluate its performance. We’ll use some of the usual classification metrics we present, along with some new ones. We create these metrics by comparing our predicted values from the testing set to the real values from the testing set.
Note that a random model would have about 50% accuracy on the testing set. This is because the training set has a true 50/50 split of classes, while the testing set has a large class imbalance. So, high accuracy is indicative of strong performance here.
First, we’ll look at the confusion matrix.
We see that the model has 5 false negatives (predicted 0 but actual 1). This is a great sign, as we’re most interested in finding the All-NBA worthy players. So, over predicting these (having lots of predicted 1s but actual 0s) is better than under predicting them. Therefore, the high number of false positives is not alarming. In fact, it’s expected given that the training data consisted of several synthetic observations of the positive class. Because of the synthetic data, we would expect the model to over-predict the positive class.
We see that the model has accuracy 0.948, recall 0.961, and precision 0.387 (and consequently, F1 0.552). Though the precision is low, we’d prefer high recall and low precision instead of the other way around here.
To create the teams, we use the same method as last year. The player with the highest predicted probability goes into the highest slot in his position. Because we care about prediction probabilities, we’ll look at some probability-focused metrics.
The model has a log loss of 0.120 and a Brier score of 0.039. Both metrics are quite strong. Next, we’ll look at the ROC curve and the area under it.
The model has a near-perfect area under the ROC curve of 0.99. This means the model is very strong at differentiating between the two classes.
Finally, we’ll look at 2 additional graphs to show our model’s prediction probability strength. First, we’ll look at the cumulative gain.
This shows the percentage of the total cases in a class “gained” by targeting a percentage of the total sample. The dotted black line represents the baseline, or essentially what you’d expect from a true random sample. So, any point along the baseline says “the top x% of the sample contains x% of this class.” This means that the higher we are above the baseline, the better the model is at identifying that class.
We see that the cumulative gain for class 1 is very high. By the top 15% of the sample, we contain almost 100% of class 1. Given that class 1 is All-NBA players, it’s very good that the top 15% of players in All-NBA probability is a superset of all All-NBA players. Though the model isn’t very good at identifying class 0 (not All-NBA players) above random, this is not our main concern.
Next, we’ll look at the lift curve. This presents a similar idea in a different way.
This essentially divides each point on the cumulative gains chart by the baseline. So, the “lift” is the ratio of the cumulative gain to the baseline. For example, at the point (0.2, 5), we see that the model is 5x more likely to identify an All-NBA player in the top 20% of the sample than a random model. Among the highest prediction probabilities, the model is about 30x more likely to identify an All-NBA player than the baseline.
Each of our metrics shows the model’s strength, particularly in predicting the positive class (All-NBA players). We believe the model creates strong predictions for All-NBA players.
On 2/3/2020 before any games were played, we collected all the necessary data for every player who played a game in the 2019-20 All-NBA season. We scaled the necessary stats (like games, win shares, etc.) up to a full season. Then, we fed the model this data to make predictions.
As described before, we created the All-NBA teams with prediction probabilities. When a player’s position was unclear, we used All-Star game position. If the player did not make the All-Star game, we used whichever position the player played the most minutes. So, Luka Doncic counts as a guard. DeMar DeRozan and Jimmy Butler count as forwards.
The table below shows our results.
|1||James Harden||Damian Lillard||Giannis Antetokounmpo||LeBron James||Anthony Davis|
|2||Luka Doncic||Ben Simmons||Jimmy Butler||Kawhi Leonard||Rudy Gobert|
|3||Devin Booker||Trae Young||Pascal Siakam||DeMar DeRozan||Nikola Jokic|
To view each player’s probability, look at the table below.
We see that several players have small differences in their All-NBA probabilities. This will likely stabilize as the season progresses.
Last year, our models almost perfectly predicted the All-NBA teams. This year, we’re taking a different – and hopefully better – approach. So far, our models return what many project for the All-NBA teams.
To improve the model, we could take a few different steps. First, several players have extremely close All-NBA probabilities. This is due to how we constructed the problem. For example, using SMOTE gives us more positive class examples, but it means that the All-NBA worthy players all have high probabilities. Furthermore, different model structures – such as using different numbers of layers, regularization, or activation functions – may help this. Nevertheless, the model’s historical performance is extremely strong. Therefore, it is likely a robust predictor of this year’s All-NBA teams.