Under The Radar: An NBA/NCAA Player & Prospect Similarity Model Utilizing a Factor Analysis & Radar

Dominic Samangy
Nov 8, 2020
11 min read

Updated: Jun 20, 2021

PDF Version

Thanks/Inspiration

First, I would like to give a massive thanks to Eoin O’Brien (Twitter: @Eoin_O’Brien_) for his help on the similarity model used in this paper. His prior work in similarity scores for soccer players was a big inspiration for my work and he even shared his R code with me, which I used for this study with slight modifications to fit my data. Also, thanks to FC rStats (Twitter: @FC_rstats) for their radar chart tutorial, which serves as the foundation of my own.

Introduction

As an avid basketball fan and an analytically inclined person, my mind constantly wanders in the direction of asking questions while watching a game of basketball. Recently, while watching a show focused on the upcoming NBA draft, the analysts were making player comparisons for NCAA draft prospects. So, I thought, could I make a model that can identify similar players based on their on-court statistical output? After researching and analyzing prior work in both basketball and soccer, I have done so utilizing basketball-reference data, a factor analysis of eight style metrics, and data visualization through radar charts. All of which was done in Microsoft Excel and R Studio.

Data

Data Used

Statistically comparing players across the professional/collegiate barrier can be a tough and daunting task to handle. While there will never be a perfect way to do so as the output of players are against different competition in different levels of the sport, I decided to take a shot at producing the most reasonable model to do so. The first step was to identify the data sample I was to analyze in my study. Using basketball-reference.com, I was able to pull data from the 2019-20 season for both NBA and NCAA players. This included totals, per game, and advanced stats for each player. To avoid unwanted skewed data in my sample, I placed a minutes played minimum of at least 200 minutes for NCAA players and 656 for NBA players, which approximately represents the role of a bench player (16.67% of available minutes). While this does eliminate some players from the study who could be compared to later on, it’s for the better as it eliminates players whose data may be manipulated due to a small sample of minutes played. Finally, I have 2 datasets, one of 788 NCAA players and the other of 530 NBA players ready to be analyzed.

Stylistic Metrics

With the data now collected, I turned my attention to determining how to analyze it in a way that could accurately describe and project a player. While I will touch on it later in the paper in a much deeper manner, I will be using a factor analysis to produce the similarity scores for each player. To do so, the data must have a certain number of variables that are identical amongst the two datasets in order to compare them. First, I split all players into three separate groupings of positions: guards, wings/forwards, and centers/bigs. Due to basketball-reference labeling positions differently for the NBA and NCAA, there may be some discrepancies. NCAA groupings are guards (PG, SG), wings/forwards (F), and centers/bigs (C) while NBA are guards (PG, SG), wings/forwards (SF, SF-SG, SF-PF), and centers/bigs (PF, C-PF, C, PF-C) With the data now cleaned and grouped properly, I created eight style metrics to be analyzed for each player that are dependent on their groupings.

To best define and visual a player’s value in terms of statistical output, I opted to define contributions in three key areas: offensive, on-ball, and defensive. While the three groupings have the same contribution areas, their style metrics differ under these to best represent their roles. To rank players based on the eight metrics, I will take the percentile rank of each. This means that the 100th percentile will represent the best player(s) in the metric while the 0 percentile will represent the worst. For example, below are the percentile rankings in the 3-Point Specialist category for NBA guards. According to the model, Ben McLemore is the #1 ranking 3-point specialist while Ben Simmons is the worst. It is important to note that this does not indicate the best or worst perimeter shooters, but those whose scoring/play-style is dominated by 3-point shooting.

Table 1: Guard 3-Point Specialist Percentile Rank in Descending/Ascending Order

While this will give us a good idea of how each player performs on-the-court, it is important to remember that these metrics don’t account for everything that occurs during play. We will never be able to account for everything with the box score statistics which make up these style metrics but with the correct implementation, we can make sound statistical comparisons based on similarities between them. The three following tables identify the formulas for the eight metrics under each position grouping that I feel describe the optimal traits for each.

Table 2: Guard Style Metrics

First, for the guards grouping of players, the style metrics are Scoring, 3-Point Specialist, Shooting Efficiency (offensive contribution), Ball Dominance, Playmaking, Turnover Prone (on-ball contribution), Perimeter Defense, and Rebounding (defensive contribution).

Table 3: Wings/Forwards Style Metrics

For the wings/forwards grouping of players, the style metrics are Scoring, 3-Point Specialist (offensive contribution), Ball Dominance, Turnover Prone, Playmaking (on-ball contribution), Perimeter Defense, Rim Protection, and Rebounding (defensive contribution).

Table 4: Center/Bigs Style Metrics

For the centers/bigs grouping of players, the style metrics are Scoring, Stretch Big (3P), Paint Dominant (offensive contribution), Passing Big, Turnover Prone, and Fouls Drawn (on-ball contribution), Rim Protection and Rebounding (defensive contribution).

Methodology

Factor Analysis/Model

With our three groupings now including their own eight unique style metrics, we turn

our attention to analyzing them in order to model similarities between the players. To do so, I performed a factor analysis in R Studio using a modified version of Eoin O’Brien’s (Twitter: @Eoin_O’Brien_ ) code to eventually create a similarity score model (O’Brien). A factor analysis is “a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors (Neto, 2014).” Therefore, in terms of our research, the factor analysis will determine the variance and correlation between each of the eight variables for players, which will then be used to predict the similarity scores for players.

The first step of the factor analysis is to identify the observed and latent variables of the model. The observed variables are the box score statistics (PTS, AST%, 3P%, etc.) that are used to model the latent, or unobserved, variables which are the eight metrics used in each of the groupings (Scoring, 3-Point Specialist, Rebounding, etc.) As explained before, these have already been identified and created.

Next, we will need to identify the number of factors, or components, to use based on the eigenvalues. Each component, or style metric in this study, has an assigned eigenvalue, thus meaning that there will be eight eigenvalues in the analysis. An eigenvalue “represents the total amount of variance that can be explained by a given principal component. Starting from the first component, each subsequent component is obtained from partialling out the previous component (Exploratory Factor Analysis, N/A).” This means that the first factor will explain the most variance while the eighth factor will describe the least. Below we can see that the first factor accounts for 24% (0.24) of the total variance and in total, the four factors used account for 72.9% (0.729) of the total variance. This can be seen below under “Proportion Var” and “Cumulative Var” for Factor 4.

Figure 1: Factor Analysis of The 8 Style Metrics

While the variance of the analysis is explained above, it is crucial to understand why the number of factors was set at four. The optimal choice has some slight subjectivity to it but is based on analyzing the scree plot of the factor analysis, which plots the eigenvalues for each factor. For example, as seen below, the eigenvalue of factor 1 is 2.40, or 24%, the total variance captured by factor 1.

Figure 2: Factor Analysis’ Scree Plot

According to the aforementioned research, commonly used criteria to determine the number of factors to extract are (refer to black line) (Exploratory Factor Analysis, N/A):

1. Use all factors with an eigenvalue greater than 1 (1, 2, & 3 in this example)

2. The number of factors before the “elbow joint”, or where the biggest drop in variance

explained is seen (see plot). Common interpretation is to take all factors before this

point (first 3 factors in this case) or to take as many as that account for a large amount

of variance.

3. The last common criterion is that the total variance of all factors should account for

between 70%-80%.

Based on these criteria and personal preference and reasoning, I decided to extract four factors for use. While criteria 1 and 2 would advise to only use three factors as factor 4 has an eigenvalue less than 1 and is on the elbow joint, the amount of variance explained in factor 4 was too large in my opinion to leave out. In agreement with criteria #3, the inclusion of factor 4 brings the total variance explained by the analysis from 58.7% to 72.9%, which is between the ideal range of 70%-80% previously described. This amount of variance explained by factor 4, 14.2%, was too large to leave out in my opinion, and led to my final decision of extracting four factors for the analysis.

Similarity Model/Table

Next, with a factor analysis of four factors, we can now calculate the factor scores of each player. With these scores, we can then find the most similar players based on the smallest euclidian distance between them. For example, if we were trying to project Anthony Edwards’ NBA comparisons based on 2019/20 season data, the model output is seen below.

Table 5: Anthony Edwards’ NBA Similarity Scores

As a final product, by incorporating four factors in the analysis, the similarity model appears to give back some names that would pass the comparison eye test. Edwards’ #1 match with a similarity score of 87.1% is Utah Jazz guard Donovan Mitchell, who is known for his dynamic athleticism and knack for scoring, both of which have propelled Edwards to be the rumored #1 pick in the 2020 NBA Draft. It is very important to understand the similarity percentage does not immediately refer to a similarity in play style but instead is based on similar statistical output. However, with proper implementation of a factor analysis like in this study, this model can predict similar playing styles like in Edwards’ example. While it does help reach our final goal of identifying players with similar play styles based on statistical variables, we must remember to not assume that a high similarity score immediately refers to similar playing styles. The combination of this model and the eye test is the best solution to making decisions, in which the Edwards/Mitchell comparison is a perfect example. Overall, comparisons like such bode well for the validity and accuracy of the model.

Results

Radar Plot Comparisons

Finally, to best describe the model’s results visually, I used R Studio to create radar plots to compare the eight metrics between two players while also showing the similarity score. A radar plot is a way of visualizing at least three variables on the same axes in a two-dimensional manner. While the sentiment towards radar plots tends to vary in data visualization because the ordering of the variables can be manipulated to display different information, if used correctly, it is a powerful tool to visual many variables at once. It is easy on the eye and can be easily interpreted with little explanation to even a person with little data visualization experience. Below, is an example of Anthony Edwards versus Donovan Mitchell.

Figure 3: Radar Plot Comparison of Edwards & Mitchell

In the example, each variable, or style metric, has a plotted point of their value on a scale of 0-100, representing the percentile rank. The eight points are then connected by a line of the same color and shaded in, which creates the final radar plot for each player. The middle of the circle represents 0 while the outer edge represents 100. The easiest example to identify is the ball dominance value for Anthony Edwards, which is slightly below the max of 100 at 98.2. In accordance with the 87.1% similarity score, the distances between each plot (red and blue) are very little and show why Edwards and Mitchell score so closely.

Application/Limitations

How It Can Be Used

This model/tool is a valuable asset that can be used for prospect scouting and player

recruitment for NBA teams. Organizations can receive almost instant feedback on a specific prospect or NBA player with comparisons based on similarity scores. They can opt to find college prospects similar to an NBA player or to find NBA players similar to college prospects. NBA teams can also analyze players across the league which can aid in identifying potential players that can fill a needed role on the roster. For example, below are Damian Lillard’s NBA and NCAA comparisons. While Lillard is a player few expect to be traded any time soon, this dashboard shows how this model can be used effectively to identify players that fit a specific role/mold.

Figure 4: Damian Lillard NBA & NCAA Comparisons

My future plans include creating an R Shiny app in which fans will be able to choose between two players, either NBA or NCAA, and compare their radar charts in a dashboard similar to above. There will also be an option to choose a single player and the league that they wish to see similar players in. As shown above, fans or analysts will be able to choose whether they want to see Lillard’s NBA or NCAA similarity rankings.

Limitations

While the versatility of the tool is evident and certainly offers a different perspective on player analysis, several limitations should be noted. First and foremost, this model is NOT an end-all decision-making tool. An organization should never target and pursue a player based on a similarity core as this model is most effective when it is used in addition to other forms of analyses, such as the eye-test and player intel reports.

In terms of the data collected, it is important to remember that NCAA players face collegiate players and NBA players face professionals in games. Therefore, their individual statistical output does not account for competition strength when comparing to project NCAA prospects to NBA players. Also, the totals, per game, and advanced statistics used to create the eight-style metric do not account for every aspect of the game that affects on-court play. With box scores statistics like these, we will never be able to perfectly model player contributions, but it can be done effectively if utilized properly. The data used for the model is from the 2019/20 NCAA and NBA seasons, which inadvertently leaves out injured players and NCAA sit-out transfers in the samples. Notable omissions include Sam Hauser, Quentin Grimes, Stephen Curry, Klay Thompson, and Kevin Durant. Lastly, it should be noted that the position labeling of NCAA players is less accurate than NBA players which may have accidentally led to collegiate players being placed into the wrong position groupings.

The methodology used in this study also contains some important limitations. First, the use of percentiles to rank players has both its advantages and disadvantages. With a large enough sample size like we have, it allows us to rank the players based on a 0-100 scale which is ideal for the radar visualizations. However, the distance between starters and superstar players towards the top end of the sample is minimized. For example, if attempting to rank NBA players based on points per game, a player in the 90th percentile may average 20 points per game while the 100th percentile player averages 30 per game. This difference in scoring is massive and the latter player scores 50% more points per game. However, the percentile ranking only puts a 10% difference in output between them based on their percentiles.

Conclusion/Future Plan

After a few months of planning, research, and trial and error, I am very pleased with the current state and results of the model. In the future, I plan on continuing to update the model if needed while also updating the database to include NCAA and NBA season data prior to the 2019/20 season. This will improve the accuracy of the model while also increasing the number of comparisons available to make. Also, as mentioned before, I plan on creating an R Shiny app that will allow users to create their own comparison dashboards and similarity tables instantly. Any feedback/recommendations are more than welcome, and it is greatly appreciated if you made it this far! Feel free to reach out to me at dsamangy@syr.edu or on LinkedIn.

Sources

“A Practical Introduction to Factor Analysis: Exploratory Factor Analysis.” IDRE Stats, stats.idre.ucla.edu/spss/seminars/introduction-to-factor-analysis/a-practical-introduction-to-factor-analysis/.

O'Brien, Eoin. “EoinOBrien94/FactorAnalysis.” GitHub, github.com/EoinOBrien94/FactorAnalysis.

Neto, João. Factor Analysis, Oct. 2014, www.di.fc.ul.pt/~jpn/r/factoranalysis/factoranalysis.html.

Under The Radar: An NBA/NCAA Player & Prospect Similarity Model Utilizing a Factor Analysis & Radar

Recent Posts

Comentarios