In this recipe I’d like you to chow down on a Smash Brother analytical approach to selecting your main character. The approach I’m going to introduce you puts an emphasis on what makes a character unique.
Before I start diving into the Smash Brothers data, let’s discuss the k-means clustering approach. A k-means helps paint a clear picture of our data, in this case specifically it will identify Smash Brothers Characters by their attributes to create picture for who your main should be. Our characters will be assigned into segments
(tiers… everyone loves to put tiers around Smash Characters but they’re based solely on opinion and player preference)
based on trends in our data, and how closely a character is to the a group.
Take the above picture, without applying this approach we are in the top left quadrant, we only have a faint idea of who should be our main. As we apply more segments and more trends in the data we’ll eventually end up in the bottom left quadrant. A clear picture of who our main should be.
Now I keep mentioning trends in our data. How do we find trends in data where attributes are on the surface completely skewed and non-normalized? Take for instance a characters weight as a whole number will be larger than a characters acceleration rate in the air (aerial attacks).
We can achieve these trends by standardizing our variables, setting all variables to have a mean of zero. In doing so this analysis focuses strictly on the trends in our data and we can have a pretty interesting discussion: i.e. Yoshi is more similar to Kirby, than he is to Pac-man.
In preparation for this data story I came across the following article, on Business Insider: “These are the 11 best ‘Super Smash Bros. Ultimate’ characters, according to the world’s number-one ranked player”
Here’s an excerpt from the article:
And here is ZeRo being named the best overall player:
This triggered a thought in my head and I haven’t done this on the Pancakes Analytics page yet, but typically you would bring a k-means cluster in production and re-score your segments on an agreed upon cadence. In this case I’ll treat the release of a new game as the cadence.
I’ll run a k-means clustering on the character attributes in Wii-U version and then a k-means clustering on the same character attributes but for the Switch version.
While going through this process I’ll only be including those characters who were in both games and where the data is clean: i.e. all characters have a weight and all characters have available acceleration data. Sorry Inkling, you’re not in this segmentation.
Above are both segmentation cadences and characters will be split into these segment tiers:
- Floaters (Far right circle)
- Jack of all Trades (Smack in the middle)
- Dashers (Faster than your Jack of all Trades segment but not fast enough to be elite in that attribute)
- Air Tanks (The bottom left circle)
- Speedsters (Top left circle)
These aren’t ranked by what tier is the best, but we can make some assumptions. The Jack of All Trades segment, most likely you won’t be winning matches often but you’ll be competitive.
Smash Brothers is a unique fighting game, so characters do have a weight to them. Being light weight does have it’s advantages, but the learning curve of playing as a Speedster might be too high risk high reward for you.
The Floaters, if you select someone with a weight advantage in this group, you’ll likely to win your match but you have to master the move set (your smash move).
Air Tanks, is a no brainer I think for any skill set. If you want to have a high likelihood of lasting till time runs out, be an Air Tank (this won’t guarantee a win, that really depends on your competition).
I’m hoping visual this stood out to you the reader: Ganondorf made a large leap from the Air Tanks to the Floaters. This doesn’t only speak to Ganondorf but it also tells you information about Bowser as well.
When I speak to this to clients and those wanting to learn about a particular data, this is how it translates:
Ganondorf has more in-common with Jiggly Puff than he does Bowser. The reason being is he’s quicker and can adapt well in aerial attacks and in falling than Bowser can.
On the flip-side of this I can also say Bowser more accurately represents how he’s viewed from the super Mario franchise, in Super Smash Bros. Ultimate.
Neither one of these characters were “nerfed”, only re-calibrated so there’s a distinct difference between the two.
What do you do with this information? If you’re main is a Floater, Ganondorf would be a good transitional character if you were looking to play as a character with more weight. Or say you always play as an Air Tank, because you have the assumption anyone who has Kirby as a main shouldn’t be playing Smash Bros. then Ganondorf is a good transitional main for you when you eventually given in and select Kirby, “by accident”.
Below are the segments a brief overview of those characters within each segments:
This segment has high variability and you can see this from the oblong shape of the circle. Ganondorf and Jiggly Puff are driving this shape, all though they are in the same segment and are more similar to each-other than are to other segments, they are the furthest apart within this segment.
Now hold up… wait a second. Didn’t I just try to prove a point of how similar they are? Yes, but in relation of whose more similar to Ganondorf: Jiggly Puff or Bowser. But if I posed the question who is more similar to Ganondorf: Jiggly Puff or Kirby… that answer is Kirby.
This group on average are the slowest by run speed and lightest by weight… they Float.
This segment is the medium of everything. There’s no uniquely distinct trend in their data. Now playing as Pikachu vs Mega Man would have so game-play differences but statistically speaking you are starting with same underlying stats.
If you’re new the series this a good group to start with… they’re a Jack of All Trades.
The Dasher segment is very similar to the Jack of All Trades segment, only slightly faster. Playing in this group you could potentially do more harm than good, if you’re selecting because you want to stay middle ground. You could… Dash yourself off the area.
Air Tanks are fast in the aerial attacks… and the heaviest? I’m anticipating this group will be re-calibrated by the next release. In other words… Bowser has no business being as effective as he is in the air as he weighs, normally these two variable don’t correlate. I guess all the time battling a plumber who can flip and jumps is finally paying off.
This is your high risk high reward group. Characters in this segment are the fastest and the lightest. I personally am awful playing as Sonic, he’s too fast for playing level but a seasoned player could probably mop the floor with Sonic.
So who should be your main? In this segment I rely on industry knowledge as well (ZeRo’s tiers as dependent variable). I’ll build propensity score with the following independent variables:
- Change in air acceleration
- Base air acceleration
- Base speed in the air
- Base Run Speed
- Character Weight
- Ultimate Smash Bros. Cluster
- Wii-U Smash Bros. Cluster
The output will give me the likelihood ZeRo would rank the character as a top tier character. The highest influencers on predictability were:
Change in air acceleration
The lowest influencers were:
Base air acceleration
Ultimate Smash Bros. Cluster (this highlights the bias towards the Wii-U stats, influencing ZeRo’s rankings)
Drum roll please….
You should have your main be one of the above three. This is the data solution to selecting your main.
Really looking forward to the comments section on this one 🙂
Before I dive into this week’s data story, let me state why I love the Nintendo Switch. I personally feel there’s a need for video games to be a social event, and couch co-op is a must have feature. The Nintendo Switch offers several games which meet this need.
My family loves playing video games and most of all we love playing video games together.
Most of the Nintendo games I’ve grown up on and have played over the years, Mario Kart by far is one of my favorites. I’ll admit my wife shows me how it’s done.
What I do find interesting about the Nintendo Switch is the joy con controllers, there’s a learning curve (but a huge improvement on the Wii-mote) and most veteran gamers prefer an alternative.
One alternative is the wireless controller, very similar to the X-box controller format. I did pick up the Yoshi version for my wife and she loves it and personally feels it improves her game-play.
I’d thought it was time to put this notion to the test, what impact if any does a wireless controller has on game-play performance versus using a joy con.
Mario Kart seemed like the logical choice for this is experiment, it’s a multiplayer game, you can standardize your users (via ride type and modifications), and performance is measured in a continuous variable of points.
A total of 8 trails were ran under these conditions:
-50 cc length race
Half through the trial one gamer switched to the wired controller (Test group) while the other gamer stayed on the single joy con (Control group).
Results were documented, and the etl. process began, points scored each race would be used as the key performance indicator.
I next ran a linear regression (great for evaluating an A/B test), with my dependent variable being the points scored after the event (introducing the wired controller) the two independent variables: Treatment and Pre Points Scored.
In this model I wasn’t concerned with the r-squared value or the significance level of each variable. The sample data was not large enough, this was closed circuit small market test.
The model itself did show to be significant, which is a good indicator I can continue with the results. Evaluating my Q’s graph, I see the model fits well, the trend goes through all the data points.
In my summary fit I notice there is a positive relationship between treatment (group) and post points scores. At first glance this says you improve your Mario Kart game-play performance if you play with a wireless controller.
To complete this story I want to know my upper confidence level to be able to know by how many points and is this enough to move me up the rankings.
Using a wired controller has the potential to increase a gamers point performance by over six points each race.
The average points differential between race placement is 1.2 points. This 6-point increase is enough to move you roughly 4 places, depending on your historic placement.
What have we learned from diving into the Mario Kart Data?
The controller you play with matters, switching to a traditional wired controller can potentially improve your point score by 6.5 points,
which depending on your average race placement can move you up 4 places in the final standings.
Observing the CPU controlled racers, Shy Guy performed the best with an average final placement of 2.8. The heavy class overall was the weakest group but without Bowser, it could have been worse. Bowser’s average final placement was 4th.
After you have consumed this meal, I hope you take these findings and enjoy your next Mario Kart Grand Prix. Also as always enjoy the featured pancake recipe below!
It all started with a mouse. This mouse is turning 90 this year and Mickey Mouse has made his impact on society. To celebrate, what better meal to cook us this week than Walt Disney World Data? I’ll be challenging myself to
identify influencers on the Parks and Resorts Division’s yearly revenue.
With Mickey Mouse turning 90 years old this year, what better meal to cook us this week than Walt Disney World Data? I’ll be challenging myself to identify influencers on the Parks and Resorts Division’s yearly revenue.
My first approach was to identify what happens during the year the revenue occurs?
The number of Animated Movies released by Disney
The number of Animated Movies featuring Disney Princesses
The number of Attractions add at all four main theme parks and then parsing this information out by the individual park
The first run was not an effective model: most of the variability in the data was not accounted for, and there were no independent variables of significance.
So my next approach was how do I capture word of mouth on movies and attractions? Secondly, how do I incorporate when Disney starts charging admission to children (currently 2 yrs and younger, enter the parks for free)?
To knock out two birds with one stone, I settled on let me test a rolling 3-year average of all behaviors. The results were very favorable, 67% of the variability is explained and I have interesting independent variables of significance to make a telling data story
If you’re a subscriber to this blog and enjoy the Stacks of Stats, you’ll recognize my preference for Q graphs.
There’s some curls at the tails but most of the data fits well, so there won’t be a need to run a more complex model.
Let’s take a bite into the initial read before accessing the financial impact of all these fun Disney variables.
I’ll caveat this, significance is in the eye of the beholder, and is up to interpretation of the storyteller and data scientist. The first read shows the 3-year average of total park attractions having the highest relationship to revenue and inversely the amount of attractions opened at EPCOT has significance but a negative impact on yearly revenue.
I’ll dive more into the individual impacts later, but I want to utilize my upper and lower bounds.
The output of this model shows the impact in millions USD. Analyzing the cone, this is where our fairy tale begins to take shape.
Potentially the average amount of attractions introduced at the all four major parks can drive in $1.6 million USD.
With the Magic Kingdom driving most of this impact:
New attractions added at the Magic Kingdom can drive in $4.5 million USD.
The average amount of the Disney Princess movies does have more of an impact than factoring Disney releasing an animated movie as the only criteria. What’s intriguing is the variability of our upper and lower bounds, there is a possibility there could be a loss of $50.6M.
What could be driving the inverse affect? Multiple reasons:
1.The quality of the movie releases
2.The presence or in this case non-presence of a meet and greet at the theme park
3.The global economic climate (Less international travel impacts this!)
What have learned from diving into the Walt Disney Data?
There’s a reason WDW is investing in new IP based rides at Epcot and Hollywood Studios: they’ve been launching the rides outdated with their audience and they drive the lowest impact currently on yearly revenue. I anticipate Epcot to see a steady growth on impact when Guardians of the Galaxy and Ratatouille open and a few years have passed.
Finally a Princess Animated Movie drives in 1 million USD more than a regular animated move release.
What could be the reasoning? I’d guesstimate rides introduced at the Magic Kingdom (drives in +4.5M USD) is having a downstream affect on the Princess impact. Most Princess interactions take place at the Magic Kingdom.
After you have consumed this meal, I hope you take these findings and with Mickey Mouse a Happy 90th Birthday. J Also as always enjoy the featured pancake recipe below!
The format of this post will be slightly different from previous recipes. Think of this as a yelp review, I’ll be going sharing the paper I presented during the SESUG 2018 SAS Conference. This will be wordy than usual, but I will start with the recipe card per usual and then we’ll dive deep into the paper. At the end of this post you’ll be a full belly of a new approach to building a NBA team, can be applied to one of my favorite game modes in the 2K series… Franchise mode.
SESUG Paper 234-2018 Data Driven Approach in the NBA Pace and Space Era
Whether you’re an NBA executive or Fantasy Basketball owner or a casual fan, you can’t help but begin the conversation of who is a top tier player? Currently who are the best players in the NBA? How do you compare a nuts and glue defensive player to a high volume scorer? The answer to all these questions lies within segmenting basketball performance data.
A k-means cluster is a commonly used guided machine learning approach to grouping data. I will apply this method to human performance. This case study will focus on NBA basketball individual performance data. The goal at the end of this case study will be to apply a k-means cluster to identify similar players to use in team construction.
My childhood was spent in Brooklyn, New York. I’m a die-hard New York Knicks fan. My formative years were spent watching my favorite team get handled by arguably the greatest basketball player of all time, Michael Jordan. Several moments throughout my life and to this day it crosses my mind, only if we had that player on our team. Over time I have come to terms with we would never have Michael Jordan or player of his caliber, but wouldn’t it be interesting if a NBA team could find complimentary parts or look-a-like players? This is why I’m writing a paper about finding these look-a-likes, these diamonds in the rough, or as the current term is “Unicorns”. Let’s begin this journey together in search for a cluster of basketball unicorns.
WATCHING THE GAME TAPE
What do high level performers have in common? In most cases you’ll find they study their sport, study their own game performance, study their opponents and study the performance of other athletes they strive to be like. The data analyst equivalent to watching game tape would be to gather as many independent and dependent variables as possible to perform an analysis. For the NBA data used in this k-means cluster analysis, I took the approach of what contributes to success in winning a game. Outscoring your opponent was a no-brainer starting point, but I’ll need to dig deeper. How many ways can and what methods can you outscore an opponent? The avid basketball fan would agree how a player scores a basket (i.e. field goal vs behind the three point line) will determine how they fit into an offensive scheme and defines their game plan. Beyond scoring there are other equally as important contributors to basketball performance. This is where I began to think of how much hustle and defensive metrics could I gather (i.e. rebounds, assists, steals, blocks, etc.). Could I normalize all of these metrics to come to get a baseline on player efficiency and more importantly effectively identify an individual player’s role in a team’s overall performance? To normalize my metrics I made the decision to produce my raw data on a per minute level, this way I wouldn’t show biases to high usage players or low usage players. To identify how a player fits into an offensive scheme and their scoring tendencies I calculated an individual level what percent of points scored comes from all methods of scoring (i.e. free throw percentage, three pointers made, two point field goals). Once I went through all of my data analyst game tape, I was ready to hold practice and cluster.
Practice makes perfect, but everything in moderation (i.e. the New York Knicks of the 1990’s overworked themselves during practice, they would lose steam in long games). Similar to I wouldn’t want to over-fit a model on sample data, I won’t get too complicated with my approach to standardizing my variables. Utilizing proc standard, I’ll standardize my clustering variables to have a mean of 0 and a standard deviation of 1. After standardizing the variables I’ll run the data analyst version of a zone defense (proc fastclus and use a macro to create max clusters from 1 through 9). I don’t anticipate to use a 9 cluster solution once running the game plan and evaluating my game time results. Ideally I want to keep my cluster size to small manageable number while still showing a striking difference between the groups. To evaluate how many cluster I’ll analyze to come to a final solution, I’ll extract the r-square values from each cluster solution and then merge them to plot an elbow curve. Using proc gplot to create my elbow curve, I’ll want to observe where the line begins to curve (creating an elbow). Finally, before we’re kicked off the court for another team’s practice, I’ll use proc anova to validate my clusters. As a validate metric I’ll use the variable “ttll_pts_per_m” this should help identify the difference between a team’s “go-to” option and a player whom is more of a complimentary piece at best.
RUNNING GAME PLAN AND GAME TIME RESULTS
A k-means cluster analysis was conducted to identify underlying subgroups of National Basketball Association athletes based on their similarity of responses on 11 variables that represent characteristics that could have an impact on 2016-17 regular season performance and play type. Clustering variables included quantitative variables measuring: perc_pts_ft (percentage of points scored from free throws) perc_pts_2pts (percentage of points scored from 2 pt field goals) perc_pts_3pts (percentage of points scored from 3 pt field goals) ‘3pts_made_per_m’N (3 point field goals made per minute) reb_per_min (rebounds per minute) asst_per_min (assists per minute) stl_per_min (steals per minute) blk_per_min (blocks per minute) fg_att_per_m (field goals attempted per minute) ft_att_per_min (free throws attempted per minute) fg_made_per_m (field goals made per minute) ft_made_per_m (free throws made per minute) to_per_min (turnovers per minute) All clustering variables were standardized to have a mean of 0 and a standard deviation of 1. Data was randomly split into a training set that included 70% of the observations (N=341) and a test set that included 30% of the observations (N=145). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve (see figure 1 below) to provide guidance for choosing the number of clusters to interpret.
Canonical discriminant analyses was used to reduce the 11 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatter-plot of the first two canonical variables by cluster (Figure 2 shown below) indicated that the observations in cluster 3 is the most densely packed with relatively low within cluster variance, and did not overlap very much with the other clusters. Cluster 1’s observations had greater spread suggesting higher within cluster variance. Observations in cluster 2 have relatively low cluster variance but there are a few observations with overlap.
The means on the clustering variables showed that, athletes in each cluster have uniquely different playing styles.
These athletes have high values for percentage of points from free throws, moderate on percentage points from 3 point field goals and low on percentage of points from 2 point field goals. These athletes attempt more field goals per minute, free throws per minute, make more 3 point field goals per minute and have the highest value for assists per minute; these athletes are focal points of a team’s offensive strategy.
Athletes in this cluster: Kevin Durant ,Anthony Davis, Stephen Curry
The athletes have extremely high values for percentage of points from 2 point field goals, moderate on percentage points from free throws, and extremely low values for percentage of points from 3 point field goals. These athletes rarely make perimeter shots and have low values for assists.
Athletes in this cluster: Rudy Gobert, Hassan Whiteside, Myles Turner
The athletes have high values for percentage of points from 3 point field goals, and low values for point 2 point field goals and free throws. These athletes stay on the perimeter (high values for 3 point field goals made) but are a secondary option at best, observed by a low field goal attempts per minute.
Athletes in this cluster: Otto Porter, Klay Thompson, Al Horford
In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on total points scored per minute (ttl_pts_per_m). A tukey test was used for post hoc comparisons between the clusters. The results indicated significant differences between the clusters on ttl_pts_per_m (F(2, 340)=86.67, p<.0001). The tukey post hoc comparisons showed significant differences between clusters on ttl_pts_per_m, with the exception that clusters 2 and 3 were not significantly different from each other. Athletes in cluster 1 had the highest ttl_pts_per_m (mean=.541, sd=0.141), and cluster 3 had the lowest ttl_pts_per_m (mean=.341, sd=0.096).
Using a k-means cluster is a data driven approach to grouping basketball player performance. This method can be used in constructing a team when a salary budget is constricted. The elephant in the room is this essentially is human behavior, therefore the validation step using proc anova is critical. The approach I’ve applied to the NBA data is a guide machine learning approach.