Classification Tree, E-Sports, K-Means Clustering, Logistic Regression, NBA2k, nintendo, Overwatch, Propensity Modeling, Regression Modeling, Super Mario, Tree Based Models

TBCC 2019 Player One, Power Ups, & Probabilities: Panel Recap

012


001.png


This panel was held on: Saturday, August 3, 2019 at 3 PM – 4 PM
And here was the pitch:
Join the data science debate of the highest critically acclaimed video games vs the nostalgia of games we grew up. The data science team at Pancake Breakfast: A Stack Of Stats will be serving up supporting data and driving the discussion for both sides of the debate. Panelists will debate greatest video game of all time or overrated!
The Panelist were myself and Stephen (an indie game developer).  Obviously Steve had the advantage going into this debate but it was really fun and the audience was very engaged, probably one of our best Q&A sessions of all time.
015

Video Game Recommendation Engine – This is how we do it

These are data science panels and we started off this panel with a video game recommendation engine.  I had Stephen fill out a survey prior to the panel and from his results I built a recommendation model, with the goal of selecting games he has not played (he’s played a lot of games, so not an easy task) and would rate above average.

002

How are we going to build this recommendation?  Through Propensity scoring!

A propensity score is an estimated probability that a data point might have the predicted outcome.

  • One of our panelists completed a survey and had to rank video games they have played
  • Their responses were linked to our ancillary data (critics score, user score, and genres)
  • Our model shot out a score between 0 and 1. The closer to 1 the more likely this game would be enjoyed by the panelist.

003


 

Video Game Recommendation Engine – The Output

004.png

For this panelist, the survey told us this about their gaming preferences:

The value User Score more than the Critics Score.

Their preferred genre is Action Adventure.

Their preferred platform is the PS2.

005


Video Game Debate: Overview

006

On the screen will be a video game, with some profiling data.

Panelist will debate the impact, perceived and replay value of the featured game.

Crowd will decide who made the better argument.

This is the meat of the panel., on the screen is also the IGN review headline and rating, Stephen and myself would take turns and argue if it deserved it’s ranking.


Goldeneye 007

007

Stephen went first and argued that Goldeneye does not deserve this high of rating and his key point was on the replay value.  I attempted to argue on to value it at time of release.  The crowd sided with Stephen.


Pokémon Gold & Silver

008

I went first this round and argued for the rating, this was a very pro Pokémon crowd.  Stephen brought up good points on where he thinks the series should go and adding another region is not the answer.  The crowd sided with Me.


Ultimate Marvel vs. Capcom 3

009

Stephen chose to argue for this game, I wanted to throw a curve-ball in this debate.  It would have been very obvious if we chose Marvel vs Capcom 2, too easy.  I argued that it wasn’t even the best in the series, and the best in the series is actually X-men vs Street fighter.


Halo Combat Evolved

010

Stephen was on team Halo for this one, I love Halo as well, but the crowd did not.  That was a shock to us but maybe Halo doesn’t have replay value?  Or everyone is getting tired with the series.


Battle Dome: Overview

011

Two games go in… only one comes out

Panelists will argue for a game, they cannot both argue for the same game

The crowd decides who had the best argument

This was fun and challenging section of our panel.  I won’t go into details on this section but I do want to try something out.  As test to see who is interacting with my page by reading the data stories, I have a special giveaway.

Here are the rules, you must have an Instagram account. You must be following my Instagram account: @pancake_analytics.

To enter you need read through the battle dome section, screen shot your favorite match-up and post it to instagram.

In this post I want you tag @pancake_analytics and caption the post with “Who do you have in this Battle Dome match-up?”.

This giveaway will end on December 31st, 2019 and the winner will receive a Game-stop Gift card from me.  For to use on your next video game purchase in the new year!

Here’s the disclaimer I have to post:

Per Instagram rules, we must mention this is in no way sponsored, administered, or associated with Instagram, Inc. By entering, entrants confirm they are 13+ years of age, release Instagram of responsibility, and agree to Instagram’s term of use. Good luck!!!!!

Here’s the battle dome match-ups:


012


013


014


I want to personally thank everyone who attended the panel in Tampa, at the Tampa Comic Convention.  I look forward to meeting again in 2020.


003_008

K-Means Clustering, Logistic Regression, nintendo, Propensity Modeling, Regression Modeling, Super Mario

TBCC 2019 Smash Brothers, Segmentation & Strategy: Panel Recap

012


003


This Panel was held on:

Friday, August 2, 2019 at 7:30 PM – 8:30 PM

During the Tampa Bay Comic Convention 2019, held at the Tampa Convention Center.

The Panelists were:

Tom Ferrara (@pancake_analytics) , Kalyn Hundley (@kehundley08), Andy Polak (@polak_andy)

001

I want to take a quick moment to discuss the panelists.  I love giving as many different point of views as possible to these data science panels.  Without this variety of point of views it’s more of a lecture and less of a discussion.  This mix of panelists gave the audience the data science view, the tech industry view and the biological sciences view.  Best part about this is Smash Brother brought us all together.


Changing the Tier Conversation

004.png

One of the main objectives of this panel was getting a discussion going on tier selection in Smash and how do we base tier selection in data science, and how do we validate our findings through one of the best players in the game.

A k-means cluster uncovers trends within our Smash Brothers data to understand the relational similarities and differences on key in game attributes.

The more clusters the clearer our picture becomes and the deeper we can understand the pros and cons of each main selection.


005.png

A brief overview of a k-means cluster:

  • Standardize your variables
  • Analyze your elbow curve
  • Validate your clusters

Treat each game release as new product launch or a change in the market.

You would re-score your data, to understand the current market and you’re able to migrate and understand how the meta-game has changed.


006

We end up with five unique clusters:

Floaters:

This group is the slowest by run speed and lightest by weight.

Jack Of All Trades:

They are middle group on everything, there is no distinct trend.

Dashers:

Like the Jack of All Trades group but faster.

Air Tanks:

Fast in aerial attacks and the heaviest of the characters.

Speedsters:

This group is the fastest and the lightest.


007

propensity model is a statistical scorecard that is used to predict the behavior of your customer or prospect base. Propensity models are often used to identify those most likely to respond to an offer, or to focus retention activity on those most likely to churn.

So who should be your main?  In this segment I rely on industry knowledge as well (ZeRo’s tiers as dependent variable).   I’ll build propensity score with the following independent variables:

  • Change in air acceleration
  • Base air acceleration
  • Base speed in the air
  • Base Run Speed
  • Character Weight
  • Ultimate Smash Bros. Cluster
  • Wii-U Smash Bros. Cluster

008


What makes these three stand above the crowd?

The are middle ground on weight, fast air accelerators.

What are the differences between the three?

Wario has a slow run speed.

Palutena is the lightest.

Yoshi is the middle ground of this group.


The Curious Case of Ganondorf

009

Ganondorf has more in-common with Jiggly Puff than he does Bowser.

The reason being is he’s quicker and can adapt well in aerial attacks and in falling than Bowser can.

On the flip-side of this I can also say Bowser more accurately represents how he’s viewed from the super Mario franchise, in Super Smash Bros. Ultimate.


Game Time: Name that segment: Overview

010

I personally feel one of the best ways to reinforce learning is through a game.  For this panel I decided to reinforce the k-means segmentation and wanted volunteers to guess the segment 3 characters on the screen fall into.

Here was the overview:

5 Volunteers

On the screen will be 3 characters

All 3 characters belong to the same segment

Volunteers will do their best to convince the panel of which segment the characters fall into:

  • Floaters
  • Jack of All Trades
  • Dashers
  • Air Tanks
  • Speedsters

For participating volunteers receive a fabulous prize.

For this particular game the prize was an amiibo of their choice that works with Smash Ultimate for the Nintendo Switch.


I want to personally thank everyone who attended the panel in Tampa, at the Tampa Comic Convention.  I look forward to meeting again in 2020.


003_008

Classification Tree, Game of Thrones, Tree Based Models

Recipe: 009 Game of Thrones Survival of the Fittest

logo


“When you play the game of thrones, you win or you die.” — Cersei

Let’s bring this quote to life in what I like to call a survival tree of the fittest.  This week’s analysis will focus on the character survival in Game of thrones.  Chow down and enjoy!


001


002


003

Winter is coming and you’d like to know your chances of survival in the Game of Thrones universe.

Let’s learn from those who have survived to this point and those who have met their unkindly fate.

To do this I’ll build a classification tree with my event being set to is the character alive (1 for yes, 2 for zero).  Classification trees in general test the null hypothesis, when we reach my tree visualization I’ll assign the color red to instances of were it’s highly probable of a character death.  Green leaves will indicate it’s highly probable a character survives… as long as all this criteria is met.

Think of this tree as a really morbid family tree, but since the data is Game of thrones it fits right into place.

The variables have readily available to me (hopefully they have importance) are as follows:

  • House Affiliation
  • Member of nobility
  • Marital Status
  • Gender
  • Family history of deaths
  • Popularity

004


005

From the initial read I see knowing if a character is popular among fans and if they are male hold the highest importance in determining survival.

Also the variables I have available account for 75% of the variability (a 25% miss-classification rate).

Let’s say you moved to Westeros, out of the gate you have a 25.4% chance of meeting your end.  At those odds I’m taking my chances but I should stay under the radar as much as I can, because the data warrants it.

If you become a popular character or are an integral part of the story, your death becomes more meaningful and your probability of survival is worse than a coin flip.

So let’s say you’re a like-able character (you can’t help it), not all is loss, as long as you’re a female.  The highest survival rate is the popular female character group.  This is a classic tale of high risk high reward.

006


007

A classification tree is a great way to visual your data and now I’ll walk us through this Game of Thrones survival tree.

Let’s start at the very top, the tree assumes everyone has a 75% of survival.  Now as the tree splits this Is where the interesting part begins, and our data story begins to unfold.

If you are a popular character you flow to the left side of the tree, your survival rate of 75% now drops to 48%.

Staying to the left side of the tree there is another important split, are you a male or female?  Female characters have a higher probability of surviving (87% if you’re popular and 76% if you’re under the radar).

If you’re a male and you’re popular you have a 42% chance of survival (We’re looking at your Peter Dinklage).

Now here’s the largest caveat to take with this classification tree: I’m assuming it will no longer be relevant after the final season.  Winter is coming and most likely our characters will see their end by hands of White Walkers.


008

What have we learned from diving into the Game of Thrones Data?

Everyone has starts off at a 75% survival rate and as your popularity grows your survival rate lessons by 27%.  If you’re a male your survival drops again by 33%.  If you’re a popular female character you are 45% more likely to survive versus your male counterparts.

An interesting tidbit…If you become popular and you are a female (hopefully the mother of dragons) you boast the highest survival rate of anyone in this universe, 87%.

 

After you have consumed this meal, I hope you take these findings and enjoy your episode of Game of Thrones. J  Also as always enjoy the featured pancake recipe below!


006

https://gameofthrones.fandom.com/wiki/Game_of_Thrones_Wiki


005

009


003_008

disney, Mickey Mouse, Regression Modeling, Theme Parks

Recipe: 006 Walt Disney World Parks and Resorts Revenue Influencer

logo

It all started with a mouse.  This mouse is turning 90 this year and Mickey Mouse has made his impact on society.  To celebrate, what better meal to cook us this week than Walt Disney World Data?  I’ll be challenging myself to

identify influencers on the Parks and Resorts Division’s yearly revenue.


001


002


003

004

With Mickey Mouse turning 90 years old this year, what better meal to cook us this week than Walt Disney World Data?  I’ll be challenging myself to identify influencers on the Parks and Resorts Division’s yearly revenue.

My first approach was to identify what happens during the year the revenue occurs?

The number of Animated Movies released by Disney

The number of Animated Movies featuring Disney Princesses

The number of Attractions add at all four main theme parks and then parsing this information out by the individual park

The first run was not an effective model: most of the variability in the data was not accounted for, and there were no independent variables of significance.

So my next approach was how do I capture word of mouth on movies and attractions?  Secondly, how do I incorporate when Disney starts charging admission to children (currently 2 yrs and younger, enter the parks for free)?

To knock out two birds with one stone, I settled on let me test a rolling 3-year average of all behaviors.  The results were very favorable, 67% of the variability is explained and I have interesting independent variables of significance to make a telling data story


005

If you’re a subscriber to this blog and enjoy the Stacks of Stats, you’ll recognize my preference for Q graphs.

There’s some curls at the tails but most of the data fits well, so there won’t be a need to run a more complex model.

Let’s take a bite into the initial read before accessing the financial impact of all these fun Disney variables.

I’ll caveat this, significance is in the eye of the beholder, and is up to interpretation of the  storyteller and data scientist.  The first read shows the 3-year average of total park attractions having the highest relationship to revenue and inversely the amount of attractions opened at EPCOT has significance but a negative impact on yearly revenue.

I’ll dive more into the individual impacts later, but I want to utilize my upper and lower bounds.


006

The output of this model shows the impact in millions USD.  Analyzing the cone, this is where our fairy tale begins to take shape.

Potentially the average amount of attractions introduced at the all four major parks can drive in $1.6 million USD.

With the Magic Kingdom driving most of this impact:

New attractions added at the Magic Kingdom can drive in $4.5 million USD.

The average amount of the Disney Princess movies does have more of an impact than factoring Disney releasing an animated movie as the only criteria.  What’s intriguing is the variability of our upper and lower bounds, there is a possibility there could be a loss of $50.6M.

007

What could be driving the inverse affect?  Multiple reasons:

1.The quality of the movie releases

2.The presence or in this case non-presence of a meet and greet at the theme park

3.The global economic climate (Less international travel impacts this!)


008

What have learned from diving into the Walt Disney Data?

There’s a reason WDW is investing in new IP based rides at Epcot and Hollywood Studios: they’ve been launching the rides outdated with their audience and they drive the lowest impact currently on yearly revenue.  I anticipate Epcot to see a steady growth on impact when Guardians of the Galaxy and Ratatouille open and a few years have passed.

Finally a Princess Animated Movie drives in 1 million USD more than a regular animated move release.

009

What could be the reasoning?  I’d guesstimate rides introduced at the Magic Kingdom (drives in +4.5M USD) is having a downstream affect on the Princess impact.  Most Princess interactions take place at the Magic Kingdom.

After you have consumed this meal, I hope you take these findings and with Mickey Mouse a Happy 90th Birthday. J  Also as always enjoy the featured pancake recipe below!


005

010

006

https://disneyworld.disney.go.com/


003_008

E-Sports, Logistic Regression, Overwatch

Recipe: 005 Overwatch League Inaugural Season Logistic Regression

logo

I’m excited to tackle the Overwatch League and my first dig into E-sports in general.  I’ve attended several conventions, including gaming conventions, and I will get this out of the way now:

I thought I was decent at video games… these athletes have shown I’m a very causal player.  This is a good thing, it was a pleasure to witness their craft.

The focus of this week is the probability of an individual player making the playoffs.  Throw into this meal where statistics based around player preferences and game-play performance.  To determine the variables throw into the final mix I threw in some confounding factors and profiling stats before going very heavy on player performance.


001


002


003


004


005


006


006

https://overwatchleague.com/en-us/

https://playoverwatch.com/en-us/


005

 

007


003_008

K-Means Clustering, NBA2k

Recipe: 004 A Data Driven Approach During the NBA Pace and Space Era

logo

The format of this post will be slightly different from previous recipes.  Think of this as a yelp review, I’ll be going sharing the paper I presented during the SESUG 2018 SAS Conference.  This will be wordy than usual, but I will start with the recipe card per usual and then we’ll dive deep into the paper.  At the end of this post you’ll be a full belly of a new approach to building a NBA team, can be applied to one of my favorite game modes in the 2K series… Franchise mode.


001


002


SESUG Paper 234-2018 Data Driven Approach in the NBA Pace and Space Era

ABSTRACT

Whether you’re an NBA executive or Fantasy Basketball owner or a casual fan, you can’t help but begin the conversation of who is a top tier player? Currently who are the best players in the NBA? How do you compare a nuts and glue defensive player to a high volume scorer? The answer to all these questions lies within segmenting basketball performance data.

OVERVIEW

A k-means cluster is a commonly used guided machine learning approach to grouping data. I will apply this method to human performance. This case study will focus on NBA basketball individual performance data. The goal at the end of this case study will be to apply a k-means cluster to identify similar players to use in team construction.

INTRODUCTION 

My childhood was spent in Brooklyn, New York. I’m a die-hard New York Knicks fan. My formative years were spent watching my favorite team get handled by arguably the greatest basketball player of all time, Michael Jordan. Several moments throughout my life and to this day it crosses my mind, only if we had that player on our team. Over time I have come to terms with we would never have Michael Jordan or player of his caliber, but wouldn’t it be interesting if a NBA team could find complimentary parts or look-a-like players? This is why I’m writing a paper about finding these look-a-likes, these diamonds in the rough, or as the current term is “Unicorns”. Let’s begin this journey together in search for a cluster of basketball unicorns.

WATCHING THE GAME TAPE

What do high level performers have in common? In most cases you’ll find they study their sport, study their own game performance, study their opponents and study the performance of other athletes they strive to be like. The data analyst equivalent to watching game tape would be to gather as many independent and dependent variables as possible to perform an analysis. For the NBA data used in this k-means cluster analysis, I took the approach of what contributes to success in winning a game. Outscoring your opponent was a no-brainer starting point, but I’ll need to dig deeper. How many ways can and what methods can you outscore an opponent? The avid basketball fan would agree how a player scores a basket (i.e. field goal vs behind the three point line) will determine how they fit into an offensive scheme and defines their game plan. Beyond scoring there are other equally as important contributors to basketball performance. This is where I began to think of how much hustle and defensive metrics could I gather (i.e. rebounds, assists, steals, blocks, etc.). Could I normalize all of these metrics to come to get a baseline on player efficiency and more importantly effectively identify an individual player’s role in a team’s overall performance? To normalize my metrics I made the decision to produce my raw data on a per minute level, this way I wouldn’t show biases to high usage players or low usage players. To identify how a player fits into an offensive scheme and their scoring tendencies I calculated an individual level what percent of points scored comes from all methods of scoring (i.e. free throw percentage, three pointers made, two point field goals). Once I went through all of my data analyst game tape, I was ready to hold practice and cluster.

HOLDING PRACTICE

Practice makes perfect, but everything in moderation (i.e. the New York Knicks of the 1990’s overworked themselves during practice, they would lose steam in long games). Similar to I wouldn’t want to over-fit a model on sample data, I won’t get too complicated with my approach to standardizing my variables. Utilizing proc standard, I’ll standardize my clustering variables to have a mean of 0 and a standard deviation of 1. After standardizing the variables I’ll run the data analyst version of a zone defense (proc fastclus and use a macro to create max clusters from 1 through 9). I don’t anticipate to use a 9 cluster solution once running the game plan and evaluating my game time results. Ideally I want to keep my cluster size to small manageable number while still showing a striking difference between the groups. To evaluate how many cluster I’ll analyze to come to a final solution, I’ll extract the r-square values from each cluster solution and then merge them to plot an elbow curve. Using proc gplot to create my elbow curve, I’ll want to observe where the line begins to curve (creating an elbow). Finally, before we’re kicked off the court for another team’s practice, I’ll use proc anova to validate my clusters. As a validate metric I’ll use the variable “ttll_pts_per_m” this should help identify the difference between a team’s “go-to” option and a player whom is more of a complimentary piece at best.

RUNNING GAME PLAN AND GAME TIME RESULTS

A k-means cluster analysis was conducted to identify underlying subgroups of National Basketball Association athletes based on their similarity of responses on 11 variables that represent characteristics that could have an impact on 2016-17 regular season performance and play type. Clustering variables included quantitative variables measuring: perc_pts_ft (percentage of points scored from free throws) perc_pts_2pts (percentage of points scored from 2 pt field goals) perc_pts_3pts (percentage of points scored from 3 pt field goals) ‘3pts_made_per_m’N (3 point field goals made per minute) reb_per_min (rebounds per minute) asst_per_min (assists per minute) stl_per_min (steals per minute) blk_per_min (blocks per minute) fg_att_per_m (field goals attempted per minute) ft_att_per_min (free throws attempted per minute) fg_made_per_m (field goals made per minute) ft_made_per_m (free throws made per minute) to_per_min (turnovers per minute) All clustering variables were standardized to have a mean of 0 and a standard deviation of 1. Data was randomly split into a training set that included 70% of the observations (N=341) and a test set that included 30% of the observations (N=145). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve (see figure 1 below) to provide guidance for choosing the number of clusters to interpret.

003

Canonical discriminant analyses was used to reduce the 11 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatter-plot of the first two canonical variables by cluster (Figure 2 shown below) indicated that the observations in cluster 3 is the most densely packed with relatively low within cluster variance, and did not overlap very much with the other clusters. Cluster 1’s observations had greater spread suggesting higher within cluster variance. Observations in cluster 2 have relatively low cluster variance but there are a few observations with overlap.

004

The means on the clustering variables showed that, athletes in each cluster have uniquely different playing styles.

Cluster 1:

These athletes have high values for percentage of points from free throws, moderate on percentage points from 3 point field goals and low on percentage of points from 2 point field goals. These athletes attempt more field goals per minute, free throws per minute, make more 3 point field goals per minute and have the highest value for assists per minute; these athletes are focal points of a team’s offensive strategy.

Athletes in this cluster: Kevin Durant ,Anthony Davis, Stephen Curry

Cluster 2:

The athletes have extremely high values for percentage of points from 2 point field goals, moderate on percentage points from free throws, and extremely low values for percentage of points from 3 point field goals. These athletes rarely make perimeter shots and have low values for assists.

Athletes in this cluster: Rudy Gobert, Hassan Whiteside, Myles Turner

Cluster 3:

The athletes have high values for percentage of points from 3 point field goals, and low values for point 2 point field goals and free throws. These athletes stay on the perimeter (high values for 3 point field goals made) but are a secondary option at best, observed by a low field goal attempts per minute.

Athletes in this cluster: Otto Porter, Klay Thompson, Al Horford

In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on total points scored per minute (ttl_pts_per_m). A tukey test was used for post hoc comparisons between the clusters. The results indicated significant differences between the clusters on ttl_pts_per_m (F(2, 340)=86.67, p<.0001). The tukey post hoc comparisons showed significant differences between clusters on ttl_pts_per_m, with the exception that clusters 2 and 3 were not significantly different from each other. Athletes in cluster 1 had the highest ttl_pts_per_m (mean=.541, sd=0.141), and cluster 3 had the lowest ttl_pts_per_m (mean=.341, sd=0.096).

CONCLUSION

Using a k-means cluster is a data driven approach to grouping basketball player performance. This method can be used in constructing a team when a salary budget is constricted. The elephant in the room is this essentially is human behavior, therefore the validation step using proc anova is critical. The approach I’ve applied to the NBA data is a guide machine learning approach.


005007


006

https://www.nba2k.com/

http://www.sesug.org/SESUG2018/index.php


003_008

Classification Tree, Harry Potter, Tree Based Models

Recipe: 003 Harry Potter: Did Voldemort Get-cha? Classification Tree

 

logo“It does not do well to dwell on dreams and forget to live.” – Albus Dumbledore – Harry Potter and the Sorcerer’s Stone

In this post we won’t dwell but we’ll analyze and learn.  I ask that you play along and imagine yourself receiving your acceptance letter to Hogwarts (well let’s be honest here we’ve all imagined this at one point or another).

So you’ve hopped off the Hogwarts’s Express, ready for your studies and the fight the dark arts. Oh wait… nobody told you about the dark arts and all the threats looming your way? Ever wonder was the budget only allowed for owls to deliver acceptance letters? This week we’ll dive into the greatest threat in the Harry Potter Universe, Lord Voldemort.


003_001


003_002


003_003


003_004


003_005


003_006


003_007


003_008