Marvel Comics, Propensity Modeling, Regression Modeling

Recipe 013: Marvel Comics Propensity Score

FerraraTom

How crazy would it be if I told you Howard the Duck and Old Man Logan are closer to each other in skill sets than they are to any other Marvel characters?  Or how about Thor and Dr. Octopus are lookalikes as well?  Let’s answer these questions together by wrangling some readily available data.


 

008

 


 

001

If I’ve learned anything from my career in data science it’s this: 80% of the work is data gathering and etl work, and 20% is analysis.

Nothing holds truer to this statement than finding data of Marvel characters skills set, on a normalized scale.  In this data story I’ll be using data from Marvel Contests of Champions (power index levels, health and attack) and the Marvel Battle Royale (a twitter fan poll of greatest superheroes).

A few more variables I’ll need to calculate around the results of the Marvel Battle Royale Twitter Fan Poll:

Total votes per each round

Average Total votes

A flag for if they were higher than average total votes per marvel character

This flag I’ll use as my dependent variable and my independent variables will be the Marvel Contest of Champions statistics.

What will this do?  This will predict the likelihood a Marvel Character would receive higher than the average total votes in the Marvel Battle Royale.

Once this is calculated I’ll receive an output of coefficients which I can apply to the rest of the Marvel Characters whom weren’t in the Marvel Battle Royale to create a propensity score.


 

002

Now let’s back track a little bit and see why I’m going with a propensity model as opposed to a grouping by opinion.  I.e. Let’s put all the top attackers in the same category.

The top 3 characters based on Attack are Rocket Raccoon, Spider-man (Symbiote), and Blade.

In the above histogram, if you look all the way to the far right you’ll notice they are the data points on their own little island.


 

 

003

Well what if I just grouped everyone by Health?  This data visualization looks more promising but mostly likely there would overlap on the other attributes and you wouldn’t be able to implement this successfully.


 

004

The power index by definition could be suitable but from the top 3 selected on power index I can tell this rating wasn’t an index in the vein of what I would typically use an index for (time-series forecasting) and it looks to be similar to the Pokemon Go Combat Point System, the ability to use their full potential.


 

005

One use of a propensity score is to create similar groups, based on the likelihood of performing a behavior.

In this case Doctor Octopus and Thor (Ragnarok) statistically the same in the Marvel Contest of Champions skill set.  For those of you want to go down and interesting rabbit whole, you can find YouTube videos on why Doctor Octopus should be in a demi-god tier.

This propensity score approach literally put Doctor Octopus in the same tier as a demi-god!


 

006

Medusa by power index alone would be close to Thanos but factoring all skill sets, she is statistically closer to Gwenpool, Cable, and Nightcrawler than she is to the Mad Titan.


 

007

Now for the crazy but statistically significant section.  Howard the Duck (I’m hoping he gets a show on Disney+) and Old Man Logan are a propensity score match.

An example like this where many begin to argue in data science, when does subject material expertise come into play?  We can argue significance forever, on any topic, but we can agree on all Marvel Champions have a value if played correctly.


006

009


 

005

010


003_008

nintendo, Regression Modeling, Super Mario

Recipe 010: Mario Kart Game-play Improvement Controller Trials

FerraraTom

Before I dive into this week’s data story, let me state why I love the Nintendo Switch.  I personally feel there’s a need for video games to be a social event, and couch co-op is a must have feature.  The Nintendo Switch offers several games which meet this need.

My family loves playing video games and most of all we love playing video games together.

Most of the Nintendo games I’ve grown up on and have played over the years, Mario Kart by far is one of my favorites.  I’ll admit my wife shows me how it’s done.


001


002


003

What I do find interesting about the Nintendo Switch is the joy con controllers, there’s a learning curve (but a huge improvement on the Wii-mote) and most veteran gamers prefer an alternative.

One alternative is the wireless controller, very similar to the X-box controller format.  I did pick up the Yoshi version for my wife and she loves it and personally feels it improves her game-play.

I’d thought it was time to put this notion to the test, what impact if any does a wireless controller has on game-play performance versus using a joy con.

Mario Kart seemed like the logical choice for this is experiment, it’s a multiplayer game, you can standardize your users (via ride type and modifications), and performance is measured in a continuous variable of points.

004


005

A total of 8 trails were ran under these conditions:

-Standard Kart

-Standard Wheels

-Standard Flyer

-Mushroom Cup

-50 cc length race

-2 gamers

Half through the trial one gamer switched to the wired controller (Test group) while the other gamer stayed on the single joy con (Control group).

Results were documented, and the etl. process began, points scored each race would be used as the key performance indicator.

I next ran a linear regression (great for evaluating an A/B test), with my dependent variable being the points scored after the event (introducing the wired controller) the two independent variables: Treatment and Pre Points Scored.

006


007.png

In this model I wasn’t concerned with the r-squared value or the significance level of each variable.  The sample data was not large enough, this was closed circuit small market test.

The model itself did show to be significant, which is a good indicator I can continue with the results.  Evaluating my Q’s graph, I see the model fits well, the trend goes through all the data points.

In my summary fit I notice there is a positive relationship between treatment (group) and post points scores.  At first glance this says you improve your Mario Kart game-play performance if you play with a wireless controller.

To complete this story I want to know my upper confidence level to be able to know by how many points and is this enough to move me up the rankings.

Using a wired controller has the potential to increase a gamers point performance by over six points each race.

The average points differential between race placement is 1.2 points.  This 6-point increase is enough to move you roughly 4 places, depending on your historic placement.

008.png


009

What have we learned from diving into the Mario Kart Data?

The controller you play with matters, switching to a traditional wired controller can potentially improve your point score by 6.5 points,

which depending on your average race placement can move you up 4 places in the final standings.

Observing the CPU controlled racers, Shy Guy performed the best with an average final placement of 2.8.  The heavy class overall was the weakest group but without Bowser, it could have been worse.  Bowser’s average final placement was 4th.

 

After you have consumed this meal, I hope you take these findings and enjoy your next Mario Kart Grand Prix.  Also as always enjoy the featured pancake recipe below!


006

010


005

011


003_008

Classification Tree, Game of Thrones, Tree Based Models

Recipe: 009 Game of Thrones Survival of the Fittest

FerraraTom


“When you play the game of thrones, you win or you die.” — Cersei

Let’s bring this quote to life in what I like to call a survival tree of the fittest.  This week’s analysis will focus on the character survival in Game of thrones.  Chow down and enjoy!


001


002


003

Winter is coming and you’d like to know your chances of survival in the Game of Thrones universe.

Let’s learn from those who have survived to this point and those who have met their unkindly fate.

To do this I’ll build a classification tree with my event being set to is the character alive (1 for yes, 2 for zero).  Classification trees in general test the null hypothesis, when we reach my tree visualization I’ll assign the color red to instances of were it’s highly probable of a character death.  Green leaves will indicate it’s highly probable a character survives… as long as all this criteria is met.

Think of this tree as a really morbid family tree, but since the data is Game of thrones it fits right into place.

The variables have readily available to me (hopefully they have importance) are as follows:

  • House Affiliation
  • Member of nobility
  • Marital Status
  • Gender
  • Family history of deaths
  • Popularity

004


005

From the initial read I see knowing if a character is popular among fans and if they are male hold the highest importance in determining survival.

Also the variables I have available account for 75% of the variability (a 25% miss-classification rate).

Let’s say you moved to Westeros, out of the gate you have a 25.4% chance of meeting your end.  At those odds I’m taking my chances but I should stay under the radar as much as I can, because the data warrants it.

If you become a popular character or are an integral part of the story, your death becomes more meaningful and your probability of survival is worse than a coin flip.

So let’s say you’re a like-able character (you can’t help it), not all is loss, as long as you’re a female.  The highest survival rate is the popular female character group.  This is a classic tale of high risk high reward.

006


007

A classification tree is a great way to visual your data and now I’ll walk us through this Game of Thrones survival tree.

Let’s start at the very top, the tree assumes everyone has a 75% of survival.  Now as the tree splits this Is where the interesting part begins, and our data story begins to unfold.

If you are a popular character you flow to the left side of the tree, your survival rate of 75% now drops to 48%.

Staying to the left side of the tree there is another important split, are you a male or female?  Female characters have a higher probability of surviving (87% if you’re popular and 76% if you’re under the radar).

If you’re a male and you’re popular you have a 42% chance of survival (We’re looking at your Peter Dinklage).

Now here’s the largest caveat to take with this classification tree: I’m assuming it will no longer be relevant after the final season.  Winter is coming and most likely our characters will see their end by hands of White Walkers.


008

What have we learned from diving into the Game of Thrones Data?

Everyone has starts off at a 75% survival rate and as your popularity grows your survival rate lessons by 27%.  If you’re a male your survival drops again by 33%.  If you’re a popular female character you are 45% more likely to survive versus your male counterparts.

An interesting tidbit…If you become popular and you are a female (hopefully the mother of dragons) you boast the highest survival rate of anyone in this universe, 87%.

 

After you have consumed this meal, I hope you take these findings and enjoy your episode of Game of Thrones. J  Also as always enjoy the featured pancake recipe below!


006

https://gameofthrones.fandom.com/wiki/Game_of_Thrones_Wiki


005

009


003_008

disney, Mickey Mouse, Regression Modeling, Theme Parks

Recipe: 006 Walt Disney World Parks and Resorts Revenue Influencer

FerraraTom

It all started with a mouse.  This mouse is turning 90 this year and Mickey Mouse has made his impact on society.  To celebrate, what better meal to cook us this week than Walt Disney World Data?  I’ll be challenging myself to

identify influencers on the Parks and Resorts Division’s yearly revenue.


001


002


003

004

With Mickey Mouse turning 90 years old this year, what better meal to cook us this week than Walt Disney World Data?  I’ll be challenging myself to identify influencers on the Parks and Resorts Division’s yearly revenue.

My first approach was to identify what happens during the year the revenue occurs?

The number of Animated Movies released by Disney

The number of Animated Movies featuring Disney Princesses

The number of Attractions add at all four main theme parks and then parsing this information out by the individual park

The first run was not an effective model: most of the variability in the data was not accounted for, and there were no independent variables of significance.

So my next approach was how do I capture word of mouth on movies and attractions?  Secondly, how do I incorporate when Disney starts charging admission to children (currently 2 yrs and younger, enter the parks for free)?

To knock out two birds with one stone, I settled on let me test a rolling 3-year average of all behaviors.  The results were very favorable, 67% of the variability is explained and I have interesting independent variables of significance to make a telling data story


005

If you’re a subscriber to this blog and enjoy the Stacks of Stats, you’ll recognize my preference for Q graphs.

There’s some curls at the tails but most of the data fits well, so there won’t be a need to run a more complex model.

Let’s take a bite into the initial read before accessing the financial impact of all these fun Disney variables.

I’ll caveat this, significance is in the eye of the beholder, and is up to interpretation of the  storyteller and data scientist.  The first read shows the 3-year average of total park attractions having the highest relationship to revenue and inversely the amount of attractions opened at EPCOT has significance but a negative impact on yearly revenue.

I’ll dive more into the individual impacts later, but I want to utilize my upper and lower bounds.


006

The output of this model shows the impact in millions USD.  Analyzing the cone, this is where our fairy tale begins to take shape.

Potentially the average amount of attractions introduced at the all four major parks can drive in $1.6 million USD.

With the Magic Kingdom driving most of this impact:

New attractions added at the Magic Kingdom can drive in $4.5 million USD.

The average amount of the Disney Princess movies does have more of an impact than factoring Disney releasing an animated movie as the only criteria.  What’s intriguing is the variability of our upper and lower bounds, there is a possibility there could be a loss of $50.6M.

007

What could be driving the inverse affect?  Multiple reasons:

1.The quality of the movie releases

2.The presence or in this case non-presence of a meet and greet at the theme park

3.The global economic climate (Less international travel impacts this!)


008

What have learned from diving into the Walt Disney Data?

There’s a reason WDW is investing in new IP based rides at Epcot and Hollywood Studios: they’ve been launching the rides outdated with their audience and they drive the lowest impact currently on yearly revenue.  I anticipate Epcot to see a steady growth on impact when Guardians of the Galaxy and Ratatouille open and a few years have passed.

Finally a Princess Animated Movie drives in 1 million USD more than a regular animated move release.

009

What could be the reasoning?  I’d guesstimate rides introduced at the Magic Kingdom (drives in +4.5M USD) is having a downstream affect on the Princess impact.  Most Princess interactions take place at the Magic Kingdom.

After you have consumed this meal, I hope you take these findings and with Mickey Mouse a Happy 90th Birthday. J  Also as always enjoy the featured pancake recipe below!


005

010

006

https://disneyworld.disney.go.com/


003_008

E-Sports, Logistic Regression, Overwatch

Recipe: 005 Overwatch League Inaugural Season Logistic Regression

FerraraTom

I’m excited to tackle the Overwatch League and my first dig into E-sports in general.  I’ve attended several conventions, including gaming conventions, and I will get this out of the way now:

I thought I was decent at video games… these athletes have shown I’m a very causal player.  This is a good thing, it was a pleasure to witness their craft.

The focus of this week is the probability of an individual player making the playoffs.  Throw into this meal where statistics based around player preferences and game-play performance.  To determine the variables throw into the final mix I threw in some confounding factors and profiling stats before going very heavy on player performance.


001


002


003


004


005


006


006

https://overwatchleague.com/en-us/

https://playoverwatch.com/en-us/


005

 

007


003_008

K-Means Clustering, NBA2k

Recipe: 004 A Data Driven Approach During the NBA Pace and Space Era

FerraraTom

The format of this post will be slightly different from previous recipes.  Think of this as a yelp review, I’ll be going sharing the paper I presented during the SESUG 2018 SAS Conference.  This will be wordy than usual, but I will start with the recipe card per usual and then we’ll dive deep into the paper.  At the end of this post you’ll be a full belly of a new approach to building a NBA team, can be applied to one of my favorite game modes in the 2K series… Franchise mode.


001


002


SESUG Paper 234-2018 Data Driven Approach in the NBA Pace and Space Era

ABSTRACT

Whether you’re an NBA executive or Fantasy Basketball owner or a casual fan, you can’t help but begin the conversation of who is a top tier player? Currently who are the best players in the NBA? How do you compare a nuts and glue defensive player to a high volume scorer? The answer to all these questions lies within segmenting basketball performance data.

OVERVIEW

A k-means cluster is a commonly used guided machine learning approach to grouping data. I will apply this method to human performance. This case study will focus on NBA basketball individual performance data. The goal at the end of this case study will be to apply a k-means cluster to identify similar players to use in team construction.

INTRODUCTION 

My childhood was spent in Brooklyn, New York. I’m a die-hard New York Knicks fan. My formative years were spent watching my favorite team get handled by arguably the greatest basketball player of all time, Michael Jordan. Several moments throughout my life and to this day it crosses my mind, only if we had that player on our team. Over time I have come to terms with we would never have Michael Jordan or player of his caliber, but wouldn’t it be interesting if a NBA team could find complimentary parts or look-a-like players? This is why I’m writing a paper about finding these look-a-likes, these diamonds in the rough, or as the current term is “Unicorns”. Let’s begin this journey together in search for a cluster of basketball unicorns.

WATCHING THE GAME TAPE

What do high level performers have in common? In most cases you’ll find they study their sport, study their own game performance, study their opponents and study the performance of other athletes they strive to be like. The data analyst equivalent to watching game tape would be to gather as many independent and dependent variables as possible to perform an analysis. For the NBA data used in this k-means cluster analysis, I took the approach of what contributes to success in winning a game. Outscoring your opponent was a no-brainer starting point, but I’ll need to dig deeper. How many ways can and what methods can you outscore an opponent? The avid basketball fan would agree how a player scores a basket (i.e. field goal vs behind the three point line) will determine how they fit into an offensive scheme and defines their game plan. Beyond scoring there are other equally as important contributors to basketball performance. This is where I began to think of how much hustle and defensive metrics could I gather (i.e. rebounds, assists, steals, blocks, etc.). Could I normalize all of these metrics to come to get a baseline on player efficiency and more importantly effectively identify an individual player’s role in a team’s overall performance? To normalize my metrics I made the decision to produce my raw data on a per minute level, this way I wouldn’t show biases to high usage players or low usage players. To identify how a player fits into an offensive scheme and their scoring tendencies I calculated an individual level what percent of points scored comes from all methods of scoring (i.e. free throw percentage, three pointers made, two point field goals). Once I went through all of my data analyst game tape, I was ready to hold practice and cluster.

HOLDING PRACTICE

Practice makes perfect, but everything in moderation (i.e. the New York Knicks of the 1990’s overworked themselves during practice, they would lose steam in long games). Similar to I wouldn’t want to over-fit a model on sample data, I won’t get too complicated with my approach to standardizing my variables. Utilizing proc standard, I’ll standardize my clustering variables to have a mean of 0 and a standard deviation of 1. After standardizing the variables I’ll run the data analyst version of a zone defense (proc fastclus and use a macro to create max clusters from 1 through 9). I don’t anticipate to use a 9 cluster solution once running the game plan and evaluating my game time results. Ideally I want to keep my cluster size to small manageable number while still showing a striking difference between the groups. To evaluate how many cluster I’ll analyze to come to a final solution, I’ll extract the r-square values from each cluster solution and then merge them to plot an elbow curve. Using proc gplot to create my elbow curve, I’ll want to observe where the line begins to curve (creating an elbow). Finally, before we’re kicked off the court for another team’s practice, I’ll use proc anova to validate my clusters. As a validate metric I’ll use the variable “ttll_pts_per_m” this should help identify the difference between a team’s “go-to” option and a player whom is more of a complimentary piece at best.

RUNNING GAME PLAN AND GAME TIME RESULTS

A k-means cluster analysis was conducted to identify underlying subgroups of National Basketball Association athletes based on their similarity of responses on 11 variables that represent characteristics that could have an impact on 2016-17 regular season performance and play type. Clustering variables included quantitative variables measuring: perc_pts_ft (percentage of points scored from free throws) perc_pts_2pts (percentage of points scored from 2 pt field goals) perc_pts_3pts (percentage of points scored from 3 pt field goals) ‘3pts_made_per_m’N (3 point field goals made per minute) reb_per_min (rebounds per minute) asst_per_min (assists per minute) stl_per_min (steals per minute) blk_per_min (blocks per minute) fg_att_per_m (field goals attempted per minute) ft_att_per_min (free throws attempted per minute) fg_made_per_m (field goals made per minute) ft_made_per_m (free throws made per minute) to_per_min (turnovers per minute) All clustering variables were standardized to have a mean of 0 and a standard deviation of 1. Data was randomly split into a training set that included 70% of the observations (N=341) and a test set that included 30% of the observations (N=145). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve (see figure 1 below) to provide guidance for choosing the number of clusters to interpret.

003

Canonical discriminant analyses was used to reduce the 11 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatter-plot of the first two canonical variables by cluster (Figure 2 shown below) indicated that the observations in cluster 3 is the most densely packed with relatively low within cluster variance, and did not overlap very much with the other clusters. Cluster 1’s observations had greater spread suggesting higher within cluster variance. Observations in cluster 2 have relatively low cluster variance but there are a few observations with overlap.

004

The means on the clustering variables showed that, athletes in each cluster have uniquely different playing styles.

Cluster 1:

These athletes have high values for percentage of points from free throws, moderate on percentage points from 3 point field goals and low on percentage of points from 2 point field goals. These athletes attempt more field goals per minute, free throws per minute, make more 3 point field goals per minute and have the highest value for assists per minute; these athletes are focal points of a team’s offensive strategy.

Athletes in this cluster: Kevin Durant ,Anthony Davis, Stephen Curry

Cluster 2:

The athletes have extremely high values for percentage of points from 2 point field goals, moderate on percentage points from free throws, and extremely low values for percentage of points from 3 point field goals. These athletes rarely make perimeter shots and have low values for assists.

Athletes in this cluster: Rudy Gobert, Hassan Whiteside, Myles Turner

Cluster 3:

The athletes have high values for percentage of points from 3 point field goals, and low values for point 2 point field goals and free throws. These athletes stay on the perimeter (high values for 3 point field goals made) but are a secondary option at best, observed by a low field goal attempts per minute.

Athletes in this cluster: Otto Porter, Klay Thompson, Al Horford

In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on total points scored per minute (ttl_pts_per_m). A tukey test was used for post hoc comparisons between the clusters. The results indicated significant differences between the clusters on ttl_pts_per_m (F(2, 340)=86.67, p<.0001). The tukey post hoc comparisons showed significant differences between clusters on ttl_pts_per_m, with the exception that clusters 2 and 3 were not significantly different from each other. Athletes in cluster 1 had the highest ttl_pts_per_m (mean=.541, sd=0.141), and cluster 3 had the lowest ttl_pts_per_m (mean=.341, sd=0.096).

CONCLUSION

Using a k-means cluster is a data driven approach to grouping basketball player performance. This method can be used in constructing a team when a salary budget is constricted. The elephant in the room is this essentially is human behavior, therefore the validation step using proc anova is critical. The approach I’ve applied to the NBA data is a guide machine learning approach.


005007


006

https://www.nba2k.com/

http://www.sesug.org/SESUG2018/index.php


003_008

Classification Tree, Harry Potter, Tree Based Models

Recipe: 003 Harry Potter: Did Voldemort Get-cha? Classification Tree

 

FerraraTom“It does not do well to dwell on dreams and forget to live.” – Albus Dumbledore – Harry Potter and the Sorcerer’s Stone

In this post we won’t dwell but we’ll analyze and learn.  I ask that you play along and imagine yourself receiving your acceptance letter to Hogwarts (well let’s be honest here we’ve all imagined this at one point or another).

So you’ve hopped off the Hogwarts’s Express, ready for your studies and the fight the dark arts. Oh wait… nobody told you about the dark arts and all the threats looming your way? Ever wonder was the budget only allowed for owls to deliver acceptance letters? This week we’ll dive into the greatest threat in the Harry Potter Universe, Lord Voldemort.


003_001


003_002


003_003


003_004


003_005


003_006


003_007


003_008