Recipes

Board Games, Logistic Regression, Regression Modeling

Recipe: 008 Likelihood a Board Game Is Universally Loved

FerraraTom

For this week’s analysis I’m taking a different approach to the introduction.  I reached out to @missionboardgame to write the forward.  They are a couple from Turkey who tries their best to inspire people to join board game community.  With out further ado here is there overview of the modern board gaming climate:

We think a successful modern board game should include the following features:

✔️Your decisions should have an impact on the game progress.
✔️Minimal randomness.
✔️No player elimination as possible as there can be throughout the game.

In addition to those, theme, artwork and mechanics are also significant for our decisions while purchasing board games. Therefore, our favorite game is Robinson Crusoe: Adventures on the Cursed Island. It is a cooperative survival game where you are trapped on a deserted island. Each decision you have made previously has an outcome afterwards. The harmony between the theme and the rules is perfectly arranged so that you feel very integrated to the game. By this way, every action you take seems meaningful and logical. Also we love feeling the cooperation among us since we are usually 2 players. – Mission Board Game

36607436_1332042043597675_3622509673829105664_n


001


002


003

Countless nights I’ve played board games among friends and family.  Every new year’s eve my family and I play Monopoly.  A few reasons: the game-play length, the amount of players, and the simplified game-play.  I have 5 siblings, so saying it’s difficult to find a game for all of us to play is an understatement.

The reasons why we enjoy board games is an interesting topic.  Is it the theme of the game?  Is it the amount of players required?  Has the game received universal praise from critics alike? Is it a common game most households own, and we grew up playing?

All the above-mentioned variables I’ll throw into a logistic regression model and use the Bayes theory of probabilities, to determine the probability of a board game player will rank a game higher than the average score.

During the first read I see the model is statistically significant based on a z score of less than .05.   A few things stand out to me immediately:

1.) Not all variables have a positive relationship to a highly scored board game

2.) There are some strong social elements going on here (i.e. the longer the play the higher the impact may imply games which encourage discussion are rated higher)

3.) Fantasy themed board games are not ranked high (I have a D&D and video games impact theory)

004


005

Before jumping into the positive relationships, I’d like to touch briefly on the negative relationship independent variables.

1.) Fantasy Theme: I included this variable in the model expecting to see a very high positive correlation, but I was very wrong.  To quote Rick and Morty : “Sometimes science is more art than science.”  In the spirit of the quote, I’ll assume there are threats to the fantasy themed board games genre, in the form of Role-Playing Video Games.  The storytelling in this medium has progressed some much in last decade it out paces the anything a board game could offer.

In other words the target audience is leaving.

2.) Major of voters:  This variable is all about the amount of users who share their ranking.  A rule of thumb for rankings, reviews and ratings is those who go through the effort of expressing their opinion either love or hate the product.  The upper and lower confidence levels mirror themselves, because of this skew-ness.

006


007

Next, I’ll discuss the positive relationship independent variables (focusing on those with the highest impact):

1.) Board games with an average game-play of at least two hours or more has the highest positive impact on a user rating a board game score above average.  What makes a game have a long game-play?

Multiple reasons: more players involved, more game-play mechanics, and mostly importantly more discussion.  The soul of any good board game is bringing people together.

2.) The second highest impact comes from the average score displayed from Game Board Geek.  The reason behind this is users see this rating first before submitting their rating.  Think of it like the Rotten Tomatoes effect, people want to feel like they have universally accepted opinions.  Take the beginning of this data story for example, I mentioned Monopoly is a family tradition of mine, this potentially could have swayed your opinion on this board game.  Possibility you could rate this higher than a game, say is fantasy themed, based on this model output.

For your own reference, this model has an accuracy rate of above 70%

008.emf009


010

What have we learned from diving into the Board Game Data? 

Board games are most successful when they encourage the spirit and soul of “game night”, a gathering of friends and family discussing and enjoying each other’s time.  Adventure and exploration themes are the majority of the top ten highly successful board game genres.  The longer the game-play does not mean the game is like pulling teeth or the pace is slow.

It is more of an indicator of the amount of players required and the story telling the game has in driving a great game night experience.

 

After you have consumed this meal, I hope you take these findings and enjoy your next game night.  Also as always enjoy the featured pancake recipe below!


006

https://boardgamegeek.com/


005

011


003_008

Cosplay

Recipe: 007 Comic Con Cosplay and the Drivers of Instagram Engagement

FerraraTom


Halloween has recently passed and it’s a good transition into this week’s analysis;

Let’s face it dressing up on Halloween is the first step to cosplaying at your local comic con.

Cosplay can be a lucrative business if done correct, and many people do.  As you read through this week’s analysis I urge you respect and treat cosplayers as you would any other professional.  It that’s a lot of hard-work and dedication to master their craft as they have.


meal_specs_cosplay


 

meal_card


 

cosplay_group_001

A staple at any comic con is the Cosplay culture.  Fans show their appreciation and passion for beloved characters.  Cosplay can also be a lucrative business if you have a strong work ethic, are consistent, and dedicated to your craft.

Get out the hot glue gun and let’s start forming the foam!

I’ve gathered a random selection of Cosplay data from Instagram.  The cosplayers ranged from followers of +3 million to below 2K.  This alone posed an interesting challenge.  How do I normalize and standardize my data to fit into a model?

My solution was to factor in key performance indicators of Instagram success (regardless of being in the realm of cosplay) and implemented an engagement score for each cosplayer (like a customer value score).

To prevent confounding variables (influencers with a direct correlation to each other), I elected to excluded everything which went into the engagement score.

for_blg_002

My initial read shows this model is very predictive of the data sample gather from Instagram and the highest influencer with significance is the images of the Cosplayer where they are exposed (think NSFW but tasteful).  The amount of hashtags impact was skewed to a correlation of the more followers the less to no hashtags are used.


cosplay_group_002

If you’re a subscriber to this blog and enjoy the Stacks of Stats, you’ll recognize my preference for Q graphs.

There’s a large curl at both tails but most of the data fits well, so there won’t be a need to run a more complex model.

for_blg_001

What could be causing these extreme values towards the end of each tail?

While gathering and visualizing my data, I observed an interesting behavior:

The amount of hashtags deviates and almost has no correlation with engagement.

Driving the skew-ness is two factors:

Newer cosplay accounts use fewer hashtags at the beginning

Well established cosplay accounts use little to zero hashtags with their most recent posts.


for_blg

Our data story isn’t complete and once take the exposed variable to the profiling stage and begin to extrapolate the engagement impact, a telling data story begins to form.

For example, this table read as:

DC comics themed Cosplayers whom also happen to be exposed potentially drive nearly 700 more likes than cosplay images fully clothed.

In the case of what has the highest impact?  We can chalk up Nintendo to the champion and most of it is from the Bowsette trend. Potentially driving in a whopping +61K likes.

Interesting enough the runner up from a potential engagement impact standpoint is Scooby-Doo (Velma mostly), and the gap is less than 10K likes.

Does being exposed help all boost all themes of cosplay?  There is one theme in this sample where there was a negative relationship; Anime.  The possible reason behind this relationship is the niche fan base and attention to detail Anime fans have.  i.e. Hard to go as Sailor Moon without the bow.


 

for_blg_003

What have learned from diving into the Cosplay Data?

Being a top cosplayer on Instagram is as delicate as any social media fame.  Every post, every composition, every hashtag, every theme… can make or break your brand.  Not all cosplay needs to have a level of exposure to be successful, but it is a huge driver in engagement.

A few uses of this analysis are if you’re going to theme as Scooby-Do lean towards Velma and there’s enough out there for comparison.

If you’re looking for a large impact and a fan of video games, take dive at Bowsette (drives in a potential +61K likes).

Finally more hashtags does not mean more likes.

There’s more value in posting a cosplay of character you are passionate about and post relevant hashtags for more organic likes.

After you have consumed this meal, I hope you take these findings and improve your cosplay engagement.  Also as always enjoy the featured pancake recipe below!


005

 

for_blg_004


006

https://www.inquisitr.com/5035455/the-5-sexiest-female-cosplayers-to-follow-on-instagram/


 

 

003_008


 

disney, Mickey Mouse, Regression Modeling, Theme Parks

Recipe: 006 Walt Disney World Parks and Resorts Revenue Influencer

FerraraTom

It all started with a mouse.  This mouse is turning 90 this year and Mickey Mouse has made his impact on society.  To celebrate, what better meal to cook us this week than Walt Disney World Data?  I’ll be challenging myself to

identify influencers on the Parks and Resorts Division’s yearly revenue.


001


002


003

004

With Mickey Mouse turning 90 years old this year, what better meal to cook us this week than Walt Disney World Data?  I’ll be challenging myself to identify influencers on the Parks and Resorts Division’s yearly revenue.

My first approach was to identify what happens during the year the revenue occurs?

The number of Animated Movies released by Disney

The number of Animated Movies featuring Disney Princesses

The number of Attractions add at all four main theme parks and then parsing this information out by the individual park

The first run was not an effective model: most of the variability in the data was not accounted for, and there were no independent variables of significance.

So my next approach was how do I capture word of mouth on movies and attractions?  Secondly, how do I incorporate when Disney starts charging admission to children (currently 2 yrs and younger, enter the parks for free)?

To knock out two birds with one stone, I settled on let me test a rolling 3-year average of all behaviors.  The results were very favorable, 67% of the variability is explained and I have interesting independent variables of significance to make a telling data story


005

If you’re a subscriber to this blog and enjoy the Stacks of Stats, you’ll recognize my preference for Q graphs.

There’s some curls at the tails but most of the data fits well, so there won’t be a need to run a more complex model.

Let’s take a bite into the initial read before accessing the financial impact of all these fun Disney variables.

I’ll caveat this, significance is in the eye of the beholder, and is up to interpretation of the  storyteller and data scientist.  The first read shows the 3-year average of total park attractions having the highest relationship to revenue and inversely the amount of attractions opened at EPCOT has significance but a negative impact on yearly revenue.

I’ll dive more into the individual impacts later, but I want to utilize my upper and lower bounds.


006

The output of this model shows the impact in millions USD.  Analyzing the cone, this is where our fairy tale begins to take shape.

Potentially the average amount of attractions introduced at the all four major parks can drive in $1.6 million USD.

With the Magic Kingdom driving most of this impact:

New attractions added at the Magic Kingdom can drive in $4.5 million USD.

The average amount of the Disney Princess movies does have more of an impact than factoring Disney releasing an animated movie as the only criteria.  What’s intriguing is the variability of our upper and lower bounds, there is a possibility there could be a loss of $50.6M.

007

What could be driving the inverse affect?  Multiple reasons:

1.The quality of the movie releases

2.The presence or in this case non-presence of a meet and greet at the theme park

3.The global economic climate (Less international travel impacts this!)


008

What have learned from diving into the Walt Disney Data?

There’s a reason WDW is investing in new IP based rides at Epcot and Hollywood Studios: they’ve been launching the rides outdated with their audience and they drive the lowest impact currently on yearly revenue.  I anticipate Epcot to see a steady growth on impact when Guardians of the Galaxy and Ratatouille open and a few years have passed.

Finally a Princess Animated Movie drives in 1 million USD more than a regular animated move release.

009

What could be the reasoning?  I’d guesstimate rides introduced at the Magic Kingdom (drives in +4.5M USD) is having a downstream affect on the Princess impact.  Most Princess interactions take place at the Magic Kingdom.

After you have consumed this meal, I hope you take these findings and with Mickey Mouse a Happy 90th Birthday. J  Also as always enjoy the featured pancake recipe below!


005

010

006

https://disneyworld.disney.go.com/


003_008

E-Sports, Logistic Regression, Overwatch

Recipe: 005 Overwatch League Inaugural Season Logistic Regression

FerraraTom

I’m excited to tackle the Overwatch League and my first dig into E-sports in general.  I’ve attended several conventions, including gaming conventions, and I will get this out of the way now:

I thought I was decent at video games… these athletes have shown I’m a very causal player.  This is a good thing, it was a pleasure to witness their craft.

The focus of this week is the probability of an individual player making the playoffs.  Throw into this meal where statistics based around player preferences and game-play performance.  To determine the variables throw into the final mix I threw in some confounding factors and profiling stats before going very heavy on player performance.


001


002


003


004


005


006


006

https://overwatchleague.com/en-us/

https://playoverwatch.com/en-us/


005

 

007


003_008

K-Means Clustering, NBA2k

Recipe: 004 A Data Driven Approach During the NBA Pace and Space Era

FerraraTom

The format of this post will be slightly different from previous recipes.  Think of this as a yelp review, I’ll be going sharing the paper I presented during the SESUG 2018 SAS Conference.  This will be wordy than usual, but I will start with the recipe card per usual and then we’ll dive deep into the paper.  At the end of this post you’ll be a full belly of a new approach to building a NBA team, can be applied to one of my favorite game modes in the 2K series… Franchise mode.


001


002


SESUG Paper 234-2018 Data Driven Approach in the NBA Pace and Space Era

ABSTRACT

Whether you’re an NBA executive or Fantasy Basketball owner or a casual fan, you can’t help but begin the conversation of who is a top tier player? Currently who are the best players in the NBA? How do you compare a nuts and glue defensive player to a high volume scorer? The answer to all these questions lies within segmenting basketball performance data.

OVERVIEW

A k-means cluster is a commonly used guided machine learning approach to grouping data. I will apply this method to human performance. This case study will focus on NBA basketball individual performance data. The goal at the end of this case study will be to apply a k-means cluster to identify similar players to use in team construction.

INTRODUCTION 

My childhood was spent in Brooklyn, New York. I’m a die-hard New York Knicks fan. My formative years were spent watching my favorite team get handled by arguably the greatest basketball player of all time, Michael Jordan. Several moments throughout my life and to this day it crosses my mind, only if we had that player on our team. Over time I have come to terms with we would never have Michael Jordan or player of his caliber, but wouldn’t it be interesting if a NBA team could find complimentary parts or look-a-like players? This is why I’m writing a paper about finding these look-a-likes, these diamonds in the rough, or as the current term is “Unicorns”. Let’s begin this journey together in search for a cluster of basketball unicorns.

WATCHING THE GAME TAPE

What do high level performers have in common? In most cases you’ll find they study their sport, study their own game performance, study their opponents and study the performance of other athletes they strive to be like. The data analyst equivalent to watching game tape would be to gather as many independent and dependent variables as possible to perform an analysis. For the NBA data used in this k-means cluster analysis, I took the approach of what contributes to success in winning a game. Outscoring your opponent was a no-brainer starting point, but I’ll need to dig deeper. How many ways can and what methods can you outscore an opponent? The avid basketball fan would agree how a player scores a basket (i.e. field goal vs behind the three point line) will determine how they fit into an offensive scheme and defines their game plan. Beyond scoring there are other equally as important contributors to basketball performance. This is where I began to think of how much hustle and defensive metrics could I gather (i.e. rebounds, assists, steals, blocks, etc.). Could I normalize all of these metrics to come to get a baseline on player efficiency and more importantly effectively identify an individual player’s role in a team’s overall performance? To normalize my metrics I made the decision to produce my raw data on a per minute level, this way I wouldn’t show biases to high usage players or low usage players. To identify how a player fits into an offensive scheme and their scoring tendencies I calculated an individual level what percent of points scored comes from all methods of scoring (i.e. free throw percentage, three pointers made, two point field goals). Once I went through all of my data analyst game tape, I was ready to hold practice and cluster.

HOLDING PRACTICE

Practice makes perfect, but everything in moderation (i.e. the New York Knicks of the 1990’s overworked themselves during practice, they would lose steam in long games). Similar to I wouldn’t want to over-fit a model on sample data, I won’t get too complicated with my approach to standardizing my variables. Utilizing proc standard, I’ll standardize my clustering variables to have a mean of 0 and a standard deviation of 1. After standardizing the variables I’ll run the data analyst version of a zone defense (proc fastclus and use a macro to create max clusters from 1 through 9). I don’t anticipate to use a 9 cluster solution once running the game plan and evaluating my game time results. Ideally I want to keep my cluster size to small manageable number while still showing a striking difference between the groups. To evaluate how many cluster I’ll analyze to come to a final solution, I’ll extract the r-square values from each cluster solution and then merge them to plot an elbow curve. Using proc gplot to create my elbow curve, I’ll want to observe where the line begins to curve (creating an elbow). Finally, before we’re kicked off the court for another team’s practice, I’ll use proc anova to validate my clusters. As a validate metric I’ll use the variable “ttll_pts_per_m” this should help identify the difference between a team’s “go-to” option and a player whom is more of a complimentary piece at best.

RUNNING GAME PLAN AND GAME TIME RESULTS

A k-means cluster analysis was conducted to identify underlying subgroups of National Basketball Association athletes based on their similarity of responses on 11 variables that represent characteristics that could have an impact on 2016-17 regular season performance and play type. Clustering variables included quantitative variables measuring: perc_pts_ft (percentage of points scored from free throws) perc_pts_2pts (percentage of points scored from 2 pt field goals) perc_pts_3pts (percentage of points scored from 3 pt field goals) ‘3pts_made_per_m’N (3 point field goals made per minute) reb_per_min (rebounds per minute) asst_per_min (assists per minute) stl_per_min (steals per minute) blk_per_min (blocks per minute) fg_att_per_m (field goals attempted per minute) ft_att_per_min (free throws attempted per minute) fg_made_per_m (field goals made per minute) ft_made_per_m (free throws made per minute) to_per_min (turnovers per minute) All clustering variables were standardized to have a mean of 0 and a standard deviation of 1. Data was randomly split into a training set that included 70% of the observations (N=341) and a test set that included 30% of the observations (N=145). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve (see figure 1 below) to provide guidance for choosing the number of clusters to interpret.

003

Canonical discriminant analyses was used to reduce the 11 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatter-plot of the first two canonical variables by cluster (Figure 2 shown below) indicated that the observations in cluster 3 is the most densely packed with relatively low within cluster variance, and did not overlap very much with the other clusters. Cluster 1’s observations had greater spread suggesting higher within cluster variance. Observations in cluster 2 have relatively low cluster variance but there are a few observations with overlap.

004

The means on the clustering variables showed that, athletes in each cluster have uniquely different playing styles.

Cluster 1:

These athletes have high values for percentage of points from free throws, moderate on percentage points from 3 point field goals and low on percentage of points from 2 point field goals. These athletes attempt more field goals per minute, free throws per minute, make more 3 point field goals per minute and have the highest value for assists per minute; these athletes are focal points of a team’s offensive strategy.

Athletes in this cluster: Kevin Durant ,Anthony Davis, Stephen Curry

Cluster 2:

The athletes have extremely high values for percentage of points from 2 point field goals, moderate on percentage points from free throws, and extremely low values for percentage of points from 3 point field goals. These athletes rarely make perimeter shots and have low values for assists.

Athletes in this cluster: Rudy Gobert, Hassan Whiteside, Myles Turner

Cluster 3:

The athletes have high values for percentage of points from 3 point field goals, and low values for point 2 point field goals and free throws. These athletes stay on the perimeter (high values for 3 point field goals made) but are a secondary option at best, observed by a low field goal attempts per minute.

Athletes in this cluster: Otto Porter, Klay Thompson, Al Horford

In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on total points scored per minute (ttl_pts_per_m). A tukey test was used for post hoc comparisons between the clusters. The results indicated significant differences between the clusters on ttl_pts_per_m (F(2, 340)=86.67, p<.0001). The tukey post hoc comparisons showed significant differences between clusters on ttl_pts_per_m, with the exception that clusters 2 and 3 were not significantly different from each other. Athletes in cluster 1 had the highest ttl_pts_per_m (mean=.541, sd=0.141), and cluster 3 had the lowest ttl_pts_per_m (mean=.341, sd=0.096).

CONCLUSION

Using a k-means cluster is a data driven approach to grouping basketball player performance. This method can be used in constructing a team when a salary budget is constricted. The elephant in the room is this essentially is human behavior, therefore the validation step using proc anova is critical. The approach I’ve applied to the NBA data is a guide machine learning approach.


005007


006

https://www.nba2k.com/

http://www.sesug.org/SESUG2018/index.php


003_008

Classification Tree, Harry Potter, Tree Based Models

Recipe: 003 Harry Potter: Did Voldemort Get-cha? Classification Tree

 

FerraraTom“It does not do well to dwell on dreams and forget to live.” – Albus Dumbledore – Harry Potter and the Sorcerer’s Stone

In this post we won’t dwell but we’ll analyze and learn.  I ask that you play along and imagine yourself receiving your acceptance letter to Hogwarts (well let’s be honest here we’ve all imagined this at one point or another).

So you’ve hopped off the Hogwarts’s Express, ready for your studies and the fight the dark arts. Oh wait… nobody told you about the dark arts and all the threats looming your way? Ever wonder was the budget only allowed for owls to deliver acceptance letters? This week we’ll dive into the greatest threat in the Harry Potter Universe, Lord Voldemort.


003_001


003_002


003_003


003_004


003_005


003_006


003_007


003_008

Regression Modeling

Recipe: 002 Marvel Cinematic Universe Regression Model

 


FerraraTomThere’s is no argument against the Marvel Cinematic Universe being a financial success.  I’ll try to identify variables which can equate to box office success. The goal is to fit a regression model to Box Office USD for Marvel Cinematic Movie releases.
*At the time of cooking Ant-man and the Wasp did not have finalized Box Office USD data (This movie was excluded.) – TF


002001


002002


002003


002004


002005


002006


Thanks for stopping and chowing down on this Recipe (click the link for a reader’s friendly pdf version of this recipe)

Now try this delicious pancake recipe (with the Ironman Gold and Red finish) courtesy of Crème De La Crumb (Link Below):

002007