Marvel Comics, Propensity Modeling, Regression Modeling

TBCC 2019 Avengers, Algorithms, and Analytics: Panel Recap

012


002


This Panel was held on:

Friday, August 2, 2019 at 9 PM – 10 PM

During the Tampa Bay Comic Convention 2019, held at the Tampa Convention Center.

The Panelists were:

Tom Ferrara (@pancake_analytics) , Kalyn Hundley (@kehundley08), Andy Polak (@polak_andy)

013

 

I want to take a quick moment to discuss the panelists.  I love giving as many different point of views as possible to these data science panels.  Without this variety of point of views it’s more of a lecture and less of a discussion.  This mix of panelists gave the audience the data science view, the tech industry view and the biological sciences view.  Best part about this is the avengers brought us all together.


003

When I pitched this panel the idea was what happens when a data scientist gets hold of the infinity gauntlet?  Pictured above is a visual representation of how I’m going to use each stone.

Use the Time Stone to predict the box office sales for the MCU and determine the top influencers for success.

Use the Power Stone to eliminate low hanging fruit.

Use the Soul Stone to uncover the underlying attributes of the marvel universe.

Use the Space Stone to transport the marvel universe to their closest match.

Use the Reality Stone to show you the marvel universe in a new light, perfectly balanced.

Use the Mind Stone to convince you this matching worked.


Time and Power Stones: What is influencing the MCU box office success?

004.png

I waked through those in attendance the output of regression model I built to unlock the the key influences of the Marvel Cinematic Universe and their relation to box office sales.

Considered influencers:

  • Rotten Tomatoes Scores (Critic and Audience)
  • Movie Release
  • Time since last MCU release
  • Solo Movie Releases
  • Was Iron Man in the movie?

Two Key Influencers stand out:

Having Iron Man in an MCU Movie drives in $100.5MM

The further along in the series drives in at least $216.8MM.  Story Development matters here’s the statistical proof!


Soul and Space Stones: Refitting the Marvel Power Scale

005

During this panel I walked the crowd through the output of a second machine learning algorithm, a propensity score.

Ingredients in the batter:

  • Marvel Contests of Champions (MCC) Power Index Levels
  • MCC Health
  • MCC Attack
  • Marvel Battle Royale (MBR) Twitter Poll:
  • TTL Votes per round, Avg TTL Votes

Flipping the pancakes:

Predict the likelihood twitter would vote for a character

Re-purposing this score to apply it to characters not in the MBR Twitter Poll


Reality and Mind Stones: Perfectly Balancing the Marvel Universe

006

This approach goes beyond ranking by attack, or defense.  This approach takes all those attributes together as well as the fan opinion.

If you only look at attack… you get skewed results

If you only look at defense… you get skewed results

A little bit of good… a little bit of crazy…

Old Man Howard the Duck?

Doctor Octopus the Demi-God?


Marvel Rapid Fire: Marvel Analytics Comparisons

007.png

This was one of my all time favorite segments out of all the comic cons I’ve had the pleasure of paneling at.  Quickly I would show the audience an analytics technique and show them the Marvel equivalent.  I think this technique is very effective in reinforcing our learning and opening up data science to a new audience.

Everything we just went through were machine learning techniques

Machine Learning is the Taskmaster of Data Science

Learns from past data, trains, and attempts to apply this training to new data

When something new is introduced it takes time to catch up


A/B Testing and Incremental ROI is the plot of Civil War

008


A neural network is Ultron… learns from observational data & figures its own solution

009


Dr. Strange ran a logistic regression to find out the odds-on Titan

010


Into the Spider verse was the perfect implementation of a random forest

011


Game Time: Marvel Team-Up: Overview

012


One of the best ways to reinforce learning is through a game.  During this panel I wanted to reinforce the learning from the propensity score.

I asked for 5 volunteers.  On the screen were 3 marvel characters.  2 characters on screen were look-a-likes (statistically speaking).  Volunteers did their best to convince the panel of which two characters should “Team-Up” or in other words identify the 2 statistically closest characters.

For participating all volunteers received a hero-clix figure of their choice.


I want to personally thank everyone who attended the panel in Tampa, at the Tampa Comic Convention.  I look forward to meeting again in 2020.


003_008

Marvel Comics, Propensity Modeling, Regression Modeling

Recipe 013: Marvel Comics Propensity Score

logo

How crazy would it be if I told you Howard the Duck and Old Man Logan are closer to each other in skill sets than they are to any other Marvel characters?  Or how about Thor and Dr. Octopus are lookalikes as well?  Let’s answer these questions together by wrangling some readily available data.


 

008

 


 

001

If I’ve learned anything from my career in data science it’s this: 80% of the work is data gathering and etl work, and 20% is analysis.

Nothing holds truer to this statement than finding data of Marvel characters skills set, on a normalized scale.  In this data story I’ll be using data from Marvel Contests of Champions (power index levels, health and attack) and the Marvel Battle Royale (a twitter fan poll of greatest superheroes).

A few more variables I’ll need to calculate around the results of the Marvel Battle Royale Twitter Fan Poll:

Total votes per each round

Average Total votes

A flag for if they were higher than average total votes per marvel character

This flag I’ll use as my dependent variable and my independent variables will be the Marvel Contest of Champions statistics.

What will this do?  This will predict the likelihood a Marvel Character would receive higher than the average total votes in the Marvel Battle Royale.

Once this is calculated I’ll receive an output of coefficients which I can apply to the rest of the Marvel Characters whom weren’t in the Marvel Battle Royale to create a propensity score.


 

002

Now let’s back track a little bit and see why I’m going with a propensity model as opposed to a grouping by opinion.  I.e. Let’s put all the top attackers in the same category.

The top 3 characters based on Attack are Rocket Raccoon, Spider-man (Symbiote), and Blade.

In the above histogram, if you look all the way to the far right you’ll notice they are the data points on their own little island.


 

 

003

Well what if I just grouped everyone by Health?  This data visualization looks more promising but mostly likely there would overlap on the other attributes and you wouldn’t be able to implement this successfully.


 

004

The power index by definition could be suitable but from the top 3 selected on power index I can tell this rating wasn’t an index in the vein of what I would typically use an index for (time-series forecasting) and it looks to be similar to the Pokemon Go Combat Point System, the ability to use their full potential.


 

005

One use of a propensity score is to create similar groups, based on the likelihood of performing a behavior.

In this case Doctor Octopus and Thor (Ragnarok) statistically the same in the Marvel Contest of Champions skill set.  For those of you want to go down and interesting rabbit whole, you can find YouTube videos on why Doctor Octopus should be in a demi-god tier.

This propensity score approach literally put Doctor Octopus in the same tier as a demi-god!


 

006

Medusa by power index alone would be close to Thanos but factoring all skill sets, she is statistically closer to Gwenpool, Cable, and Nightcrawler than she is to the Mad Titan.


 

007

Now for the crazy but statistically significant section.  Howard the Duck (I’m hoping he gets a show on Disney+) and Old Man Logan are a propensity score match.

An example like this where many begin to argue in data science, when does subject material expertise come into play?  We can argue significance forever, on any topic, but we can agree on all Marvel Champions have a value if played correctly.


006

009


 

005

010


003_008

DC Comics, K-Means Clustering, Logistic Regression, Propensity Modeling

Recipe 011: DC Super Hero Throw Down: Propensity Modeling

logo

I want you to remember, Clark…In all the years to come… in your most private moments… I want you to remember my hand at your throat… I want you to remember the one man who beat you.

Chilling quote isn’t it?  That was said by Batman to Superman during the The Dark Knight Returns, a comic book miniseries written and drawn by Frank Miller.

One of the greatest debates in comic book lore and a fun discussion to have is pitting up two superheroes against each other… Who wins and why?  The below data story will introduce a data science approach to answering this debate.  To have fun with it… I’ve thrown characters from the video game Injustice 2 into a Superhero Thrown Down Tournament.


012_pic

 

 


010_pic

Before we dive into the tournament and the results of the throw down, I’d like to touch on the approach: Propensity modeling.

Propensity modeling has been around since 1983 and is a statistical approach to measuring uplift (think return on investment).  The goal is to measure the uplift of similar or matched groups.

The heart of this approach lies within two machine learning approaches (segmentation and probability.)

Why propensity modeling for this exercise?  I wanted to rank my superheroes for the bracket using statistics (i.e. Batman is not getting a number one seed.)

35 characters were segmented on strength, ability, defense and health.  For the propensity score I gathered ranking information from crowd sourced websites and surveys.  Using this I was able to give an intangible skill score.  The reasoning was I wanted the medium of comics to do the majority of the work for me.  Comics are stories and the narrative drives the inner core of a character.  The higher a character is on a fan sourced website I’m assuming they are written well and are timeless.

Next step was to take the mean of the intangible skill score and flag those characters above the average (this will be my dependent variable for my logistic regression to calculate a propensity score).

What was thrown into the propensity model?  The skill sets gathered from the Injustice game, the assumption here is a character of Superman’s skill set would be written much differently then say Catwoman.

011_pic


Now it’s time for our throw down.

001_pic

The top four characters by propensity score were:

Cyborg

Supergirl

Aquaman

Black Adam

To determine a winner in the throw-downs characters were put up against each other in 11 categories.


Round 1 Takeaways:

002_pic

Our number one seed Cyborg nearly lost to Atrocitus. The result was 6-2-5, that’s read as six wins, 2 ties and 5 losses.

There were no upsets in the first round of play.  A few characters did not win a single category in their match-ups:

Harley Quinn (vs. Captain Cold)

Green Arrow (vs. Batman)

Black Manta (vs. Black Canary)

These three characters were ill-equipped to take on their opponent, it is possible they would have advanced given a new opponent.

003_pic


Round 2 Takeaways:

004_pic

Cyborg (our number one seed) defeated Captain Cold by a larger difference (+3 winning categories) compared to the previous match-up against Atrocitus, but he scored one win less.

We begin to see upsets in Round 2:

Robin defeated Black Adam by 1 winning category.  Wonder Woman defeated Firestorm by 4 winning categories.  Batman defeated Supergirl by 3 wining categories.

On propensity scores these were upsets, but from comic book debate standpoint you could argue these, i.e. given enough time to prepare Batman could defeat Supergirl.

005_pic


Round 3 Takeaways:

006_pic

Cyborg falls to Superman, loss by 4 categories.  This was the biggest fight Superman was given in this tournament to date (in both previous rounds he had 9 winning categories).

The upsets keep coming in:

Robin sneaks in a win again by 1 winning category (over Brainiac). Wonder Woman defeats the top seed in her region of the bracket (Aquaman) by 4 winning categories.  Batman defeated Green Lantern by 3 winning categories.

007_pic


Final 4 Takeaways:

008_pic

Robin’s Cinderella story comes to an end at the hands of Superman (winning in 9 categories).  Robin did fair better than those previously who gave Superman 9 category wins… Robin won in 2 categories.

Batman was able to upset Wonder Woman, by 2 winning categories.  We’re set for a championship round, the original who wins… Batman Versus Superman!

batman-vs-superman-movie


Our winner is…

009_pic

Superman defeats Batman.  Superman did not win in a landslide.  Batman loss by two categories but he was able to win in 5 categories.  Previously the highest total win categories against Superman were 3 winning categories.


What did we learn from diving into the DC data?  Comic book writing and fan perception goes along way in determining who wins a thrown debate.  If we use propensity modeling we can have more even playing field and limit the amount of unfair battles.


005

SupermanPancakesW


003_008

Cosplay

Recipe: 007 Comic Con Cosplay and the Drivers of Instagram Engagement

logo


Halloween has recently passed and it’s a good transition into this week’s analysis;

Let’s face it dressing up on Halloween is the first step to cosplaying at your local comic con.

Cosplay can be a lucrative business if done correct, and many people do.  As you read through this week’s analysis I urge you respect and treat cosplayers as you would any other professional.  It that’s a lot of hard-work and dedication to master their craft as they have.


meal_specs_cosplay


 

meal_card


 

cosplay_group_001

A staple at any comic con is the Cosplay culture.  Fans show their appreciation and passion for beloved characters.  Cosplay can also be a lucrative business if you have a strong work ethic, are consistent, and dedicated to your craft.

Get out the hot glue gun and let’s start forming the foam!

I’ve gathered a random selection of Cosplay data from Instagram.  The cosplayers ranged from followers of +3 million to below 2K.  This alone posed an interesting challenge.  How do I normalize and standardize my data to fit into a model?

My solution was to factor in key performance indicators of Instagram success (regardless of being in the realm of cosplay) and implemented an engagement score for each cosplayer (like a customer value score).

To prevent confounding variables (influencers with a direct correlation to each other), I elected to excluded everything which went into the engagement score.

for_blg_002

My initial read shows this model is very predictive of the data sample gather from Instagram and the highest influencer with significance is the images of the Cosplayer where they are exposed (think NSFW but tasteful).  The amount of hashtags impact was skewed to a correlation of the more followers the less to no hashtags are used.


cosplay_group_002

If you’re a subscriber to this blog and enjoy the Stacks of Stats, you’ll recognize my preference for Q graphs.

There’s a large curl at both tails but most of the data fits well, so there won’t be a need to run a more complex model.

for_blg_001

What could be causing these extreme values towards the end of each tail?

While gathering and visualizing my data, I observed an interesting behavior:

The amount of hashtags deviates and almost has no correlation with engagement.

Driving the skew-ness is two factors:

Newer cosplay accounts use fewer hashtags at the beginning

Well established cosplay accounts use little to zero hashtags with their most recent posts.


for_blg

Our data story isn’t complete and once take the exposed variable to the profiling stage and begin to extrapolate the engagement impact, a telling data story begins to form.

For example, this table read as:

DC comics themed Cosplayers whom also happen to be exposed potentially drive nearly 700 more likes than cosplay images fully clothed.

In the case of what has the highest impact?  We can chalk up Nintendo to the champion and most of it is from the Bowsette trend. Potentially driving in a whopping +61K likes.

Interesting enough the runner up from a potential engagement impact standpoint is Scooby-Doo (Velma mostly), and the gap is less than 10K likes.

Does being exposed help all boost all themes of cosplay?  There is one theme in this sample where there was a negative relationship; Anime.  The possible reason behind this relationship is the niche fan base and attention to detail Anime fans have.  i.e. Hard to go as Sailor Moon without the bow.


 

for_blg_003

What have learned from diving into the Cosplay Data?

Being a top cosplayer on Instagram is as delicate as any social media fame.  Every post, every composition, every hashtag, every theme… can make or break your brand.  Not all cosplay needs to have a level of exposure to be successful, but it is a huge driver in engagement.

A few uses of this analysis are if you’re going to theme as Scooby-Do lean towards Velma and there’s enough out there for comparison.

If you’re looking for a large impact and a fan of video games, take dive at Bowsette (drives in a potential +61K likes).

Finally more hashtags does not mean more likes.

There’s more value in posting a cosplay of character you are passionate about and post relevant hashtags for more organic likes.

After you have consumed this meal, I hope you take these findings and improve your cosplay engagement.  Also as always enjoy the featured pancake recipe below!


005

 

for_blg_004


006

https://www.inquisitr.com/5035455/the-5-sexiest-female-cosplayers-to-follow-on-instagram/


 

 

003_008


 

Regression Modeling

Recipe: 002 Marvel Cinematic Universe Regression Model

logo


There’s is no argument against the Marvel Cinematic Universe being a financial success.  I’ll try to identify variables which can equate to box office success. The goal is to fit a regression model to Box Office USD for Marvel Cinematic Movie releases.
*At the time of cooking Ant-man and the Wasp did not have finalized Box Office USD data (This movie was excluded.) – TF


002001


002002


002003


002004


002005


002006


Thanks for stopping and chowing down on this Recipe (click the link for a reader’s friendly pdf version of this recipe)

Now try this delicious pancake recipe (with the Ironman Gold and Red finish) courtesy of Crème De La Crumb (Link Below):

002007