TBCC 2019 The Pokemon Journey Panel


Welcome to the first recap of the Comic Con Data Science panels run by the crew at Pancake Analytics.  Before I dive into the recap of The Pokemon Journey panel held at the Tampa Bay Comic Convention 2019, I’d like to have a quick over view of why I’ve chosen this path.

One question I get asked often is where did I get the idea to apply the fundamentals of data science to comic, video games and all fanfare?

The answer is simple to me and is a core pillar of Pancake Analytics.  I want to teach, share, engage and learn from the comic con family.

I want to TEACH those who attend my panels or interact with this page an introduction to data science and how it can improve areas of your life you are passionate in.

I want to SHARE my years of analytics experience with aspiring analysts and those scared of statistics.

I want to ENGAGE with fans of comic, video games, anime, theme parks, all things geek! I’m one of you and love our conversations.

I want to LEARN your point of view of the topics I discuss.  How do we have a high level discussion about data that doesn’t feel like a math class?

If any these core pillars resonate with you, I hope you enjoy the content I produce and continue to join the discussions.


The Pokemon Journey at TBCC2019 was held on Saturday, August 3, 2019 at 7:30 PM – 8:30 PM.
The pitch of the panel was as follows:
Going to Tampa Bay Comic Con⁉️

Join us in the lite heart-ed data science discussion of Pokémon. Journey from Kanto to the Alola region through machine learning. This panel is more helpful than a Pokédex.

The Panelist were myself and Steve (an indie game developer).  Here’s a commissioned piece I got from a comic con artist:

Above is the a visual representation of the Pokemon Journey we are about to embark on.

The steps on our Pokémon Journey:

  • New Point Of View on Pokémon
  • Field Researchers & Learning from them
  • Pokémon Team Recommendations

During the new point of view on Pokémon section, I walked through the audience of a K-means clustering algorithm to reset Pokémon tiers and move us away from only grouping Pokémon together by typing.

During the Field Researchers & Learning from them section, I walked through the audience how to utilize survey data to build recommendation engine ( companies as large as Amazon and Netflix use this technique).

During the Pokémon Team Recommendations section, I walked through the audience the output of the recommendation model and real life scenarios of recommended teams.



A k-means cluster uncovers trends within our Pokémon data to understand the relational similarities and differences on key in game attributes.

The more clusters the clearer our picture becomes and the deeper we can understand the Pokémon throughout our journey.

When you pick up a Pokémon game for the first time ever you are in the left square.  Running this algorithm will get you the bottom right sooner, a clear picture.



A Brief overview of the approach:

Standardize your variables (bring your variables to a mean of zero)

Analyze your elbow curve

Validate your clusters

3 Distinct Groups:

High – Highest in all categories except for base defense and hp

Medium – Highest on defense, middle ground in everything else

Low – Only high on hp



What does this tell us about the starters?

The output of the k-means clusters can be used in to help determine your approach from the very beginning.

Reading the pyramid:

Easy path:

Greninja, Swampert, & Sceptile

Hard path:

Serperior, Meganium, Torterra, & Chesnaught



How do we implement this scoring?

I needed more data to implement this approach.

5 Questions:

What’s your ideal team of 6 Pokémon?

What year did you start playing Pokémon?

Do you play Pokémon GO?

How many Pokémon games have you played?

Do you play the Pokémon TCG?


This approach recommends a new squad of Pokémon to the field researcher!

Implementing the scoring: Trust The Process

propensity model is a statistical scorecard that is used to predict the behavior of your customer or prospect base. Propensity models are often used to identify those most likely to respond to an offer, or to focus retention activity on those most likely to churn.

This the whole Pokémon journey coming to a full circle.

The Pokémon Professor has done their own research and builds a model.

The field research team assist the Pokémon Professor with gathering new data.

The Pokémon Professor uses the model to assist the field research team.

Here’s the model at work, the input and recommendations:





During the my data science panels I like to reinforce the learning through a game and participants get a prize from my own personal collection.  For this specific panel participants received an unopened pack of Team Up from the Pokemon TCG, and a Pokemon EX TCG individual card.

Here’s an overview of the game:

5 Volunteers

On the screen will be 3 Pokémon

2 Characters are look-a-likes (statistically speaking)

Volunteers will do their best to convince the panel of which two characters are look-a-likes and who should be wonder traded

For participating volunteers receive a fabulous prize

I want to personally thank everyone who attended the panel in Tampa, at the Tampa Comic Convention.  I look forward to meeting again in 2020.


K-Means Clustering, Pokemon Go

Recipe 015: Pokemon Gen 3 K-means Clustering


Take Charge of your Destiny!

In this data story I’ll be showing you how a self guided machine learning algorithm can select the best Pokemon squad for the Hoenn region.

At the end of this data story you’ll have

six Pokemon to look out for in Pokemon GO

, as well as understand why the Bagon Community Day was the best to date!



As seen in the generation 2 games, the generation 3 games brought a wave of changes, especially the data structure.

Listed below are what I feel to be some of the major changes which effect the data of Hoenn region Pokemon.

Main Features added from Generation 2:
A complete overhaul of the Pokemon data structure:
Individual personality value
Abilities and Nature
The IV system went from 0-15 to 0-31
Damage such as Poison, Burn and Leach Seed (passive damage) are resolved at the end of the turn instead of immediately)
135 new Pokemon introduced
103 new moves were introduced
Weather can now be found on the field and activate at the start of a battle
Double Battles


I’d like to call out double battles, as one of the main ingredients in my Pokemon evaluation soup is : Experience Growth Rate.

Double battles allow for more and quicker experience.

In other words all Pokemon can gain more experience earlier on in the game.


If you recall when I looked at the data of the Johto Pokemon, we introduced to the very strong bugs.

Now in the Hoenn region we are introduced to weaker bugs.

This was done to counteract the impact of Heracross and Shuckle.

Catch these bugs below for the pokedex completion but you’re not going to have them on your main team.


So these weak bugs aside you do get one of (if not the most) powerful dragons: Salamence.  If you play Pokemon Go, you most likely took advantage of in my opinion the best Pokemon Go Community Day to date (Held on 4/13/2019).


One of my favorite sayings and motto is “Stay away from the brand names.”  What does this mean and how does it apply to Pokemon?  It means don’t buy into popular opinion, let the facts and data support your choices.

What’s all you hear about on community days?  If you screamed “shinys” then yes… that’s all you hear about.  How many shinys did you catch?

What’s your highest CP shiny?  I’ll trade for shinys.  Don’t be distracted by the brand name of community day, go for more than shinys.  Play in area with several poke stops and has cover from weather.

During the Bagon community day you should have been catching every Bagon spawn

, not only clicking in to see if it’s shiny.  Salamence is the goal, you want to be the mother of dragons (yes, I’m hype for Game of Thrones).



Sticking to the theme of “Stay Away from Brand Names”, applying a k-means clustering algorithm will look for trends in the data and give us a group of Elite Pokemon we should replay Pokemon Ruby and Sapphire with and keep an eye out for in Pokemon Go.

How do we get to the ideal Pokemon team?  Applying a self guided machine learning approach: K-means clustering.  Now you can’t jump ahead and run the algorithm against your data.  First step is standardize your data, because you want to give each of your attributes an equal weight. 

Take for instance:

I want a well balanced team, I don’t want a team elite on attack but weak on defense.

After the data standardize and I run the k-means algorithm, you can see the scatter plot above.  The top right and far right cluster is the segment I want to build my team out of.  All other segments, you can win with but you can 100% steam roll the competition.

Below I’ve included visual representation of the top attackers and defenders in each cluster.





This is great, love info graphics… but what do we do this knowledge?  Well we can build a team.

Your team building begins from the very beginning.

I’ll cut to the chase… you should chose Torchic (sorry Swampert fans)


Why Torchic? Well I’m concerned about team structure and most importantly a showdown with Slaking (Fighting moves are must).  Below you can see the full recommendation of what your final team should look like.  You should also target all of these in Pokemon GO.






nintendo, Propensity Modeling, Super Mario

Recipe 014: Smash Brothers Main Selection


In this recipe I’d like you to chow down on a Smash Brother analytical approach to selecting your main character.  The approach I’m going to introduce you puts an emphasis on what makes a character unique.




Before I start diving into the Smash Brothers data, let’s discuss the k-means clustering approach.  A k-means helps paint a clear picture of our data, in this case specifically it will identify Smash Brothers Characters by their attributes to create picture for who your main should be.  Our characters will be assigned into segments

(tiers… everyone loves to put tiers around Smash Characters but they’re based solely on opinion and player preference)

based on trends in our data, and how closely a character is to the a group.

Take the above picture, without applying this approach we are in the top left quadrant, we only have a faint idea of who should be our main.  As we apply more segments and more trends in the data we’ll eventually end up in the bottom left quadrant.  A clear picture of who our main should be.

Now I keep mentioning trends in our data.  How do we find trends in data where attributes are on the surface completely skewed and non-normalized?  Take for instance a characters weight as a whole number will be larger than a characters acceleration rate in the air (aerial attacks).

We can achieve these trends by standardizing our variables, setting all variables to have a mean of zero.  In doing so this analysis focuses strictly on the trends in our data and we can have a pretty interesting discussion: i.e. Yoshi is more similar to Kirby, than he is to Pac-man.


Super Smash Bros Ultimate Mural


In preparation for this data story I came across the following article, on Business Insider: “These are the 11 best ‘Super Smash Bros. Ultimate’ characters, according to the world’s number-one ranked player

Here’s an excerpt from the article:


And here is ZeRo being named the best overall player:


This triggered a thought in my head and I haven’t done this on the Pancakes Analytics page yet, but typically you would bring a k-means cluster in production and re-score your segments on an agreed upon cadence.  In this case I’ll treat the release of a new game as the cadence.

I’ll run a k-means clustering on the character attributes in Wii-U version and then a k-means clustering on the same character attributes but for the Switch version.

While going through this process I’ll only be including those characters who were in both games and where the data is clean: i.e. all characters have a weight and all characters have available acceleration data.  Sorry Inkling, you’re not in this segmentation.



Above are both segmentation cadences and characters will be split into these segment tiers:

  • Floaters (Far right circle)
  • Jack of all Trades (Smack in the middle)
  • Dashers (Faster than your Jack of all Trades segment but not fast enough to be elite in that attribute)
  • Air Tanks (The bottom left circle)
  • Speedsters  (Top left circle)

These aren’t ranked by what tier is the best, but we can make some assumptions.  The Jack of All Trades segment, most likely you won’t be winning matches often but you’ll be competitive.

Smash Brothers is a unique fighting game, so characters do have a weight to them.  Being light weight does have it’s advantages, but the learning curve of playing as a Speedster might be too high risk high reward for you.

The Floaters, if you select someone with a weight advantage in this group, you’ll likely to win your match but you have to master the move set (your smash move).

Air Tanks, is a no brainer I think for any skill set.  If you want to have a high likelihood of lasting till time runs out, be an Air Tank (this won’t guarantee a win, that really depends on your competition).



I’m hoping visual this stood out to you the reader: Ganondorf made a large leap from the Air Tanks to the Floaters.  This doesn’t only speak to Ganondorf but it also tells you information about Bowser as well.

When I speak to this to clients and those wanting to learn about a particular data, this is how it translates:

Ganondorf has more in-common with Jiggly Puff than he does Bowser.   The reason being is he’s quicker and can adapt well in aerial attacks and in falling than Bowser can.

On the flip-side of this I can also say Bowser more accurately represents how he’s viewed from the super Mario franchise, in Super Smash Bros. Ultimate.

Neither one of these characters were “nerfed”, only re-calibrated so there’s a distinct difference between the two.

What do you do with this information?  If you’re main is a Floater, Ganondorf would be a good transitional character if you were looking to play as a character with more weight.  Or say you always play as an Air Tank, because you have the assumption anyone who has Kirby as a main shouldn’t be playing Smash Bros. then Ganondorf is a good transitional main for you when you eventually given in and select Kirby, “by accident”.

Image result for kirby smash


Below are the segments a brief overview of those characters within each segments:


This segment has high variability and you can see this from the oblong shape of the circle.  Ganondorf and Jiggly Puff are driving this shape, all though they are in the same segment and are more similar to each-other than are to other segments, they are the furthest apart within this segment.

Now hold up… wait a second.  Didn’t I just try to prove a point of how similar they are?  Yes, but in relation of whose more similar to Ganondorf: Jiggly Puff or Bowser.  But if I posed the question who is more similar to Ganondorf: Jiggly Puff or Kirby… that answer is Kirby.

This group on average are the slowest by run speed and lightest by weight… they Float.



This segment is the medium of everything.  There’s no uniquely distinct trend in their data.  Now playing as Pikachu vs Mega Man would have so game-play differences but statistically speaking you are starting with same underlying stats.

If you’re new the series this a good group to start with… they’re a Jack of All Trades.




The Dasher segment is very similar to the Jack of All Trades segment, only slightly faster.  Playing in this group you could potentially do more harm than good, if you’re selecting because you want to stay middle ground. You could… Dash yourself off the area.



Air Tanks are fast in the aerial attacks… and the heaviest?  I’m anticipating this group will be re-calibrated by the next release.  In other words… Bowser has no business being as effective as he is in the air as he weighs, normally these two variable don’t correlate.  I guess all the time battling a plumber who can flip and jumps is finally paying off.



This is your high risk high reward group.  Characters in this segment are the fastest and the lightest.  I personally am awful playing as Sonic, he’s too fast for playing level but a seasoned player could probably mop the floor with Sonic.


So who should be your main?  In this segment I rely on industry knowledge as well (ZeRo’s tiers as dependent variable).   I’ll build propensity score with the following independent variables:

  • Change in air acceleration
  • Base air acceleration
  • Base speed in the air
  • Base Run Speed
  • Character Weight
  • Ultimate Smash Bros. Cluster
  • Wii-U Smash Bros. Cluster


The output will give me the likelihood ZeRo would rank the character as a top tier character.  The highest influencers on predictability were:

Change in air acceleration

Run speed

The lowest influencers were:

Base air acceleration

Ultimate Smash Bros. Cluster (this highlights the bias towards the Wii-U stats, influencing ZeRo’s rankings)

Drum roll please….




You should have your main be one of the above three.  This is the data solution to selecting your main.

Really looking forward to the comments section on this one 🙂




Marvel Comics, Propensity Modeling, Regression Modeling

Recipe 013: Marvel Comics Propensity Score


How crazy would it be if I told you Howard the Duck and Old Man Logan are closer to each other in skill sets than they are to any other Marvel characters?  Or how about Thor and Dr. Octopus are lookalikes as well?  Let’s answer these questions together by wrangling some readily available data.






If I’ve learned anything from my career in data science it’s this: 80% of the work is data gathering and etl work, and 20% is analysis.

Nothing holds truer to this statement than finding data of Marvel characters skills set, on a normalized scale.  In this data story I’ll be using data from Marvel Contests of Champions (power index levels, health and attack) and the Marvel Battle Royale (a twitter fan poll of greatest superheroes).

A few more variables I’ll need to calculate around the results of the Marvel Battle Royale Twitter Fan Poll:

Total votes per each round

Average Total votes

A flag for if they were higher than average total votes per marvel character

This flag I’ll use as my dependent variable and my independent variables will be the Marvel Contest of Champions statistics.

What will this do?  This will predict the likelihood a Marvel Character would receive higher than the average total votes in the Marvel Battle Royale.

Once this is calculated I’ll receive an output of coefficients which I can apply to the rest of the Marvel Characters whom weren’t in the Marvel Battle Royale to create a propensity score.



Now let’s back track a little bit and see why I’m going with a propensity model as opposed to a grouping by opinion.  I.e. Let’s put all the top attackers in the same category.

The top 3 characters based on Attack are Rocket Raccoon, Spider-man (Symbiote), and Blade.

In the above histogram, if you look all the way to the far right you’ll notice they are the data points on their own little island.




Well what if I just grouped everyone by Health?  This data visualization looks more promising but mostly likely there would overlap on the other attributes and you wouldn’t be able to implement this successfully.



The power index by definition could be suitable but from the top 3 selected on power index I can tell this rating wasn’t an index in the vein of what I would typically use an index for (time-series forecasting) and it looks to be similar to the Pokemon Go Combat Point System, the ability to use their full potential.



One use of a propensity score is to create similar groups, based on the likelihood of performing a behavior.

In this case Doctor Octopus and Thor (Ragnarok) statistically the same in the Marvel Contest of Champions skill set.  For those of you want to go down and interesting rabbit whole, you can find YouTube videos on why Doctor Octopus should be in a demi-god tier.

This propensity score approach literally put Doctor Octopus in the same tier as a demi-god!



Medusa by power index alone would be close to Thanos but factoring all skill sets, she is statistically closer to Gwenpool, Cable, and Nightcrawler than she is to the Mad Titan.



Now for the crazy but statistically significant section.  Howard the Duck (I’m hoping he gets a show on Disney+) and Old Man Logan are a propensity score match.

An example like this where many begin to argue in data science, when does subject material expertise come into play?  We can argue significance forever, on any topic, but we can agree on all Marvel Champions have a value if played correctly.








Recipe 012: Pokemon Gen 2 K-means Clustering


Thanks for coming for a bite, let’s dig into some pancakes and the data science behind the Pokemon of the Johto Region.  How do they differ from the Kanto Region?  What’s the importance of introducing two new Pokemon Types?  Finally how speaking about the trends in our data will help us understand the relational differences and similarities beyond Pokemon general typing!




Pokemon Gold and Silver ushered in a new era for the Pokemon series and listed below are few changes (not listing all the influential game changes in this post) which still have a large influence through this day:

The introduction of Shinies (Shiny Gyarados shown above)

Gender types

Eggs, breeding and babies

The experience bar

Two new Pokemon types: Dark and Steel


I want to touch base on specifically two items in the above list and how they effect the overall re-balancing of the Pokemon universe (see above the increase of stronger bug type Pokemon) and how it’s driving difference between generation 1 and generation 2.

Eggs, breeding and babies

Two new Pokemon types: Dark and Steel

How do Eggs, breeding and babies influence the trends in our data?  For instance there’s more normal types added to the mix (+1%) but the average base attack (-8%) and base defense (-2%, even with the introduction of Blissey!)  have both declined versus generation 1 (Red and Blue).


How do the introduction of two new Pokemon types: Dark and Steel influence our trends?  For those of you have played gold and/or silver you know this is longest nameplate in the Pokemon series to date because you also travel back to the Kanto region (Where psychic and ice types reign supreme!).

Dark type Pokemon are super effective against Psychic and Ghost types.  They’re vulnerable to Fighting, Bug and Fairy types.

Steel type Pokemon are super effective against Rock, Ice, Fairy and Dragon types.  They’re vulnerable to Fighting, Ground, and Fire.

Bug type Pokemon are super effective against Psychic, Grass, and Dark.  They’re vulnerable to Flying, Rock and Fire.

Dark and Steel types where introduced to re-balance the game and give the player the tools to be prepared for the Kanto region challenges.  In doing new and stronger Bug type Pokemon (think Heracross and Shuckle) were introduced to add a check in place for those trainers who go on a full on attack against Psychic type Pokemon (Dark and Grass types [counters to Mewtwo]).

Now we’ve dug into the differences of our data from generation 2 to generation 1 we can begin focusing on generation 2 and how we can apply a guided machine learning to building the best Pokemon Johto team we can!


While training this model, I uncovered a segment full of only legendary Pokemon, although you can get these Pokemon in the game I will be removing them from this analysis, for a few reasons:

They’re overpowered compared to the rest of the population.

They’re meant as a reward.

It’s not very insight full to know the legendary dogs have more income with other legendary Pokemon as opposed to a baby Pokemon.

Let’s continue…


In my segmentation I’ll be throwing in several key performance indicators for Pokemon value throughout the game ranging from base attack to experience growth rate.  How do I get these vastly different attributes on the same scale?

Through standardization!  Standardizing my variables to a mean of zero will put a heavier weight on the trends within the data, as opposed the individual weight of each variable.


How do I determine the proper number of clusters?  I’ll analyze this elbow graph and look for an error where my sum of squares begins to bend (as an elbow would).

From first glance I begin to see the shift at 4 groups, then a slight change at 5 groups and vast difference at 6 groups.  What does this tell me? Possibly one of clusters has high deviations and variability on the attributes selected for clustering.


Understanding I might have a group with high variability and seeing there isn’t a large difference from 4 groups to 5 groups, I decide to plot a 4 cluster solution.

Visualizing our data in this way (plotting my the top two components [ which accounts for 60.33% of the variability in the data]) show me two things:

The relationships between Pokemon beyond general type.

My group to the far right, if I ran a 6 cluster solution would have large overlap and possibly a smaller cluster smack in the middle of it.

Now that we’ve done this let’s learn about the Johto Pokemon…


My top tiered Pokemon group is a clustering of elite scored attributes, which explains the variability.  Above you can see the type breakdown and the top base attack and top base defense Pokemon within the group.  I like this display because it puts the emphasis on how introducing Dark, Steel, and more stronger bugs have influenced the Pokemon universe.  During a previous analysis (Which can be found in the kitchen!) I did the same approach for the Kanto Pokemon and Psychic types were the top attackers.


The next tier is the Valuable tier, Pokemon fall in this tier because they are borderline elite in one attribute but overall well balanced.  Think of these Pokemon as the Jack of All Trades.


The Medium value tier has more variability on Pokemon type, and are Pokemon which evolve in most cases (all three starters fall in this group) but not all (see Dunsparce).  Pokemon in this tier if left as is and never evolve…. will never migrate to the upper tiers.


All Pokemon have value when trained to their full potential and this is why my bottom tier is called Low Value.  Pokemon in this tier will take time and patience but do offer unique attribute scores which can be useful at higher levels.  As seen above Granbull’s family tree begins in this tier.  There’s an opportunity to migrate from the Low value tier to the Valuable tier if you train, train, train!!!


Now that we’ve gone through this exercise what unique findings can we come up with?  Possibly something you didn’t already know.

Shuckle has more in common with Tyranitar than Miltank.

Shuckle’s unique combination of Elite base defense and hp, out weighs it’s lower scored attacks, to take it’s place among the Pokemon powerhouses of the Johto region.

Thank you for reading this data story and if you have follow-ups or would like to continue the discussion direct message me on Instagram @pancake_analytics !

Enjoy your breakfast!