Faceting: Quickly comprehend what’s inside your data - Ritvvij Parrikh Faceting: Quickly comprehend what’s inside your data | Ritvvij Parrikh Humane ClubMade in Humane Club
table of contents

Faceting: Quickly comprehend what’s inside your data

Faceting splits a large dataset into multiple smaller subsets (“facets”).

Why it matters: Faceting reveals variations and similarities across different facets. It allows you to:

  • Do a detailed comparison without losing the broader context.
  • Understand interactions between variables, spot outliers within facets, and uncover hidden trends that might be missed in aggregated data.

In this blog, we’ll help you apply ideas and concepts covered in Fundamentals of data and Data audit and cleaning. We’ll follow five steps:

1. Look at the data

2. See the columns

3. Imagine the hypothesis

4. Evaluate the hypothesis

5. Be ready for unknowns


1 – Look at the datasets

Download this dataset: IPL-transformed.csv

Applying ritvvij.parrikh.com/fundamentals-of-data/14068/#what

Identify three different aspects:

  • Skills to store and access the data: The data is in CSV format and hence it is machine readable. The file size is small enough to be opened on a computer.
  • Domain knowledge to understand the nuance: The data is able the Indian Premier League, a dataset about cricket (sports). Like is with most two-player games, the csv has following datapoints:
    • Match number: One season has many matches. Which match number was it
    • City: Where did they play
    • Toss winner: Who gets to play first
    • Winning team: Who won the match
    • Margin: How large was the win
  • Skills to use the data: This is what we’ll explore in this blog!

Data Provenance


2 – See the columns

Theory:

  • What is the color of an apple? Normally, it is red color. Hence, normal means what is generally expected.
    • In numerical columns, you can find what is normal using Mean and Median.
    • In categorical columns, you can find out what is normal using Mode.
  • But then there are even green and yellow apples! You can identify how different is a particular numerical datapoint using Standard Deviation and Median Absolute Deviation.
  • What if you find a purple apple!!? That’s an absolutely special find. You can find special numerical data points like this using Z-score.

We’ll be using an open source tool released by Google called ‘Facets Overview.’ It does basic summary analysis and other cleanliness checks for all the columns in your dataset.

Here’s how to use it.

  • The tool we are going to use only works with small datasets because it processes the data in your browser itself. Go to pair-code.github.io/facets/#facets-overview in Chrome or Microsoft Edge.
  • Scroll down to this button and upload your data
  • Upload the downloaded CSV file.

Here’s what we observe.

Numerical columns

  • We can ignore the column ID because it a unique identifier and doesn’t hold insights.
  • If we look at the margin column, then you can see from the histogram that most of the wins have been narrow wins and only a few extreme outcomes.

Categorical columns

  • City: Most matches were played in Mumbai.
  • Toss Winner: Most tosses were won by Mumbai.
  • Winning Team: Most matches were won by Mumbai.

3 – Imagine the hypothesis

Below are some of my hypothesis:

  • Margin of winning should be higher early on because early in any tournament there will be weak teams and strong teams.
  • Certain stadiums have a tendency to generate outsized wins because different the type of soil and weather can impact the game.
  • Teams win more matches in their home stadium.
  • Teams that win the toss tend to win the match.
  • Teams that win the toss in their home stadium tend to win the match.

4 – Evaluate if the hypothesis

We’ll be using an open source tool released by Google called ‘Facets Dive.’ It helps you explore relationships between data points across all of the different features of a dataset. Each dot is a row in your data.

Here’s how to use it.

  • The tool we are going to use only works with small datasets because it processes the data in your browser itself. Go to pair-code.github.io/facets/#facets-dive in Chrome or Microsoft Edge.
  • Scroll down to this button and upload your data

Now let’s test some hypothesis:

Hypothesis: Margin of winning should be higher early on.

Answer: Not true.

Hypothesis: Certain stadiums have a tendency to generate outsized wins.

Answer: Mumbai tends to have big wins. Bangalore has fewer but extremely large wins!

Hypothesis: Teams win more matches in their home stadium.

Answer: True.

Hypothesis: Teams that win the toss tend to win the match.

Answer: True.

Hypothesis: Teams that win the toss in their home stadium tend to win the match.

Answer: Not always true.

Theory:

  • You can find a linear relationship between two numerical columns using Correlation.
  • You can try to compare an apple and orange by abstracting both to fruits using Normalization.

5 – Be ready for unknowns

Just when you think you know everything, your domain knowledge expert — CEO, Analyst, Beat Reporters — one who has deeper judgment will tell you something that will make you relook at things again!

Here we’ve bucked win margin (a number) by won by. Prima face, looking at this one can conclude that when a match is won by wickets, then the win margin is less!

However, thats not true. This is where domain knowledge comes in.

In cricket, if a team wins by wickets, that means that the team batting second chased the target. In such scenarios, the winning margin is not reported in runs (min: 0 to max: 200+) but my wickets (min: 0 to max: 10).