Contentsquare rolls out AI agent, Sense Analyst →
Learn More
Blog Post

Anscombe's quartet, and why summary statistics don't tell the whole story

Analytics
[visual] Blog - anscombe quartet stock image

Let’s say you’re looking at a spreadsheet of your customers. You have data about how many times they’ve logged in, how much revenue you’ve earned from them, and so on. You can immediately calculate several compelling summary statistics: what’s the average number of logins per customer? What’s the average revenue? What’s the correlation between number of logins and revenue?

Summary statistics allow you to describe a vast, complex dataset using just a few key numbers. They give you something easy to optimize against and use as a barometer for your business.

But there’s a danger in relying only on summary statistics and ignoring the overall distribution. Read on to learn how summary statistics can be misleading—and why calculating summary statistics should only be one piece of your data analysis pipeline.

Get deeper insights with Contentsquare

Visualize your data and contextualize it with qualitative user insights from Contentsquare.

What is Anscombe’s quartet?

Perhaps the most elegant demonstration of the dangers of summary statistics is Anscombe’s quartet. It’s a group of 4 datasets that appear to be similar when using typical summary statistics, yet tell 4 different stories when graphed. 

Each dataset consists of eleven (x, y) pairs as follows:

Dataset I

Dataset II

Dataset III

Dataset IV

x

y

x

y

x

y

x

y

10.0 

8.04 

10.0 

9.14 

10.0 

7.46 

8.0 

6.58

8.0 

6.95 

8.0 

8.14 

8.0 

6.77 

8.0 

5.76

13.0

7.58

13.0

8.74

13.0

12.74

8.0 

7.71

9.0

8.81

9.0

8.77

9.0

7.11

8.0 

8.84

11.0

8.33

11.0

9.26

11.0

7.81

8.0 

8.47

14.0

9.96

14.0

8.10

14.0

8.84

8.0 

7.04

6.0

7.24

6.0

6.13

6.0

6.08

8.0 

5.25

4.0

4.26

4.0

3.10

4.0

5.39

19.0 

12.50

12.0

10.84

12.0

9.13

12.0

8.15

8.0 

5.56

7.0

4.82

7.0

7.26

7.0

6.42

8.0 

7.91

5.0

5.68

5.0

4.74

5.0

5.73

8.0

6.89

All the summary statistics you’d think to compute are close to identical:

  • The average x value is 9 for each dataset

  • The average y value is 7.50 for each dataset

  • The variance for x is 11 and the variance for y is 4.12

  • The correlation between x and y is 0.816 for each dataset

  • A linear regression (line of best fit) for each dataset follows the equation y = 0.5x + 3

So far these 4 datasets appear to be pretty similar. But when you plot them on an x/y coordinate plane, you get the following results:

[visual] Anscombe's Quartet graph

Now the real relationships in the datasets start to emerge. Dataset 1 consists of a set of points that appear to follow a rough linear relationship with some variance. 

Dataset 2 fits a neat curve but doesn’t follow a linear relationship (maybe it’s quadratic?). 

Dataset 3 looks like a tight linear relationship between x and y, except for one large outlier, and dataset 4 looks like x remains constant, except for one outlier as well.

Computing summary statistics or staring at the data wouldn’t have told you any of these stories. Instead, it’s important to visualize the data to get a clear picture of what’s going on.

How summary statistics can mislead you: a real-world example

Let’s look at a real dataset that shows exactly how summary statistics can be dangerous.

A great example is the distribution of starting salaries for new law graduates. The National Association of Law Placement (NALP) reports that in 2012, lawyers made $80,798 on average in starting salary. However, examining the salary distribution shows what law salaries really look like:

[visual] [blog] anscombe lawyers graph

It turns out, law graduates usually fall into 1 of 2 groups. The majority of new lawyers make somewhere between $35,000 and $75,000 per year, and a sizable minority earns $160,000 per year. 

This is a bimodal distribution: there are 2 peaks that arise from 2 distinct distributions happening within the same dataset. 

The $80,798 figure reported as the average falls into the trough between the 2 peaks, and few lawyers have salaries near that number. A much more accurate statement would be that most law graduates make around $50,000 on average, and those who go to one of the top law schools make $160,000 on average.

There’s also something else happening here that you wouldn’t have observed if you hadn’t plotted the data. There’s a giant spike at exactly $160,000 in starting salary, rather than a peak with some variance. Why is $160,000 such a popular number for law salaries? 

As it turns out, this data isn’t based on actual legal salaries, but based on what law schools report to the NALP as their students’ median starting salaries. There’s a lot of skepticism about the $160,000 figure, and third-party data shows that the distribution might not be so skewed.

Visualizing the data helped in 2 ways. It revealed a better picture of what realistic starting law salaries look like, and also prompted a follow-up question that exposed a potential flaw in the data.

When should you use summary statistics?

This isn’t to say that summary statistics are useless—they can just be misleading on their own. It’s important to use these as just one tool in a larger data analysis process.

Visualizing the data allows you to revisit summary statistics and recontextualize them as needed. 

For example, Dataset 2 from Anscombe’s quartet demonstrates a strong relationship between x and y; it just doesn’t appear to be linear. In this case, a linear regression was the wrong tool to use, and you can try other regressions. 

Eventually, you’ll be able to revise this into a model that does a great job of describing your data, and has a high degree of predictive power for future observations.

Get deeper insights with Contentsquare

Visualize your data and contextualize it with qualitative user insights from Contentsquare.

Ravi Parikh author picture
Ravi Parikh
Co-founder of Heap

Ravi Parikh co-founded Heap in 2013, and is a co-founder of Airplane, a developer platform for internal tools.