9.5 BETA

Scatter diagrams

3 learning objectives

1. Overview

Scatter diagrams are used to visually explore the relationship (correlation) between two variables. By plotting data points on a graph, you can quickly identify if there's a trend, and whether that trend is positive (both variables increase together), negative (one variable increases as the other decreases), or if there's no relationship at all. This helps in making predictions and understanding how different factors might influence each other.


Key Definitions

  • Bivariate Data: Data that involves two variables (e.g., height and weight).
  • Scatter Diagram: A graph where individual data points are plotted as coordinates $(x, y)$ to show the relationship between variables.
  • Correlation: A measure of the strength and direction of the relationship between two variables.
  • Line of Best Fit: A straight line drawn through the center of the data points to represent the general trend.
  • Outlier: A data point that lies significantly far away from the general pattern of the other points.
  • Interpolation: Estimating a value within the range of the plotted data.
  • Extrapolation: Estimating a value outside the range of the plotted data (this is often unreliable).

Core Content

1. Drawing and Interpreting Scatter Diagrams

To draw a scatter diagram, you plot each pair of values as a point on a grid. The horizontal axis ($x$) is usually the independent variable, and the vertical axis ($y$) is the dependent variable.

How to Draw:

  1. Choose a sensible scale for both axes.
  2. Label both axes clearly with units.
  3. Plot each pair of data points accurately with a small 'x' or a dot.
  4. Do not join the points together.
📊A scatter diagram showing 'Hours Spent Studying' on the x-axis and 'Exam Score (%)' on the y-axis. Points are scattered in an upward direction.

2. Understanding Correlation

Correlation describes how the variables are related. There are three main types:

  • Positive Correlation: As $x$ increases, $y$ increases. The points move from bottom-left to top-right.
    • Example: Height and Shoe Size.
  • Negative Correlation: As $x$ increases, $y$ decreases. The points move from top-left to bottom-right.
    • Example: Age of a car and its Value.
  • Zero (No) Correlation: There is no discernible pattern; points are scattered randomly.
    • Example: Hair length and Math marks.

Strength of Correlation:

  • Strong: Points are very close to forming a straight line.
  • Weak: Points follow a general direction but are spread out.
📊Three side-by-side plots: 1. Strong positive (tight line), 2. Weak negative (loose cluster moving down), 3. Zero correlation (random dots).

3. The Line of Best Fit

A line of best fit is a straight line that passes through the "middle" of the data points.

Rules for drawing "by eye":

  • It must follow the trend of the data (positive or negative).
  • Try to have an equal number of points above and below the line.
  • It does not have to go through the origin $(0,0)$ unless the data suggests it.
  • The line should be long enough to cover the full range of points.

Worked Example 1 — Estimating Values from a Scatter Diagram

The table shows the temperature ($x$) and ice cream sales ($y$).

  • Data points: $(15, 100), (20, 250), (25, 400), (30, 550)$.
  • Task: Estimate sales when the temperature is $22^\circ\text{C}$.

Step-by-Step Working:

  1. Plot the points: Mark the coordinates on the grid.
  2. Draw the line: Use a ruler to draw a straight line through the center of the points.
  3. Find the value:
    • Locate $22^\circ\text{C}$ on the x-axis.
    • Draw a vertical dashed line up to meet your line of best fit.
    • Draw a horizontal dashed line from that point to the y-axis.
  4. Read the result: The y-axis shows approximately $310$ sales.
    • Answer: Estimated sales = $310$.

Worked example 2 — Describing Correlation and Estimating

A study recorded the number of hours students spent playing video games per week ($x$) and their average test score in mathematics ($y$). The data is as follows: $(5, 75), (10, 60), (15, 50), (20, 40), (25, 30)$.

a) Describe the correlation between the number of hours spent playing video games and the average test score. b) Draw a scatter diagram of the data. c) Draw a line of best fit on your scatter diagram. d) Use your line of best fit to estimate the average test score of a student who spends 12 hours per week playing video games.

Step-by-Step Working:

a) Describe the correlation: As the number of hours spent playing video games increases, the average test score decreases. Therefore, there is a negative correlation.

b) Draw the scatter diagram:

  1. Draw and label the x-axis (Hours spent playing video games) and the y-axis (Average test score). Choose appropriate scales for each axis.
  2. Plot each data point $(x, y)$ on the graph.

c) Draw the line of best fit:

  1. Use a ruler to draw a straight line that best represents the trend of the data. Aim to have approximately the same number of points above and below the line. The line does not necessarily need to pass through the origin.

d) Estimate the average test score:

  1. Locate 12 hours on the x-axis.
  2. Draw a vertical line from 12 hours up to the line of best fit.
  3. Draw a horizontal line from the point where the vertical line intersects the line of best fit to the y-axis.
  4. Read the value on the y-axis. This is the estimated average test score.

Answer: Estimated average test score ≈ $55$.


Extended Content (Extended Only)

While the core concepts of scatter diagrams are the same for both Core and Extended students, Extended students are expected to apply these concepts in more complex problem-solving scenarios. This often involves interpreting scatter diagrams in real-world contexts and making more nuanced judgments about the strength and reliability of correlations. Furthermore, while drawing the line of best fit "by eye" is sufficient for Core, Extended students should understand that more advanced statistical methods (like least squares regression) exist to calculate the line of best fit mathematically. Although you won't be required to perform these calculations by hand, understanding the idea behind them can help you appreciate the limitations of drawing a line of best fit by eye. For example, different people might draw slightly different lines, leading to slightly different estimates. The closer the points are to a perfect straight line, the less variation there will be in the lines of best fit drawn by different people.


Key Equations

There are no specific formulas on the IGCSE formula sheet for scatter diagrams. However, understanding the concept of the mean point can be helpful.

Mean Point $(\bar{x}, \bar{y})$:

  • $\bar{x} = \frac{\sum x}{n}$ (Sum of all $x$ values divided by the number of points)
  • $\bar{y} = \frac{\sum y}{n}$ (Sum of all $y$ values divided by the number of points)

Note: A line of best fit is most accurate when it passes through the mean point $(\bar{x}, \bar{y})$.


Common Mistakes to Avoid

  • Connecting the dots: Students often join the points like a line graph. Scatter diagrams show a relationship, not a chronological sequence. ✓ Plot individual points: Each point represents a pair of data values; do not connect them.
  • Forcing the line through $(0,0)$: Only pass the line through the origin if it fits the trend of the data. ✓ Consider the data: The line of best fit should reflect the overall trend, even if it doesn't start at the origin.
  • Drawing a line without a ruler: Free-hand lines will lose marks. Use a clear plastic ruler so you can see the points underneath. ✓ Use a ruler: Always use a ruler to draw a straight line of best fit.
  • Ignoring outliers: While the line shouldn't be "pulled" drastically toward an outlier, you must still plot it if it is in the data set. ✓ Plot all data points: Include all data points on the scatter diagram, even if they appear to be outliers.
  • Reversing the axes: Confusing the independent and dependent variables when plotting the data. ✓ Label axes correctly: Ensure the independent variable is on the x-axis and the dependent variable is on the y-axis.

Exam Tips

  • Command Words:
    • "Describe the relationship": You should state the type of correlation (e.g., "There is a strong positive correlation").
    • "Estimate the value": Use your line of best fit and show your method by drawing dashed lines on the graph.
  • Accuracy: When drawing the line of best fit, markers allow a small range of tolerance, but it must be a single, thin, straight line.
  • Real-world Context: Expect questions on topics like engine size vs. fuel consumption or rainfall vs. umbrella sales.
  • Calculator Tip: While you won't usually need a calculator to draw the graph, you will need it to calculate the mean $(\bar{x}, \bar{y})$ if the question asks for it.
  • Reliability: If an exam question asks "How reliable is this estimate?", check if the point is within the data range. Estimates outside the range (extrapolation) are less reliable.

Test Your Knowledge

Ready to check what you've learned? Practice with 9 flashcards covering key definitions and concepts from Scatter diagrams.

Study Flashcards Practice MCQs

Frequently Asked Questions: Scatter diagrams

What is Bivariate Data in Scatter diagrams?

Bivariate Data: Data that involves two variables (e.g., height and weight).

What is Scatter Diagram in Scatter diagrams?

Scatter Diagram: A graph where individual data points are plotted as coordinates $(x, y)$ to show the relationship between variables.

What is Correlation in Scatter diagrams?

Correlation: A measure of the strength and direction of the relationship between two variables.

What is Line of Best Fit in Scatter diagrams?

Line of Best Fit: A straight line drawn through the center of the data points to represent the general trend.

What is Outlier in Scatter diagrams?

Outlier: A data point that lies significantly far away from the general pattern of the other points.

What is Interpolation in Scatter diagrams?

Interpolation: Estimating a value within the range of the plotted data.

What is Extrapolation in Scatter diagrams?

Extrapolation: Estimating a value outside the range of the plotted data (this is often unreliable).