Visualizing Individual Data Points Using Scatter Plots

Artificial data for the scatter plot

A typical application of scatter plots is for visualizing the correlation between two variables. For this purpose, we’ll create a function that generates correlated measurements.

set.seed(1)
generator <- function(n = 1000) {
    x <- rnorm(n)
    # add some noise to x
    noise <- rnorm(n, 0, sd = 0.5)
    y <- x + noise
    df <- data.frame(x = x, y = y)
    return(df)
}
df <- generator(1000)

Scatter plots in native R

In native R, you can create a simple scatter plot in this way:

plot(df$x, df$y)
# show diagonal line indicating perfect correlation
dg <- par("usr") 
segments(dg[1],dg[3],dg[2],dg[4], col='red') 

As we can see, the two variables are strongly correlated (correlation of 0.89).

If you want to visualize the correlation of several variables simultaneously, you can use the pairs function in order to create a scatter plot matrix. This matrix represents the correlation for all pairs of variables in a data frame that are selected by the formula argument.

data(mtcars)
pairs(~wt+hp+qsec, data = mtcars)

In this case, you would find that there is some correlation between the features wt and hp.

Scatter plots with ggplot

If you prefer using ggplot, then you can create a scatter plot using geom_point. To give the plot more of a nice touch, you can also include the correlation.

library(ggplot2)
cor.val <- round(cor(df$x, df$y), 2)
cor.label <- paste0("Correlation: ", cor.val)
ggplot(df, aes(x = x, y = y)) + geom_point() + 
    geom_abline(color = "red", slope = 1) + 
    annotate(x = -2.25, y = 3.5,  geom = "text", 
            label = cor.label, size = 5)

To show the distribution of the data in more detail, you can also draw a 2D density.

ggplot(df, aes(x = x, y = y)) + geom_point() + 
    geom_density_2d()

The ellipses of the density indicate where the values are concentrated and allow you to whether a sufficient range of values has been sampled. In this case, the values span the possible range of values well and there are few outliers, so we can be confident about the identified correlation.