Improving the whiskey distillery data set

I have previously used a data set describing the characteristics of whiskeys to draw radar plots. Here, I present how I cleaned and augmented the original data from the University of Strathclyde, resulting in an improved version of the whiskey data set.

Loading the whiskey data set

The original data set can be loaded from the web in the following way:

library(RCurl)
# load data as character
f <- getURL('https://www.datascienceblog.net/data-sets/whiskies.txt')
# read table from text connection
df <- read.csv(textConnection(f), header=T)

Fixing the post codes

Since there are tab characters and spaces in the post codes, we will clean those up:

head(df)
##   RowID  Distillery Body Sweetness Smoky Medicinal Tobacco Honey Spicy
## 1     1   Aberfeldy    2         2     2         0       0     2     1
## 2     2    Aberlour    3         3     1         0       0     4     3
## 3     3      AnCnoc    1         3     2         0       0     2     0
## 4     4      Ardbeg    4         1     4         4       0     0     2
## 5     5     Ardmore    2         2     2         0       0     1     1
## 6     6 ArranIsleOf    2         3     1         1       0     1     1
##   Winey Nutty Malty Fruity Floral    Postcode Latitude Longitude
## 1     2     2     2      2      2  \tPH15 2EB   286580    749680
## 2     2     2     3      3      2  \tAB38 9PJ   326340    842570
## 3     0     2     2      3      2   \tAB5 5LI   352960    839320
## 4     0     1     2      1      0  \tPA42 7EB   141560    646220
## 5     1     2     3      1      1  \tAB54 4NH   355350    829140
## 6     1     0     1      1      2    KA27 8HJ   194050    649950
df$Postcode <- gsub(" *\t*", "", df$Postcode)

Annotating the locations of the distilleries

A blog post by Koki Ando gives a nice overview of how UTM data can be handled. In the following code snippet, we use the raster and sp packages to create a SpatialPoints object from latitude/longitude coordinates in UTM format. Then, we add UK as a reference point system by specifying +init=epsg:27700" (see epsg.io for other reference coordinates). Finally, we call spTransform with WGS84 (+init=epsg:4326) in order to set the world geodetic system, which is used for GPS.

# transform UTM coordinates to longitude/latitude in degrees
geo.df <- df[, c("Latitude", "Longitude")]
colnames(geo.df) <- c("lat", "long") # switch for plotting
library(raster)
# create 'SpatialPoints' object
coordinates(geo.df) <-  ~lat + long
# add coordinate reference system (CRS) for UK
proj4string(geo.df) <- CRS("+init=epsg:27700")
# transform to new coordinate system
# NB: getting rgdal working on old systems is tough due to libgdal dependency
library(rgdal)
geo.df <- spTransform(geo.df, CRS("+init=epsg:4326"))
map.df <- data.frame("Distillery" = df[, "Distillery"], geo.df)
df <- cbind(df, map.df[, c("lat", "long")])

Other annotations

To annotate the regions in which the distilleries are situated, I manually assigned regions by relying on a list of Scottisch distilleries available at Wikipedia. I also fixed some spelling errors in the distillery names.

The improved whiskey data set

The improved whiskey data set is available here.

Author: Matthias Döring

Download Markdown