Data Science Blog: Understand. Implement. Succed.

In a previous post, I have described how to set up your own Staticman instance and use it to run a commenting system. Since Staticman is not limited to bringing comments to static sites, I decided to implement polls with Staticman as well. Overview In order to get polls working, the following steps need to be followed: Adjust your Staticman configuration to include a configuration for polls Create two subfolders in your data folder: one for storing the votes and one for setting up the polls Implement the Hugo template logic for the polls in your partials Implement JavaScript/CSS to allow for participating in the poll and viewing the results Staticman configuration Configuring Staticman for polls is relatively straight-forward.

Dimensionality Reduction for Visualization and Prediction

Dimensionality reduction has two primary use cases: data exploration and machine learning. It is useful for data exploration because dimensionality reduction to few dimensions (e.g. 2 or 3 dimensions) allows for visualizing the samples. Such a visualization can then be used to obtain insights from the data (e.g. detect clusters and identify outliers). For machine learning, dimensionality reduction is useful because oftentimes models generalize better when fewer features are used during the fitting process.

I have previously used a data set describing the characteristics of whiskeys to draw radar plots. Here, I present how I cleaned and augmented the original data from the University of Strathclyde, resulting in an improved version of the whiskey data set.

Radar plots visualize several variables using a radial layout. This plot is most suitable for visualizing and comparing the properties associated with individual objects. In the following, we will use a radar plot for comparing the characteristics of whiskeys from different distilleries. A data set on whiskey Some of you may already know that radar plots are well-suited for visualizing whiskey flavors. I saw this type of visualization first, when I visited the Talisker distillery, the only whiskey distillery on the Isle of Skye.

Interpreting generalized linear models (GLM) obtained through glm is similar to interpreting conventional linear models. Here, we will discuss the differences that need to be considered. Basics of GLMs GLMs enable the use of linear models in cases where the response variable has an error distribution that is non-normal. Each distribution is associated with a specific canonical link function. A link function \(g(x)\) fulfills \(X \beta = g(\mu)\). For example, for a Poisson distribution, the canonical link function is \(g(\mu) = \text{ln}(\mu)\).

Finding a Suitable Linear Model for Ozone Prediction

In a previous post, I have introduced the airquality data set in order to demonstrate how linear models are interpreted. In this post, I will start with a basic linear model and, from there, try to find a linear model with a better fit. Data preprocessing Since the airquality data set contains some missing values, we will remove those before we begin to fit models and select 70% of the samples for training and use the remainder for testing:

Although linear models are one of the simplest machine learning techniques, they are still a powerful tool for predictions. This is particularly due to the fact that linear models are especially easy to interpret. Here, I discuss the most important aspects when interpreting linear models by example of ordinary least-squares regression using the airquality data set. The airquality data set The airquality data set contains 154 measurements of the following four air quality metrics as obtained in New York:

Nowadays, infographics are everywhere. Fortunately, you do not have to be a professional designer to create them because there are several free platforms that assist you in creating engaging infographics. In this post, I compare three freely available tools for creating static infographics: Venngage, easelly, and Infogram. Each of the tools is reviewed according to three criteria: Customizability: number of available templates, graphics, fonts and so on. User experience: how easy is it to design/deploy infographics?

Box Plot Alternatives: Beeswarm and Violin Plots

Box plots are great as they do not only indicate the median value but also show the variation of the measurements in terms of the 1st and 3rd quartiles. There are, however, also plots that provide a bit of additional information. Here, we take a closer look at potential alternatives to the box plot: the beeswarm and the violin plot. The beeswarm plot An implementation of the beeswarm plot is available via the beeswarm package.

Visualizing Time-Series Data with Line Plots

The line plot is the go-to plot for visualizing time-series data (i.e. measurements for several points in time) as it allows for showing trends along time. Here, we’ll use stock market data to show how line plots can be created using native R, the MTS package, and ggplot.

Posts

Implementing Polls Using Staticman

Dimensionality Reduction for Visualization and Prediction

Improving the whiskey distillery data set

Radar plots

Interpreting Generalized Linear Models

Finding a Suitable Linear Model for Ozone Prediction

Interpreting Linear Prediction Models

Getting Your Point Across with Infographics

Box Plot Alternatives: Beeswarm and Violin Plots

Visualizing Time-Series Data with Line Plots