Posts on datascienceblog.net: R for Data Science
https://www.datascienceblog.net/post/
Recent content in Posts on datascienceblog.net: R for Data ScienceHugo -- gohugo.ioen-usTue, 18 Dec 2018 00:00:00 +0000An Introduction to Forecasting
https://www.datascienceblog.net/post/machine-learning/forecasting-an-introduction/
Tue, 18 Dec 2018 00:00:00 +0000https://www.datascienceblog.net/post/machine-learning/forecasting-an-introduction/Forecasting is concerned with making predictions about future observations by relying on past measurements. In this article, I will give an introduction how ARMA, ARIMA (Box-Jenkins), SARIMA, and ARIMAX models can be used for forecasting given time-series data.
Preliminaries Before we can talk about models for time-series data, we have to introduce two concepts.
The backshift operator Given the time series \(y = \{y_1, y_2, \ldots \}\), the backshift operator (also called lag operator) is defined asPrediction vs Forecasting
https://www.datascienceblog.net/post/machine-learning/forecasting_vs_prediction/
Sun, 09 Dec 2018 00:00:00 +0000https://www.datascienceblog.net/post/machine-learning/forecasting_vs_prediction/In supervised learning, we are often concerned with prediction. However, there is also the concept of forecasting. Here, I will discuss the differences between the two concepts so that we can answer the question why weather forecasting is not called weather prediction.
Predicion and forecasting Prediction is concerned with estimating the outcomes for unseen data. For this purpose, you fit a model to a training data set, which results in an estimator \(\hat{f}(x)\) that can make predictions for new samples \(x\).Interpreting ROC Curves, Precision-Recall Curves, and AUCs
https://www.datascienceblog.net/post/machine-learning/interpreting-roc-curves-auc/
Sat, 08 Dec 2018 00:00:00 +0000https://www.datascienceblog.net/post/machine-learning/interpreting-roc-curves-auc/Receiver operating characteristic (ROC) curves are probably the most commonly used measure for evaluating the predictive performance of scoring classifiers.
The confusion matrix of a classifier that predicts a positive class (+1) and a negative class (-1) has the following structure:
Prediction/Reference Class +1 -1 +1 TP FP -1 FN TN Here, TP indicates the number of true positives (model predicts positive class correctly), FP indicates the number of false positives (model incorrectly predicts positive class), FN indicates the number of false negatives (model incorrectly predicts negative class), and TN indicates the number of true negatives (model correctly predicts negative class).Inference vs Prediction
https://www.datascienceblog.net/post/commentary/inference-vs-prediction/
Fri, 07 Dec 2018 00:00:00 +0000https://www.datascienceblog.net/post/commentary/inference-vs-prediction/The terms inference and prediction both describe tasks where we learn from data in a supervised manner in order to find a model that describes the relationship between the independent variables and the outcome. Inference and prediction, however, diverge when it comes to the use of the resulting model:
Inference: Use the model to learn about the data generation process. Prediction: Use the model to predict the outcomes for new data points.Performance Measures for Multi-Class Problems
https://www.datascienceblog.net/post/machine-learning/performance-measures-multi-class-problems/
Tue, 04 Dec 2018 00:00:00 +0000https://www.datascienceblog.net/post/machine-learning/performance-measures-multi-class-problems/For classification problems, classifier performance is typically defined according to the confusion matrix associated with the classifier. Based on the entries of the matrix, it is possible to compute sensitivity (recall), specificity, and precision. For a single cutoff, these quantities lead to balanced accuracy (sensitivity and specificity) or to the F1-score (recall and precision). For evaluate a scoring classifier at multiple cutoffs, these quantities can be used to determine the area under the ROC curve (AUC) or the area under the precision-recall curve (AUCPR).Behind the Scenes: The First Month of datascienceblog.net
https://www.datascienceblog.net/post/other/blog-summary-month-01/
Sun, 02 Dec 2018 00:00:00 +0000https://www.datascienceblog.net/post/other/blog-summary-month-01/By now, datascienceblog.net already exists for one month, with the first post dating back to the 16th of October, 2018. I would like to use this opportunity to reflect on how the blog has developed since its inception.
Content I am quite happy with the amount of content I could produce over the last couple of weeks. Especially when starting a blog, high-quality content is the most important criterion for developing a user base.Linear, Quadratic, and Regularized Discriminant Analysis
https://www.datascienceblog.net/post/machine-learning/linear-discriminant-analysis/
Fri, 30 Nov 2018 00:00:00 +0000https://www.datascienceblog.net/post/machine-learning/linear-discriminant-analysis/Discriminant analysis encompasses methods that can be used for both classification and dimensionality reduction. Linear discriminant analysis (LDA) is particularly popular because it is both a classifier and a dimensionality reduction technique. Quadratic discriminant analysis (QDA) is a variant of LDA that allows for non-linear separation of data. Finally, regularized discriminant analysis (RDA) is a compromise between LDA and QDA.
This post focuses mostly on LDA and explores its use as a classification and visualization technique, both in theory and in practice.An Introduction to Probabilistic Programming with Stan in R
https://www.datascienceblog.net/post/machine-learning/probabilistic_programming/
Wed, 28 Nov 2018 00:00:00 +0000https://www.datascienceblog.net/post/machine-learning/probabilistic_programming/Probabilistic programming enables us to implement statistical models without having to worry about the technical details. It is particularly useful for Bayesian models that are based on MCMC sampling. In this article, I investigate how Stan can be used through its implementation in R, RStan. This post is largely based on the GitHub documentation of Rstan and its vignette.
Introduction to Stan Stan is a C++ library for Bayesian inference. It is based on the No-U-Turn sampler (NUTS), which is used for estimating the posterior distribution according to a user-specified model and data.Performance Measures for Feature Selection
https://www.datascienceblog.net/post/machine-learning/performance-measures-feature-selection/
Sun, 25 Nov 2018 00:00:00 +0000https://www.datascienceblog.net/post/machine-learning/performance-measures-feature-selection/In a recent post, I have discussed performance measures for model selection. This time, I write about a related topic: performance measures that are suitable for selecting models when performing feature selection. Since feature selection is concerned with reducing the number of dependent variables, suitable performance measures evaluate the trade-off between the number of features, \(p\), and the fit of the model.
Performance measures for regression Mean squared error (MSE) and \(R^2\) are unsuited for comparing models during feature selection.The Case Against Precision as a Model Selection Criterion
https://www.datascienceblog.net/post/machine-learning/specificity-vs-precision/
Wed, 21 Nov 2018 00:00:00 +0000https://www.datascienceblog.net/post/machine-learning/specificity-vs-precision/Recently, I have introduced sensitivity and specificity as performance measures for model selection. Besides these measures, there is also the notion of recall and precision. Precision and recall originate from information retrieval but are also used in machine learning settings. However, the use of precision and recall can be problematic in some situations. In this post, I discuss the shortcomings of recall and precision and show why sensitivity and specificity are generally more useful.Performance Measures for Model Selection
https://www.datascienceblog.net/post/machine-learning/performance-measures-model-selection/
Mon, 19 Nov 2018 00:00:00 +0000https://www.datascienceblog.net/post/machine-learning/performance-measures-model-selection/There are several performance measures for describing the quality of a machine learning model. However, the question is, which is the right measure for which problem? Here, I discuss the most important performance measures for selecting regression and classification models. Note that the performance measures introduced here should not be used for feature selection as they do not take model complexity into account.
Performance measures for regression For models that are based on the same set of features, RMSE and \(R^2\) are typically used for model selection.Statistical Nomenclature for Variables
https://www.datascienceblog.net/post/basic-statistics/variable_nomenclature/
Mon, 19 Nov 2018 00:00:00 +0000https://www.datascienceblog.net/post/basic-statistics/variable_nomenclature/Variables can be differentiated by two characteristics. The first characteristic is the scale of the variable (i.e. the values that the variable can assume). The second is the role that the variable fulfills in a statistical model.
Measurements scales of variables Variables can be on the following scales:
Quantitative variables: Variables indicating numeric values for which pairwise differences are meaningful. Categorical variables: Variables representing a discrete set of groups. Categorical values are also called nominal variables.Implementing Polls Using Staticman
https://www.datascienceblog.net/post/other/staticman_polls/
Fri, 16 Nov 2018 00:00:00 +0000https://www.datascienceblog.net/post/other/staticman_polls/In a previous post, I have described how to set up your own Staticman instance and use it to run a commenting system. Since Staticman is not limited to bringing comments to static sites, I decided to implement polls with Staticman as well.
Overview In order to get polls working, the following steps need to be followed:
Adjust your Staticman configuration to include a configuration for polls Create two subfolders in your data folder: one for storing the votes and one for setting up the polls Implement the Hugo template logic for the polls in your partials Implement JavaScript/CSS to allow for participating in the poll and viewing the results Staticman configuration Configuring Staticman for polls is relatively straight-forward.Dimensionality Reduction for Visualization and Prediction
https://www.datascienceblog.net/post/machine-learning/dimensionality-reduction/
Wed, 14 Nov 2018 00:00:00 +0000https://www.datascienceblog.net/post/machine-learning/dimensionality-reduction/Dimensionality reduction has two primary use cases: data exploration and machine learning. It is useful for data exploration because dimensionality reduction to few dimensions (e.g. 2 or 3 dimensions) allows for visualizing the samples. Such a visualization can then be used to obtain insights from the data (e.g. detect clusters and identify outliers). For machine learning, dimensionality reduction is useful because oftentimes models generalize better when fewer features are used during the fitting process.Improving the whiskey distillery data set
https://www.datascienceblog.net/post/other/whiskey-data-annotation/
Tue, 13 Nov 2018 18:00:00 +0000https://www.datascienceblog.net/post/other/whiskey-data-annotation/I have previously used a data set describing the characteristics of whiskeys to draw radar plots. Here, I present how I cleaned and augmented the original data from the University of Strathclyde, resulting in an improved version of the whiskey data set.
Loading the whiskey data set The original data set can be loaded from the web in the following way:
library(RCurl) # load data as character f <- getURL('https://www.datascienceblog.net/data-sets/whiskies.txt') # read table from text connection df <- read.Radar plots
https://www.datascienceblog.net/post/data-visualization/radar-plot/
Tue, 13 Nov 2018 00:00:00 +0000https://www.datascienceblog.net/post/data-visualization/radar-plot/Radar plots visualize several variables using a radial layout. This plot is most suitable for visualizing and comparing the properties associated with individual objects. In the following, we will use a radar plot for comparing the characteristics of whiskeys from different distilleries.
A data set on whiskey Some of you may already know that radar plots are well-suited for visualizing whiskey flavors. I saw this type of visualization first, when I visited the Talisker distillery, the only whiskey distillery on the Isle of Skye.Interpreting Generalized Linear Models
https://www.datascienceblog.net/post/machine-learning/interpreting_generalized_linear_models/
Fri, 09 Nov 2018 20:00:00 +0000https://www.datascienceblog.net/post/machine-learning/interpreting_generalized_linear_models/Interpreting generalized linear models (GLM) obtained through glm is similar to interpreting conventional linear models. Here, we will discuss the differences that need to be considered.
Basics of GLMs GLMs enable the use of linear models in cases where the response variable has an error distribution that is non-normal. Each distribution is associated with a specific canonical link function. A link function \(g(x)\) fulfills \(X \beta = g(\mu)\). For example, for a Poisson distribution, the canonical link function is \(g(\mu) = \text{ln}(\mu)\).Finding a Suitable Linear Model for Ozone Prediction
https://www.datascienceblog.net/post/machine-learning/improving_ozone_prediction/
Wed, 07 Nov 2018 15:00:00 +0000https://www.datascienceblog.net/post/machine-learning/improving_ozone_prediction/In a previous post, I have introduced the airquality data set in order to demonstrate how linear models are interpreted. In this post, I will start with a basic linear model and, from there, try to find a linear model with a better fit.
Data preprocessing Since the airquality data set contains some missing values, we will remove those before we begin to fit models and select 70% of the samples for training and use the remainder for testing:Interpreting Linear Prediction Models
https://www.datascienceblog.net/post/machine-learning/linear_models/
Tue, 06 Nov 2018 15:00:00 +0000https://www.datascienceblog.net/post/machine-learning/linear_models/Although linear models are one of the simplest machine learning techniques, they are still a powerful tool for predictions. This is particularly due to the fact that linear models are especially easy to interpret. Here, I discuss the most important aspects when interpreting linear models by example of ordinary least-squares regression using the airquality data set.
The airquality data set The airquality data set contains 154 measurements of the following four air quality metrics as obtained in New York:Getting Your Point Across with Infographics
https://www.datascienceblog.net/post/data-visualization/infographics/
Tue, 06 Nov 2018 00:00:00 +0000https://www.datascienceblog.net/post/data-visualization/infographics/Nowadays, infographics are everywhere. Fortunately, you do not have to be a professional designer to create them because there are several free platforms that assist you in creating engaging infographics. In this post, I compare three freely available tools for creating static infographics: Venngage, easelly, and Infogram. Each of the tools is reviewed according to three criteria:
Customizability: number of available templates, graphics, fonts and so on. User experience: how easy is it to design/deploy infographics?