R for applications in data science

I was recently alerted because my Bioconductor package openPrimeR was failing the automated package tests. The reason for this is that the Bioconductor team has decided to set a new environment variable when testing the packages.

Forecasting is concerned with making predictions about future observations by relying on past measurements. In this article, I will give an introduction how ARMA, ARIMA (Box-Jenkins), SARIMA, and ARIMAX models can be used for forecasting given time-series data. Preliminaries Before we can talk about models for time-series data, we have to introduce two concepts. The backshift operator Given the time series \(y = \{y_1, y_2, \ldots \}\), the backshift operator (also called lag operator) is defined as

Interpreting ROC Curves, Precision-Recall Curves, and AUCs

Receiver operating characteristic (ROC) curves are probably the most commonly used measure for evaluating the predictive performance of scoring classifiers. The confusion matrix of a classifier that predicts a positive class (+1) and a negative class (-1) has the following structure: Prediction/Reference Class +1 -1 +1 TP FP -1 FN TN Here, TP indicates the number of true positives (model predicts positive class correctly), FP indicates the number of false positives (model incorrectly predicts positive class), FN indicates the number of false negatives (model incorrectly predicts negative class), and TN indicates the number of true negatives (model correctly predicts negative class).

Performance Measures for Multi-Class Problems

For classification problems, classifier performance is typically defined according to the confusion matrix associated with the classifier. Based on the entries of the matrix, it is possible to compute sensitivity (recall), specificity, and precision. For a single cutoff, these quantities lead to balanced accuracy (sensitivity and specificity) or to the F1-score (recall and precision). For evaluate a scoring classifier at multiple cutoffs, these quantities can be used to determine the area under the ROC curve (AUC) or the area under the precision-recall curve (AUCPR).

Linear, Quadratic, and Regularized Discriminant Analysis

Discriminant analysis encompasses methods that can be used for both classification and dimensionality reduction. Linear discriminant analysis (LDA) is particularly popular because it is both a classifier and a dimensionality reduction technique. Quadratic discriminant analysis (QDA) is a variant of LDA that allows for non-linear separation of data. Finally, regularized discriminant analysis (RDA) is a compromise between LDA and QDA. This post focuses mostly on LDA and explores its use as a classification and visualization technique, both in theory and in practice.

An Introduction to Probabilistic Programming with Stan in R

Probabilistic programming enables us to implement statistical models without having to worry about the technical details. It is particularly useful for Bayesian models that are based on MCMC sampling. In this article, I investigate how Stan can be used through its implementation in R, RStan. This post is largely based on the GitHub documentation of Rstan and its vignette. Introduction to Stan Stan is a C++ library for Bayesian inference.

In a recent post, I have discussed performance measures for model selection. This time, I write about a related topic: performance measures that are suitable for selecting models when performing feature selection. Since feature selection is concerned with reducing the number of dependent variables, suitable performance measures evaluate the trade-off between the number of features, \(p\), and the fit of the model. Performance measures for regression Mean squared error (MSE) and \(R^2\) are unsuited for comparing models during feature selection.

The Case Against Precision as a Model Selection Criterion

Recently, I have introduced sensitivity and specificity as performance measures for model selection. Besides these measures, there is also the notion of recall and precision. Precision and recall originate from information retrieval but are also used in machine learning settings. However, the use of precision and recall can be problematic in some situations. In this post, I discuss the shortcomings of recall and precision and show why sensitivity and specificity are generally more useful.

Dimensionality Reduction for Visualization and Prediction

Dimensionality reduction has two primary use cases: data exploration and machine learning. It is useful for data exploration because dimensionality reduction to few dimensions (e.g. 2 or 3 dimensions) allows for visualizing the samples. Such a visualization can then be used to obtain insights from the data (e.g. detect clusters and identify outliers). For machine learning, dimensionality reduction is useful because oftentimes models generalize better when fewer features are used during the fitting process.

I have previously used a data set describing the characteristics of whiskeys to draw radar plots. Here, I present how I cleaned and augmented the original data from the University of Strathclyde, resulting in an improved version of the whiskey data set.

R for applications in data science

Posts about R

Two Environment Variables for More Robust R Code

An Introduction to Forecasting

Interpreting ROC Curves, Precision-Recall Curves, and AUCs

Performance Measures for Multi-Class Problems

Linear, Quadratic, and Regularized Discriminant Analysis

An Introduction to Probabilistic Programming with Stan in R

Performance Measures for Feature Selection

The Case Against Precision as a Model Selection Criterion

Dimensionality Reduction for Visualization and Prediction

Improving the whiskey distillery data set