One of the main criteria indicating the quality of a machine learning models is its predictive performance. However, suitable performances measures differ depending on the prediction task. This post investigates the most commonly used quantities that are used for selecting regression and classification models.
Variables can be identified by their value as well as their role. Variables are categorized into quantitative, categorical, and ordinal variables, depending on their values. Moreover, when variables are used in statistical models, additional terms are used to indicate their role such as dependent, independent, and confounding variable.
With Staticman it is possible to integrate user-generated content into static sites. Here, I demonstrate how Staticman can be used to implement polls into websites that are generated with Hugo.
Dimensionality reduction is primarily used for exploring data and for reducing the feature space in machine learning applications. In this post, I investigate techniques such as PCA to obtain insights from a whiskey data set and show how PCA can be used to improve supervised approaches. Finally, I introduce the notion of the whiskey twilight zone.
In this post, I clean up and augment a data set that provides their taste characteristics. The improved data set is augmented with the regions where the distilleries are situated, as well as their geological location in terms of longitude and latitude.
Radar plots are exceptional for visualizing the properties of individual objects. Here, I demonstrate how to draw radar plots in R by plotting the properties of whiskeys from several distilleries.
Generalized linear models (GLMs) are related to conventional linear models but there are some important differences. For example, GLMs are based on the deviance rather than the conventional residuals and they enable the use of different distributions and linker functions. This post investigates how these aspects influence the interpretation of GLMs.
Although ordinary least-squares regression is often used, it is not appropriate for all types of data. Using the airquality data set, I try to find a generalized linear model that fits the data better. For this purpose, I use the following methods: weighted regression, Poisson regression, and imputation.
Linear machine learning models are very convenient for interpretation. This post discusses the following aspects: residuals, coefficients, standard errors, p-values, the F-statistic, and much more.
People without technical backgrounds can have a hard time understanding plots. A less formal means for conveying information is provided by infographics, which are easily understandable. This post compares several free tools for creating engaging infographics.