Data Science Glossary

Data Science Glossary

Machine Learning Glossary

General Terms

Data Point Data Wrangling Machine Learning
Model Complexity Model Reinforcement Learning
Supervised Learning Unsupervised Learning

Reinforcement Learning

Action Agent Environment
Observation Policy Reward
State

Supervised Learning

AUC Categorical Outcome Class
Classifier Confusion Matrix Cross-Validation
Dependent Variable Estimate False Positive Rate
Feature Engineering Forecasting Gold Standard
Ground Truth Independent Variable Inference
Inference Interpretability Label
Linear Model Model Validation Non-Linear Model
Observation Outcome Performance
Prediction Quantitative Outcome Regressor
Sensitivity Specificity Test Data
Training Data True Positive Rate Validation Data

Unsupervised Learning

Clustering Dimensionality Reduction k-means
PCA

Action

In reinforcement learning, agents try to perform actions that maximize the reward. Each action changes the learning environment and thus yields a new state.

Agent

In reinforcement learning, an agent is the learner that interacts with the environment. Based on a given state, the agent selects an appropriate action by considering past earned rewards. The policy of an agent determines the actions that should be executed for each state.

AUC

AUC means area under the curve. When evaluating scoring classifiers, the term AUC usually refers to the ROC (receiver operating characteristic)-AUC. The ROC curve determines the true positive rate and false positive rate for all cutoffs on the scores. When available, the ROC-AUC is preferable to other measures such as sensitivity and specificity.

Categorical Outcome

See Outcome.

Classifier

Classifiers (classification models) are used for the prediction of categorical outcomes. Classifiers that ouput quantitative outcomes are called scoring classifiers and are more interpretable than non-scoring classifiers.

Clustering

Clustering, one of the main applications of unsupervised learning, is used to assign each sample to a group of samples. These groups of samples are called clusters. Clustering can be used for the visual exploration of data or for the automated identification of outliers. One of the simplest and most well-known clustering algorithms is k-means.

Confusion Matrix

The confusion matrix is used to evaluate the predictive performance of a classifier. The name confusion matrix stems from the fact that the table illustrates which predictions are confused among the two classes. For binary classifiers, which differentiate between a positive (+1) and a negative (-1) class, the confusion matrix is a 2x2 table of the following form:

Predicted Class   Ground Truth
  +1 -1
+1   TP FP
-1   FN TN

The entries are defined as follows:

  • TP: The number of samples from the positive class that were correctly predicted
  • FP: The number of samples from the negative class that were falsely predicted
  • FN: The number of samples from the positive class that were falsely predicted
  • TN: The number of samples from the negative class that were correctly predicted

From the confusion matrix, one can determine performance metrics such as sensitivity, specificity, and the AUC.

Cross-Validation

Cross-validation is a strategy for evaluating the predictive performance of a model. In k-fold cross-validation, the data set is split into k-folds such that each fold is used for training once, while the remaining data are used for testing. Nested cross-validation introduces another layer by introducing an additional fold that is used for selecting the model that is evaluated on the test fold.

Data Point

See Observation.

Data Wrangling

Data wrangling describes the unpopular task of transforming data into a machine-readable format. For example, data wrangling could entail transforming semi-structured data (e.g. from spreadsheets) to the CSV (comma-separated values) format. Data wrangling is often performed via automated scripts but may also involve manual steps. Note that data wrangling does not involve feature engineering.

Dependent Variable

See Feature.

Dimensionality Reduction

In dimensionality reduction, data are projected to a low-dimensional subspace. This is either done in order to obtain better data visualizations or during feature engineering in the context of supervised learning. Dimensionality reduction techniques such as PCA are unsupervised methods.

Environment

In reinforcement learning, the environment determines the observable states and the actions that an agent can perform. A popular framework for specifying environments is OpenAI’s Gym.

Estimate

See Prediction.

False Positive Rate

Given a classifier, the false positive rate represents the ratio of false positive predictions among all samples from the negative class:

FPR = FP / (FP + TN)

See also Confusion Matrix.

Feature matrix

See Features.

Features

Features are the dependent variables in the supervised learning scenario. The columns of a feature matrix, \[X \in \mathbb{R}^{n \times p}\,\] represent the values of the p features. For example, to predict the weather, two possible features are the level of precipitation and the cloudiness.

Feature Engineering

Supervised learning aims at learning the general associations betwen features and outcomes. However, in their original form, the input data are often not well-suited for this purpose. Feature engineering is concerned with transforming the data such that machine learning models can easily learn from the data.

Forecasting

See Prediction.

Gold Standard

See Ground Truth.

Ground Truth

In order to perform supervised learning, it is necesary that the outcome for each data point is known. The measured outcome should reflect the ground truth. Otherwise, models are optimized with respect to the wrong values, aka garbage in, garbage out.

Independent Variable

See Outcome.

Inference

See Prediction.

Interpretability

Interpretability describes whether a model is able to produce results that humans can easily interpret. Interpretability is closely tied to model complexity (i.e. the effective numbers of model parameters). Simple models such as linear models have few parameters and can easily be understood and interpreted. Complex models such as deep neural networks have large numbers of parameters, which makes them hard to understand and interpret.

There are many application scenarios in which it is acceptable to sacrifice some predictive performance in favor of greater interpretability. This is because in machine learning applications such as decision support systems, it is key that human operators can understand the intentions of the model.

k-means

k-means is a simple yet powerful clustering algorithm that identifies k cluster centers in the data. The algorithm terminates when the cluster centers have converged.

Label

In classification, labels are the values that are used to differentiate between individual classes. For example, one could use Sunny and Cloudy as labels for observations that have been made on sunny and cloudy days, respectively. However, to apply supervised learning algorithms, numeric labels such as +1 and -1 would be necessary.

Linear Model

See Model.

Machine Learning

Machine learning encompasses artificial intelligence approaches that are concerned with learning from data. There are three machine learning areas: supervised learning, unsupervised learning, and reinforcement learning.

Once a model has been fitted to the data, it is possible to make predictions given new data points (supervised learning), structure data (unsupervised learning), or select optimal actions in a dynamic environment (reinforcement learning).

Model

Models are the mathematical approximation of real-world phenomena. In supervised learning, models are constructed using pairs of input data and observed outcomes. In unsupervised learning, the outcomes are not available such that only the structure of the data is modeled. In reinforcement learning, models are constructed according to states, actions, and rewards.

Besides these machine learning approaches, which use optimization algorithms to fit models to data, there is a host of other models that are useful for specific tasks, for example, hidden Markov models, epidemiological models, and Bayesian models.

It is possible to differentiate between linear and non-linear models. While linear models assume linear relationship between the features, non-linear models assume non-linear relationships.

One should always remember the following famous quote from British statistican George E.P. Box:

All models are wrong but some are useful.

Model Complexity

Model complexity is defined by the effective numbers of parameters that make up a model. For example, deep learning models with many parameters are more complex than simple models, such as linear models. Complex models should be avoided if there are not sufficient training data available.

Model Validation

Model validation entails the following steps:

  1. Fitting the model to a set of training data
  2. Tuning the hyperparameters of the model using a set of validation data
  3. Evaluating predictive performance on an independent test data set

The two most popular approaches for validation are:

  1. Splitting the data into a training, validation, and test set
  2. Using cross-validation, in which the model is trained on various subsets of the data.

Non-Linear Model

See Model.

Outcome

In supervised learning, the outcome is a measurement of the ground truth. Principal types of outcomes are categorical outcomes (class labels) and quantitative outcomes. For example, when predicting the weather, Sunny and Cloudy would be categorical outcomes, while the amount of precipitation would be a quantitative outcome.

The underlying variable associated with the outcome is called the independent variable.

Observation

In supervised learning, observations are the rows of the feature matrix. Observations are also called data points or samples. The number of observations is usually denoted by N.

For the use of the term observation in reinforcement learning, see State.

PCA

Principal component analysis (PCA) is a standard dimensionality reduction technique. It is based on finding a projection to orthogonal coordinates that maintain as much variance as possible.

Policy

In reinforcement learning, the policy of an agent is a mapping from states to actions. This means that the policy defines the behavior of the agent in the environment. There are on-policy and off-policy reinforcement learning algorithms.

Performance

In supervised learning, predictive performance is the ability of a model to correctly classify observations. To quantify predictive performance, metrics such as the AUC can be utilized.

Prediction

Prediction is the act of applying a model on a new data point in order to determine the estimated outcome. Inference is often used synonymously, although inference is geared towards learning about the data generation process. Forecasting is a special form of prediction in which time-series are used as the input.

The term estimate is a synonym for prediction that is popular in the statistical community because it underlines the fact that predictions are only approximations of reality.

Quantitative Outcome

See Outcome.

Regressor

Regressors (regression models) are used to predict the outcomes for quantitative variables. Compared to classifiers, they allow for fine-grained predictions.

Reinforcement Learning

Reinforcement learning (RL) is an area of machine learning in which one or multiple agents perform actions in an environment, after observing the state. Once an action has been performed, the agent receives a reward. By balancing exploration (finding novel states) and exploitation (reaping rewards), RL agents can learn an optimal policy, which identifies the best action to take for every state.

In recent years, reinforcement learning has gained in popularity due to the emergence of deep RL, in which deep neural networks are used to learn which states are associated with the greatest rewards.

Reward

In reinforcement learning, agents obtain rewards after performing an action. Agents adjust their policy in order to maximize the reward.

Sample

For its use in supervised learning, see Observation.

Sensitivity

The sensitivity of a classifier is defined by its true positive rate:

sensitivity = TPR = TP/(TP+FN).

See also Confusion Matrix.

Specificity

Specificity indicates the true negative rate of a classifier:

specificity = 1 - TP / (TP + FP) = 1 - FPR

Since specificty considers the number of false positives (FP), it allows for conclusions about the false positive rate (FPR). See also Confusion Matrix.

State

In reinforcement learning, the state indicates the observations that an agent has made at a given point in time. States are usually represented by numeric vectors or matrices. Crafting appropriate states is a form of feature engineering.

Supervised Learning

Supervised learning is an area of machine learning that is concerned with learning from pairs of input data and associated outcomes. Once a model has been trained on a set of training data, it is tuned using a validation data set, and, finally, evaluated on an independent test data set. The application of a supervised learning model on new data is called prediction, inference, or forecasting.

Models that are trained on labeled data (i.e. categorical outcomes) are called classifiers. Models that are trained on quantitative outcomes are called regressors.

Test Data

Test data refers to the set of data that is used for evaluating the predictive performance of a model.

Training Data

Training data refers to the set of data that is used for fitting a model.

True Positive Rate

See Sensitivity.

Unsupervised Learning

Unsupervised learning is an area of machine learning that is concerned with the identification of models that are capable to represent the properties of the data in a condensed manner, which allows for greater interpretability.

Evaluating the performance of unsupervised learning methods is more challenging than for supervised learning because there are no outcomes that provide the ground truth. Popular unsupervised methods include k-means and PCA.

Validation Data

Validation data refers to the set of data that is used for tuning the hyperparameters of a model.