Data Science Glossary
Machine Learning Glossary
General Terms
Data Point | Data Wrangling | Machine Learning |
Model Complexity | Model | Reinforcement Learning |
Supervised Learning | Unsupervised Learning |
Reinforcement Learning
Action | Agent | Environment |
Observation | Policy | Reward |
State |
Supervised Learning
Unsupervised Learning
Clustering | Dimensionality Reduction | k-means |
PCA |
Action
In reinforcement learning, agents try to perform actions that maximize the reward. Each action changes the learning environment and thus yields a new state.
Agent
In reinforcement learning, an agent is the learner that interacts with the environment. Based on a given state, the agent selects an appropriate action by considering past earned rewards. The policy of an agent determines the actions that should be executed for each state.
AUC
AUC means area under the curve. When evaluating scoring classifiers, the term AUC usually refers to the ROC (receiver operating characteristic)-AUC. The ROC curve determines the true positive rate and false positive rate for all cutoffs on the scores. When available, the ROC-AUC is preferable to other measures such as sensitivity and specificity.
Categorical Outcome
See Outcome.
Class
See categorical outcomes.
Classifier
Classifiers (classification models) are used for the prediction of categorical outcomes. Classifiers that ouput quantitative outcomes are called scoring classifiers and are more interpretable than non-scoring classifiers.
Clustering
Clustering, one of the main applications of unsupervised learning, is used to assign each sample to a group of samples. These groups of samples are called clusters. Clustering can be used for the visual exploration of data or for the automated identification of outliers. One of the simplest and most well-known clustering algorithms is k-means.
Confusion Matrix
The confusion matrix is used to evaluate the predictive performance of a classifier. The name confusion matrix stems from the fact
that the table illustrates which predictions are confused among the two classes. For binary classifiers, which differentiate
between a positive (+1
) and a negative (-1
) class, the confusion matrix is a 2x2 table of the following form:
Predicted Class | Ground Truth | ||
---|---|---|---|
+1 | -1 | ||
+1 | TP | FP | |
-1 | FN | TN |
The entries are defined as follows:
- TP: The number of samples from the positive class that were correctly predicted
- FP: The number of samples from the negative class that were falsely predicted
- FN: The number of samples from the positive class that were falsely predicted
- TN: The number of samples from the negative class that were correctly predicted
From the confusion matrix, one can determine performance metrics such as sensitivity, specificity, and the AUC.
Cross-Validation
Cross-validation is a strategy for evaluating the predictive performance of a model. In k-fold cross-validation, the data set is split into k-folds such that each fold is used for training once, while the remaining data are used for testing. Nested cross-validation introduces another layer by introducing an additional fold that is used for selecting the model that is evaluated on the test fold.
Data Point
See Observation.
Data Wrangling
Data wrangling describes the unpopular task of transforming data into a machine-readable format. For example, data wrangling could entail transforming semi-structured data (e.g. from spreadsheets) to the CSV (comma-separated values) format. Data wrangling is often performed via automated scripts but may also involve manual steps. Note that data wrangling does not involve feature engineering.
Dependent Variable
See Feature.
Dimensionality Reduction
In dimensionality reduction, data are projected to a low-dimensional subspace. This is either done in order to obtain better data visualizations or during feature engineering in the context of supervised learning. Dimensionality reduction techniques such as PCA are unsupervised methods.
Environment
In reinforcement learning, the environment determines the observable states and the actions that an agent can perform. A popular framework for specifying environments is OpenAI’s Gym.
Estimate
See Prediction.
False Positive Rate
Given a classifier, the false positive rate represents the ratio of false positive predictions among all samples from the negative class:
FPR = FP / (FP + TN)
See also Confusion Matrix.
Feature matrix
See Features.
Features
Features are the dependent variables in the supervised learning scenario. The columns of a feature matrix, \[X \in \mathbb{R}^{n \times p}\,\] represent the values of the p features. For example, to predict the weather, two possible features are the level of precipitation and the cloudiness.
Feature Engineering
Supervised learning aims at learning the general associations betwen features and outcomes. However, in their original form, the input data are often not well-suited for this purpose. Feature engineering is concerned with transforming the data such that machine learning models can easily learn from the data.
Forecasting
See Prediction.
Gold Standard
See Ground Truth.
Ground Truth
In order to perform supervised learning, it is necesary that the outcome for each data point is known. The measured outcome should reflect the ground truth. Otherwise, models are optimized with respect to the wrong values, aka garbage in, garbage out.
Independent Variable
See Outcome.
Inference
See Prediction.
Interpretability
Interpretability describes whether a model is able to produce results that humans can easily interpret. Interpretability is closely tied to model complexity (i.e. the effective numbers of model parameters). Simple models such as linear models have few parameters and can easily be understood and interpreted. Complex models such as deep neural networks have large numbers of parameters, which makes them hard to understand and interpret.
There are many application scenarios in which it is acceptable to sacrifice some predictive performance in favor of greater interpretability. This is because in machine learning applications such as decision support systems, it is key that human operators can understand the intentions of the model.
k-means
k-means is a simple yet powerful clustering algorithm that identifies k cluster centers in the data. The algorithm terminates when the cluster centers have converged.
Label
In classification, labels are the values that are used to differentiate between individual classes.
For example, one could use Sunny
and Cloudy
as labels for observations that have been made on sunny and cloudy days, respectively.
However, to apply supervised learning algorithms, numeric labels such as +1
and -1
would be necessary.
Linear Model
See Model.
Machine Learning
Machine learning encompasses artificial intelligence approaches that are concerned with learning from data. There are three machine learning areas: supervised learning, unsupervised learning, and reinforcement learning.
Once a model has been fitted to the data, it is possible to make predictions given new data points (supervised learning), structure data (unsupervised learning), or select optimal actions in a dynamic environment (reinforcement learning).
Model
Models are the mathematical approximation of real-world phenomena. In supervised learning, models are constructed using pairs of input data and observed outcomes. In unsupervised learning, the outcomes are not available such that only the structure of the data is modeled. In reinforcement learning, models are constructed according to states, actions, and rewards.
Besides these machine learning approaches, which use optimization algorithms to fit models to data, there is a host of other models that are useful for specific tasks, for example, hidden Markov models, epidemiological models, and Bayesian models.
It is possible to differentiate between linear and non-linear models. While linear models assume linear relationship between the features, non-linear models assume non-linear relationships.
One should always remember the following famous quote from British statistican George E.P. Box:
All models are wrong but some are useful.
Model Complexity
Model complexity is defined by the effective numbers of parameters that make up a model. For example, deep learning models with many parameters are more complex than simple models, such as linear models. Complex models should be avoided if there are not sufficient training data available.
Model Validation
Model validation entails the following steps:
- Fitting the model to a set of training data
- Tuning the hyperparameters of the model using a set of validation data
- Evaluating predictive performance on an independent test data set
The two most popular approaches for validation are:
- Splitting the data into a training, validation, and test set
- Using cross-validation, in which the model is trained on various subsets of the data.
Non-Linear Model
See Model.
Outcome
In supervised learning, the outcome is a measurement of the ground truth.
Principal types of outcomes are categorical outcomes (class labels) and quantitative outcomes. For example,
when predicting the weather, Sunny
and Cloudy
would be categorical outcomes, while the amount of precipitation
would be a quantitative outcome.
The underlying variable associated with the outcome is called the independent variable.
Observation
In supervised learning, observations are the rows of the feature matrix. Observations are also called data points or samples. The number of observations is usually denoted by N.
For the use of the term observation in reinforcement learning, see State.
PCA
Principal component analysis (PCA) is a standard dimensionality reduction technique. It is based on finding a projection to orthogonal coordinates that maintain as much variance as possible.
Policy
In reinforcement learning, the policy of an agent is a mapping from states to actions. This means that the policy defines the behavior of the agent in the environment. There are on-policy and off-policy reinforcement learning algorithms.
Performance
In supervised learning, predictive performance is the ability of a model to correctly classify observations. To quantify predictive performance, metrics such as the AUC can be utilized.
Prediction
Prediction is the act of applying a model on a new data point in order to determine the estimated outcome. Inference is often used synonymously, although inference is geared towards learning about the data generation process. Forecasting is a special form of prediction in which time-series are used as the input.
The term estimate is a synonym for prediction that is popular in the statistical community because it underlines the fact that predictions are only approximations of reality.
Quantitative Outcome
See Outcome.
Regressor
Regressors (regression models) are used to predict the outcomes for quantitative variables. Compared to classifiers, they allow for fine-grained predictions.
Reinforcement Learning
Reinforcement learning (RL) is an area of machine learning in which one or multiple agents perform actions in an environment, after observing the state. Once an action has been performed, the agent receives a reward. By balancing exploration (finding novel states) and exploitation (reaping rewards), RL agents can learn an optimal policy, which identifies the best action to take for every state.
In recent years, reinforcement learning has gained in popularity due to the emergence of deep RL, in which deep neural networks are used to learn which states are associated with the greatest rewards.
Reward
In reinforcement learning, agents obtain rewards after performing an action. Agents adjust their policy in order to maximize the reward.
Sample
For its use in supervised learning, see Observation.
Sensitivity
The sensitivity of a classifier is defined by its true positive rate:
sensitivity = TPR = TP/(TP+FN).
See also Confusion Matrix.
Specificity
Specificity indicates the true negative rate of a classifier:
specificity = 1 - TP / (TP + FP) = 1 - FPR
Since specificty considers the number of false positives (FP), it allows for conclusions about the false positive rate (FPR). See also Confusion Matrix.
State
In reinforcement learning, the state indicates the observations that an agent has made at a given point in time. States are usually represented by numeric vectors or matrices. Crafting appropriate states is a form of feature engineering.
Supervised Learning
Supervised learning is an area of machine learning that is concerned with learning from pairs of input data and associated outcomes. Once a model has been trained on a set of training data, it is tuned using a validation data set, and, finally, evaluated on an independent test data set. The application of a supervised learning model on new data is called prediction, inference, or forecasting.
Models that are trained on labeled data (i.e. categorical outcomes) are called classifiers. Models that are trained on quantitative outcomes are called regressors.
Test Data
Test data refers to the set of data that is used for evaluating the predictive performance of a model.
Training Data
Training data refers to the set of data that is used for fitting a model.
True Positive Rate
See Sensitivity.
Unsupervised Learning
Unsupervised learning is an area of machine learning that is concerned with the identification of models that are capable to represent the properties of the data in a condensed manner, which allows for greater interpretability.
Evaluating the performance of unsupervised learning methods is more challenging than for supervised learning because there are no outcomes that provide the ground truth. Popular unsupervised methods include k-means and PCA.
Validation Data
Validation data refers to the set of data that is used for tuning the hyperparameters of a model.
Comments
There aren't any comments yet. Be the first to comment!
Leave a comment